Polyaxon allows to schedule distributed MPI experiments, and supports tracking metrics, outputs, and models.
In order to use the
mpi backend, users need to install the MPIJob.
To enable distributed runs, you need to set the
backend field to
mpi and update the
You can annotate your experiments with any framework you are using, it's optional.
The environment section allows to customize the resources as well as defining the topology/replicas of the experiment.
To define a cluster in Polyaxon with 2 workers, add a replicas subsection to the environment section of your polyaxonfile:
... framework: mpi ... environment: replicas: n_workers: 2 default_worker: resources: gpu: requests: 1 limits: 1
Since the MPIOperator does not allow to expose specific resources for the different workers, you can only use the default worker subsection to define the default resources for all workers.