In this guide we will try to go through several aspects of Polyaxon deployment that we think any user trying to use Polyaxon in production mode should think about.
Polyaxon uses Kubernetes, a tool that is rapidly getting adopted by several teams, but running stateful application on Kubernetes could be hard.
Polyaxon depends on some core components to function correctly, these core components include the API, the scheduler, other services for hptuning and for monitoring, and third party services like a database for example. In addition to these core components, Polyaxon schedules jobs and experiments for every data scientist using the platform.
In order to keep the core components highly responsive, we recommend that users should deployed them on separate nodes than those used for running user's workload. This ensures that, experiments and jobs won't consume CPU and/or memory that could be essential to the database or the API to be responsive.
In order to achieve such behaviour Polyaxon provides a node scheduling configuration.
Here's an example of the minimum requirement that we suggest for a production cluster:
nodeSelectors: core: polyaxon: core experiments: polyaxon: experiments jobs: polyaxon: experiments builds: polyaxon: builds tensorboards: polyaxon: experiments
You can also decide to just use at a minimum 2 selectors one for core components and for the workload to keep them separated.
Several teams have advanced setup where they take advantage of Node Selectors, Affinity, and Tolerations to setup the default platform behaviour, and use a custom scheduling per experiment/job when needed. Please refer to this section for full reference of the node scheduling behaviour.
If you are running Polyaxon in production mode, we suggest that you keep your database "safe" and highly available. We provide a reference document on how to achieve High Available Database on Polyaxon in this guide.
Stateful application are very hard to setup correctly on a Kubernetes cluster, so to achieve Postgres HA, we suggest that to look at setting an external Database with Polyaxon.
We also recommend users to take snapshots and backups before going through a migration, this is particularly important if an upgrade contains DB or Data migrations.
Your experiments and jobs outputs/artifacts and logs are stored by default in temporary storages, in order to enable durable, i.e. available after a node failure, we recommend that you read the following guides:
If you are storing code references on Polyaxon, i.e. you managing code repos in-cluster, we also recommend a durable storage for repos:
Starting from Polyaxon v0.5, we will be recommending tp our users to run all Polyaxon's services and workloads with a non-root/privileged user.
Polyaxon will expose a security context for users to setup a user uid and a group uid to use for it's containers.
All mounted volumes will have a filesystem group with the same value as the gid provided by the user.
Debugging workloads on Kubernetes can be challenging, we generally tell our users:
- to start a notebook to get an interactive environment to try their code
- to try their code locally first, before submitting long running jobs to Polyaxon
For our tracking API, Polyaxon respect a setting to disable calls to the APIs.
We are also in the process of providing Local Runs for Polyaxon, where user will be able to generate dockerfiles locally, run them, and track the result similar to the in-cluster behaviour.