Building and Deploying Kubeflow Machine Learning Workflows
Training a machine learning model is a resource-intensive task. It’s possible to try new ideas on a single machine but as your project scales you will find that you need more power. Unfortunately, scaling up your training isn’t simply a case of throwing more processing power and more data at the system. Data analysts quickly run into the problem of technical debt.
Machine learning code is just one part of your system, as you expand your implementation you will find that there are many other things to take into account:
- Data collection and verification
- Feature extraction
- Data analysis
- Machine resource management
- Process management
Those issues are often more complex and time-consuming than the actual data processing. There are many tools that claim to handle one or more of the above issues, but handling them all is harder than it first looks. Kubeflow aims to solve that problem by bringing together a suite of the best tools in a way that “just works”.
How to Use Kubeflow to Solve Your Machine Learning Technical Debt
Kubeflow runs in Kubernetes clusters. While it most commonly runs with TensorFlow, it can also use PyTorch and other popular ML frameworks. Setting up Hosted Kubeflow Pipelines is fairly simple:
Configuring Hosted Kubeflow Pipelines
For the purposes of this guide we’ll assume that you’re running on the Google Cloud AI Platform. Create a new Kubeflow Pipelines workflow, and give it the relevant cluster permissions. Run the following in the console, replacing ACCOUNT, ZONE, CLUSTER and NAMESPACE with the details from your GCP setup:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst cd training-data-analyst/courses/machine_learning/deepdive/06_structured/pipelines ./setup_auth.sh ACCOUNT ZONE CLUSTER NAMESPACE
You may need to add some IAM permissions after doing this.
You can use AI Platform Notebooks to develop and deploy your code. Create a new instance, in this case we’ll use TensorFlow 2.x, but you can apply the principles of this to other frameworks. Choose a descriptive name for the instance, and then make sure the appropriate permissions are set in Machine Configuration -> Permissions.
Download some sample data/models to test your setup with:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
Navigate to the notebook, change the pipelines host to the URL of your Kubeflow Pipelines installation. This will allow you to manually deploy the notebook by running the cells in it, once it’s ready.
Authoring Your Pipelines
There are two approaches to authoring pipelines. From a data scientist’s point of view, Jupyter notebooks are the easiest to use. Sysadmins tend to prefer Docker. If you’re a data scientist it makes sense to develop in Jupyter notebooks, then let the people in operations do the work of making your models something that they can deploy.
As we said earlier, simply running a cell in a notebook is enough to deploy it. If you have a notebook that can perform a set of actions for part of your ML workflow, how can you turn that into something that runs in a Kubeflow pipeline? Well, anything self-contained container that accepts parameters and writes output can be classed as a component. All you need to do is package the notebook so it can run as a Docker image.
Let’s imagine your notebook relies on TensorFlow, Python 3 and GCloud. Your Dockerfile should cover those dependencies:
RUN apt-get update -y && apt-get install –no-install-recommends -y -q python3-dev python3-setuptools python3-pip
RUN python3 -m pip install tensorflow==2.2.0 jupyter papermill
COPY notebook.sh ./
ENTRYPOINT [“bash”, “./notebook.sh”]
The entry point for the Docker is notebook.sh, which contains the instructions to execute the notebook using Papermill:
papermill input.ipynb output.ipynb -f params.yaml –log-output
The file params.yaml contains any relevant configuration information for the notebook.
For the long term, you may wish to set up a script to copy the notebook from your Google Cloud storage into Kubeflow, then to copy the output to Google Cloud.
Adding Your Notebook to Your Pipeline
We have just one notebook here. A major reason for having pipelines is so that you can run multiple notebooks as a part of your pipeline, and re-use them. You can have a simple pipeline that runs just one notebook, though.
Simply define the pipeline as normal, including the name, description, inputs, outputs and parameters, and point to the notebook.
Be aware that while notebooks are agile and easy for data scientists to use, they do add to technical debt. A monolithic notebook defeats the object of using notebooks, since you can’t re-use it as easily. Unfortunately, breaking notebooks up into specific tasks comes with its own issues, as it becomes harder to track dependencies.
Jupyter notebooks are great for development and experimentation, but moving to containers for your mature ML models makes a lot of sense. For raid development of models, put CI/CD in place to automatically rebuild your ML components when you have new code ready to go.
CI/CD will detect when code has changed and rebuild the docker images when necessary so that the ML system is always running the latest pipelines. This is fairly easy to set up with GitHub triggers, assuming you have a sensible and clear directory structure for your repo. Whenever a file in the relevant directory is committed, the trigger can launch, and run a gcloud build command.
Streamline Your Machine Learning
Kubeflow makes it easy to deploy your Machine Learning models. Whether you’re running your code as notebooks or in Docker containers, Kubeflow allows you to focus on the model. This is especially important for data scientists because it allows them to simply focus on the notebook, and not have to worry about the hardware that it’s running on. The optimization for the hardware can come later when the model is ready to be deployed on a larger scale.