Fine-tune and deploy an AI model for inferencing on Azure Kubernetes Service (AKS) with the AI toolchain operator (Preview)
This article shows you how to fine-tune and deploy a language model inferencing workload with the AI toolchain operator add-on (preview) for AKS. You learn how to accomplish the following tasks:
- Set environment variables to reference your Azure Container Registry (ACR) and repository details.
- Create your container registry image push/pull secret to store and retrieve private fine-tuning adapter images.
- Select a supported model and fine-tune it to your data.
- Test the inference service endpoint.
- Clean up resources.
The AI toolchain operator (KAITO) is a managed add-on for AKS that simplifies the deployment and operations for AI models on your AKS clusters. Starting with KAITO version 0.3.1 and above, you can use the AKS managed add-on to fine-tune supported foundation models with new data and enhance the accuracy of your AI models. To learn more about parameter efficient fine-tuning methods and their use cases, see Concepts - Fine-tuning language models for AI and machine learning workflows on AKS.
Important
AKS preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. AKS previews are partially covered by customer support on a best-effort basis. As such, these features aren't meant for production use. For more information, see the following support articles:
Before you begin
- This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
- Azure CLI version 2.47.0 or later installed and configured. Run
az --version
to find the version. If you need to install or upgrade, see Install Azure CLI.
Prerequisites
- The Kubernetes command-line client, kubectl, installed and configured. For more information, see Install kubectl.
- Configure Azure Container Registry (ACR) integration of a new or existing ACR with your AKS cluster.
- Install the AI toolchain operator add-on on your AKS cluster.
- If you already have the AI toolchain operator add-on installed, update your AKS cluster to the latest version to run KAITO v0.3.1+ and ensure that the AI toolchain operator add-on feature flag is enabled.
Export environmental variables
To simplify the configuration steps in this article, you can define environment variables using the following commands. Make sure to replace the placeholder values with your own.
ACR_NAME="myACRname"
ACR_USERNAME="myACRusername"
REPOSITORY="myRepository"
VERSION="repositoryVersion'
ACR_PASSWORD=$(az acr token create --name $ACR_USERNAME --registry $ACR_NAME --expiration-in-days 10 --repository $REPOSITORY content/write content/read --query "credentials.passwords[0].value" --output tsv)
Create a new secret for your private registry
In this example, your KAITO fine-tuning deployment produces a containerized adapter output, and the KAITO workspace requires a new push secret as authorization to push the adapter image to your ACR.
Generate a new secret to provide the KAITO fine-tuning workspace access to push the model fine-tuning output image to your ACR using the kubectl create secret docker-registry
command.
kubectl create secret docker-registry myregistrysecret --docker-server=$ACR_NAME.azurecr.io --docker-username=$ACR_USERNAME --docker-password=$ACR_PASSWORD
Fine-tune an AI model
In this example, you fine-tune the Phi-3-mini small language model using the qLoRA tuning method by applying the following Phi-3-mini KAITO fine-tuning workspace CRD:
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: workspace-tuning-phi-3-mini
resource:
instanceType: "Standard_NC24ads_A100_v4"
labelSelector:
matchLabels:
apps: tuning-phi-3-mini-pycoder
tuning:
preset:
name: phi3mini128kinst
method: qlora
input:
urls:
- “myDatasetURL”
output:
image: “$ACR_NAME.azurecr.io/$REPOSITORY:$VERSION”
imagePushSecret: myregistrysecret
This example uses a public dataset specified by a URL in the input. If choosing an image as the source of your fine-tuning data, please refer to the KAITO fine-tuning API specification to adjust the input to pull an image from your ACR.
Note
The choice of GPU SKU is critical since model fine-tuning normally requires more GPU memory compared to model inference. To avoid GPU Out-Of-Memory errors, we recommend using NVIDIA A100 or higher tier GPUs.
Apply the KAITO fine-tuning workspace CRD using the
kubectl apply
command.kubectl apply workspace-tuning-phi-3-mini.yaml
Track the readiness of your GPU resources, fine-tuning job, and workspace using the
kubectl get workspace
command.kubectl get workspace -w
Your output should look similar to the following example output:
NAME INSTANCE RESOURCE READY INFERENCE READY JOB STARTED WORKSPACE SUCCEEDED AGE workspace-tuning-phi-3-mini Standard_NC24ads_A100_v4 True True 3m 45s
Check the status of your fine-tuning job pods using the
kubectl get pods
command.kubectl get pods
Note
You can store the adapter to your specific output location as a container image or any storage type supported by Kubernetes.
Deploy the fine-tuned model for inferencing
Now, you use the Phi-3-mini adapter image created in the previous section for a new inferencing deployment with this model.
The KAITO inference workspace CRD below consists of the following resources and adapter(s) to deploy on your AKS cluster:
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: workspace-phi-3-mini-adapter
resource:
instanceType: "Standard_NC6s_v3"
labelSelector:
matchLabels:
apps: phi-3-adapter
inference:
preset:
name: “phi-3-mini-128k-instruct“
adapters:
-source:
name: kubernetes-adapter
image: $ACR_NAME.azurecr.io/$REPOSITORY:$VERSION
imagePullSecrets:
- myregistrysecret
strength: “1.0”
Note
Optionally, you can pull in several adapters created from fine-tuning deployments with the same model on different data sets by defining additional "source" fields. Inference with different adapters to compare the performance of your fine-tuned model in varying contexts.
Apply the KAITO inference workspace CRD using the
kubectl apply
command.kubectl apply -f workspace-phi-3-mini-adapter.yaml
Track the readiness of your GPU resources, inference server, and workspace using the
kubectl get workspace
command.kubectl get workspace -w
Your output should look similar to the following example output:
NAME INSTANCE RESOURCE READY INFERENCE READY JOB STARTED WORKSPACE SUCCEEDED AGE workspace-phi-3-mini-adapter Standard_NC6s_v3 True True True 5m 47s
Check the status of your inferencing workload pods using the
kubectl get pods
command.kubectl get pods
It might take several minutes for your pods to show the
Running
status.
Test the model inference service endpoint
Check your model inferencing service and retrieve the service IP address using the
kubectl get svc
command.export SERVICE_IP=$(kubectl get svc workspace-phi-3-mini-adapter -o jsonpath=’{.spec.clusterIP}’)
Run your fine-tuned Phi-3-mini model with a sample input of your choice using the
kubectl run
command. The following example asks the generative AI model, "What is AKS?":kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$SERVICE_IP/chat -H "accept: application/json" -H "Content-Type: application/json" -d "{\"prompt\":\"What is AKS?\"}"
Your output might look similar to the following example output:
"Kubernetes on Azure" is the official name. https://learn.microsoft.com/en-us/azure/aks/ ...
Clean up resources
If you no longer need these resources, you can delete them to avoid incurring extra Azure charges. To calculate the estimated cost of your resources, you can use the Azure pricing calculator.
Delete the KAITO workspaces and their allocated resources on your AKS cluster using the kubectl delete workspace
command.
kubectl delete workspace workspace-tuning-phi-3-mini
kubectl delete workspace workspace-phi-3-mini-adapter
Next steps
- Learn more on how to Fine tune language models with KAITO - AKS Engineering Blog
- Explore MLOps for AI and machine learning workflows and best practices on AKS
- Learn about supported families of GPUs on Azure Kubernetes Service
Azure Kubernetes Service