Fine-tune and deploy an AI model for inferencing on Azure Kubernetes Service (AKS) with the AI toolchain operator (Preview)

Article
01/07/2025

This article shows you how to fine-tune and deploy a language model inferencing workload with the AI toolchain operator add-on (preview) for AKS. You learn how to accomplish the following tasks:

Set environment variables to reference your Azure Container Registry (ACR) and repository details.
Create your container registry image push/pull secret to store and retrieve private fine-tuning adapter images.
Select a supported model and fine-tune it to your data.
Test the inference service endpoint.
Clean up resources.

The AI toolchain operator (KAITO) is a managed add-on for AKS that simplifies the deployment and operations for AI models on your AKS clusters. Starting with KAITO version 0.3.1 and above, you can use the AKS managed add-on to fine-tune supported foundation models with new data and enhance the accuracy of your AI models. To learn more about parameter efficient fine-tuning methods and their use cases, see Concepts - Fine-tuning language models for AI and machine learning workflows on AKS.

Important

AKS preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. AKS previews are partially covered by customer support on a best-effort basis. As such, these features aren't meant for production use. For more information, see the following support articles:

Before you begin

This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
Azure CLI version 2.47.0 or later installed and configured. Run az --version to find the version. If you need to install or upgrade, see Install Azure CLI.

Prerequisites

The Kubernetes command-line client, kubectl, installed and configured. For more information, see Install kubectl.
Configure Azure Container Registry (ACR) integration of a new or existing ACR with your AKS cluster.
Install the AI toolchain operator add-on on your AKS cluster.
If you already have the AI toolchain operator add-on installed, update your AKS cluster to the latest version to run KAITO v0.3.1+ and ensure that the AI toolchain operator add-on feature flag is enabled.

Export environmental variables

To simplify the configuration steps in this article, you can define environment variables using the following commands. Make sure to replace the placeholder values with your own.

ACR_NAME="myACRname"
ACR_USERNAME="myACRusername"
REPOSITORY="myRepository"
VERSION="repositoryVersion'
ACR_PASSWORD=$(az acr token create --name $ACR_USERNAME --registry $ACR_NAME --expiration-in-days 10 --repository $REPOSITORY content/write content/read --query "credentials.passwords[0].value" --output tsv)

Create a new secret for your private registry

In this example, your KAITO fine-tuning deployment produces a containerized adapter output, and the KAITO workspace requires a new push secret as authorization to push the adapter image to your ACR.

Generate a new secret to provide the KAITO fine-tuning workspace access to push the model fine-tuning output image to your ACR using the kubectl create secret docker-registry command.

kubectl create secret docker-registry myregistrysecret --docker-server=$ACR_NAME.azurecr.io --docker-username=$ACR_USERNAME --docker-password=$ACR_PASSWORD

Fine-tune an AI model

In this example, you fine-tune the Phi-3-mini small language model using the qLoRA tuning method by applying the following Phi-3-mini KAITO fine-tuning workspace CRD:

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
     name: workspace-tuning-phi-3-mini
resource:
     instanceType: "Standard_NC24ads_A100_v4"
     labelSelector:
          matchLabels:
                apps: tuning-phi-3-mini-pycoder
tuning:
     preset:
         name: phi3mini128kinst
  method: qlora
  input:
      urls: 
          - “myDatasetURL”
  output:
      image: “$ACR_NAME.azurecr.io/$REPOSITORY:$VERSION”
      imagePushSecret: myregistrysecret

This example uses a public dataset specified by a URL in the input. If choosing an image as the source of your fine-tuning data, please refer to the KAITO fine-tuning API specification to adjust the input to pull an image from your ACR.

Note

The choice of GPU SKU is critical since model fine-tuning normally requires more GPU memory compared to model inference. To avoid GPU Out-Of-Memory errors, we recommend using NVIDIA A100 or higher tier GPUs.

Apply the KAITO fine-tuning workspace CRD using the kubectl apply command.
```
kubectl apply workspace-tuning-phi-3-mini.yaml
```

Track the readiness of your GPU resources, fine-tuning job, and workspace using the kubectl get workspace command.

kubectl get workspace -w

Your output should look similar to the following example output:

NAME                         INSTANCE                  RESOURCE READY  INFERENCE READY  JOB STARTED  WORKSPACE SUCCEEDED  AGE
workspace-tuning-phi-3-mini  Standard_NC24ads_A100_v4  True                             True                              3m 45s

Check the status of your fine-tuning job pods using the kubectl get pods command.
```
kubectl get pods
```

Note

You can store the adapter to your specific output location as a container image or any storage type supported by Kubernetes.

Deploy the fine-tuned model for inferencing

Now, you use the Phi-3-mini adapter image created in the previous section for a new inferencing deployment with this model.

The KAITO inference workspace CRD below consists of the following resources and adapter(s) to deploy on your AKS cluster:

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: workspace-phi-3-mini-adapter
resource:
  instanceType: "Standard_NC6s_v3"
  labelSelector:
    matchLabels:
      apps: phi-3-adapter
inference:
  preset:
    name: “phi-3-mini-128k-instruct“
  adapters:
    -source:
       name: kubernetes-adapter
       image: $ACR_NAME.azurecr.io/$REPOSITORY:$VERSION
       imagePullSecrets:
             - myregistrysecret
     strength: “1.0”

Note

Optionally, you can pull in several adapters created from fine-tuning deployments with the same model on different data sets by defining additional "source" fields. Inference with different adapters to compare the performance of your fine-tuned model in varying contexts.

Apply the KAITO inference workspace CRD using the kubectl apply command.
```
kubectl apply -f workspace-phi-3-mini-adapter.yaml
```

Track the readiness of your GPU resources, inference server, and workspace using the kubectl get workspace command.

kubectl get workspace -w

Your output should look similar to the following example output:

NAME                          INSTANCE          RESOURCE READY  INFERENCE READY  JOB STARTED  WORKSPACE SUCCEEDED  AGE
workspace-phi-3-mini-adapter  Standard_NC6s_v3  True            True                          True                 5m 47s

Check the status of your inferencing workload pods using the kubectl get pods command.
```
kubectl get pods
```
It might take several minutes for your pods to show the Running status.

Test the model inference service endpoint

Check your model inferencing service and retrieve the service IP address using the kubectl get svc command.
```
export SERVICE_IP=$(kubectl get svc workspace-phi-3-mini-adapter -o jsonpath=’{.spec.clusterIP}’)
```

Run your fine-tuned Phi-3-mini model with a sample input of your choice using the kubectl run command. The following example asks the generative AI model, "What is AKS?":

kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$SERVICE_IP/chat -H "accept: application/json" -H "Content-Type: application/json" -d "{\"prompt\":\"What is AKS?\"}"

Your output might look similar to the following example output:

"Kubernetes on Azure" is the official name.
https://learn.microsoft.com/en-us/azure/aks/ ...

Clean up resources

If you no longer need these resources, you can delete them to avoid incurring extra Azure charges. To calculate the estimated cost of your resources, you can use the Azure pricing calculator.

Delete the KAITO workspaces and their allocated resources on your AKS cluster using the kubectl delete workspace command.

kubectl delete workspace workspace-tuning-phi-3-mini
kubectl delete workspace workspace-phi-3-mini-adapter

Next steps

Learn more on how to Fine tune language models with KAITO - AKS Engineering Blog
Explore MLOps for AI and machine learning workflows and best practices on AKS
Learn about supported families of GPUs on Azure Kubernetes Service

Share via

Fine-tune and deploy an AI model for inferencing on Azure Kubernetes Service (AKS) with the AI toolchain operator (Preview)

Before you begin

Prerequisites

Export environmental variables

Create a new secret for your private registry

Fine-tune an AI model

Deploy the fine-tuned model for inferencing

Test the model inference service endpoint

Clean up resources

Next steps

Additional resources