Unexplainable token usage with azure ai search as datasource

Question

Unexplainable token usage with azure ai search as datasource

Mehmetcan Sinir 0

I created an app via azure ai studio, using azure search as datasource. When a user makes a single request, 3 requests in total is sent to the chat completion model, per a single user reques that is. My chunk size is 256 tokens pro chunk , and top_n_documents setting is set to 3,. I would expect with this context, and some extra meta information, a total prompt token size of at most 1200 tokens. However my single request in like 5K tokens at least. I have no idea where the remaining 3800 tokens are coming from, and it is impossible to debug, as microsoft does not expose the response body in the diagnostic logs of azure openai service, even with all logs activated. So the actual prompt the model is receiving is a complete blackbox. This is extremely inefficient, using it with a model of a total TPM setting of 200k tokens. This limits the number of parallel requests, and results in high costs.

When using semantic kernel, and querying the datasource yourself, and controlling the workflow yourself, the above request would perhaps result in 1k tokens maximum. There is something extremely inefficient going on, making this clickOps solution not suitable for production with high usage.

There are many questions asking the same thing, and still answer no from microsoft!

https://stackoverflow.com/questions/78779006/why-is-the-consumption-of-openai-tokens-in-azure-hybrid-search-100x-higher-in-co
https://learn.microsoft.com/en-us/answers/questions/2103832/high-token-consumption-in-azure-openai-with-your-d?page=1&orderby=Helpful&comment=answer-1904758&translated=false#newest-answer-comment

Please get back to me with a solution

Manas Mohanty 1,850 Reputation points Microsoft External Staff

2025-03-11T11:30:12.6366667+00:00

Hi Mehmetcan Sinir

We are checking to see if above pointers helped

Thank you.
Manas Mohanty 1,850 Reputation points Microsoft External Staff

2025-03-13T08:22:54.4733333+00:00

Hi Mehmetcan Sinir

We have not heard from you. Hope the pointers shared were useful to you.

Thank you.
Mehmetcan Sinir 0 Reputation points

2025-03-13T08:28:34.6166667+00:00

Will check today thanks
Mehmetcan Sinir 0 Reputation points

2025-03-13T08:32:35.7733333+00:00

No it did not the person answered did not even read my question or did not understand the question

1 answer

Your answer

Manas Mohanty 1,850 Reputation points Microsoft External Staff

2025-03-11T11:30:12.6366667+00:00

Hi Mehmetcan Sinir

We are checking to see if above pointers helped

Thank you.
Manas Mohanty 1,850 Reputation points Microsoft External Staff

2025-03-13T08:22:54.4733333+00:00

Hi Mehmetcan Sinir

We have not heard from you. Hope the pointers shared were useful to you.

Thank you.
Mehmetcan Sinir 0 Reputation points

2025-03-13T08:28:34.6166667+00:00

Will check today thanks
Mehmetcan Sinir 0 Reputation points

2025-03-13T08:32:35.7733333+00:00

No it did not the person answered did not even read my question or did not understand the question

Answer 1

Manas Mohanty 1,850 Microsoft External Staff

Hi Mehmetcan Sinir

TPM usage also depends on query size running on AI search aside prompt tokens and output tokens

You can view the detailed usage like active tokens, processed token etc, from Azure OpenAI monitoring age from a custom Kusto query.

and General metrics like processed token, active tokens with filter as model deployment.s

Kusto queries

Here are the suggested steps to reduce token usage

You can reduce max_token size in deployment configuration of OpenAI
You can reduce chunk size while adding/creating an AI search index
You can adjust Top_p and Top_k to reduce no of results being shown in answer
You can ask model to keep the answer in certain word limit (200 words or 300 words etc)

Hope it helps.

Thank you.

Mehmetcan Sinir 0 Reputation points

2025-03-13T08:32:09.7733333+00:00

I am talking about the processed prompt tokens. NOT the output tokens. I am actually talking about the prompt tokens the model receives! Please reread my question. Your answer did not address my question at all.
Mehmetcan Sinir 0 Reputation points

2025-03-13T08:35:23.09+00:00

I am talking about the processed prompt tokens size of 5k so the actual prompt size the model receives. Output tokens are irrelevant. I already reduced chunk size to 256 and top n to 3. So where does the 5k prompt size come from? You gave a generic answer and seem to not have understood my question
Manas Mohanty 1,850 Reputation points Microsoft External Staff

2025-03-13T08:57:28.5733333+00:00
Hi Mehmetcan Sinir

Thank you for clarifying that issue is about unexpected prompt_token count.

could you provide the estimated word account and method used for checking prompt tokens. Are you using "Usage" details from chat completion requests or any other method like by viewing metrics from monitor tab.

For below example message, Total prompt tokens incurred in 37 (12+10+14)

messages=[ {"role": "system", "content": "Assistant is a large language model trained by OpenAI."}, {"role": "user", "content": "Who were the founders of Microsoft?"} ]

System Message:

"role": 1 token

"system": 1 token

"content": 1 token

"Assistant is a large language model trained by OpenAI.": 9 tokens

Total for system message: 12 tokens

User Message:

"role": 1 token

"user": 1 token

"content": 1 token

"Who were the founders of Microsoft?": 7 tokens

Total for user message: 10 tokens

Additional Metadata and Formatting:

Brackets, commas, and other structural tokens: approximately 14 tokens

{ "id": "chatcmpl-8GHoQAJ3zN2DJYqOFiVysrMQJfe1P", ..... "created": 1698892410, "model": "gpt-35-turbo", "object": "chat.completion", "prompt_tokens": 37, "total_tokens": 123 }, .... .... }

Ideally prompt token count will match approximately sum of word count+ system message + structural tokens.

You can use tiktoken library to find estimated tokens usage.

Here is sample usage.

import tiktoken # Initialize the tokenizer for the specific model tokenizer = tiktoken.get_encoding("gpt-3.5-turbo") # Example text text = "How are prompt tokens calculated in Azure OpenAI?" # Tokenize the text tokens = tokenizer.encode(text) # Count the number of tokens num_tokens = len(tokens) print(f"Number of tokens: {num_tokens}")

https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken

In the meanwhile, I will check with team internally for details and update you.

Thank you.
Mehmetcan Sinir 0 Reputation points

2025-03-13T09:48:43.9766667+00:00

hi, thank you for your message, here is the issue, we are not constructing the end prompt ourselves. We are using the clickOp solutoin, i.e. the code and the app is generated via Azure AI Foundry. Everything hosted in Azure.

We send the user prompt, with a datasource, in our case Azure AI Search, to azure open AI service. The user prompt itself is perhaps 200 tokens max. Datasource index, and the skillset is set to a chunk size of 256 tokens. The AI Search is instructed to retrieve 3 chunks (top_n_documents). With this information the Azure Open AI Service constructs its prompt itself, after querying the data from AI Search. WE are not actively querying and constructing the end prompt. That is where the difficulty is. The lbrary you are suggesting doesnt help at all as we are not constructing the end prompt.

We might have to do that if this problem is not solved, in the end I AM getting a processed prompt token cout of 5k. I would expect a max processed prompt token count of 1k or 1.2K. Makes absolutely no sense. where is the remaining 3800 coming from? There are other questions regarding this. IT seems to be a known issue, that microsoft is not interested in addressing. Please see my initial question for the links.

Share via

Unexplainable token usage with azure ai search as datasource

1 answer

Your answer