Unexplainable token usage with azure ai search as datasource

Mehmetcan Sinir 0 Reputation points
2025-03-06T17:38:01.0433333+00:00

I created an app via azure ai studio, using azure search as datasource. When a user makes a single request, 3 requests in total is sent to the chat completion model, per a single user reques that is. My chunk size is 256 tokens pro chunk , and top_n_documents setting is set to 3,. I would expect with this context, and some extra meta information, a total prompt token size of at most 1200 tokens. However my single request in like 5K tokens at least. I have no idea where the remaining 3800 tokens are coming from, and it is impossible to debug, as microsoft does not expose the response body in the diagnostic logs of azure openai service, even with all logs activated. So the actual prompt the model is receiving is a complete blackbox. This is extremely inefficient, using it with a model of a total TPM setting of 200k tokens. This limits the number of parallel requests, and results in high costs.

When using semantic kernel, and querying the datasource yourself, and controlling the workflow yourself, the above request would perhaps result in 1k tokens maximum. There is something extremely inefficient going on, making this clickOps solution not suitable for production with high usage.

There are many questions asking the same thing, and still answer no from microsoft!

https://stackoverflow.com/questions/78779006/why-is-the-consumption-of-openai-tokens-in-azure-hybrid-search-100x-higher-in-co
https://learn.microsoft.com/en-us/answers/questions/2103832/high-token-consumption-in-azure-openai-with-your-d?page=1&orderby=Helpful&comment=answer-1904758&translated=false#newest-answer-comment

Please get back to me with a solution

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,788 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Manas Mohanty 1,850 Reputation points Microsoft External Staff
    2025-03-10T14:05:53.4933333+00:00

    Hi Mehmetcan Sinir

    TPM usage also depends on query size running on AI search aside prompt tokens and output tokens

    You can view the detailed usage like active tokens, processed token etc, from Azure OpenAI monitoring age from a custom Kusto query.

    and General metrics like processed token, active tokens with filter as model deployment.s

    Kusto queries

    Here are the suggested steps to reduce token usage

    1. You can reduce max_token size in deployment configuration of OpenAI
    2. You can reduce chunk size while adding/creating an AI search index
    3. You can adjust Top_p and Top_k to reduce no of results being shown in answer
    4. You can ask model to keep the answer in certain word limit (200 words or 300 words etc)

    Hope it helps.

    Thank you.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.