I created an app via azure ai studio, using azure search as datasource. When a user makes a single request, 3 requests in total is sent to the chat completion model, per a single user reques that is. My chunk size is 256 tokens pro chunk , and top_n_documents setting is set to 3,. I would expect with this context, and some extra meta information, a total prompt token size of at most 1200 tokens. However my single request in like 5K tokens at least. I have no idea where the remaining 3800 tokens are coming from, and it is impossible to debug, as microsoft does not expose the response body in the diagnostic logs of azure openai service, even with all logs activated. So the actual prompt the model is receiving is a complete blackbox. This is extremely inefficient, using it with a model of a total TPM setting of 200k tokens. This limits the number of parallel requests, and results in high costs.
When using semantic kernel, and querying the datasource yourself, and controlling the workflow yourself, the above request would perhaps result in 1k tokens maximum. There is something extremely inefficient going on, making this clickOps solution not suitable for production with high usage.
There are many questions asking the same thing, and still answer no from microsoft!
https://stackoverflow.com/questions/78779006/why-is-the-consumption-of-openai-tokens-in-azure-hybrid-search-100x-higher-in-co
https://learn.microsoft.com/en-us/answers/questions/2103832/high-token-consumption-in-azure-openai-with-your-d?page=1&orderby=Helpful&comment=answer-1904758&translated=false#newest-answer-comment
Please get back to me with a solution