你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

Azure OpenAI Evaluation (预览版)

项目
12/10/2024

大型语言模型的评估是衡量这些模型在不同任务和维度上的性能的关键步骤。这对于微调的模型尤其重要，评估训练的性能提升（或损失）对于这类模型至关重要。全面的评估有助于了解模型的不同版本如何影响应用程序或方案。

Azure OpenAI 评估使开发人员能够创建评估运行，以针对预期的输入/输出对进行测试，从而通过各种关键指标评估模型的性能，例如准确性、可靠性和整体性能。

评估支持

区域可用性

美国东部 2
美国中北部
瑞典中部
瑞士西部

支持的部署类型

Standard
已预配

评估管道

测试数据

你需要将要测试的内容集合到一个事实数据集。数据集创建通常是一个迭代过程，可确保评估在一段时间内与方案保持相关。此事实数据集通常是手动制作的，表示预期的模型行为。该数据集也会被标记，并包括预期的答案。

注意

某些评估测试(如“情绪”和“有效的 JSON 或 XML”)不需要事实数据。

数据源需要采用 JSONL 格式。下面是 JSONL 评估数据集的两个示例:

{"question": "Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.", "subject": "abstract_algebra", "A": "0", "B": "4", "C": "2", "D": "6", "answer": "B", "completion": "B"}
{"question": "Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.", "subject": "abstract_algebra", "A": "8", "B": "2", "C": "24", "D": "120", "answer": "C", "completion": "C"}
{"question": "Find all zeros in the indicated finite field of the given polynomial with coefficients in that field. x^5 + 3x^3 + x^2 + 2x in Z_5", "subject": "abstract_algebra", "A": "0", "B": "1", "C": "0,1", "D": "0,4", "answer": "D", "completion": "D"}

上传并选择评估文件时，将返回前三行内容的预览:

{"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}      
{"input": [{"role": "system", "content": "Create a comprehensive yet concise summary that captures the essential technical details and significance of the topic."}, {"role": "user", "content": "Self-attention mechanisms are vital in modern transformer models, allowing them to evaluate the relevance of different parts of an input sequence when processing each element. In self-attention, each position in a sequence learns to focus on all other positions, forming a weighted sum of their values. Attention weights are derived using queries, keys, and values, with compatibility between a query and key determining attention to each value. This mechanism enables capturing long-range dependencies and relationships within a sequence, making it effective for tasks needing context understanding and inter-part relationships."}], "output": "Self-attention is crucial in transformers, enabling each sequence position to evaluate and collect data from all others. It uses queries, keys, and values for attention weights, effectively capturing long-range dependencies and contextual links."}      
{"input": [{"role": "system", "content": "Synthesize the technical information into a brief, accurate summary that maintains the core concepts and their significance."}, {"role": "user", "content": "The transformer architecture significantly changed natural language processing by introducing a new method for sequence processing. Unlike recurrent neural networks, transformers process whole sequences at once using self-attention. The architecture consists of an encoder and decoder, each with multiple layers of multi-head attention, feed-forward networks, and normalization layers. The encoder processes the input sequence, and the decoder produces the output sequence. Positional encodings are added to retain sequence order information since the model lacks inherent sequence position awareness. This architecture underlies many leading language models like BERT and GPT."}], "output": "Transformers are a transformative neural architecture processing sequences with self-attention instead of recurrence. Their encoder-decoder design, with multi-head attention and position encodings, is fundamental to modern language models like BERT and GPT."}

上传并选择评估文件时，将返回前三行内容的预览:

你可以选择任何以前上传的现有数据集，或上传新的数据集。

生成响应(可选)

在评估中使用的提示应与计划在生产中使用的提示相匹配。这些提示提供了模型要遵循的说明。与操场体验类似，你可以创建多个输入，以在提示中添加小样本示例。有关详细信息，请参阅提示工程技术，详细了解提示设计和提示工程中的一些高级技术。

你可以使用 {{input.column_name}} 格式在提示中引用输入数据，其中 column_name 对应于输入文件中列的名称。

后续步骤将使用 {{sample.output_text}} 格式引用在评估过程中生成的输出。

注意

你需要使用双大括号来确保正确引用数据。

模型部署

作为创建评估的一部分，你将选取生成响应时要使用的模型(可选)，以及根据特定测试条件对模型进行评分时要使用的模型。

在 Azure OpenAI 中，你在评估过程中将分配要使用的特定模型部署。你可以通过为每个模型创建单独的评估配置来比较多个部署。这使你可以为每个评估定义特定的提示，从而更好地控制不同模型所需的变体。

你可以评估基本模型部署或优化模型部署。列表中的可用部署取决于你在 Azure OpenAI 资源中创建的部署。如果找不到所需的部署，可以从 Azure OpenAI 评估页创建新的部署。

测试条件

测试条件用于评估目标模型生成的每个输出的有效性。这些测试将输入数据与输出数据进行比较，以确保一致性。你可以灵活地配置不同的条件来测试和测量不同级别的输出的质量和相关性。

入门

在 Azure AI Foundry 门户中选择“Azure OpenAI 评估(预览版)”。若要查看此视图，可能需要首先在受支持的区域中选择一个现有的 Azure OpenAI 资源。
选择“新建评估”
输入评估的名称。默认情况下，将自动生成随机名称，除非你进行编辑并替换它。 >选择“上传新数据集”。

选择将采用 .jsonl 格式的评估。如果需要示例测试文件，可以将这 10 行保存到名为“eval-test.jsonl”的文件:

{"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}      
{"input": [{"role": "system", "content": "Create a comprehensive yet concise summary that captures the essential technical details and significance of the topic."}, {"role": "user", "content": "Self-attention mechanisms are vital in modern transformer models, allowing them to evaluate the relevance of different parts of an input sequence when processing each element. In self-attention, each position in a sequence learns to focus on all other positions, forming a weighted sum of their values. Attention weights are derived using queries, keys, and values, with compatibility between a query and key determining attention to each value. This mechanism enables capturing long-range dependencies and relationships within a sequence, making it effective for tasks needing context understanding and inter-part relationships."}], "output": "Self-attention is crucial in transformers, enabling each sequence position to evaluate and collect data from all others. It uses queries, keys, and values for attention weights, effectively capturing long-range dependencies and contextual links."}      
{"input": [{"role": "system", "content": "Synthesize the technical information into a brief, accurate summary that maintains the core concepts and their significance."}, {"role": "user", "content": "The transformer architecture significantly changed natural language processing by introducing a new method for sequence processing. Unlike recurrent neural networks, transformers process whole sequences at once using self-attention. The architecture consists of an encoder and decoder, each with multiple layers of multi-head attention, feed-forward networks, and normalization layers. The encoder processes the input sequence, and the decoder produces the output sequence. Positional encodings are added to retain sequence order information since the model lacks inherent sequence position awareness. This architecture underlies many leading language models like BERT and GPT."}], "output": "Transformers are a transformative neural architecture processing sequences with self-attention instead of recurrence. Their encoder-decoder design, with multi-head attention and position encodings, is fundamental to modern language models like BERT and GPT."}      
{"input": [{"role": "system", "content": "Distill the technical content into a clear, concise summary that preserves the most important concepts and their practical applications."}, {"role": "user", "content": "Cosine similarity is a metric for measuring similarity between vectors by calculating the cosine of the angle between them. The outcome ranges from -1 to 1, with 1 indicating identical orientation, 0 orthogonality, and -1 opposite orientation. In machine learning and NLP, cosine similarity is useful for comparing document vectors, word embeddings, or any high-dimensional representations. It's often preferred over Euclidean distance because it's invariant to vector magnitude, focusing instead on orientation. This makes it particularly useful for comparing documents or embeddings of various lengths or scales."}], "output": "Cosine similarity evaluates vector similarity by computing their angle's cosine, with values from -1 to 1. It's widely used in ML and NLP for comparing document vectors and embeddings due to its magnitude-invariant nature."}      
{"input": [{"role": "system", "content": "Provide a succinct summary that captures the key technical concepts and their relationships while maintaining clarity and accuracy."}, {"role": "user", "content": "Neural network activation functions are crucial for introducing non-linearity to deep learning models. Commonly used activation functions include ReLU (Rectified Linear Unit), which outputs the input directly if positive and zero otherwise, sigmoid, which compresses values between 0 and 1, and tanh, which normalizes values between -1 and 1. Each function has specific benefits and applications: ReLU avoids vanishing gradients and enables sparse activation, sigmoid is useful for binary classification, and tanh is often used in hidden layers for its zero-centered output. The activation function choice greatly affects model training dynamics and ultimate performance."}], "output": "Activation functions introduce non-linearity to neural networks, with ReLU, sigmoid, and tanh being the most common. Each serves distinct purposes: ReLU prevents vanishing gradients, sigmoid is suitable for binary classification, and tanh offers zero-centered output, all affecting model performance differently."}      
{"input": [{"role": "system", "content": "Create a focused summary that emphasizes the main technical concepts and their practical implications while maintaining accuracy."}, {"role": "user", "content": "Batch normalization is a technique that stabilizes and accelerates deep neural network training by normalizing each layer's inputs. It normalizes activations using the mean and variance computed over the current data batch. This normalization reduces internal covariate shift, where the input distribution changes during training. Batch normalization includes learnable parameters (gamma and beta) allowing the network to reverse normalization if necessary. Benefits include enabling higher learning rates, lessening the importance of precise parameter initialization, and serving as a regularizer, potentially removing the need for dropout."}], "output": "Batch normalization stabilizes neural network training by normalizing layer inputs using batch statistics. It addresses internal covariate shift with learnable parameters, allowing higher learning rates and improved training dynamics, while offering regularization benefits."}      
{"input": [{"role": "system", "content": "Summarize the technical content while preserving the essential concepts and their interconnections, ensuring clarity and precision."}, {"role": "user", "content": "Gradient descent optimization algorithms are foundational to training machine learning models. The basic version updates parameters in the steepest descent direction of the loss function. Variants like Stochastic Gradient Descent (SGD) use random samples to estimate gradients, enhancing computational efficiency. Advanced optimizers like Adam combine momentum (helping overcome local minima) and RMSprop (adapting learning rates for each parameter). These optimizers maintain running averages of gradients and their squares, enabling adaptive learning rates and more efficient training. The optimizer choice can significantly influence model convergence speed and final performance."}], "output": "Gradient descent algorithms optimize ML models by updating parameters based on loss gradients. Variants like SGD use sampling for efficiency, while advanced optimizers like Adam combine momentum with adaptive learning rates for improved training performance."}      
{"input": [{"role": "system", "content": "Deliver a concise yet comprehensive summary that highlights the key technical aspects and their practical significance."}, {"role": "user", "content": "Cross-validation is a resampling method used to evaluate machine learning models on limited data. The most common form, k-fold cross-validation, involves dividing the dataset into k equal parts, training the model on k-1 folds while validating on the remaining fold, and rotating through all combinations. This provides a more robust model performance assessment than a single train-test split, helping detect overfitting and estimate model generalization to unseen data. Variations include stratified k-fold (maintaining class distribution), leave-one-out (k equals dataset size), and time series cross-validation for temporal data."}], "output": "Cross-validation evaluates ML models by training and testing on different data splits, typically using k-fold methodology. This approach offers better performance assessment than single splits, with variations for different data types and requirements."}      
{"input": [{"role": "system", "content": "Generate a clear and focused summary that captures the essential technical details while maintaining their relationships and significance."}, {"role": "user", "content": "Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second task. This approach is powerful in deep learning, where pre-trained models on large datasets (like ImageNet for computer vision or BERT for NLP) are fine-tuned on specific downstream tasks. Transfer learning reduces the need for large amounts of task-specific training data and computational resources, as the model has already learned useful features from the source domain. Common strategies include feature extraction (freezing pre-trained layers) and fine-tuning (updating all or some pre-trained weights)."}], "output": "Transfer learning reuses models trained on one task for different tasks, particularly effective in deep learning. It leverages pre-trained models through feature extraction or fine-tuning, reducing data and computational needs for new tasks."}      
{"input": [{"role": "system", "content": "Provide a precise and informative summary that distills the key technical concepts while maintaining their relationships and practical importance."}, {"role": "user", "content": "Ensemble methods combine multiple machine learning models to create a more robust and accurate predictor. Common techniques include bagging (training models on random data subsets), boosting (sequentially training models to correct earlier errors), and stacking (using a meta-model to combine base model predictions). Random Forests, a popular bagging method, create multiple decision trees using random feature subsets. Gradient Boosting builds trees sequentially, with each tree correcting the errors of previous ones. These methods often outperform single models by reducing overfitting and variance while capturing different data aspects."}], "output": "Ensemble methods enhance prediction accuracy by combining multiple models through techniques like bagging, boosting, and stacking. Popular implementations include Random Forests (using multiple trees with random features) and Gradient Boosting (sequential error correction), offering better performance than single models."}

你可以预览文件的前三行:

选择“生成响应”切换按钮。从下拉列表中选择{{item.input}}。这会将评估文件中的输入字段注入单个提示，提示运行新的模型，我们希望能够将该模型与评估数据集进行比较。该模型将采用该输入并生成自己的唯一输出，在这种情况下，该输出将存储在名为“{{sample.output_text}}”的变量中。我们稍后会将该示例输出文本作为我们的测试条件使用。或者，你可以手动提供自己的自定义系统消息和单个消息示例。
根据评估选择要生成响应的模型。如果没有模型，可以创建一个。在本示例中，我们将使用 gpt-4o-mini 的标准部署。

设置/sprocket 符号控制传递给模型的基本参数。目前仅支持以下参数:

温度
最大长度
Top P

无论选择哪种模型，当前最大长度上限都为 2048。

选择“添加测试条件”，然后选择“添加”。
选择“对比”下面的“语义相似性”，然后选择“添加”。>{{item.output}}{{sample.output_text}} 这会将评估 .jsonl 文件中的原始引用输出与给定模型提示基于 {{item.input}} 将生成的输出进行对比。
此时选择“添加”，可以添加其他测试条件，或者选择“创建”以启动评估作业运行。
选择“创建”后，你将进入评估作业的状态页。
创建评估作业后，可以选择该作业以查看其完整的详细信息:
对于语义相似性,“查看输出详细信息”包含可以复制/粘贴通过测试的 JSON 表示形式。

测试条件详细信息

Azure OpenAI 评估提供多个测试条件选项。以下部分提供有关每个选项的其他详细信息。

真实性

通过将提交答案与专家答案进行比较来评估提交答案的事实准确性。

真实性功能通过将提交答案与专家答案进行比较来评估提交答案的事实准确性。使用详细的思考链 (CoT) 提示，评分者可确定提交的答案与专家的答案是相同或是相悖的，或者是专家答案的子集或超集。它无视风格、语法或标点符号的差异，只关注事实内容。在很多情况下，真实性功能非常有用，包括但不限于内容验证和教育工具，可确保 AI 提供的答案的准确性。

你可以通过选择提示旁边的下拉列表来查看用作此测试条件一部分的提示文本。当前提示文本为:

Prompt
You are comparing a submitted answer to an expert answer on a given question.
Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {ideal}
************
[Submission]: {completion}
************
[END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

语义相似性

度量模型响应与引用之间的相似度。 Grades: 1 (completely different) - 5 (very similar)。

情绪

尝试识别输出的情感基调。

你可以通过选择提示旁边的下拉列表来查看用作此测试条件一部分的提示文本。当前提示文本为:

Prompt
You will be presented with a text generated by a large language model. Your job is to rate the sentiment of the text. Your options are:

A) Positive
B) Neutral
C) Negative
D) Unsure

[BEGIN TEXT]
***
[{text}]
***
[END TEXT]

First, write out in a step by step manner your reasoning about the answer to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the letter again by itself on a new line

字符串检查

验证输出是否与预期的字符串完全匹配。

字符串检查对允许多种评估条件的两个字符串变量执行各种二进制操作。它有助于验证各种字符串关系，包括相等、包含和特定模式。此评估程序允许区分大小写或不区分大小写的比较。它还为 true 或 false 结果提供指定的分数，允许基于比较结果的自定义评估结果。下面是支持的操作类型:

equals: 检查输出字符串是否完全等于评估字符串。
contains：检查评估字符串是否为输出字符串的子字符串。
starts-with: 检查输出字符串是否以评估字符串开头。
ends-with: 检查输出字符串是否以评估字符串结尾。

注意

在测试条件中设置某些参数时，可以在“变量”和“模板”之间进行选择。如果要引用输入数据中的列，请选择“变量”。如果要提供固定字符串，请选择“模板”。

有效的 JSON 或 XML

验证输出是否为有效的 JSON 或 XML。

匹配架构

确保输出遵循指定的结构。

条件匹配

评估模型的响应是否符合条件。成绩: 通过或失败。

你可以通过选择提示旁边的下拉列表来查看用作此测试条件一部分的提示文本。当前提示文本为:

Prompt
Your job is to assess the final response of an assistant based on conversation history and provided criteria for what makes a good response from the assistant. Here is the data:

[BEGIN DATA]
***
[Conversation]: {conversation}
***
[Response]: {response}
***
[Criteria]: {criteria}
***
[END DATA]

Does the response meet the criteria? First, write out in a step by step manner your reasoning about the criteria to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer. "Y" for yes if the response meets the criteria, and "N" for no if it does not. At the end, repeat just the letter again by itself on a new line.
Reasoning:

文本质量

通过与参考文本进行比较来评估文本的质量。

汇总：

BLEU 评分: 通过使用 BLEU 分数将生成文本与一个或多个高质量的参考翻译进行比较来评估生成的文本的质量。
ROUGE 分数: 通过使用 ROUGE 分数将生成的文本与参考摘要进行比较来评估生成的文本的质量。
余弦值: 也称为余弦相似性，度量两个文本嵌入(如模型输出和参考文本)在含义上的一致性，有助于评估它们之间的语义相似性。这是通过测量它们在矢量空间的距离完成的。

详细信息：

BLEU (双语评估替补)分数常用于自然语言处理 (NLP) 和机器翻译。它广泛应用于文本汇总和文本生成用例。它评估生成的文本与参考文本的匹配程度。 BLEU 分数范围为 0 到 1，分数越高表示质量越好。

ROUGE（以召回为导向的要点评估替补）是用于评估自动汇总和机器翻译的一组指标。它度量生成的文本与参考摘要之间的重叠。 ROUGE 注重使用以召回为导向的措施，来评估生成的文本与参考文本之间的覆盖程度。 ROUGE 分数提供各种指标，包括: • ROUGE-1：生成文本和参考文本之间的单元语法(单字)重叠。 • ROUGE-2: 生成文本和参考文本之间的二元语法(两个连续字词)重叠。 • ROUGE-3: 生成文本和参考文本之间的三元语法(三个连续字词)重叠。 • ROUGE-4: 生成文本和参考文本之间的四元语法(四个连续字词)重叠。 • ROUGE-5: 生成文本和参考文本之间的五元语法(五个连续字词)重叠。 • ROUGE-L: 生成文本和参考文本之间的 L 元语法(L 个连续字词)重叠。文本汇总和文档比较是 ROUGE 的最佳用例之一，尤其是在文本连贯性和相关性至关重要的方案中。

余弦相似性度量两个文本嵌入(如模型输出和参考文本)在含义上的一致性，有助于评估它们之间的语义相似性。与其他基于模型的评估程序相同，需要提供用于评估的模型部署。

重要

此评估程序仅支持嵌入模型:

text-embedding-3-small
text-embedding-3-large
text-embedding-ada-002

自定义提示

使用模型将输出归为一组指定标签的集合。此评估程序使用需要定义的自定义提示。

通过

Azure OpenAI Evaluation (预览版)

评估支持

区域可用性

支持的部署类型

评估管道

测试数据

评估格式

生成响应(可选)

模型部署

测试条件

入门

测试条件详细信息

真实性

语义相似性

情绪

字符串检查

有效的 JSON 或 XML

匹配架构

条件匹配

文本质量

自定义提示

反馈

其他资源