Azure OpenAI Değerlendirmesi (Önizleme)

Makale
01/29/2025

Büyük dil modellerinin değerlendirilmesi, çeşitli görev ve boyutlardaki performanslarını ölçmede kritik bir adımdır. Bu, özellikle eğitimden elde edilen performans artışlarını (veya kayıplarını) değerlendirmenin önemli olduğu hassas ayarlanmış modeller için önemlidir. Kapsamlı değerlendirmeler, modelin farklı sürümlerinin uygulamanızı veya senaryonuzu nasıl etkileyebileceklerini anlamanıza yardımcı olabilir.

Azure OpenAI değerlendirmesi, geliştiricilerin beklenen giriş/çıkış çiftlerine karşı test etmek için değerlendirme çalıştırmaları oluşturmasına olanak tanır ve modelin doğruluk, güvenilirlik ve genel performans gibi temel ölçümlerdeki performansını değerlendirir.

Değerlendirme desteği

Bölgesel kullanılabilirlik

Doğu ABD 2
Orta Kuzey ABD
Orta İsveç
Batı İsviçre

Desteklenen dağıtım türleri

Standart
Genel standart
Veri bölgesi standardı
Sağlanan yönetilen
Genel olarak sağlanan yönetilen
Veri bölgesi sağlanmış-yönetilen

Değerlendirme işlem hattı

Test verileri

Test etmek istediğiniz bir temel gerçeklik veri kümesini derlemeniz gerekir. Veri kümesi oluşturma genellikle değerlendirmelerinizin zaman içindeki senaryolarınızla ilgili kalmasını sağlayan yinelemeli bir işlemdir. Bu temel gerçeklik veri kümesi genellikle el yapımıdır ve modelinizden beklenen davranışı temsil eder. Veri kümesi de etiketlenir ve beklenen yanıtları içerir.

Not

Yaklaşım ve geçerli JSON veya XML gibi bazı değerlendirme testleri temel gerçek verileri gerektirmez.

Veri kaynağınızın JSONL biçiminde olması gerekir. Aşağıda JSONL değerlendirme veri kümelerine iki örnek verilmiştir:

{"question": "Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.", "subject": "abstract_algebra", "A": "0", "B": "4", "C": "2", "D": "6", "answer": "B", "completion": "B"}
{"question": "Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.", "subject": "abstract_algebra", "A": "8", "B": "2", "C": "24", "D": "120", "answer": "C", "completion": "C"}
{"question": "Find all zeros in the indicated finite field of the given polynomial with coefficients in that field. x^5 + 3x^3 + x^2 + 2x in Z_5", "subject": "abstract_algebra", "A": "0", "B": "1", "C": "0,1", "D": "0,4", "answer": "D", "completion": "D"}

Değerlendirme dosyanızı karşıya yükleyip seçtiğinizde ilk üç satırın önizlemesi döndürülür:

{"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}      
{"input": [{"role": "system", "content": "Create a comprehensive yet concise summary that captures the essential technical details and significance of the topic."}, {"role": "user", "content": "Self-attention mechanisms are vital in modern transformer models, allowing them to evaluate the relevance of different parts of an input sequence when processing each element. In self-attention, each position in a sequence learns to focus on all other positions, forming a weighted sum of their values. Attention weights are derived using queries, keys, and values, with compatibility between a query and key determining attention to each value. This mechanism enables capturing long-range dependencies and relationships within a sequence, making it effective for tasks needing context understanding and inter-part relationships."}], "output": "Self-attention is crucial in transformers, enabling each sequence position to evaluate and collect data from all others. It uses queries, keys, and values for attention weights, effectively capturing long-range dependencies and contextual links."}      
{"input": [{"role": "system", "content": "Synthesize the technical information into a brief, accurate summary that maintains the core concepts and their significance."}, {"role": "user", "content": "The transformer architecture significantly changed natural language processing by introducing a new method for sequence processing. Unlike recurrent neural networks, transformers process whole sequences at once using self-attention. The architecture consists of an encoder and decoder, each with multiple layers of multi-head attention, feed-forward networks, and normalization layers. The encoder processes the input sequence, and the decoder produces the output sequence. Positional encodings are added to retain sequence order information since the model lacks inherent sequence position awareness. This architecture underlies many leading language models like BERT and GPT."}], "output": "Transformers are a transformative neural architecture processing sequences with self-attention instead of recurrence. Their encoder-decoder design, with multi-head attention and position encodings, is fundamental to modern language models like BERT and GPT."}

Değerlendirme dosyasını karşıya yükleyip seçtiğinizde ilk üç satırın önizlemesi döndürülür:

Önceden yüklenmiş mevcut veri kümelerini seçebilir veya yeni bir veri kümesi yükleyebilirsiniz.

Yanıt oluşturma (isteğe bağlı)

Değerlendirmenizde kullandığınız istem, üretimde kullanmayı planladığınız istemle eşleşmelidir. Bu istemler, modelin izlemesi için yönergeleri sağlar. Oyun alanı deneyimlerine benzer şekilde, isteminize birkaç çekim örneği eklemek için birden çok giriş oluşturabilirsiniz. Daha fazla bilgi için, istem tasarımı ve istem mühendisliğindeki bazı gelişmiş teknikler hakkında ayrıntılı bilgi için bkz. istem mühendisliği teknikleri.

column_name giriş dosyanızdaki sütunların adlarına karşılık geldiği biçimi kullanarak {{input.column_name}} istemler içinde giriş verilerinize başvurabilirsiniz.

Değerlendirme sırasında oluşturulan çıkışlara, biçimi kullanılarak {{sample.output_text}} sonraki adımlarda başvurulacaktır.

Not

Verilerinize doğru başvuru yaptığınızdan emin olmak için çift küme ayracı kullanmanız gerekir.

Model dağıtımı

Değerlendirme oluşturmanın bir parçası olarak, yanıt oluştururken hangi modellerin kullanılacağını (isteğe bağlı) ve belirli test ölçütlerine sahip modelleri puanlarken hangi modellerin kullanılacağını seçersiniz.

Azure OpenAI'de değerlendirmelerinizin bir parçası olarak kullanmak üzere belirli model dağıtımları atayacaksınız. Tek değerlendirme çalıştırmasında birden çok model dağıtımlarını karşılaştırabilirsiniz.

Temel veya ince ayarlı model dağıtımlarını değerlendirebilirsiniz. Listenizde kullanılabilen dağıtımlar, Azure OpenAI kaynağınızda oluşturduğunuz dağıtımlara bağlıdır. İstediğiniz dağıtımı bulamazsanız Azure OpenAI Değerlendirme sayfasından yeni bir dağıtım oluşturabilirsiniz.

Test ölçütleri

Test ölçütleri, hedef model tarafından oluşturulan her çıkışın verimliliğini değerlendirmek için kullanılır. Bu testler, tutarlılık sağlamak için giriş verilerini çıkış verileriyle karşılaştırır. Çıkışın kalitesini ve ilgi düzeyini farklı düzeylerde test etmek ve ölçmek için farklı ölçütler yapılandırma esnekliğine sahipsiniz.

Başlarken

Azure AI Foundry portalında Azure OpenAI Değerlendirmesi (ÖNİzLEME) seçeneğini belirleyin. Bu görünümü bir seçenek olarak görmek için önce desteklenen bir bölgede var olan bir Azure OpenAI kaynağını seçmeniz gerekebilir.
Yeni değerlendirme'yi seçin
Değerlendirmenizin adını girin. Varsayılan olarak, siz düzenleyip değiştirmediğiniz sürece rastgele bir ad otomatik olarak oluşturulur. Yeni veri kümesini karşıya yükle'yi seçin.

Biçiminde olacak .jsonl değerlendirmenizi seçin. Örnek bir test dosyasına ihtiyacınız varsa, bu 10 satırı adlı eval-test.jsonlbir dosyaya kaydedebilirsiniz:

{"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}      
{"input": [{"role": "system", "content": "Create a comprehensive yet concise summary that captures the essential technical details and significance of the topic."}, {"role": "user", "content": "Self-attention mechanisms are vital in modern transformer models, allowing them to evaluate the relevance of different parts of an input sequence when processing each element. In self-attention, each position in a sequence learns to focus on all other positions, forming a weighted sum of their values. Attention weights are derived using queries, keys, and values, with compatibility between a query and key determining attention to each value. This mechanism enables capturing long-range dependencies and relationships within a sequence, making it effective for tasks needing context understanding and inter-part relationships."}], "output": "Self-attention is crucial in transformers, enabling each sequence position to evaluate and collect data from all others. It uses queries, keys, and values for attention weights, effectively capturing long-range dependencies and contextual links."}      
{"input": [{"role": "system", "content": "Synthesize the technical information into a brief, accurate summary that maintains the core concepts and their significance."}, {"role": "user", "content": "The transformer architecture significantly changed natural language processing by introducing a new method for sequence processing. Unlike recurrent neural networks, transformers process whole sequences at once using self-attention. The architecture consists of an encoder and decoder, each with multiple layers of multi-head attention, feed-forward networks, and normalization layers. The encoder processes the input sequence, and the decoder produces the output sequence. Positional encodings are added to retain sequence order information since the model lacks inherent sequence position awareness. This architecture underlies many leading language models like BERT and GPT."}], "output": "Transformers are a transformative neural architecture processing sequences with self-attention instead of recurrence. Their encoder-decoder design, with multi-head attention and position encodings, is fundamental to modern language models like BERT and GPT."}      
{"input": [{"role": "system", "content": "Distill the technical content into a clear, concise summary that preserves the most important concepts and their practical applications."}, {"role": "user", "content": "Cosine similarity is a metric for measuring similarity between vectors by calculating the cosine of the angle between them. The outcome ranges from -1 to 1, with 1 indicating identical orientation, 0 orthogonality, and -1 opposite orientation. In machine learning and NLP, cosine similarity is useful for comparing document vectors, word embeddings, or any high-dimensional representations. It's often preferred over Euclidean distance because it's invariant to vector magnitude, focusing instead on orientation. This makes it particularly useful for comparing documents or embeddings of various lengths or scales."}], "output": "Cosine similarity evaluates vector similarity by computing their angle's cosine, with values from -1 to 1. It's widely used in ML and NLP for comparing document vectors and embeddings due to its magnitude-invariant nature."}      
{"input": [{"role": "system", "content": "Provide a succinct summary that captures the key technical concepts and their relationships while maintaining clarity and accuracy."}, {"role": "user", "content": "Neural network activation functions are crucial for introducing non-linearity to deep learning models. Commonly used activation functions include ReLU (Rectified Linear Unit), which outputs the input directly if positive and zero otherwise, sigmoid, which compresses values between 0 and 1, and tanh, which normalizes values between -1 and 1. Each function has specific benefits and applications: ReLU avoids vanishing gradients and enables sparse activation, sigmoid is useful for binary classification, and tanh is often used in hidden layers for its zero-centered output. The activation function choice greatly affects model training dynamics and ultimate performance."}], "output": "Activation functions introduce non-linearity to neural networks, with ReLU, sigmoid, and tanh being the most common. Each serves distinct purposes: ReLU prevents vanishing gradients, sigmoid is suitable for binary classification, and tanh offers zero-centered output, all affecting model performance differently."}      
{"input": [{"role": "system", "content": "Create a focused summary that emphasizes the main technical concepts and their practical implications while maintaining accuracy."}, {"role": "user", "content": "Batch normalization is a technique that stabilizes and accelerates deep neural network training by normalizing each layer's inputs. It normalizes activations using the mean and variance computed over the current data batch. This normalization reduces internal covariate shift, where the input distribution changes during training. Batch normalization includes learnable parameters (gamma and beta) allowing the network to reverse normalization if necessary. Benefits include enabling higher learning rates, lessening the importance of precise parameter initialization, and serving as a regularizer, potentially removing the need for dropout."}], "output": "Batch normalization stabilizes neural network training by normalizing layer inputs using batch statistics. It addresses internal covariate shift with learnable parameters, allowing higher learning rates and improved training dynamics, while offering regularization benefits."}      
{"input": [{"role": "system", "content": "Summarize the technical content while preserving the essential concepts and their interconnections, ensuring clarity and precision."}, {"role": "user", "content": "Gradient descent optimization algorithms are foundational to training machine learning models. The basic version updates parameters in the steepest descent direction of the loss function. Variants like Stochastic Gradient Descent (SGD) use random samples to estimate gradients, enhancing computational efficiency. Advanced optimizers like Adam combine momentum (helping overcome local minima) and RMSprop (adapting learning rates for each parameter). These optimizers maintain running averages of gradients and their squares, enabling adaptive learning rates and more efficient training. The optimizer choice can significantly influence model convergence speed and final performance."}], "output": "Gradient descent algorithms optimize ML models by updating parameters based on loss gradients. Variants like SGD use sampling for efficiency, while advanced optimizers like Adam combine momentum with adaptive learning rates for improved training performance."}      
{"input": [{"role": "system", "content": "Deliver a concise yet comprehensive summary that highlights the key technical aspects and their practical significance."}, {"role": "user", "content": "Cross-validation is a resampling method used to evaluate machine learning models on limited data. The most common form, k-fold cross-validation, involves dividing the dataset into k equal parts, training the model on k-1 folds while validating on the remaining fold, and rotating through all combinations. This provides a more robust model performance assessment than a single train-test split, helping detect overfitting and estimate model generalization to unseen data. Variations include stratified k-fold (maintaining class distribution), leave-one-out (k equals dataset size), and time series cross-validation for temporal data."}], "output": "Cross-validation evaluates ML models by training and testing on different data splits, typically using k-fold methodology. This approach offers better performance assessment than single splits, with variations for different data types and requirements."}      
{"input": [{"role": "system", "content": "Generate a clear and focused summary that captures the essential technical details while maintaining their relationships and significance."}, {"role": "user", "content": "Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second task. This approach is powerful in deep learning, where pre-trained models on large datasets (like ImageNet for computer vision or BERT for NLP) are fine-tuned on specific downstream tasks. Transfer learning reduces the need for large amounts of task-specific training data and computational resources, as the model has already learned useful features from the source domain. Common strategies include feature extraction (freezing pre-trained layers) and fine-tuning (updating all or some pre-trained weights)."}], "output": "Transfer learning reuses models trained on one task for different tasks, particularly effective in deep learning. It leverages pre-trained models through feature extraction or fine-tuning, reducing data and computational needs for new tasks."}      
{"input": [{"role": "system", "content": "Provide a precise and informative summary that distills the key technical concepts while maintaining their relationships and practical importance."}, {"role": "user", "content": "Ensemble methods combine multiple machine learning models to create a more robust and accurate predictor. Common techniques include bagging (training models on random data subsets), boosting (sequentially training models to correct earlier errors), and stacking (using a meta-model to combine base model predictions). Random Forests, a popular bagging method, create multiple decision trees using random feature subsets. Gradient Boosting builds trees sequentially, with each tree correcting the errors of previous ones. These methods often outperform single models by reducing overfitting and variance while capturing different data aspects."}], "output": "Ensemble methods enhance prediction accuracy by combining multiple models through techniques like bagging, boosting, and stacking. Popular implementations include Random Forests (using multiple trees with random features) and Gradient Boosting (sequential error correction), offering better performance than single models."}

Dosyanın ilk üç satırını önizleme olarak görürsünüz:

Yanıtlar'ın altında Oluştur düğmesini seçin. Şablonla oluştur açılan listesinden seçin{{item.input}}. Bu işlem, değerlendirme dosyamızdaki giriş alanlarını değerlendirme veri kümemizle karşılaştırmak istediğimiz yeni bir model çalıştırması için tek tek istemlere ekler. Model bu girişi alır ve bu durumda adlı {{sample.output_text}}bir değişkende depolanacak olan kendi benzersiz çıkışlarını oluşturur. Daha sonra bu örnek çıkış metnini test ölçütlerimizin bir parçası olarak kullanacağız. Alternatif olarak, kendi özel sistem iletinizi ve tek tek ileti örneklerini el ile sağlayabilirsiniz.
Değerlendirmenize göre yanıt oluşturmak istediğiniz modeli seçin. Modeliniz yoksa bir model oluşturabilirsiniz. Bu örnekte standart gpt-4o-minidağıtımı kullanılır.

Ayarlar/dişli simgesi, modele geçirilen temel parametreleri denetler. Şu anda yalnızca aşağıdaki parametreler desteklenir:

Sıcaklık
Maksimum uzunluk
Üst P

Seçtiğiniz modelden bağımsız olarak maksimum uzunluk şu anda 2048'de eşlenir.

Test ölçütü ekle'yi seçin ve Ekle'yi seçin.
Ekle ile altındaki Eklemeyi karşılaştır'ın altında AnlamSal Benzerlik'i {{sample.output_text}}>seçin.{{item.output}} Bu işlem, değerlendirme .jsonl dosyanızdan özgün başvuru çıkışını alır ve modelinize {{item.input}}göre istemler vererek oluşturulacak çıkışla karşılaştırır.
Bu noktada Ekle'yi> seçerek ek test ölçütleri ekleyebilir veya oluştur'u seçerek değerlendirme işi çalıştırmasını başlatabilirsiniz.
Oluştur'u seçtiğinizde değerlendirme işinizin durum sayfasına yönlendirilirsiniz.
Değerlendirme işiniz oluşturulduktan sonra, işin tüm ayrıntılarını görüntülemek için işi seçebilirsiniz:
Anlamsal benzerlik için Görünüm çıkış ayrıntıları , geçiş testlerinizi kopyalayabileceğiniz/yapıştırabileceğiniz bir JSON gösterimi içerir.

Ölçüt ayrıntılarını test etme

Azure OpenAI Değerlendirmesi birden çok test ölçütü seçeneği sunar. Aşağıdaki bölümde her seçenekle ilgili ek ayrıntılar sağlanır.

Olgusallık

Gönderilen yanıtı uzman yanıtıyla karşılaştırarak gerçek doğruluğunu değerlendirir.

Olgusallık, gönderilen yanıtı uzman yanıtıyla karşılaştırarak gerçek doğruluğunu değerlendirir. Not veren, ayrıntılı bir düşünce zinciri (CoT) isteminden yararlanarak gönderilen yanıtın bir alt kümesiyle, üst kümesiyle mi yoksa uzman yanıtıyla mı çakışdığını belirler. Stil, dil bilgisi veya noktalama işaretleri arasındaki farkları göz ardı eder ve yalnızca olgusal içeriğe odaklanır. Olgusallık, içerik doğrulama ve yapay zeka tarafından sağlanan yanıtların doğruluğunu sağlayan eğitim araçları dahil ancak bunlarla sınırlı olmamak üzere birçok senaryoda yararlı olabilir.

İstemin yanındaki açılan listeyi seçerek bu test ölçütlerinin bir parçası olarak kullanılan istem metnini görüntüleyebilirsiniz. Geçerli istem metni:

Prompt
You are comparing a submitted answer to an expert answer on a given question.
Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {ideal}
************
[Submission]: {completion}
************
[END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

Anlamsal benzerlik

Modelin yanıtı ile başvuru arasındaki benzerlik derecesini ölçer. Grades: 1 (completely different) - 5 (very similar).

Duyarlılık

Çıkışın duygusal tonunu belirlemeye çalışır.

İstemin yanındaki açılan listeyi seçerek bu test ölçütlerinin bir parçası olarak kullanılan istem metnini görüntüleyebilirsiniz. Geçerli istem metni:

Prompt
You will be presented with a text generated by a large language model. Your job is to rate the sentiment of the text. Your options are:

A) Positive
B) Neutral
C) Negative
D) Unsure

[BEGIN TEXT]
***
[{text}]
***
[END TEXT]

First, write out in a step by step manner your reasoning about the answer to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the letter again by itself on a new line

Dize denetimi

Çıkışın beklenen dizeyle tam olarak eşleşip eşleşmediğini doğrular.

Dize denetimi, çeşitli değerlendirme ölçütlerine olanak sağlayan iki dize değişkeni üzerinde çeşitli ikili işlemler gerçekleştirir. Eşitlik, kapsama ve belirli desenler dahil olmak üzere çeşitli dize ilişkilerini doğrulamaya yardımcı olur. Bu değerlendirici büyük/küçük harfe duyarlı veya büyük/küçük harfe duyarlı olmayan karşılaştırmalara olanak tanır. Ayrıca, doğru veya yanlış sonuçlar için belirtilen notları sağlayarak karşılaştırma sonucuna göre özelleştirilmiş değerlendirme sonuçları sağlar. Desteklenen işlemlerin türü aşağıdadır:

equals: Çıkış dizesinin değerlendirme dizesine tam olarak eşit olup olmadığını denetler.
contains: Değerlendirme dizesinin çıkış dizesinin alt dizesi olup olmadığını denetler.
starts-with: Çıkış dizesinin değerlendirme dizesiyle başlayıp başlamadiğini denetler.
ends-with: Çıkış dizesinin değerlendirme dizesiyle bitip bitmediğini denetler.

Not

Test ölçütlerinizde belirli parametreleri ayarlarken değişkenleşablon arasında seçim yapma seçeneğiniz vardır. Giriş verilerinizdeki bir sütuna başvurmak istiyorsanız değişkeni seçin. Sabit bir dize sağlamak istiyorsanız şablonu seçin.

Geçerli JSON veya XML

Çıktının geçerli JSON veya XML olup olmadığını doğrular.

Şemayla eşleşir

Çıkışın belirtilen yapıyı izlediğinden emin olur.

Ölçüt eşleşmesi

Modelin yanıtının ölçütlerinizle eşleşip eşleşmediğini değerlendirin. Not: Başarılı veya Başarısız.

İstemin yanındaki açılan listeyi seçerek bu test ölçütlerinin bir parçası olarak kullanılan istem metnini görüntüleyebilirsiniz. Geçerli istem metni:

Prompt
Your job is to assess the final response of an assistant based on conversation history and provided criteria for what makes a good response from the assistant. Here is the data:

[BEGIN DATA]
***
[Conversation]: {conversation}
***
[Response]: {response}
***
[Criteria]: {criteria}
***
[END DATA]

Does the response meet the criteria? First, write out in a step by step manner your reasoning about the criteria to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer. "Y" for yes if the response meets the criteria, and "N" for no if it does not. At the end, repeat just the letter again by itself on a new line.
Reasoning:

Metin kalitesi

Başvuru metniyle karşılaştırarak metnin kalitesini değerlendirin.

Özet:

BLEU Puanı: Oluşturulan metnin kalitesini, BLEU puanını kullanarak bir veya daha fazla yüksek kaliteli başvuru çevirisi ile karşılaştırarak değerlendirir.
ROUGE Puanı: ROUGE puanlarını kullanarak oluşturulan metnin kalitesini başvuru özetleriyle karşılaştırarak değerlendirir.
Kosinüs: Kosinüs benzerliği, model çıkışları ve başvuru metinleri gibi iki metin ekleme işleminin anlam açısından ne kadar yakın olduğunu ölçer ve aralarındaki anlamsal benzerliğin değerlendirilmesine yardımcı olur. Bu, vektör uzayı içindeki mesafeleri ölçülerek yapılır.

Ayrıntılar:

BLEU (BiLingual Evaluation Understudy) puanı genellikle doğal dil işleme (NLP) ve makine çevirisinde kullanılır. Metin özetleme ve metin oluşturma kullanım örneklerinde yaygın olarak kullanılır. Oluşturulan metnin başvuru metniyle ne kadar yakın eşleştiğini değerlendirir. BLEU puanı 0 ile 1 arasında değişir ve daha yüksek puanlar daha iyi kaliteyi gösterir.

ROUGE (Gisting Değerlendirmesi için Geri Çağırma Odaklı Yedek), otomatik özetleme ve makine çevirisini değerlendirmek için kullanılan bir ölçüm kümesidir. Oluşturulan metin ve başvuru özetleri arasındaki çakışmayı ölçer. ROUGE, oluşturulan metnin başvuru metnini ne kadar iyi kapsadığını değerlendirmek için geri çağırma odaklı ölçülere odaklanır. ROUGE puanı, aşağıdakiler gibi çeşitli ölçümler sağlar: • ROUGE-1: Oluşturulan ve başvuru metni arasındaki tek birimlerin (tek sözcük) çakışması. • ROUGE-2: Oluşturulan ve başvuru metni arasındaki büyük harfler (iki ardışık sözcük) çakışması. • ROUGE-3: Oluşturulan ve başvuru metni arasındaki trigramların (ardışık üç sözcük) çakışması. • ROUGE-4: Oluşturulan ve başvuru metni arasında dört gramlık (ardışık dört sözcük) çakışması. • ROUGE-5: Oluşturulan ve başvuru metni arasındaki beş gramlık (beş ardışık sözcük) çakışması. • ROUGE-L: Oluşturulan ve başvuru metni arasında L-gram (L ardışık sözcükler) çakışması. Metin özetlemesi ve belge karşılaştırması, özellikle metin tutarlılığının ve ilgi düzeyinin kritik olduğu senaryolarda ROUGE için en uygun kullanım örnekleri arasındadır.

Kosinüs benzerliği, model çıkışları ve başvuru metinleri gibi iki metin ekleme işleminin anlam açısından ne kadar yakın olduğunu ölçer ve aralarındaki anlamsal benzerliğin değerlendirilmesine yardımcı olur. Diğer model tabanlı değerlendiricilerde olduğu gibi, değerlendirme için kullanarak bir model dağıtımı sağlamanız gerekir.

Önemli

Bu değerlendirici için yalnızca ekleme modelleri desteklenir:

text-embedding-3-small
text-embedding-3-large
text-embedding-ada-002

Özel istem

Çıktıyı belirtilen etiketler kümesinde sınıflandırmak için modeli kullanır. Bu değerlendirici tanımlamanız gereken özel bir istem kullanır.

Aracılığıyla paylaş

Azure OpenAI Değerlendirmesi (Önizleme)

Değerlendirme desteği

Bölgesel kullanılabilirlik

Desteklenen dağıtım türleri

Değerlendirme işlem hattı

Test verileri

Değerlendirme biçimi

Yanıt oluşturma (isteğe bağlı)

Model dağıtımı

Test ölçütleri

Başlarken

Ölçüt ayrıntılarını test etme

Olgusallık

Anlamsal benzerlik

Duyarlılık

Dize denetimi

Geçerli JSON veya XML

Şemayla eşleşir

Ölçüt eşleşmesi

Metin kalitesi

Özel istem

Geri Bildirim

Ek kaynaklar