Skip to main content

如何定义一个将大语言模型作为评判者的评估器

大型语言模型(LLM)应用的评估可能颇具挑战性,因为它们通常生成的是对话式文本,而这类文本往往没有唯一正确的答案。

本指南将向您介绍如何使用 LangSmith SDK 或用户界面为离线评估定义一个“大语言模型作为评判者”(LLM-as-a-judge)的评估器。注意:如需在生产环境的追踪数据上实时运行评估,请参阅配置在线评估

预构建的评估器

预构建的评估器是设置评估工作的理想起点。有关如何在 LangSmith 中使用预构建评估器,请参阅 预构建评估器

创建您自己的大语言模型(LLM)裁判评估器

如需完全掌控评估器逻辑,请创建您自己的“大语言模型作为裁判”评估器,并使用 LangSmith SDK 运行(Python / TypeScript)。

需要 langsmith>=0.2.0

from langsmith import evaluate, traceable, wrappers, Client
from openai import OpenAI
# Assumes you've installed pydantic
from pydantic import BaseModel

# Optionally wrap the OpenAI client to trace all model calls.
oai_client = wrappers.wrap_openai(OpenAI())

def valid_reasoning(inputs: dict, outputs: dict) -> bool:
"""Use an LLM to judge if the reasoning and the answer are consistent."""

instructions = """\

Given the following question, answer, and reasoning, determine if the reasoning \
for the answer is logically valid and consistent with question and the answer.\
"""

class Response(BaseModel):
reasoning_is_valid: bool

msg = f"Question: {inputs['question']}\nAnswer: {outputs['answer']}\nReasoning: {outputs['reasoning']}"
response = oai_client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
response_format=Response
)
return response.choices[0].message.parsed.reasoning_is_valid

# Optionally add the 'traceable' decorator to trace the inputs/outputs of this function.
@traceable
def dummy_app(inputs: dict) -> dict:
return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

ls_client = Client()
dataset = ls_client.create_dataset("big questions")
examples = [
{"inputs": {"question": "how will the universe end"}},
{"inputs": {"question": "are we alone"}},
]
ls_client.create_examples(dataset_id=dataset.id, examples=examples)

results = evaluate(
dummy_app,
data=dataset,
evaluators=[valid_reasoning]
)

有关如何编写自定义评估器的更多信息,请参见此处


这个页面对你有帮助吗?


您可以留下详细的反馈 在 GitHub 上.