Skip to main content

如何评估应用程序的中间步骤

虽然在许多场景下,仅评估任务的最终输出就已足够,但在某些情况下,您可能希望评估流水线中的中间步骤。

例如,对于检索增强生成(RAG),你可能希望

  1. 评估检索步骤,以确保针对输入查询检索到正确的文档。
  2. 评估生成步骤,以确保所生成的答案与检索到的文档一致。

在本指南中,我们将使用一个简单且完全自定义的评估器来评估标准1,并使用基于大语言模型(LLM)的评估器来评估标准2,以突出展示这两种场景。

为了评估您流水线的中间步骤,您的评估器函数应遍历并处理 run/rootRun 参数,该参数是一个 Run 对象,其中包含您流水线的中间步骤。

1. 定义您的大语言模型(LLM)流水线

以下RAG流水线包含三个步骤:1)根据输入问题生成维基百科查询;2)从维基百科中检索相关文档;3)根据检索到的文档生成答案。

pip install -U langsmith langchain[openai] wikipedia

需要 langsmith>=0.3.13

import wikipedia as wp
from openai import OpenAI
from langsmith import traceable, wrappers

oai_client = wrappers.wrap_openai(OpenAI())

@traceable
def generate_wiki_search(question: str) -> str:
"""Generate the query to search in wikipedia."""
instructions = (
"Generate a search query to pass into wikipedia to answer the user's question. "
"Return only the search query and nothing more. "
"This will passed in directly to the wikipedia search engine."
)
messages = [
{"role": "system", "content": instructions},
{"role": "user", "content": question}
]
result = oai_client.chat.completions.create(
messages=messages,
model="gpt-4o-mini",
temperature=0,
)
return result.choices[0].message.content

@traceable(run_type="retriever")
def retrieve(query: str) -> list:
"""Get up to two search wikipedia results."""
results = []
for term in wp.search(query, results = 10):
try:
page = wp.page(term, auto_suggest=False)
results.append({
"page_content": page.summary,
"type": "Document",
"metadata": {"url": page.url}
})
except wp.DisambiguationError:
pass
if len(results) >= 2:
return results

@traceable
def generate_answer(question: str, context: str) -> str:
"""Answer the question based on the retrieved information."""
instructions = f"Answer the user's question based ONLY on the content below:\n\n{context}"
messages = [
{"role": "system", "content": instructions},
{"role": "user", "content": question}
]
result = oai_client.chat.completions.create(
messages=messages,
model="gpt-4o-mini",
temperature=0
)
return result.choices[0].message.content

@traceable
def qa_pipeline(question: str) -> str:
"""The full pipeline."""
query = generate_wiki_search(question)
context = "\n\n".join([doc["page_content"] for doc in retrieve(query)])
return generate_answer(question, context)

此流水线将生成一个类似如下的追踪记录:

2. 创建数据集和示例以评估该流水线

我们正在构建一个非常简单的数据集,其中包含几个示例,用于评估该流水线。

需要 langsmith>=0.3.13

from langsmith import Client

ls_client = Client()
dataset_name = "Wikipedia RAG"

if not ls_client.has_dataset(dataset_name=dataset_name):
dataset = ls_client.create_dataset(dataset_name=dataset_name)
examples = [
{"inputs": {"question": "What is LangChain?"}},
{"inputs": {"question": "What is LangSmith?"}},
]
ls_client.create_examples(
dataset_id=dataset.id,
examples=examples,
)

3. 定义您的自定义评估器

如上所述,我们将定义两个评估器:一个用于评估检索到的文档相对于输入查询的相关性,另一个用于评估生成答案相对于检索到的文档的幻觉程度。 我们将使用 LangChain 的大语言模型(LLM)封装器,并结合 with_structured_output 来定义幻觉评估器。

关键在于,评估器函数应遍历 run / rootRun 参数,以访问流水线的中间步骤。随后,评估器可处理这些中间步骤的输入和输出,并依据所需标准进行评估。

示例为方便起见使用了 langchain,但这并非必需。

from langchain.chat_models import init_chat_model
from langsmith.schemas import Run
from pydantic import BaseModel, Field

def document_relevance(run: Run) -> bool:
"""Checks if retriever input exists in the retrieved docs."""
qa_pipeline_run = next(
r for run in run.child_runs if r.name == "qa_pipeline"
)
retrieve_run = next(
r for run in qa_pipeline_run.child_runs if r.name == "retrieve"
)
page_contents = "\n\n".join(
doc["page_content"] for doc in retrieve_run.outputs["output"]
)
return retrieve_run.inputs["query"] in page_contents


# Data model
class GradeHallucinations(BaseModel):
"""Binary score for hallucination present in generation answer."""

is_grounded: bool = Field(..., description="True if the answer is grounded in the facts, False otherwise.")

# LLM with structured outputs for grading hallucinations
# For more see: https://python.langchain.com/docs/how_to/structured_output/
grader_llm= init_chat_model("gpt-4o-mini", temperature=0).with_structured_output(
GradeHallucinations,
method="json_schema",
strict=True,
)

def no_hallucination(run: Run) -> bool:
"""Check if the answer is grounded in the documents.

Return True if there is no hallucination, False otherwise.
"""
# Get documents and answer
qa_pipeline_run = next(
r for r in run.child_runs if r.name == "qa_pipeline"
)
retrieve_run = next(
r for r in qa_pipeline_run.child_runs if r.name == "retrieve"
)
retrieved_content = "\n\n".join(
doc["page_content"] for doc in retrieve_run.outputs["output"]
)

# Construct prompt
instructions = (
"You are a grader assessing whether an LLM generation is grounded in / "
"supported by a set of retrieved facts. Give a binary score 1 or 0, "
"where 1 means that the answer is grounded in / supported by the set of facts."
)

messages = [
{"role": "system", "content": instructions},
{"role": "user", "content": f"Set of facts:\n{retrieved_content}\n\nLLM generation: {run.outputs['answer']}"},
]

grade = grader_llm.invoke(messages)
return grade.is_grounded

4. 评估流水线

最后,我们将使用上面定义的自定义评估器来运行 evaluate

def qa_wrapper(inputs: dict) -> dict:
"""Wrap the qa_pipeline so it can accept the Example.inputs dict as input."""
return {"answer": qa_pipeline(inputs["question"])}

experiment_results = ls_client.evaluate(
qa_wrapper,
data=dataset_name,
evaluators=[document_relevance, no_hallucination],
experiment_prefix="rag-wiki-oai"
)

该实验将包含评估结果,包括评审员给出的分数和评语:


这个页面对你有帮助吗?


您可以留下详细的反馈 在 GitHub 上.