自信

DeepEval package for unit testing LLMs. Using Confident, everyone can build robust language models through faster iterations using both unit testing and integration testing. We provide support for each step in the iteration from synthetic data creation to testing.

在本指南中，我们将演示如何测试和衡量大型语言模型（LLMs）的性能。我们将展示如何使用我们的回调函数来测量性能，以及如何定义自己的指标并将其记录到我们的仪表板中。

DeepEval还提供：

如何生成合成数据
如何衡量性能
一个用于监控和回顾随时间变化结果的仪表盘

安装与设置

%pip install --upgrade --quiet  langchain langchain-openai langchain-community deepeval langchain-chroma

获取API凭据

要获取DeepEval API的凭据，请按照以下步骤操作：

前往 https://app.confident-ai.com
点击“组织”
复制API密钥。

登录时，您还将被要求设置 implementation 的名称。实现名称是必需的，用于描述实现的类型。（想想您想为项目起什么名字。我们建议使其具有描述性。）

!deepeval login

设置 DeepEval

您可以默认使用 DeepEvalCallbackHandler 来设置您想要跟踪的指标。但是，目前对指标的支持有限（很快会添加更多）。它目前支持：

from deepeval.metrics.answer_relevancy import AnswerRelevancy

# Here we want to make sure the answer is minimally relevant
answer_relevancy_metric = AnswerRelevancy(minimum_score=0.5)

开始使用

要使用 DeepEvalCallbackHandler，我们需要 implementation_name。

from langchain_community.callbacks.confident_callback import DeepEvalCallbackHandler

deepeval_callback = DeepEvalCallbackHandler(
    implementation_name="langchainQuickstart", metrics=[answer_relevancy_metric]
)

API 参考：DeepEvalCallbackHandler

场景 1：输入到大型语言模型（LLM）

然后你可以将其输入到使用 OpenAI 的大型语言模型中。

from langchain_openai import OpenAI

llm = OpenAI(
    temperature=0,
    callbacks=[deepeval_callback],
    verbose=True,
    openai_api_key="<YOUR_API_KEY>",
)
output = llm.generate(
    [
        "What is the best evaluation tool out there? (no bias at all)",
    ]
)

API 参考：OpenAI

LLMResult(generations=[[Generation(text='\n\nQ: What did the fish say when he hit the wall? \nA: Dam.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nThe Moon \n\nThe moon is high in the midnight sky,\nSparkling like a star above.\nThe night so peaceful, so serene,\nFilling up the air with love.\n\nEver changing and renewing,\nA never-ending light of grace.\nThe moon remains a constant view,\nA reminder of life’s gentle pace.\n\nThrough time and space it guides us on,\nA never-fading beacon of hope.\nThe moon shines down on us all,\nAs it continues to rise and elope.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nQ. What did one magnet say to the other magnet?\nA. "I find you very attractive!"', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text="\n\nThe world is charged with the grandeur of God.\nIt will flame out, like shining from shook foil;\nIt gathers to a greatness, like the ooze of oil\nCrushed. Why do men then now not reck his rod?\n\nGenerations have trod, have trod, have trod;\nAnd all is seared with trade; bleared, smeared with toil;\nAnd wears man's smudge and shares man's smell: the soil\nIs bare now, nor can foot feel, being shod.\n\nAnd for all this, nature is never spent;\nThere lives the dearest freshness deep down things;\nAnd though the last lights off the black West went\nOh, morning, at the brown brink eastward, springs —\n\nBecause the Holy Ghost over the bent\nWorld broods with warm breast and with ah! bright wings.\n\n~Gerard Manley Hopkins", generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nQ: What did one ocean say to the other ocean?\nA: Nothing, they just waved.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text="\n\nA poem for you\n\nOn a field of green\n\nThe sky so blue\n\nA gentle breeze, the sun above\n\nA beautiful world, for us to love\n\nLife is a journey, full of surprise\n\nFull of joy and full of surprise\n\nBe brave and take small steps\n\nThe future will be revealed with depth\n\nIn the morning, when dawn arrives\n\nA fresh start, no reason to hide\n\nSomewhere down the road, there's a heart that beats\n\nBelieve in yourself, you'll always succeed.", generation_info={'finish_reason': 'stop', 'logprobs': None})]], llm_output={'token_usage': {'completion_tokens': 504, 'total_tokens': 528, 'prompt_tokens': 24}, 'model_name': 'text-davinci-003'})

然后，您可以调用 is_successful() 方法检查该指标是否成功。

answer_relevancy_metric.is_successful()
# returns True/False

运行完成后，您应该能够看到我们下面的仪表盘。

Dashboard

场景 2：在没有回调的情况下跟踪链中的 LLM

要在链中跟踪一个大型语言模型（LLM）而无需回调，可以在最后插入。

我们可以从定义一个简单的链开始，如下所示。

import requests
from langchain.chains import RetrievalQA
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

text_file_url = "https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt"

openai_api_key = "sk-XXX"

with open("state_of_the_union.txt", "w") as f:
    response = requests.get(text_file_url)
    f.write(response.text)

loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(
    llm=OpenAI(openai_api_key=openai_api_key),
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
)

# Providing a new question-answering pipeline
query = "Who is the president?"
result = qa.run(query)

API 参考：RetrievalQA | TextLoader | OpenAI | OpenAIEmbeddings | CharacterTextSplitter

定义好链之后，你可以手动检查答案的相似性。

answer_relevancy_metric.measure(result, query)
answer_relevancy_metric.is_successful()

接下来呢？

您可以在此创建自己的自定义指标这里。

DeepEval 还提供了其他功能，例如能够自动生成单元测试、测试幻觉。

如果您感兴趣，请查看我们的Github仓库 https://github.com/confident-ai/deepeval。我们欢迎任何关于如何提高LLM性能的PR和讨论。

安装与设置​

获取API凭据​

设置 DeepEval​

开始使用​

场景 1：输入到大型语言模型（LLM）​

场景 2：在没有回调的情况下跟踪链中的 LLM​

接下来呢？​