构建提取链

在本教程中，我们将使用工具调用功能的聊天模型从非结构化文本中提取结构化信息。我们还将演示如何在此情境下使用少样本提示来提升性能。

重要

本教程需要 langchain-core>=0.3.20，并且仅适用于支持 工具调用 的模型。

设置

Jupyter Notebook

其他教程也最适合在 Jupyter 笔记本中运行。在交互式环境中学习指南是更好地理解它们的好方法。有关安装说明，请参见此处。

安装

要安装 LangChain，请运行：

Pip
Conda

pip install --upgrade langchain-core

conda install langchain-core -c conda-forge

有关详细信息，请参阅我们的安装指南。

LangSmith

使用 LangChain 构建的许多应用程序都包含多个步骤，以及多次调用大型语言模型（LLM）。随着这些应用程序变得越来越复杂，能够检查链或代理内部的具体情况变得至关重要。实现这一点的最佳方式是使用 LangSmith。

在您通过上方链接注册后，请确保设置您的环境变量以开始记录追踪信息：

export LANGSMITH_TRACING="true"
export LANGSMITH_API_KEY="..."

或者，如果在笔记本中，你可以通过以下方式设置它们：

import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

模式

首先，我们需要描述我们希望从文本中提取哪些信息。

我们将使用 Pydantic 来定义一个示例模式，以提取个人信息。

from typing import Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )

在定义模式时，有两个最佳实践：

记录属性和模式本身：此信息将发送给大语言模型，用于提高信息提取的质量。
不要强迫大语言模型编造信息！如上所述，我们为属性设置了 Optional，这样当大语言模型不知道答案时，可以输出 None。

重要

为了获得最佳性能，请详细记录模式，并确保在文本中没有可提取信息时，模型不会被强制返回结果。

提取器

让我们使用上面定义的模式创建一个信息提取器。

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        # Please see the how-to about improving performance with
        # reference examples.
        # MessagesPlaceholder('examples'),
        ("human", "{text}"),
    ]
)

API 参考：ChatPromptTemplate | MessagesPlaceholder

我们需要使用支持函数/工具调用的模型。

请查阅文档，了解可与该API一起使用的所有模型。

选择聊天模型:

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

structured_llm = llm.with_structured_output(schema=Person)

让我们来试试看：

text = "Alan Smith is 6 feet tall and has blond hair."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

Person(name='Alan Smith', hair_color='blond', height_in_meters='1.83')

重要

提取即生成 🤯

大型语言模型是生成式模型，因此它们能够完成一些非常酷的事情，比如即使身高信息以英尺为单位提供，也能正确地提取出以米为单位的人的身高！

我们可以看到 LangSmith 跟踪信息这里。请注意，跟踪中的聊天模型部分显示了发送给模型的消息序列、调用的工具以及其他元数据。

多个实体

在大多数情况下，你应该提取实体列表，而不是单个实体。

这可以通过将模型嵌套在另一个模型中，使用 pydantic 轻松实现。

from typing import List, Optional

from pydantic import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    # ^ Doc-string for the entity Person.
    # This doc-string is sent to the LLM as the description of the schema Person,
    # and it can help to improve extraction results.

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    name: Optional[str] = Field(default=None, description="The name of the person")
    hair_color: Optional[str] = Field(
        default=None, description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(
        default=None, description="Height measured in meters"
    )


class Data(BaseModel):
    """Extracted data about people."""

    # Creates a model so that we can extract multiple entities.
    people: List[Person]

重要

提取结果在这里可能并不完美。继续阅读，了解如何使用参考示例来提高提取质量，并查看我们的提取操操作指南以获取更多详细信息。

structured_llm = llm.with_structured_output(schema=Data)
text = "My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me."
prompt = prompt_template.invoke({"text": text})
structured_llm.invoke(prompt)

Data(people=[Person(name='Jeff', hair_color='black', height_in_meters='1.83'), Person(name='Anna', hair_color='black', height_in_meters=None)])

提示

当模式支持提取多个实体时，如果文本中没有相关信息，模型也可以通过提供空列表来提取无实体。

这通常是一件好事！它允许在实体上指定必需的属性，而无需强制模型必须检测到该实体。

我们可以看到 LangSmith 跟踪这里。

参考示例

大型语言模型应用的行为可以通过少量示例提示来引导。对于聊天模型，这可以表现为一系列输入和响应消息对，用以展示期望的行为。

例如，我们可以用交替的 user 和 assistant 消息来传达一个符号的含义：

messages = [
    {"role": "user", "content": "2 🦜 2"},
    {"role": "assistant", "content": "4"},
    {"role": "user", "content": "2 🦜 3"},
    {"role": "assistant", "content": "5"},
    {"role": "user", "content": "3 🦜 4"},
]

response = llm.invoke(messages)
print(response.content)

结构化输出通常在底层使用工具调用。这通常涉及生成包含工具调用的 AI 消息，以及包含工具调用结果的工具消息。在这种情况下，消息序列应该是什么样子？

不同的聊天模型提供商对有效消息序列有不同的要求。某些提供商将接受如下形式的（重复）消息序列：

用户消息
带有工具调用的AI消息
工具消息（包含结果）

其他情况需要一个包含某种回应的最终AI消息。

LangChain 包含一个实用函数 tool_example_to_messages，可为大多数模型提供商生成有效的序列。只需提供相应工具调用的 Pydantic 表示，即可简化结构化少样本示例的生成。

让我们试一下。我们可以将输入字符串对和期望的 Pydantic 对象转换为一系列消息，这些消息可以提供给聊天模型。在底层，LangChain 会将工具调用格式化为每个提供商所需的格式。

注意：此版本的 tool_example_to_messages 需要 langchain-core>=0.3.20。

from langchain_core.utils.function_calling import tool_example_to_messages

examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep.",
        Data(people=[]),
    ),
    (
        "Fiona traveled far from France to Spain.",
        Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)]),
    ),
]


messages = []

for txt, tool_call in examples:
    if tool_call.people:
        # This final message is optional for some providers
        ai_response = "Detected people."
    else:
        ai_response = "Detected no people."
    messages.extend(tool_example_to_messages(txt, [tool_call], ai_response=ai_response))

API 参考：tool_example_to_messages

检查结果，我们看到这两组示例生成了八条消息：

for message in messages:
    message.pretty_print()

================================[1m Human Message [0m=================================

The ocean is vast and blue. It's more than 20,000 feet deep.
==================================[1m Ai Message [0m==================================
Tool Calls:
  Data (d8f2e054-7fb9-417f-b28f-0447a775b2c3)
 Call ID: d8f2e054-7fb9-417f-b28f-0447a775b2c3
  Args:
    people: []
=================================[1m Tool Message [0m=================================

You have correctly called this tool.
==================================[1m Ai Message [0m==================================

Detected no people.
================================[1m Human Message [0m=================================

Fiona traveled far from France to Spain.
==================================[1m Ai Message [0m==================================
Tool Calls:
  Data (0178939e-a4b1-4d2a-a93e-b87f665cdfd6)
 Call ID: 0178939e-a4b1-4d2a-a93e-b87f665cdfd6
  Args:
    people: [{'name': 'Fiona', 'hair_color': None, 'height_in_meters': None}]
=================================[1m Tool Message [0m=================================

You have correctly called this tool.
==================================[1m Ai Message [0m==================================

Detected people.

让我们对比包含和不包含这些消息时的性能表现。例如，我们传递一条意图不提取任何人员信息的消息：

message_no_extraction = {
    "role": "user",
    "content": "The solar system is large, but earth has only 1 moon.",
}

structured_llm = llm.with_structured_output(schema=Data)
structured_llm.invoke([message_no_extraction])

Data(people=[Person(name='Earth', hair_color='None', height_in_meters='0.00')])

在此示例中，该模型可能会错误地生成人员记录。

由于我们的少量示例中包含了“负面”示例，我们鼓励模型在这种情况下也能正确表现：

structured_llm.invoke(messages + [message_no_extraction])

Data(people=[])

提示

该运行的 LangSmith 跟踪信息揭示了发送给聊天模型的消息序列、生成的工具调用、延迟、标记计数及其他元数据。

有关使用参考示例进行提取工作流的更多详细信息，请参阅此指南，其中包括如何引入提示模板以及自定义示例消息生成的方法。

下一步

现在你已经了解了使用 LangChain 进行提取的基础知识，接下来可以继续阅读其余的教程：

添加示例: 了解如何使用 参考示例 来提升性能的更多细节。
处理长文本: 如果文本无法放入大型语言模型的上下文窗口，您应该怎么办？
使用解析方法：使用基于提示的方法来提取，适用于不支持工具/函数调用的模型。

设置​

Jupyter Notebook​

安装​

LangSmith​

模式​

提取器​

多个实体​

参考示例​

下一步​

设置

Jupyter Notebook

安装

LangSmith

模式

提取器

多个实体

参考示例

下一步