如何通过迭代优化来总结文本

大型语言模型可以从文本中总结并提炼出所需信息，包括大量文本内容。在许多情况下，特别是当文本量远超模型上下文窗口大小时，将总结任务分解为更小的组成部分会有所帮助（或必要）。

迭代优化是一种总结长文本的策略。该策略如下：

将文本拆分为更小的文档；
总结第一份文档；
根据下一份文档，进一步优化或更新结果；
重复遍历文档序列，直到完成。

请注意，此策略并未实现并行化。当理解某个子文档的内容依赖于先前的上下文时，该策略尤其有效——例如，在总结一部具有内在顺序的小说或文本集合时。

LangGraph，基于 langchain-core 构建，非常适合解决此问题：

LangGraph 允许对各个步骤（如连续的摘要生成）进行流式处理，从而实现对执行过程的更精细控制；
LangGraph 的检查点支持错误恢复，可扩展至人机协作工作流，并更轻松地集成到对话式应用中。
由于它由模块化组件组成，因此也易于扩展或修改（例如，集成工具调用或其他行为）。

下面，我们演示如何通过迭代优化来总结文本。

加载聊天模型

让我们首先加载一个聊天模型：

选择聊天模型:

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

加载文档

接下来，我们需要一些文档来进行摘要。以下我们生成了一些示例文档，以供演示使用。有关更多数据源，请参阅文档加载器的操操作指南和集成页面。摘要教程中还包含一个摘要博客文章的示例。

from langchain_core.documents import Document

documents = [
    Document(page_content="Apples are red", metadata={"title": "apple_book"}),
    Document(page_content="Blueberries are blue", metadata={"title": "blueberry_book"}),
    Document(page_content="Bananas are yelow", metadata={"title": "banana_book"}),
]

API 参考：文档

创建图表

以下是该流程的 LangGraph 实现：

我们生成一个简单的链，用于初始摘要，该链提取第一个文档，将其格式化为提示，并使用我们的大语言模型运行推理。
我们生成第二个 refine_summary_chain，它作用于每个后续文档，以完善初始摘要。

我们将需要安装 langgraph：

pip install -qU langgraph

import operator
from typing import List, Literal, TypedDict

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig
from langgraph.constants import Send
from langgraph.graph import END, START, StateGraph

# Initial summary
summarize_prompt = ChatPromptTemplate(
    [
        ("human", "Write a concise summary of the following: {context}"),
    ]
)
initial_summary_chain = summarize_prompt | llm | StrOutputParser()

# Refining the summary with new docs
refine_template = """
Produce a final summary.

Existing summary up to this point:
{existing_answer}

New context:
------------
{context}
------------

Given the new context, refine the original summary.
"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])

refine_summary_chain = refine_prompt | llm | StrOutputParser()


# We will define the state of the graph to hold the document
# contents and summary. We also include an index to keep track
# of our position in the sequence of documents.
class State(TypedDict):
    contents: List[str]
    index: int
    summary: str


# We define functions for each node, including a node that generates
# the initial summary:
async def generate_initial_summary(state: State, config: RunnableConfig):
    summary = await initial_summary_chain.ainvoke(
        state["contents"][0],
        config,
    )
    return {"summary": summary, "index": 1}


# And a node that refines the summary based on the next document
async def refine_summary(state: State, config: RunnableConfig):
    content = state["contents"][state["index"]]
    summary = await refine_summary_chain.ainvoke(
        {"existing_answer": state["summary"], "context": content},
        config,
    )

    return {"summary": summary, "index": state["index"] + 1}


# Here we implement logic to either exit the application or refine
# the summary.
def should_refine(state: State) -> Literal["refine_summary", END]:
    if state["index"] >= len(state["contents"]):
        return END
    else:
        return "refine_summary"


graph = StateGraph(State)
graph.add_node("generate_initial_summary", generate_initial_summary)
graph.add_node("refine_summary", refine_summary)

graph.add_edge(START, "generate_initial_summary")
graph.add_conditional_edges("generate_initial_summary", should_refine)
graph.add_conditional_edges("refine_summary", should_refine)
app = graph.compile()

API 参考：StrOutputParser | ChatPromptTemplate | RunnableConfig | 发送 |StateGraph

LangGraph 可以将图结构绘制出来，以帮助可视化其功能：

from IPython.display import Image

Image(app.get_graph().draw_mermaid_png())

调用图表

我们可以逐步执行，打印出在不断优化过程中的摘要：

async for step in app.astream(
    {"contents": [doc.page_content for doc in documents]},
    stream_mode="values",
):
    if summary := step.get("summary"):
        print(summary)

Apples are characterized by their red color.
Apples are characterized by their red color, while blueberries are known for their blue hue.
Apples are characterized by their red color, blueberries are known for their blue hue, and bananas are recognized for their yellow color.

最后一个 step 包含从所有文档集合中综合得出的摘要。

下一步

查看摘要操操作指南，了解其他摘要策略，包括适用于大量文本的策略。

查看此教程以了解有关摘要的更多详细信息。

另请参阅 LangGraph 文档，了解使用 LangGraph 构建的详细信息。

加载聊天模型​

加载文档​

创建图表​

调用图表​

下一步​

加载聊天模型

加载文档

创建图表

调用图表

下一步