如何对CSV文件进行问答

大型语言模型非常适合用于构建针对各种数据源的问答系统。在本节中，我们将介绍如何基于存储在 CSV 文件中的数据构建问答系统。与操作 SQL 数据库类似，处理 CSV 文件的关键在于让大型语言模型能够访问查询和操作数据的工具。实现这一目标的两种主要方法是：

推荐：将CSV文件加载到SQL数据库中，并使用SQL 教程中概述的方法。
让大型语言模型访问一个可以使用 Pandas 等库与数据交互的 Python 环境。

本指南将涵盖这两种方法。

⚠️ 安全提示 ⚠️

上述两种方法都存在重大风险。使用SQL需要执行模型生成的SQL查询，而使用Pandas之类的库则需要让模型执行Python代码。由于相比沙盒化Python环境，更易于严格限制SQL连接权限并清理SQL查询，我们强烈建议通过SQL与CSV数据交互。 有关通用安全最佳实践的更多信息，请点击此处。

设置

本指南的依赖项：

%pip install -qU langchain langchain-openai langchain-community langchain-experimental pandas

设置所需的环境变量：

# Using LangSmith is recommended but not required. Uncomment below lines to use.
# import os
# os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

如果尚未拥有，请下载泰坦尼克号数据集：

!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv -O titanic.csv

import pandas as pd

df = pd.read_csv("titanic.csv")
print(df.shape)
print(df.columns.tolist())

(887, 8)
['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']

SQL

使用 SQL 与 CSV 数据交互是推荐的方法，因为相比任意的 Python 代码，它更容易限制权限并清理查询。

大多数SQL数据库都可以轻松地将CSV文件加载为表（DuckDB、SQLite等）。完成此操作后，您可以使用在SQL教程中概述的所有链和代理创建技术。以下是使用SQLite实现此操作的快速示例：

from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine

engine = create_engine("sqlite:///titanic.db")
df.to_sql("titanic", engine, index=False)

API 参考：SQLDatabase

db = SQLDatabase(engine=engine)
print(db.dialect)
print(db.get_usable_table_names())
print(db.run("SELECT * FROM titanic WHERE Age < 2;"))

sqlite
['titanic']
[(1, 2, 'Master. Alden Gates Caldwell', 'male', 0.83, 0, 2, 29.0), (0, 3, 'Master. Eino Viljami Panula', 'male', 1.0, 4, 1, 39.6875), (1, 3, 'Miss. Eleanor Ileen Johnson', 'female', 1.0, 1, 1, 11.1333), (1, 2, 'Master. Richard F Becker', 'male', 1.0, 2, 1, 39.0), (1, 1, 'Master. Hudson Trevor Allison', 'male', 0.92, 1, 2, 151.55), (1, 3, 'Miss. Maria Nakid', 'female', 1.0, 0, 2, 15.7417), (0, 3, 'Master. Sidney Leonard Goodwin', 'male', 1.0, 5, 2, 46.9), (1, 3, 'Miss. Helene Barbara Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 3, 'Miss. Eugenie Baclini', 'female', 0.75, 2, 1, 19.2583), (1, 2, 'Master. Viljo Hamalainen', 'male', 0.67, 1, 1, 14.5), (1, 3, 'Master. Bertram Vere Dean', 'male', 1.0, 1, 2, 20.575), (1, 3, 'Master. Assad Alexander Thomas', 'male', 0.42, 0, 1, 8.5167), (1, 2, 'Master. Andre Mallet', 'male', 1.0, 0, 2, 37.0042), (1, 2, 'Master. George Sibley Richards', 'male', 0.83, 1, 1, 18.75)]

并创建一个 SQL代理与之交互：

选择聊天模型:

pip install -qU "langchain[openai]"

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

from langchain_community.agent_toolkits import create_sql_agent

agent_executor = create_sql_agent(llm, db=db, agent_type="openai-tools", verbose=True)

API 参考：create_sql_agent

agent_executor.invoke({"input": "what's the average age of survivors"})

[1m> Entering new SQL Agent Executor chain...[0m
[32;1m[1;3m
Invoking: `sql_db_list_tables` with `{}`


[0m[38;5;200m[1;3mtitanic[0m[32;1m[1;3m
Invoking: `sql_db_schema` with `{'table_names': 'titanic'}`


[0m[33;1m[1;3m
CREATE TABLE titanic (
	"Survived" BIGINT, 
	"Pclass" BIGINT, 
	"Name" TEXT, 
	"Sex" TEXT, 
	"Age" FLOAT, 
	"Siblings/Spouses Aboard" BIGINT, 
	"Parents/Children Aboard" BIGINT, 
	"Fare" FLOAT
)

/*
3 rows from titanic table:
Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
0	3	Mr. Owen Harris Braund	male	22.0	1	0	7.25
1	1	Mrs. John Bradley (Florence Briggs Thayer) Cumings	female	38.0	1	0	71.2833
1	3	Miss. Laina Heikkinen	female	26.0	0	0	7.925
*/[0m[32;1m[1;3m
Invoking: `sql_db_query` with `{'query': 'SELECT AVG(Age) AS Average_Age FROM titanic WHERE Survived = 1'}`


[0m[36;1m[1;3m[(28.408391812865496,)][0m[32;1m[1;3mThe average age of survivors in the Titanic dataset is approximately 28.41 years.[0m

[1m> Finished chain.[0m

{'input': "what's the average age of survivors",
 'output': 'The average age of survivors in the Titanic dataset is approximately 28.41 years.'}

这种方法很容易推广到多个CSV文件，因为我们只需将每个文件加载到数据库中作为独立的表。请参阅下面的多个CSV文件部分。

Pandas

除了SQL之外，我们还可以使用pandas等数据分析库以及大语言模型的代码生成能力来操作CSV数据。同样，除非你已建立全面的防护措施，否则这种做法不适合用于生产环境。因此，我们的代码执行工具和构造函数位于langchain-experimental包中。

Chains

大多数大型语言模型都经过了足够多的pandas Python代码训练，因此只需被要求，就能生成相应的代码：

ai_msg = llm.invoke(
    "I have a pandas DataFrame 'df' with columns 'Age' and 'Fare'. Write code to compute the correlation between the two columns. Return Markdown for a Python code snippet and nothing else."
)
print(ai_msg.content)

\`\`\`python
correlation = df['Age'].corr(df['Fare'])
correlation
\`\`\`

我们可以将这一能力与一个执行 Python 代码的工具相结合，创建一个简单的数据分析流程。首先，我们需要将我们的 CSV 表格加载为数据框，并让该工具能够访问这个数据框：

import pandas as pd
from langchain_core.prompts import ChatPromptTemplate
from langchain_experimental.tools import PythonAstREPLTool

df = pd.read_csv("titanic.csv")
tool = PythonAstREPLTool(locals={"df": df})
tool.invoke("df['Fare'].mean()")

API 参考：ChatPromptTemplate | PythonAstREPLTool

32.30542018038331

为了帮助确保我们Python工具的正确使用，我们将使用工具调用：

llm_with_tools = llm.bind_tools([tool], tool_choice=tool.name)
response = llm_with_tools.invoke(
    "I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)
response

AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_SBrK246yUbdnJemXFC8Iod05', 'function': {'arguments': '{"query":"df.corr()[\'Age\'][\'Fare\']"}', 'name': 'python_repl_ast'}, 'type': 'function'}]}, response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 125, 'total_tokens': 138}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_3b956da36b', 'finish_reason': 'stop', 'logprobs': None}, id='run-1fd332ba-fa72-4351-8182-d464e7368311-0', tool_calls=[{'name': 'python_repl_ast', 'args': {'query': "df.corr()['Age']['Fare']"}, 'id': 'call_SBrK246yUbdnJemXFC8Iod05'}])

response.tool_calls

[{'name': 'python_repl_ast',
  'args': {'query': "df.corr()['Age']['Fare']"},
  'id': 'call_SBrK246yUbdnJemXFC8Iod05'}]

我们将添加一个工具输出解析器，以将函数调用提取为字典：

from langchain_core.output_parsers.openai_tools import JsonOutputKeyToolsParser

parser = JsonOutputKeyToolsParser(key_name=tool.name, first_tool_only=True)
(llm_with_tools | parser).invoke(
    "I have a dataframe 'df' and want to know the correlation between the 'Age' and 'Fare' columns"
)

API 参考：JsonOutputKeyToolsParser

{'query': "df[['Age', 'Fare']].corr()"}

并将其与提示语结合，以便我们只需指定问题，而无需每次调用时都提供数据框信息：

system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:

\`\`\`
{df.head().to_markdown()}
\`\`\`

Given a user question, write the Python code to answer it. \
Return ONLY the valid Python code and nothing else. \
Don't assume you have access to any libraries other than built-in Python ones and pandas."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])
code_chain = prompt | llm_with_tools | parser
code_chain.invoke({"question": "What's the correlation between age and fare"})

{'query': "df[['Age', 'Fare']].corr()"}

最后，我们将添加我们的 Python 工具，以便生成的代码能够实际执行：

chain = prompt | llm_with_tools | parser | tool
chain.invoke({"question": "What's the correlation between age and fare"})

0.11232863699941621

就这样，我们创建了一个简单数据分析链。可以通过查看 LangSmith 跟踪记录来了解中间步骤：https://smith.langchain.com/public/b1309290-7212-49b7-bde2-75b39a32b49a/r

我们可以在最后再添加一次大型语言模型调用，以生成对话式回复，而不仅仅是回应工具输出。为此，我们需要在提示中添加聊天历史 MessagesPlaceholder：

from operator import itemgetter

from langchain_core.messages import ToolMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough

system = f"""You have access to a pandas dataframe `df`. \
Here is the output of `df.head().to_markdown()`:

\`\`\`
{df.head().to_markdown()}
\`\`\`

Given a user question, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas.
Respond directly to the question once you have enough information to answer it."""
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            system,
        ),
        ("human", "{question}"),
        # This MessagesPlaceholder allows us to optionally append an arbitrary number of messages
        # at the end of the prompt using the 'chat_history' arg.
        MessagesPlaceholder("chat_history", optional=True),
    ]
)


def _get_chat_history(x: dict) -> list:
    """Parse the chain output up to this point into a list of chat history messages to insert in the prompt."""
    ai_msg = x["ai_msg"]
    tool_call_id = x["ai_msg"].additional_kwargs["tool_calls"][0]["id"]
    tool_msg = ToolMessage(tool_call_id=tool_call_id, content=str(x["tool_output"]))
    return [ai_msg, tool_msg]


chain = (
    RunnablePassthrough.assign(ai_msg=prompt | llm_with_tools)
    .assign(tool_output=itemgetter("ai_msg") | parser | tool)
    .assign(chat_history=_get_chat_history)
    .assign(response=prompt | llm | StrOutputParser())
    .pick(["tool_output", "response"])
)

API 参考：ToolMessage | StrOutputParser | MessagesPlaceholder | RunnablePassthrough

chain.invoke({"question": "What's the correlation between age and fare"})

{'tool_output': 0.11232863699941616,
 'response': 'The correlation between age and fare is approximately 0.1123.'}

这是本次运行的 LangSmith 跟踪记录： https://smith.langchain.com/public/14e38d70-45b1-4b81-8477-9fd2b7c07ea6/r

代理

对于复杂问题，让大型语言模型能够迭代执行代码，并保持之前执行的输入和输出，这将非常有帮助。这就是代理（Agents）发挥作用的地方。它们允许大型语言模型决定需要调用工具的次数，并跟踪迄今为止已完成的执行记录。create_pandas_dataframe_agent 是一个内置代理，可轻松处理数据框：

from langchain_experimental.agents import create_pandas_dataframe_agent

agent = create_pandas_dataframe_agent(
    llm, df, agent_type="openai-tools", verbose=True, allow_dangerous_code=True
)
agent.invoke(
    {
        "input": "What's the correlation between age and fare? is that greater than the correlation between fare and survival?"
    }
)

API 参考：create_pandas_dataframe_agent

[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[['Age', 'Fare']].corr().iloc[0,1]"}`

[0m[36;1m[1;3m0.11232863699941621[0m[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[['Fare', 'Survived']].corr().iloc[0,1]"}`

[0m[36;1m[1;3m0.2561785496289603[0m[32;1m[1;3mThe correlation between Age and Fare is approximately 0.112, and the correlation between Fare and Survival is approximately 0.256.

Therefore, the correlation between Fare and Survival (0.256) is greater than the correlation between Age and Fare (0.112).[0m

[1m> Finished chain.[0m

{'input': "What's the correlation between age and fare? is that greater than the correlation between fare and survival?",
 'output': 'The correlation between Age and Fare is approximately 0.112, and the correlation between Fare and Survival is approximately 0.256.\n\nTherefore, the correlation between Fare and Survival (0.256) is greater than the correlation between Age and Fare (0.112).'}

这是本次运行的 LangSmith 跟踪信息：https://smith.langchain.com/public/6a86aee2-4f22-474a-9264-bd4c7283e665/r

多个CSV文件

要处理多个CSV文件（或数据框），我们只需将多个数据框传递给我们的Python工具。我们的 create_pandas_dataframe_agent 构造函数可以开箱即用地实现这一点，我们可以传入一个数据框列表，而不仅仅是一个。如果我们自己构建链式结构，可以这样做：

df_1 = df[["Age", "Fare"]]
df_2 = df[["Fare", "Survived"]]

tool = PythonAstREPLTool(locals={"df_1": df_1, "df_2": df_2})
llm_with_tool = llm.bind_tools(tools=[tool], tool_choice=tool.name)
df_template = """\`\`\`python
{df_name}.head().to_markdown()
>>> {df_head}
\`\`\`"""
df_context = "\n\n".join(
    df_template.format(df_head=_df.head().to_markdown(), df_name=df_name)
    for _df, df_name in [(df_1, "df_1"), (df_2, "df_2")]
)

system = f"""You have access to a number of pandas dataframes. \
Here is a sample of rows from each dataframe and the python code that was used to generate the sample:

{df_context}

Given a user question about the dataframes, write the Python code to answer it. \
Don't assume you have access to any libraries other than built-in Python ones and pandas. \
Make sure to refer only to the variables mentioned above."""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{question}")])

chain = prompt | llm_with_tool | parser | tool
chain.invoke(
    {
        "question": "return the difference in the correlation between age and fare and the correlation between fare and survival"
    }
)

0.14384991262954416

这是本次运行的 LangSmith 跟踪信息：https://smith.langchain.com/public/cc2a7d7f-7c5a-4e77-a10c-7b5420fcd07f/r

沙箱代码执行

有许多工具，如E2B和Bearly，可提供沙箱环境以执行Python代码，从而实现更安全的代码执行链和代理。

下一步

对于更高级的数据分析应用，我们推荐查看：

SQL 教程: 与 SQL 数据库和 CSV 文件一起工作时遇到的许多挑战都适用于任何结构化数据类型，因此即使您使用 Pandas 进行 CSV 数据分析，阅读 SQL 技巧也很有帮助。
工具使用: 有关在使用调用工具的链和代理时的一般最佳实践指南
代理: 了解构建大型语言模型代理的基础知识。
集成：沙盒环境如 E2B 和 Bearly，工具类如 SQLDatabase，相关代理如 Spark DataFrame 代理。

⚠️ 安全提示 ⚠️​

设置​

SQL​

Pandas​

Chains​

代理​

多个CSV文件​

沙箱代码执行​

下一步​

⚠️ 安全提示 ⚠️

设置

SQL

Pandas

Chains

代理

多个CSV文件

沙箱代码执行

下一步