处理 SQL 问答时如何应对大型数据库
为了编写针对数据库的有效查询,我们需要将表名、表结构和特征值提供给模型,以便其进行查询。当存在大量表、列和/或高基数列时,我们无法在每次提示中 dumping 数据库的全部信息。相反,我们必须找到方法,仅在提示中动态插入最相关的信息。
在本指南中,我们演示了识别此类相关信息并将其输入到查询生成步骤的方法。我们将涵盖:
- 识别相关的表子集;
- 识别相关的一列值子集。
设置
首先,获取所需的软件包并设置环境变量:
%pip install --upgrade --quiet langchain langchain-community langchain-openai
# Uncomment the below to use LangSmith. Not required.
# import os
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()
# os.environ["LANGSMITH_TRACING"] = "true"
下面的示例将使用带有 Chinook 数据库的 SQLite 连接。请遵循 这些安装步骤 以在同一目录下创建 Chinook.db:
- 保存 此文件 为
Chinook_Sqlite.sql - 运行
sqlite3 Chinook.db - 运行
.read Chinook_Sqlite.sql - 测试
SELECT * FROM Artist LIMIT 10;
现在,Chinook.db 已存在于我们的目录中,我们可以使用由 SQLAlchemy 驱动的 SQLDatabase 类与其进行交互:
from langchain_community.utilities import SQLDatabase
db = SQLDatabase.from_uri("sqlite:///Chinook.db")
print(db.dialect)
print(db.get_usable_table_names())
print(db.run("SELECT * FROM Artist LIMIT 10;"))
sqlite
['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']
[(1, 'AC/DC'), (2, 'Accept'), (3, 'Aerosmith'), (4, 'Alanis Morissette'), (5, 'Alice In Chains'), (6, 'Antônio Carlos Jobim'), (7, 'Apocalyptica'), (8, 'Audioslave'), (9, 'BackBeat'), (10, 'Billy Cobham')]
许多表格
我们需要在提示词中包含的相关信息之一是相关表的架构。当我们有很多表时,无法将所有架构放入单个提示词中。在这种情况下,我们可以首先提取与用户输入相关的表名,然后仅包含它们的架构。
一种简单可靠的方法是使用工具调用。下面,我们展示如何利用此功能获取符合期望格式的输出(在本例中为表名列表)。我们使用聊天模型的.bind_tools方法以 Pydantic 格式绑定工具,并将该工具输入到输出解析器中,以便从模型响应中重构对象。
pip install -qU "langchain[openai]"
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain.chat_models import init_chat_model
llm = init_chat_model("gpt-4o-mini", model_provider="openai")
from langchain_core.output_parsers.openai_tools import PydanticToolsParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
class Table(BaseModel):
"""Table in SQL database."""
name: str = Field(description="Name of table in SQL database.")
table_names = "\n".join(db.get_usable_table_names())
system = f"""Return the names of ALL the SQL tables that MIGHT be relevant to the user question. \
The tables are:
{table_names}
Remember to include ALL POTENTIALLY RELEVANT tables, even if you're not sure that they're needed."""
prompt = ChatPromptTemplate.from_messages(
[
("system", system),
("human", "{input}"),
]
)
llm_with_tools = llm.bind_tools([Table])
output_parser = PydanticToolsParser(tools=[Table])
table_chain = prompt | llm_with_tools | output_parser
table_chain.invoke({"input": "What are all the genres of Alanis Morisette songs"})
[Table(name='Genre')]
这效果相当不错!不过,正如我们下面将看到的,我们实际上还需要一些其他表格。仅凭用户的问题,模型很难知道这些。在这种情况下,我们可以通过将表格分组来简化模型的任务。我们只需让模型在“音乐”和“商业”这两个类别中进行选择,然后从中筛选出所有相关的表格:
system = """Return the names of any SQL tables that are relevant to the user question.
The tables are:
Music
Business
"""
prompt = ChatPromptTemplate.from_messages(
[
("system", system),
("human", "{input}"),
]
)
category_chain = prompt | llm_with_tools | output_parser
category_chain.invoke({"input": "What are all the genres of Alanis Morisette songs"})
[Table(name='Music'), Table(name='Business')]
from typing import List
def get_tables(categories: List[Table]) -> List[str]:
tables = []
for category in categories:
if category.name == "Music":
tables.extend(
[
"Album",
"Artist",
"Genre",
"MediaType",
"Playlist",
"PlaylistTrack",
"Track",
]
)
elif category.name == "Business":
tables.extend(["Customer", "Employee", "Invoice", "InvoiceLine"])
return tables
table_chain = category_chain | get_tables
table_chain.invoke({"input": "What are all the genres of Alanis Morisette songs"})
['Album',
'Artist',
'Genre',
'MediaType',
'Playlist',
'PlaylistTrack',
'Track',
'Customer',
'Employee',
'Invoice',
'InvoiceLine']
现在我们已经有了一个能够针对任何查询输出相关表的链,我们可以将其与 create_sql_query_chain 结合使用,该链可以接受一个列表 table_names_to_use 以确定哪些表架构被包含在提示中:
from operator import itemgetter
from langchain.chains import create_sql_query_chain
from langchain_core.runnables import RunnablePassthrough
query_chain = create_sql_query_chain(llm, db)
# Convert "question" key to the "input" key expected by current table_chain.
table_chain = {"input": itemgetter("question")} | table_chain
# Set table_names_to_use using table_chain.
full_chain = RunnablePassthrough.assign(table_names_to_use=table_chain) | query_chain
query = full_chain.invoke(
{"question": "What are all the genres of Alanis Morisette songs"}
)
print(query)
SELECT DISTINCT "g"."Name"
FROM "Genre" g
JOIN "Track" t ON "g"."GenreId" = "t"."GenreId"
JOIN "Album" a ON "t"."AlbumId" = "a"."AlbumId"
JOIN "Artist" ar ON "a"."ArtistId" = "ar"."ArtistId"
WHERE "ar"."Name" = 'Alanis Morissette'
LIMIT 5;
db.run(query)
"[('Rock',)]"
我们可以看到此运行的 LangSmith 跟踪 此处。
我们已经了解了如何在链中的提示里动态包含一部分表模式。另一种解决此问题的方法是让智能体自行决定何时查找表,并为其提供一个执行此操作的工具。你可以在SQL:智能体指南中看到此类示例。
高基数列
为了筛选包含专有名词(如地址、歌曲名称或艺术家)的列,我们首先需要对拼写进行双重检查,以便正确过滤数据。
一种简单的策略是创建一个向量存储,其中包含数据库中所有不同的专有名词。然后,我们可以针对每个用户输入查询该向量存储,并将最相关的专有名词注入到提示中。
首先,我们需要为每个目标实体获取唯一值,为此我们定义一个函数将结果解析为元素列表:
import ast
import re
def query_as_list(db, query):
res = db.run(query)
res = [el for sub in ast.literal_eval(res) for el in sub if el]
res = [re.sub(r"\b\d+\b", "", string).strip() for string in res]
return res
proper_nouns = query_as_list(db, "SELECT Name FROM Artist")
proper_nouns += query_as_list(db, "SELECT Title FROM Album")
proper_nouns += query_as_list(db, "SELECT Name FROM Genre")
len(proper_nouns)
proper_nouns[:5]
['AC/DC', 'Accept', 'Aerosmith', 'Alanis Morissette', 'Alice In Chains']
现在我们可以将所有值嵌入并存储到向量数据库中:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
vector_db = FAISS.from_texts(proper_nouns, OpenAIEmbeddings())
retriever = vector_db.as_retriever(search_kwargs={"k": 15})
并构建一个查询链,该链首先从数据库检索值并将其插入到提示中:
from operator import itemgetter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
system = """You are a SQLite expert. Given an input question, create a syntactically
correct SQLite query to run. Unless otherwise specificed, do not return more than
{top_k} rows.
Only return the SQL query with no markup or explanation.
Here is the relevant table info: {table_info}
Here is a non-exhaustive list of possible feature values. If filtering on a feature
value make sure to check its spelling against this list first:
{proper_nouns}
"""
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{input}")])
query_chain = create_sql_query_chain(llm, db, prompt=prompt)
retriever_chain = (
itemgetter("question")
| retriever
| (lambda docs: "\n".join(doc.page_content for doc in docs))
)
chain = RunnablePassthrough.assign(proper_nouns=retriever_chain) | query_chain
为了尝试我们的链,让我们看看当我们尝试对"elenis moriset"(Alanis Morissette 的拼写错误)进行过滤时,在有无检索的情况下会发生什么:
# Without retrieval
query = query_chain.invoke(
{"question": "What are all the genres of elenis moriset songs", "proper_nouns": ""}
)
print(query)
db.run(query)
SELECT DISTINCT g.Name
FROM Track t
JOIN Album a ON t.AlbumId = a.AlbumId
JOIN Artist ar ON a.ArtistId = ar.ArtistId
JOIN Genre g ON t.GenreId = g.GenreId
WHERE ar.Name = 'Elenis Moriset';
''
# With retrieval
query = chain.invoke({"question": "What are all the genres of elenis moriset songs"})
print(query)
db.run(query)
SELECT DISTINCT g.Name
FROM Genre g
JOIN Track t ON g.GenreId = t.GenreId
JOIN Album a ON t.AlbumId = a.AlbumId
JOIN Artist ar ON a.ArtistId = ar.ArtistId
WHERE ar.Name = 'Alanis Morissette';
"[('Rock',)]"
我们可以看到,通过检索功能,我们能够将拼写从"Elenis Moriset"纠正为"Alanis Morissette",并返回一个有效的结果。
解决此问题的另一种方法是让智能体自行决定何时查找专有名词。您可以在SQL:智能体指南中查看相关示例。