Skip to main content
Open In ColabOpen on GitHub

ArxivRetriever

arXiv is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

本笔记本展示了如何从Arxiv.org检索科学文章并将其转换为下游使用的文档格式。

有关所有 ArxivRetriever 功能和配置的详细文档,请访问 API 参考

集成详情

检索器
ArxivRetrieverScholarly articles on arxiv.orglangchain_community

设置

如果要获取来自单个查询的自动跟踪,还可以通过取消注释以下内容来设置您的 LangSmith API 密钥:

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

安装

此检索器位于 langchain-community 包中。我们还需要 arxiv 依赖项:

%pip install -qU langchain-community arxiv

实例化

ArxivRetriever 参数包括:

  • 可选 load_max_docs: 默认值=100。使用它来限制下载文档的数量。下载全部100个文档需要花费时间,因此在实验时请使用较小的数字。目前存在300的硬性限制。
  • 可选 load_all_available_meta: 默认值=False。默认情况下,仅下载最重要的字段:Published(文档发布/最后更新的日期),TitleAuthorsSummary。如果为True,也会下载其他字段。
  • get_full_documents: 布尔值,默认为False。确定是否获取文档的完整文本。

查看更多细节,请参阅API参考

from langchain_community.retrievers import ArxivRetriever

retriever = ArxivRetriever(
load_max_docs=2,
get_ful_documents=True,
)
API 参考:ArxivRetriever

使用

ArxivRetriever 支持通过文章标识符进行检索:

docs = retriever.invoke("1605.08386")
docs[0].metadata  # meta-information of the Document
{'Entry ID': 'http://arxiv.org/abs/1605.08386v1',
'Published': datetime.date(2016, 5, 26),
'Title': 'Heat-bath random walks with Markov bases',
'Authors': 'Caprice Stanley, Tobias Windisch'}
docs[0].page_content[:400]  # a content of the Document
'Graphs on lattice points are studied whose edges come from a finite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on\nfibers of a fixed integer matrix can be bounded from above by a constant. We\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\nalso state explicit conditions on the set of moves so that the heat-bath random\nwalk, a ge'

ArxivRetriever 还支持基于自然语言文本的检索:

docs = retriever.invoke("What is the ImageBind model?")
docs[0].metadata
{'Entry ID': 'http://arxiv.org/abs/2305.05665v2',
'Published': datetime.date(2023, 5, 31),
'Title': 'ImageBind: One Embedding Space To Bind Them All',
'Authors': 'Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra'}

在链中使用

与其他检索器类似,可以通过ChainsArxivRetriever集成到LLM应用程序中。

我们需要一个LLM或聊天模型:

pip install -qU "langchain[openai]"
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_template(
"""Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)


def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)


chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
chain.invoke("What is the ImageBind model?")
'The ImageBind model is an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It shows that only image-paired data is sufficient to bind the modalities together and can leverage large scale vision-language models for zero-shot capabilities and emergent applications such as cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.'

API 参考

有关所有 ArxivRetriever 功能和配置的详细文档,请访问 API 参考