ArxivRetriever
arXiv is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
本笔记本展示了如何从Arxiv.org检索科学文章并将其转换为下游使用的文档格式。
有关所有 ArxivRetriever 功能和配置的详细文档,请访问 API 参考。
集成详情
| 检索器 | 源 | 包 |
|---|---|---|
| ArxivRetriever | Scholarly articles on arxiv.org | langchain_community |
设置
如果要获取来自单个查询的自动跟踪,还可以通过取消注释以下内容来设置您的 LangSmith API 密钥:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"
安装
此检索器位于 langchain-community 包中。我们还需要 arxiv 依赖项:
%pip install -qU langchain-community arxiv
实例化
ArxivRetriever 参数包括:
- 可选
load_max_docs: 默认值=100。使用它来限制下载文档的数量。下载全部100个文档需要花费时间,因此在实验时请使用较小的数字。目前存在300的硬性限制。 - 可选
load_all_available_meta: 默认值=False。默认情况下,仅下载最重要的字段:Published(文档发布/最后更新的日期),Title,Authors,Summary。如果为True,也会下载其他字段。 get_full_documents: 布尔值,默认为False。确定是否获取文档的完整文本。
查看更多细节,请参阅API参考。
from langchain_community.retrievers import ArxivRetriever
retriever = ArxivRetriever(
load_max_docs=2,
get_ful_documents=True,
)
API 参考:ArxivRetriever
使用
ArxivRetriever 支持通过文章标识符进行检索:
docs = retriever.invoke("1605.08386")
docs[0].metadata # meta-information of the Document
{'Entry ID': 'http://arxiv.org/abs/1605.08386v1',
'Published': datetime.date(2016, 5, 26),
'Title': 'Heat-bath random walks with Markov bases',
'Authors': 'Caprice Stanley, Tobias Windisch'}
docs[0].page_content[:400] # a content of the Document
'Graphs on lattice points are studied whose edges come from a finite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on\nfibers of a fixed integer matrix can be bounded from above by a constant. We\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\nalso state explicit conditions on the set of moves so that the heat-bath random\nwalk, a ge'
ArxivRetriever 还支持基于自然语言文本的检索:
docs = retriever.invoke("What is the ImageBind model?")
docs[0].metadata
{'Entry ID': 'http://arxiv.org/abs/2305.05665v2',
'Published': datetime.date(2023, 5, 31),
'Title': 'ImageBind: One Embedding Space To Bind Them All',
'Authors': 'Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra'}
在链中使用
与其他检索器类似,可以通过Chains将ArxivRetriever集成到LLM应用程序中。
我们需要一个LLM或聊天模型:
选择 聊天模型:
pip install -qU "langchain[openai]"
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain.chat_models import init_chat_model
llm = init_chat_model("gpt-4o-mini", model_provider="openai")
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
prompt = ChatPromptTemplate.from_template(
"""Answer the question based only on the context provided.
Context: {context}
Question: {question}"""
)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
chain.invoke("What is the ImageBind model?")
'The ImageBind model is an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It shows that only image-paired data is sufficient to bind the modalities together and can leverage large scale vision-language models for zero-shot capabilities and emergent applications such as cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.'
API 参考
有关所有 ArxivRetriever 功能和配置的详细文档,请访问 API 参考。