Ontotext GraphDB

Ontotext GraphDB is a graph database and knowledge discovery tool compliant with RDF and SPARQL.

This notebook shows how to use LLMs to provide natural language querying (NLQ to SPARQL, also called text2sparql) for Ontotext GraphDB.

GraphDB LLM 功能

GraphDB 支持一些大型语言模型（LLM）集成功能，如此处所述：

gpt-queries

使用知识图谱（KG）中的数据，通过魔法谓词向大型语言模型（LLM）请求文本、列表或表格。
查询解释
结果解释，总结，改写，翻译

retrieval-graphdb-connector

向量数据库中知识图谱实体的索引
支持任何文本嵌入算法和向量数据库
使用与 GraphDB 用于 Elastic、Solr 和 Lucene 相同的强大连接器（索引）语言。
RDF 数据中的更改自动同步到知识图谱实体索引中
支持嵌套对象（GraphDB 版本 10.5 中无 UI 支持）
像这样将知识图谱实体序列化为文本（例如，对于葡萄酒数据集）：

Franvino:
- is a RedWine.
- made from grape Merlo.
- made from grape Cabernet Franc.
- has sugar dry.
- has year 2012.

talk-to-graph

一个使用定义的知识图谱实体索引的简单聊天机器人

对于本教程，我们将不使用GraphDB LLM集成，而是从NLQ生成SPARQL。我们将使用Star Wars API（SWAPI）本体和数据集，您可以在此查看。

设置

你需要一个正在运行的GraphDB实例。本教程展示了如何使用GraphDB Docker镜像在本地运行数据库。它提供了一个docker compose设置，将Star Wars数据集填充到GraphDB中。所有必要的文件，包括此笔记本，都可以从GitHub仓库langchain-graphdb-qa-chain-demo下载。

安装 Docker。本教程使用的是 Docker 版本 24.0.7，它包含了 Docker Compose。对于早期的 Docker 版本，您可能需要单独安装 Docker Compose。
克隆 GitHub 仓库 langchain-graphdb-qa-chain-demo 到您机器上的本地文件夹。
从同一文件夹执行以下脚本启动 GraphDB

docker build --tag graphdb .
docker compose up -d graphdb

你需要等待几秒钟，让数据库在 http://localhost:7200/ 上启动。《星球大战》数据集 starwars-data.trig 会自动加载到 langchain 存储库中。本地 SPARQL 端点 http://localhost:7200/repositories/langchain 可用于运行查询。你也可以从你喜欢的网页浏览器 http://localhost:7200/sparql 打开 GraphDB Workbench，在其中可以进行交互式查询。

设置工作环境

如果使用 conda，请创建并激活一个新的 conda 环境，例如：

conda create -n graph_ontotext_graphdb_qa python=3.12
conda activate graph_ontotext_graphdb_qa

安装以下库：

pip install jupyter==1.1.1
pip install rdflib==7.1.1
pip install langchain-community==0.3.4
pip install langchain-openai==0.2.4

运行 Jupyter

jupyter notebook

指定本体

为了让大型语言模型（LLM）能够生成SPARQL查询，它需要了解知识图谱的模式（本体）。可以通过OntotextGraphDBGraph类中的两个参数之一来提供。

query_ontology: 一个在SPARQL端点上执行的CONSTRUCT查询，返回知识图谱（KG）模式语句。我们建议将本体存储在它自己的命名图中，这将使您更容易获取相关语句（如下面的示例所示）。DESCRIBE查询不受支持，因为DESCRIBE返回的是对称简洁边界描述（SCBD），即还包括传入的类链接。对于包含一百万个实例的大规模图，这种方法效率不高。请查看https://github.com/eclipse-rdf4j/rdf4j/issues/4857
local_file: 一个本地RDF本体文件。支持的RDF格式包括 Turtle, RDF/XML, JSON-LD, N-Triples, Notation-3, Trig, Trix, N-Quads。

在任何情况下，本体论转储应：

包含足够的关于类、属性以及属性与类的关联信息（使用 rdfs）。:domain，模式:domainIncludes或OWL限制，以及本体（重要人物）。
不包含过于冗长和无关的定义和示例，这些内容无助于 SPARQL 的构建。

from langchain_community.graphs import OntotextGraphDBGraph

# feeding the schema using a user construct query

graph = OntotextGraphDBGraph(
    query_endpoint="http://localhost:7200/repositories/langchain",
    query_ontology="CONSTRUCT {?s ?p ?o} FROM <https://swapi.co/ontology/> WHERE {?s ?p ?o}",
)

API 参考：OntotextGraphDBGraph

# feeding the schema using a local RDF file

graph = OntotextGraphDBGraph(
    query_endpoint="http://localhost:7200/repositories/langchain",
    local_file="/path/to/langchain_graphdb_tutorial/starwars-ontology.nt",  # change the path here
)

无论哪种方式，本体（模式）都会以 Turtle 的形式提供给大型语言模型（LLM），因为带有适当前缀的 Turtle 是最紧凑且最容易让大型语言模型记住的形式。

《星球大战》本体论有些不寻常，因为它包含了许多关于类别的特定三元组，例如物种 :Aleena 生活在 <planet/38> 上，它们是 :Reptile 的子类，具有某些典型特征（平均身高、平均寿命、皮肤颜色），并且具体的个体（角色）是该类别的代表：

@prefix : <https://swapi.co/vocabulary/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

:Aleena a owl:Class, :Species ;
    rdfs:label "Aleena" ;
    rdfs:isDefinedBy <https://swapi.co/ontology/> ;
    rdfs:subClassOf :Reptile, :Sentient ;
    :averageHeight 80.0 ;
    :averageLifespan "79" ;
    :character <https://swapi.co/resource/aleena/47> ;
    :film <https://swapi.co/resource/film/4> ;
    :language "Aleena" ;
    :planet <https://swapi.co/resource/planet/38> ;
    :skinColor "blue", "gray" .

    ...

为了使本教程保持简单，我们使用未受保护的GraphDB。如果GraphDB已受保护，则在初始化OntotextGraphDBGraph之前应设置环境变量“GRAPHDB_USERNAME”和“GRAPHDB_PASSWORD”。

os.environ["GRAPHDB_USERNAME"] = "graphdb-user"
os.environ["GRAPHDB_PASSWORD"] = "graphdb-password"

graph = OntotextGraphDBGraph(
    query_endpoint=...,
    query_ontology=...
)

基于StarWars数据集的问答

现在我们可以使用 OntotextGraphDBQAChain 提出一些问题。

import os

from langchain.chains import OntotextGraphDBQAChain
from langchain_openai import ChatOpenAI

# We'll be using an OpenAI model which requires an OpenAI API Key.
# However, other models are available as well:
# https://python.langchain.com/docs/integrations/chat/

# Set the environment variable `OPENAI_API_KEY` to your OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-***"

# Any available OpenAI model can be used here.
# We use 'gpt-4-1106-preview' because of the bigger context window.
# The 'gpt-4-1106-preview' model_name will deprecate in the future and will change to 'gpt-4-turbo' or similar,
# so be sure to consult with the OpenAI API https://platform.openai.com/docs/models for the correct naming.

chain = OntotextGraphDBQAChain.from_llm(
    ChatOpenAI(temperature=0, model_name="gpt-4-1106-preview"),
    graph=graph,
    verbose=True,
    allow_dangerous_requests=True,
)

API 参考：OntotextGraphDBQAChain | ChatOpenAI

让我们问一个简单的问题。

chain.invoke({chain.input_key: "What is the climate on Tatooine?"})[chain.output_key]

[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX : <https://swapi.co/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?climate
WHERE {
  ?planet rdfs:label "Tatooine" ;
          :climate ?climate .
}[0m

[1m> Finished chain.[0m

'The climate on Tatooine is arid.'

再复杂一点。

chain.invoke({chain.input_key: "What is the climate on Luke Skywalker's home planet?"})[
    chain.output_key
]

[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX : <https://swapi.co/vocabulary/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?climate
WHERE {
  ?character rdfs:label "Luke Skywalker" .
  ?character :homeworld ?planet .
  ?planet :climate ?climate .
}[0m

[1m> Finished chain.[0m

"The climate on Luke Skywalker's home planet is arid."

我们也可以提出更复杂的问题，比如

chain.invoke(
    {
        chain.input_key: "What is the average box office revenue for all the Star Wars movies?"
    }
)[chain.output_key]

[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX : <https://swapi.co/vocabulary/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT (AVG(?boxOffice) AS ?averageBoxOfficeRevenue)
WHERE {
  ?film a :Film .
  ?film :boxOffice ?boxOfficeValue .
  BIND(xsd:decimal(?boxOfficeValue) AS ?boxOffice)
}
[0m

[1m> Finished chain.[0m

'The average box office revenue for all the Star Wars movies is approximately 754.1 million dollars.'

链修饰器

Ontotext GraphDB QA 链允许对提示进行优化，以进一步改进您的 QA 链，并提升应用程序的整体用户体验。

“SPARQL生成”提示

提示用于根据用户问题和知识图谱模式生成SPARQL查询。

sparql_generation_prompt

默认值：

  GRAPHDB_SPARQL_GENERATION_TEMPLATE = """
  Write a SPARQL SELECT query for querying a graph database.
  The ontology schema delimited by triple backticks in Turtle format is:
  ```
  {schema}
  ```
  Use only the classes and properties provided in the schema to construct the SPARQL query.
  Do not use any classes or properties that are not explicitly provided in the SPARQL query.
  Include all necessary prefixes.
  Do not include any explanations or apologies in your responses.
  Do not wrap the query in backticks.
  Do not include any text except the SPARQL query generated.
  The question delimited by triple backticks is:
  ```
  {prompt}
  ```
  """
  GRAPHDB_SPARQL_GENERATION_PROMPT = PromptTemplate(
      input_variables=["schema", "prompt"],
      template=GRAPHDB_SPARQL_GENERATION_TEMPLATE,
  )

“SPARQL Fix”提示

有时，大型语言模型可能会生成带有语法错误或缺少前缀等的 SPARQL 查询。该链将尝试通过提示大型语言模型进行纠正来修正这些问题，次数有限。

sparql_fix_prompt

默认值：

  GRAPHDB_SPARQL_FIX_TEMPLATE = """
  This following SPARQL query delimited by triple backticks
  ```
  {generated_sparql}
  ```
  is not valid.
  The error delimited by triple backticks is
  ```
  {error_message}
  ```
  Give me a correct version of the SPARQL query.
  Do not change the logic of the query.
  Do not include any explanations or apologies in your responses.
  Do not wrap the query in backticks.
  Do not include any text except the SPARQL query generated.
  The ontology schema delimited by triple backticks in Turtle format is:
  ```
  {schema}
  ```
  """
  
  GRAPHDB_SPARQL_FIX_PROMPT = PromptTemplate(
      input_variables=["error_message", "generated_sparql", "schema"],
      template=GRAPHDB_SPARQL_FIX_TEMPLATE,
  )

max_fix_retries

默认值：5

“回答”提示

提示用于根据数据库返回的结果和初始用户问题来回答问题。默认情况下，LLM 被指示仅使用返回结果中的信息。如果结果集为空，LLM 应告知无法回答该问题。

qa_prompt

默认值：

  GRAPHDB_QA_TEMPLATE = """Task: Generate a natural language response from the results of a SPARQL query.
  You are an assistant that creates well-written and human understandable answers.
  The information part contains the information provided, which you can use to construct an answer.
  The information provided is authoritative, you must never doubt it or try to use your internal knowledge to correct it.
  Make your response sound like the information is coming from an AI assistant, but don't add any information.
  Don't use internal knowledge to answer the question, just say you don't know if no information is available.
  Information:
  {context}
  
  Question: {prompt}
  Helpful Answer:"""
  GRAPHDB_QA_PROMPT = PromptTemplate(
      input_variables=["context", "prompt"], template=GRAPHDB_QA_TEMPLATE
  )

在你完成与图数据库的问答游戏后，可以通过从包含 Docker Compose 文件的目录中运行 docker compose down -v --remove-orphans 来关闭 Docker 环境。

GraphDB LLM 功能​

设置​

指定本体​

基于StarWars数据集的问答​

链修饰器​

“SPARQL生成”提示​

“SPARQL Fix”提示​

“回答”提示​

GraphDB LLM 功能

设置

指定本体

基于StarWars数据集的问答

链修饰器

“SPARQL生成”提示

“SPARQL Fix”提示

“回答”提示