Skip to main content
Open In ColabOpen on GitHub

Llama.cpp

llama-cpp-pythonllama.cpp 的 Python 绑定。

它支持许多大型语言模型(LLMs)的推理,这些模型可以在 Hugging Face 上访问。

本笔记本介绍了如何在 LangChain 中运行 llama-cpp-python

注意:llama-cpp-python 的新版本使用 GGUF 模型文件(参见 此处)。

这是一个重大变更。

要将现有的GGML模型转换为GGUF,您可以在llama.cpp中运行以下命令:

python ./convert-llama-ggmlv3-to-gguf.py --eps 1e-5 --input models/openorca-platypus2-13b.ggmlv3.q4_0.bin --output models/openorca-platypus2-13b.gguf.q4_0.bin

安装

安装 llama-cpp 包有不同的选项:

  • CPU使用率
  • CPU + GPU(使用多种BLAS后端之一)
  • 金属GPU(搭载Apple Silicon芯片的MacOS)

仅CPU安装

%pip install --upgrade --quiet  llama-cpp-python

使用 OpenBLAS / cuBLAS / CLBlast 安装

llama.cpp 支持多种 BLAS 后端以实现更快的处理。使用 FORCE_CMAKE=1 环境变量强制使用 cmake 并安装 pip 包以获取所需的 BLAS 后端 (来源)。

使用 cuBLAS 后端的示例安装:

!CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install llama-cpp-python

重要提示:如果您已经安装了仅支持CPU版本的包,您需要从头重新安装。可以考虑以下命令:

!CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

使用Metal安装

llama.cpp 首先支持 Apple silicon,通过 ARM NEON、Accelerate 和 Metal 框架进行优化。使用 FORCE_CMAKE=1 环境变量强制使用 cmake 并安装 pip 包以支持 Metal(来源)。

带有Metal支持的示例安装:

!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

重要提示:如果您已经安装了仅支持CPU版本的软件包,您需要从头重新安装它:可以考虑以下命令:

!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

Windows安装

通过从源代码编译安装 llama-cpp-python 库是稳定的。您可以遵循仓库中的大部分说明,但也有一些适用于 Windows 的特定说明可能会有所帮助。

安装 llama-cpp-python 的要求,

  • Git
  • Python
  • cmake
  • Visual Studio Community(请确保按照以下设置进行安装)
    • 使用C++进行桌面开发
    • Python开发
    • Linux嵌入式开发使用C++
  1. 递归克隆 Git 仓库以获取 llama.cpp 子模块
git clone --recursive -j8 https://github.com/abetlen/llama-cpp-python.git
  1. 打开命令提示符并设置以下环境变量。
set FORCE_CMAKE=1
set CMAKE_ARGS=-DGGML_CUDA=OFF

如果你有NVIDIA GPU,请确保将DGGML_CUDA设置为ON

编译和安装

现在你可以 cdllama-cpp-python 目录并安装该包

python -m pip install -e .

重要提示:如果您已经安装了仅支持CPU版本的软件包,您需要从头重新安装它:可以考虑以下命令:

!python -m pip install -e . --force-reinstall --no-cache-dir

使用

请确保您已按照所有说明 安装所有必要的模型文件

您不需要一个API_TOKEN,因为您将在本地运行LLM。

值得了解哪些模型适合在所需的机器上使用。

TheBloke的 Hugging Face 模型有一个 Provided files 部分,该部分展示了运行不同量化大小和方法的模型所需的 RAM(例如:Llama2-7B-Chat-GGUF)。

这个 github 问题 也与在您的机器上找到合适的模型相关。

from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate

考虑使用适合您模型的模板!在 Hugging Face 等网站上查看模型页面,以获取正确的提示模板。

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate.from_template(template)
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

CPU

使用 LLaMA 2 7B 模型的示例

# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
temperature=0.75,
max_tokens=2000,
top_p=1,
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)
question = """
Question: A rap battle between Stephen Colbert and John Oliver
"""
llm.invoke(question)

Stephen Colbert:
Yo, John, I heard you've been talkin' smack about me on your show.
Let me tell you somethin', pal, I'm the king of late-night TV
My satire is sharp as a razor, it cuts deeper than a knife
While you're just a british bloke tryin' to be funny with your accent and your wit.
John Oliver:
Oh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.
My show is the one that people actually watch and listen to, not just for the laughs but for the facts.
While you're busy talkin' trash, I'm out here bringing the truth to light.
Stephen Colbert:
Truth? Ha! You think your show is about truth? Please, it's all just a joke to you.
You're just a fancy-pants british guy tryin' to be funny with your news and your jokes.
While I'm the one who's really makin' a difference, with my sat
``````output

llama_print_timings: load time = 358.60 ms
llama_print_timings: sample time = 172.55 ms / 256 runs ( 0.67 ms per token, 1483.59 tokens per second)
llama_print_timings: prompt eval time = 613.36 ms / 16 tokens ( 38.33 ms per token, 26.09 tokens per second)
llama_print_timings: eval time = 10151.17 ms / 255 runs ( 39.81 ms per token, 25.12 tokens per second)
llama_print_timings: total time = 11332.41 ms
"\nStephen Colbert:\nYo, John, I heard you've been talkin' smack about me on your show.\nLet me tell you somethin', pal, I'm the king of late-night TV\nMy satire is sharp as a razor, it cuts deeper than a knife\nWhile you're just a british bloke tryin' to be funny with your accent and your wit.\nJohn Oliver:\nOh Stephen, don't be ridiculous, you may have the ratings but I got the real talk.\nMy show is the one that people actually watch and listen to, not just for the laughs but for the facts.\nWhile you're busy talkin' trash, I'm out here bringing the truth to light.\nStephen Colbert:\nTruth? Ha! You think your show is about truth? Please, it's all just a joke to you.\nYou're just a fancy-pants british guy tryin' to be funny with your news and your jokes.\nWhile I'm the one who's really makin' a difference, with my sat"

使用LLaMA v1模型的示例

# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="./ggml-model-q4_0.bin", callback_manager=callback_manager, verbose=True
)
llm_chain = prompt | llm
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.invoke({"question": question})


1. First, find out when Justin Bieber was born.
2. We know that Justin Bieber was born on March 1, 1994.
3. Next, we need to look up when the Super Bowl was played in that year.
4. The Super Bowl was played on January 28, 1995.
5. Finally, we can use this information to answer the question. The NFL team that won the Super Bowl in the year Justin Bieber was born is the San Francisco 49ers.
``````output

llama_print_timings: load time = 434.15 ms
llama_print_timings: sample time = 41.81 ms / 121 runs ( 0.35 ms per token)
llama_print_timings: prompt eval time = 2523.78 ms / 48 tokens ( 52.58 ms per token)
llama_print_timings: eval time = 23971.57 ms / 121 runs ( 198.11 ms per token)
llama_print_timings: total time = 28945.95 ms
'\n\n1. First, find out when Justin Bieber was born.\n2. We know that Justin Bieber was born on March 1, 1994.\n3. Next, we need to look up when the Super Bowl was played in that year.\n4. The Super Bowl was played on January 28, 1995.\n5. Finally, we can use this information to answer the question. The NFL team that won the Super Bowl in the year Justin Bieber was born is the San Francisco 49ers.'

GPU

如果使用BLAS后端的安装正确,您将在模型属性中看到一个BLAS = 1指示器。

使用 GPU 时最重要的两个参数是:

  • n_gpu_layers - 确定有多少层模型被卸载到您的 GPU 上。
  • n_batch - 并行处理的标记数量。

正确设置这些参数将显著提高评估速度(有关更多详细信息,请参阅包装代码)。

n_gpu_layers = -1  # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)
llm_chain = prompt | llm
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.invoke({"question": question})


1. Identify Justin Bieber's birth date: Justin Bieber was born on March 1, 1994.

2. Find the Super Bowl winner of that year: The NFL season of 1993 with the Super Bowl being played in January or of 1994.

3. Determine which team won the game: The Dallas Cowboys faced the Buffalo Bills in Super Bowl XXVII on January 31, 1993 (as the year is mis-labelled due to a error). The Dallas Cowboys won this matchup.

So, Justin Bieber was born when the Dallas Cowboys were the reigning NFL Super Bowl.
``````output

llama_print_timings: load time = 427.63 ms
llama_print_timings: sample time = 115.85 ms / 164 runs ( 0.71 ms per token, 1415.67 tokens per second)
llama_print_timings: prompt eval time = 427.53 ms / 45 tokens ( 9.50 ms per token, 105.26 tokens per second)
llama_print_timings: eval time = 4526.53 ms / 163 runs ( 27.77 ms per token, 36.01 tokens per second)
llama_print_timings: total time = 5293.77 ms
"\n\n1. Identify Justin Bieber's birth date: Justin Bieber was born on March 1, 1994.\n\n2. Find the Super Bowl winner of that year: The NFL season of 1993 with the Super Bowl being played in January or of 1994.\n\n3. Determine which team won the game: The Dallas Cowboys faced the Buffalo Bills in Super Bowl XXVII on January 31, 1993 (as the year is mis-labelled due to a error). The Dallas Cowboys won this matchup.\n\nSo, Justin Bieber was born when the Dallas Cowboys were the reigning NFL Super Bowl."

金属

如果Metal安装正确,您将在模型属性中看到一个NEON = 1指示器。

两个最重要的GPU参数是:

  • n_gpu_layers - 确定有多少层模型被卸载到你的Metal GPU上。
  • n_batch - 并行处理的标记数,默认为 8,可设置为更大的数字。
  • f16_kv - 由于某种原因,Metal 仅支持 True,否则您将收到类似 Asserting on type 0 GGML_ASSERT: .../ggml-metal.m:706: false && "not implemented" 的错误。

正确设置这些参数将显著提高评估速度(有关更多详细信息,请参阅包装代码)。

n_gpu_layers = 1  # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)

控制台日志将显示以下日志,以表明 Metal 已正确启用。

ggml_metal_init: allocating
ggml_metal_init: using MPS
...

您还可以通过查看进程的GPU使用情况来检查Activity Monitor,在启用n_gpu_layers=1后,CPU使用率将大幅下降。

第一次调用大型语言模型时,由于Metal GPU中的模型编译,性能可能会较慢。

语法

我们可以使用语法来限制模型输出,并根据其中定义的规则对令牌进行采样。

为了演示这个概念,我们包含了示例语法文件,这些文件将在下面的示例中使用。

创建GBNF语法文件可能需要花费时间,但如果您的应用场景中输出模式很重要,有两款工具可以帮助您:

  • 在线语法生成器应用程序,可将 TypeScript 接口定义转换为 gbnf 文件。
  • Python脚本 用于将JSON模式转换为GBNF文件。例如,您可以创建 pydantic 对象,使用 .schema_json() 方法生成其JSON模式,然后使用此脚本将其转换为GBNF文件。

在第一个示例中,提供指定文件 json.gbnf 的路径以生成 JSON:

n_gpu_layers = 1  # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
grammar_path="/Users/rlm/Desktop/Code/langchain-main/langchain/libs/langchain/langchain/llms/grammars/json.gbnf",
)
%%capture captured --no-stdout
result = llm.invoke("Describe a person in JSON format:")
{
"name": "John Doe",
"age": 34,
"": {
"title": "Software Developer",
"company": "Google"
},
"interests": [
"Sports",
"Music",
"Cooking"
],
"address": {
"street_number": 123,
"street_name": "Oak Street",
"city": "Mountain View",
"state": "California",
"postal_code": 94040
}}
``````output

llama_print_timings: load time = 357.51 ms
llama_print_timings: sample time = 1213.30 ms / 144 runs ( 8.43 ms per token, 118.68 tokens per second)
llama_print_timings: prompt eval time = 356.78 ms / 9 tokens ( 39.64 ms per token, 25.23 tokens per second)
llama_print_timings: eval time = 3947.16 ms / 143 runs ( 27.60 ms per token, 36.23 tokens per second)
llama_print_timings: total time = 5846.21 ms

我们也可以提供 list.gbnf 来返回一个列表:

n_gpu_layers = 1
n_batch = 512
llm = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True,
grammar_path="/Users/rlm/Desktop/Code/langchain-main/langchain/libs/langchain/langchain/llms/grammars/list.gbnf",
)
%%capture captured --no-stdout
result = llm.invoke("List of top-3 my favourite books:")
["The Catcher in the Rye", "Wuthering Heights", "Anna Karenina"]
``````output

llama_print_timings: load time = 322.34 ms
llama_print_timings: sample time = 232.60 ms / 26 runs ( 8.95 ms per token, 111.78 tokens per second)
llama_print_timings: prompt eval time = 321.90 ms / 11 tokens ( 29.26 ms per token, 34.17 tokens per second)
llama_print_timings: eval time = 680.82 ms / 25 runs ( 27.23 ms per token, 36.72 tokens per second)
llama_print_timings: total time = 1295.27 ms