vLLM
vLLM 是一个快速且易于使用的大型语言模型(LLM)推理和部署库,提供:
- 最先进的服务吞吐量
- 使用分页注意力(PagedAttention)高效管理注意力键和值的内存。
- 连续批量处理传入请求
- 优化的CUDA内核
本笔记本介绍了如何使用 LangChain 和 vLLM 与大型语言模型(LLM)。
要使用,您应该已经安装了 vllm Python 包。
%pip install --upgrade --quiet vllm -q
from langchain_community.llms import VLLM
llm = VLLM(
model="mosaicml/mpt-7b",
trust_remote_code=True, # mandatory for hf models
max_new_tokens=128,
top_k=10,
top_p=0.95,
temperature=0.8,
)
print(llm.invoke("What is the capital of France ?"))
API 参考:VLLM
INFO 08-06 11:37:33 llm_engine.py:70] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 08-06 11:37:41 llm_engine.py:196] # GPU blocks: 861, # CPU blocks: 512
``````output
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 2.00it/s]
``````output
What is the capital of France ? The capital of France is Paris.
在LLMChain中集成模型
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate
template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "Who was the US president in the year the first Pokemon game was released?"
print(llm_chain.invoke(question))
API 参考:LLMChain | PromptTemplate
Processed prompts: 100%|██████████| 1/1 [00:01<00:00, 1.34s/it]
``````output
1. The first Pokemon game was released in 1996.
2. The president was Bill Clinton.
3. Clinton was president from 1993 to 2001.
4. The answer is Clinton.
分布式推理
vLLM支持分布式张量并行推理和部署。
要使用 LLM 类进行多 GPU 推理,请将 tensor_parallel_size 参数设置为您想要使用的 GPU 数量。例如,要在 4 块 GPU 上运行推理
from langchain_community.llms import VLLM
llm = VLLM(
model="mosaicml/mpt-30b",
tensor_parallel_size=4,
trust_remote_code=True, # mandatory for hf models
)
llm.invoke("What is the future of AI?")
API 参考:VLLM
量化
vLLM 支持 awq 量化。要启用它,请将 quantization 传递给 vllm_kwargs。
llm_q = VLLM(
model="TheBloke/Llama-2-7b-Chat-AWQ",
trust_remote_code=True,
max_new_tokens=512,
vllm_kwargs={"quantization": "awq"},
)
与OpenAI兼容的服务器
vLLM 可以部署为一个模拟 OpenAI API 协议的服务器。这使得 vLLM 可以作为使用 OpenAI API 的应用程序的即插即用替代品。
该服务器可以以与 OpenAI API 相同的格式进行查询。
与OpenAI兼容的完成
from langchain_community.llms import VLLMOpenAI
llm = VLLMOpenAI(
openai_api_key="EMPTY",
openai_api_base="http://localhost:8000/v1",
model_name="tiiuae/falcon-7b",
model_kwargs={"stop": ["."]},
)
print(llm.invoke("Rome is"))
API 参考:VLLMOpenAI
a city that is filled with history, ancient buildings, and art around every corner
LoRA适集成
LoRA适集成可以与任何实现 SupportsLoRA 的 vLLM 模型一起使用。
from langchain_community.llms import VLLM
from vllm.lora.request import LoRARequest
llm = VLLM(
model="meta-llama/Llama-3.2-3B-Instruct",
max_new_tokens=300,
top_k=1,
top_p=0.90,
temperature=0.1,
vllm_kwargs={
"gpu_memory_utilization": 0.5,
"enable_lora": True,
"max_model_len": 350,
},
)
LoRA_ADAPTER_PATH = "path/to/adapter"
lora_adapter = LoRARequest("lora_adapter", 1, LoRA_ADAPTER_PATH)
print(
llm.invoke("What are some popular Korean street foods?", lora_request=lora_adapter)
)
API 参考:VLLM