英特尔权重仅量化
仅使用权重量化与Intel Transformers管道对Huggingface模型进行量化
通过 WeightOnlyQuantPipeline 类,Hugging Face 模型可以通过权重仅量化在本地运行。
Hugging Face 模型中心 托管了超过 12 万个模型、2 万个数据集和 5 万个演示应用程序(Spaces),所有这些资源都是开源且公开可用的,通过一个在线平台让人们能够轻松协作并共同构建机器学习项目。
这些可以通过本地管道包装类从 LangChain 调用。
要使用,您应该安装 transformers Python 包,以及 pytorch、intel-extension-for-transformers。
%pip install transformers --quiet
%pip install intel-extension-for-transformers
模型加载
可以通过指定模型参数来加载模型,这些参数包括在 intel_extension_for_transformers 中的 WeightOnlyQuantConfig 类。
from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
from langchain_community.llms.weight_only_quantization import WeightOnlyQuantPipeline
conf = WeightOnlyQuantConfig(weight_dtype="nf4")
hf = WeightOnlyQuantPipeline.from_model_id(
model_id="google/flan-t5-large",
task="text2text-generation",
quantization_config=conf,
pipeline_kwargs={"max_new_tokens": 10},
)
也可以通过直接传入现有的 transformers 管道来加载它们
from intel_extension_for_transformers.transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline
model_id = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
pipe = pipeline(
"text2text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10
)
hf = WeightOnlyQuantPipeline(pipeline=pipe)
创建链
模型已加载到内存中后,你可以将其与提示词组合,形成一个链。
from langchain_core.prompts import PromptTemplate
template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)
chain = prompt | hf
question = "What is electroencephalography?"
print(chain.invoke({"question": question}))
CPU 推理
现在 intel-extension-for-transformers 仅支持 CPU 设备推理。很快将支持 Intel GPU。在具有 CPU 的机器上运行时,您可以指定 device="cpu" 或 device=-1 参数将模型放置到 CPU 设备上。
默认为 -1 进行 CPU 推理。
conf = WeightOnlyQuantConfig(weight_dtype="nf4")
llm = WeightOnlyQuantPipeline.from_model_id(
model_id="google/flan-t5-large",
task="text2text-generation",
quantization_config=conf,
pipeline_kwargs={"max_new_tokens": 10},
)
template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)
chain = prompt | llm
question = "What is electroencephalography?"
print(chain.invoke({"question": question}))
批量CPU推理
你也可以在CPU上以批量模式运行推理。
conf = WeightOnlyQuantConfig(weight_dtype="nf4")
llm = WeightOnlyQuantPipeline.from_model_id(
model_id="google/flan-t5-large",
task="text2text-generation",
quantization_config=conf,
pipeline_kwargs={"max_new_tokens": 10},
)
chain = prompt | llm.bind(stop=["\n\n"])
questions = []
for i in range(4):
questions.append({"question": f"What is the number {i} in french?"})
answers = chain.batch(questions)
for answer in answers:
print(answer)
Intel-extension-for-transformers 支持的数据类型
我们支持将权重量化为以下数据类型以进行存储(weight_dtype in WeightOnlyQuantConfig):
- int8: 使用 8 位数据类型。
- int4_fullrange: 与正常的 int4 范围 [-7,7] 相比,使用了 int4 范围中的 -8 值。
- int4_clip: 将值限制在 int4 范围内,超出范围的值设为零。
- nf4: 使用归一化的浮点4位数据类型。
- fp4_e2m1:使用常规的浮点4位数据类型。“e2”表示使用2位表示指数,“m1”表示使用1位表示尾数。
虽然这些技术将权重存储为4位或8位,但计算仍然在float32、bfloat16或int8(WeightOnlyQuantConfig中的compute_dtype)下进行。
- fp32: 使用 float32 数据类型进行计算。
- bf16: 使用 bfloat16 数据类型进行计算。
- int8: 使用 8 位数据类型进行计算。
支持的算法矩阵
intel-extension-for-transformers支持的量化算法(WeightOnlyQuantConfig中的算法):
| 算法 | PyTorch | LLM运行时 |
|---|---|---|
| RTN | ✔ | ✔ |
| AWQ | ✔ | stay tuned |
| TEQ | ✔ | stay tuned |
RTN: A quantification method that we can think of very intuitively. It does not require additional datasets and is a very fast quantization method. Generally speaking, RTN will convert the weight into a uniformly distributed integer data type, but some algorithms, such as Qlora, propose a non-uniform NF4 data type and prove its theoretical optimality.
AWQ: Proved that protecting only 1% of salient weights can greatly reduce quantization error. the salient weight channels are selected by observing the distribution of activation and weight per channel. The salient weights are also quantized after multiplying a big scale factor before quantization for preserving.
TEQ: A trainable equivalent transformation that preserves the FP32 precision in weight-only quantization. It is inspired by AWQ while providing a new solution to search for the optimal per-channel scaling factor between activations and weights.