使用rkllm-toolkit导出Qwen3模型
rk中有些主板是有NPU的,比如rk3576就有6 TOPS的算力。官方有NPU的加速sdk,本文以Qwen3为例,导出rkllm格式的模型。
创建环境
# 只试了python3.10环境
uv init rkllm-export -p 3.10
cd rkllm-export
# 建议先编辑pyproject.toml,使用国内镜像
wget https://github.com/airockchip/rknn-llm/raw/refs/heads/main/rkllm-toolkit/rkllm_toolkit-1.2.1-cp310-cp310-linux_x86_64.whl
uv add rkllm_toolkit-1.2.1-cp310-cp310-linux_x86_64.whl下载模型
# 使用modelscope下载模型
uv add modelscope
uv run modelscope donwload Qwen/Qwen3-4Bmain文件内容如下
from rkllm.api import RKLLM
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
model_path = '~/.cache/modelscope/hub/models/Qwen/Qwen3-4B/' # 这里填你已经下载好的模型目录
llm = RKLLM()
# Load model
# Use 'export CUDA_VISIBLE_DEVICES=0' to specify GPU device
# device options ['cpu', 'cuda']
# dtype options ['float32', 'float16', 'bfloat16']
# Using 'bfloat16' or 'float16' can significantly reduce memory consumption but at the cost of lower precision
# compared to 'float32'. Choose the appropriate dtype based on your hardware and model requirements.
ret = llm.load_huggingface(
model=model_path,
model_lora=None,
device='cuda',
dtype="float32",
custom_config=None,
load_weight=True
)
if ret != 0:
print('Load model failed!')
exit(ret)
# Build model
dataset = "./data_quant.json" # 这里没有用到dataset
# Json file format, please note to add prompt in the input,like this:
# [{"input":"Human: 你好!\nAssistant: ", "target": "你好!我是人工智能助手KK!"},...]
# Different quantization methods are optimized for different algorithms:
# w8a8/w8a8_gx is recommended to use the normal algorithm.
# w4a16/w4a16_gx is recommended to use the grq algorithm.
qparams = None # Use extra_qparams
target_platform = "RK3576"
optimization_level = 1
quantized_dtype = "w8a8" # 默认使用w8a8 + normal,使用w4a16 + grq模型会更小,但精度也会变小
quantized_algorithm = "normal"
# quantized_dtype = "w4a16"
# quantized_algorithm = "grq"
num_npu_core = 2 # 如果是3568,可以传3
ret = llm.build(
do_quantization=True,
# dataset=dataset,
optimization_level=optimization_level,
quantized_dtype=quantized_dtype,
quantized_algorithm=quantized_algorithm,
target_platform=target_platform,
num_npu_core=num_npu_core,
extra_qparams=qparams,
hybrid_rate=0,
max_context=16384 # 当前最大就是16384
)
if ret != 0:
print('Build model failed!')
exit(ret)
# Export rkllm model
ret = llm.export_rkllm(
os.path.join(model_path, f"{os.path.basename(model_path)}_{target_platform}_{quantized_dtype}.rkllm")
)
if ret != 0:
print('Export model failed!')
exit(ret)
代码出处:
运行导出
uv run main.py
# 导出到~/.cache/modelscope/hub/models/Qwen/Qwen3-4B/Qwen3-4B_RK3576_w8a8.rkllm过程中会占用较多的内存,请注意内存余量
使用GPU能加速导出
本文是原创文章,采用 CC BY-NC-ND 4.0 协议,完整转载请注明来自 Ryan的折腾日记
评论
匿名评论
隐私政策
你无需删除空行,直接评论以获取最佳展示效果