rk中有些主板是有NPU的,比如rk3576就有6 TOPS的算力。官方有NPU的加速sdk,本文以Qwen3为例,导出rkllm格式的模型。

创建环境

# 只试了python3.10环境
uv init rkllm-export -p 3.10
cd rkllm-export

# 建议先编辑pyproject.toml,使用国内镜像

wget https://github.com/airockchip/rknn-llm/raw/refs/heads/main/rkllm-toolkit/rkllm_toolkit-1.2.1-cp310-cp310-linux_x86_64.whl
uv add rkllm_toolkit-1.2.1-cp310-cp310-linux_x86_64.whl

下载模型

# 使用modelscope下载模型
uv add modelscope
uv run modelscope donwload Qwen/Qwen3-4B

main文件内容如下

from rkllm.api import RKLLM
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0'

model_path = '~/.cache/modelscope/hub/models/Qwen/Qwen3-4B/'  # 这里填你已经下载好的模型目录
llm = RKLLM()

# Load model
# Use 'export CUDA_VISIBLE_DEVICES=0' to specify GPU device
# device options ['cpu', 'cuda']
# dtype  options ['float32', 'float16', 'bfloat16']
# Using 'bfloat16' or 'float16' can significantly reduce memory consumption but at the cost of lower precision
# compared to 'float32'. Choose the appropriate dtype based on your hardware and model requirements.
ret = llm.load_huggingface(
    model=model_path,
    model_lora=None,
    device='cuda',
    dtype="float32",
    custom_config=None,
    load_weight=True
)
if ret != 0:
    print('Load model failed!')
    exit(ret)

# Build model

dataset = "./data_quant.json"  # 这里没有用到dataset
# Json file format, please note to add prompt in the input,like this:
# [{"input":"Human: 你好!\nAssistant: ", "target": "你好!我是人工智能助手KK!"},...]

# Different quantization methods are optimized for different algorithms:
# w8a8/w8a8_gx   is recommended to use the normal algorithm.
# w4a16/w4a16_gx is recommended to use the grq algorithm.
qparams = None  # Use extra_qparams

target_platform = "RK3576"

optimization_level = 1

quantized_dtype = "w8a8"  # 默认使用w8a8 + normal,使用w4a16 + grq模型会更小,但精度也会变小
quantized_algorithm = "normal"

# quantized_dtype = "w4a16"
# quantized_algorithm = "grq"

num_npu_core = 2  # 如果是3568,可以传3

ret = llm.build(
    do_quantization=True,
    # dataset=dataset,
    optimization_level=optimization_level,
    quantized_dtype=quantized_dtype,
    quantized_algorithm=quantized_algorithm,
    target_platform=target_platform,
    num_npu_core=num_npu_core,
    extra_qparams=qparams,
    hybrid_rate=0,
    max_context=16384  # 当前最大就是16384
)
if ret != 0:
    print('Build model failed!')
    exit(ret)

# Export rkllm model
ret = llm.export_rkllm(
    os.path.join(model_path, f"{os.path.basename(model_path)}_{target_platform}_{quantized_dtype}.rkllm")
)
if ret != 0:
    print('Export model failed!')
    exit(ret)

代码出处:

rknn-llm

运行导出

uv run main.py
# 导出到~/.cache/modelscope/hub/models/Qwen/Qwen3-4B/Qwen3-4B_RK3576_w8a8.rkllm

过程中会占用较多的内存,请注意内存余量

使用GPU能加速导出