news 2026/6/13 13:47:39

CANN/AMCT DeepSeek-V4-Flash量化流程

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
CANN/AMCT DeepSeek-V4-Flash量化流程

DeepSeek-V4-Flash End-to-End Post-Training Quantization Workflow

【免费下载链接】amctAMCT是CANN提供的昇腾AI处理器亲和的模型压缩工具仓。项目地址: https://gitcode.com/cann/amct

This document introduces how to use AMCT to complete end-to-end accuracy evaluation, post-training quantization, and deployment weight export for DeepSeek-V4-Flash.

Overall Overview

This document completes the following steps:

  • Environment preparation and verification.
  • Model weight confirmation.
  • BF16 model accuracy evaluation as baseline.
  • Evaluate accuracy loss under direct conversion quantization.
  • Extract calibration dataset based on quantization targets.
  • Perform PTQ based on calibration dataset, and save quantization parameters.
  • Evaluate quantization model accuracy loss, export quantization parameters.

This sample is constructed and demonstrated through Python code snippets. Each step prints commands throughshow_cmd. If you decide to directly run the current step, you can switch torun_cmd(..., dry_run=False).

1. Environment Preparation

This chapter is the environment preparation stage, including environment installation methods and whether dependencies have been successfully installed.

import sys import subprocess from pathlib import Path REPO_ROOT = Path("/path/to/AMCT") REQUIREMENTS_TXT = REPO_ROOT / "requirements.txt" print(f"Python executable: {sys.executable}") print(f"Python version: {sys.version.split()[0]}") print(f"Requirements file: {REQUIREMENTS_TXT}") # Uncomment the next line when the environment still needs dependency installation. # subprocess.run([sys.executable, "-m", "pip", "install", "-r", str(REQUIREMENTS_TXT)], check=True) import torch import transformers print(f"torch version: {torch.__version__}") print(f"transformers version: {transformers.__version__}") print("Environment sanity check passed.")

2. New Model Import and Model Weight Preparation

If you encounter a model that has not yet been adapted in the code repository, you also need to adapt the new model according to a fixed pattern. This chapter is based on DeepSeek-v4-Flash that has been adapted in the code repository for weight preparation and parameter setting.

Since all processes in the current code repository are based onbfloat16, if official weights are in FP8 or FP4 format, you need to first convert weights tobfloat16. Specifically use the deploy interface and set parametergranularity = tensor. For converted bf16 weights, if you want to deploy later, you can also deploy with one click. If you want to continue quantizing to different data formats with different bit widths, you can perform PTQ based on converted bf16 weights.

DeepSeek-v4-Flash official open source uses mixed FP8+FP4 weights, so this sample needs to first convert tobfloat16weights.

from pathlib import Path import shlex import subprocess REPO_ROOT = Path("/path/to/AMCT") MODEL_PATH = Path("/path/to/deepseek-v4-flash") CALIB_SOURCE_DIR = Path("/path/to/calibration_or_pileval_source") OUTPUT_ROOT = REPO_ROOT / "outputs" / "deepseek_v4_flash_walkthrough" EXPORT_DIR = OUTPUT_ROOT / "export" MODEL_NAME = "deepseek_v4" DEVICE = "npu:0" GRANULARITY = "tensor" SEQ_LEN = 4096 NSAMPLES = 128 ALL_QUANT_TARGETS = ["mlp", "moe", "attn-linear", "attn-cache"] PTQ_TARGETS = ["moe"] QUANT_DTYPE = "int" PTQ_ALGOS = ["lwc"] BF16_BIT_CONFIG = REPO_ROOT / "amct_pytorch/configs/bf16.yaml" W8A8_BIT_CONFIG = REPO_ROOT / "amct_pytorch/configs/w4a8.yaml" MLP_MOE_PARAM_DIR = OUTPUT_ROOT / "ptq_params" / MODEL_NAME / "mlp_moe" for path in [OUTPUT_ROOT, EXPORT_DIR, MLP_MOE_PARAM_DIR]: path.mkdir(parents=True, exist_ok=True) def pretty_cmd(parts): return " \ ".join(shlex.quote(str(part)) for part in parts) def show_cmd(parts): print(pretty_cmd(parts)) def run_cmd(parts, cwd=REPO_ROOT, dry_run=True): print(pretty_cmd(parts)) if not dry_run: subprocess.run([str(part) for part in parts], cwd=str(cwd), check=True) required_files = [ MODEL_PATH / "config.json", MODEL_PATH / "model.safetensors.index.json", ] optional_files = [ MODEL_PATH / "tokenizer.json", MODEL_PATH / "tokenizer_config.json", ] print(f"REPO_ROOT exists: {REPO_ROOT.exists()} -> {REPO_ROOT}") print(f"MODEL_PATH exists: {MODEL_PATH.exists()} -> {MODEL_PATH}") print() for path in required_files: print(f"[required] {path}: {path.exists()}") for path in optional_files: print(f"[optional] {path}: {path.exists()}") NEW_MODEL_PATH = Path("/path/to/deepseek-v4-flash-bf16") NEW_MODEL_PATH.mkdir(parents=True, exist_ok=True) baseline_cmd = [ "python", "amct_pytorch/cli/llm/deploy.py", "--model", MODEL_PATH, "--model_name", MODEL_NAME, "--device", DEVICE, "--granularity", GRANULARITY, "--eval_mode", "bf16", "--output_dir", NEW_MODEL_PATH, ] show_cmd(baseline_cmd)

3. Model Accuracy Baseline Test

This chapter is the BF16 accuracy evaluation before quantization, serving as the accuracy baseline for subsequent quantization experiments, used to evaluate quantization accuracy loss.

baseline_dir = OUTPUT_ROOT / "01_baseline_eval" baseline_dir.mkdir(parents=True, exist_ok=True) baseline_cmd = [ "python", "-m", "amct_pytorch.eval", "--model", MODEL_PATH, "--model_name", MODEL_NAME, "--device", DEVICE, "--granularity", GRANULARITY, "--eval_mode", "bf16", "--bit_config", BF16_BIT_CONFIG, "--seq_len", str(SEQ_LEN), "--output_dir", baseline_dir, "--wikitext_final_out", baseline_dir / "wikitext", ] show_cmd(baseline_cmd) # run_cmd(baseline_cmd, dry_run=False)

4. Direct Conversion Quantization Accuracy Loss Evaluation

This section evaluates the accuracy performance of the model under direct conversion quantization without introducing quantization algorithms, providing data reference for subsequent application of quantization algorithms.

direct_quant_dir = OUTPUT_ROOT / "02_direct_quant_eval" direct_quant_dir.mkdir(parents=True, exist_ok=True) direct_quant_cmd = [ "python", "-m", "amct_pytorch.eval", "--model", MODEL_PATH, "--model_name", MODEL_NAME, "--device", DEVICE, "--granularity", GRANULARITY, "--eval_mode", "quant", "--quant_dtype", QUANT_DTYPE, "--bit_config", W8A8_BIT_CONFIG, "--seq_len", str(SEQ_LEN), "--output_dir", direct_quant_dir, "--wikitext_final_out", direct_quant_dir / "wikitext", "--quant_target", *ALL_QUANT_TARGETS, ] show_cmd(direct_quant_cmd) # run_cmd(direct_quant_cmd, dry_run=False)

5. Typical Calibration Data Preparation

This chapter prepares calibration data for PTQ in AMCT.

extract_ptq_data.pyonly supports a singlequant_target, so this section will construct instructions for eachquant_targetseparately and save extracted inputs in corresponding paths.

calib_output_root = OUTPUT_ROOT / "03_calibration_data" calib_output_root.mkdir(parents=True, exist_ok=True) extract_cmds = {} for target in PTQ_TARGETS: target_output_dir = calib_output_root / target.replace("-", "_") target_output_dir.mkdir(parents=True, exist_ok=True) cmd = [ "python", "-m", "amct_pytorch.extract_ptq_data", "--model", MODEL_PATH, "--model_name", MODEL_NAME, "--device", DEVICE, "--granularity", GRANULARITY, "--quant_target", target, "--data_dir", CALIB_SOURCE_DIR, "--output_dir", target_output_dir, "--nsamples", str(NSAMPLES), "--seq_len", str(SEQ_LEN), ] extract_cmds[target] = cmd for target, cmd in extract_cmds.items(): print(f"\n[{target}]") show_cmd(cmd) # for cmd in extract_cmds.values(): # run_cmd(cmd, dry_run=False)

6. Post-Training Quantization

This section performs PTQ for each quantization targetquant_targetand saves learned parameters for subsequent accuracy evaluation.

ptq.pyonly supports a singlequant_target, so this section will construct instructions for eachquant_targetseparately and save learned parameters in corresponding paths.

ptq_output_root = OUTPUT_ROOT / "04_ptq_runs" ptq_output_root.mkdir(parents=True, exist_ok=True) param_dir_by_target = { "mlp": MLP_MOE_PARAM_DIR, "moe": MLP_MOE_PARAM_DIR, "attn-linear": ATTN_LINEAR_PARAM_DIR, "attn-cache": ATTN_CACHE_PARAM_DIR, } ptq_cmds = {} for target in PTQ_TARGETS: target_output_dir = ptq_output_root / target.replace("-", "_") target_output_dir.mkdir(parents=True, exist_ok=True) cmd = [ "python", "-m", "amct_pytorch.ptq", "--model", MODEL_PATH, "--model_name", MODEL_NAME, "--device", DEVICE, "--granularity", GRANULARITY, "--quant_target", target, "--quant_dtype", QUANT_DTYPE, "--bit_config", W8A8_BIT_CONFIG, "--data_dir", calib_output_root / target.replace("-", "_"), "--output_dir", target_output_dir, "--seq_len", str(SEQ_LEN), ] if PTQ_ALGOS: cmd.extend(["--algos", *PTQ_ALGOS]) if target in {"mlp", "moe"}: cmd.extend(["--moe_mlp_param_dir", MLP_MOE_PARAM_DIR]) elif target == "attn-linear": cmd.extend(["--attn_linear_param_dir", ATTN_LINEAR_PARAM_DIR]) elif target == "attn-cache": cmd.extend(["--attn_cache_param_dir", ATTN_CACHE_PARAM_DIR]) ptq_cmds[target] = cmd for target, cmd in ptq_cmds.items(): print(f"\n[{target}]") show_cmd(cmd) # for cmd in ptq_cmds.values(): # run_cmd(cmd, dry_run=False)

7. Direct Conversion Quantization Accuracy Evaluation Based on Post-Training Quantization

After completing PTQ, add quantization algorithms in direct conversion quantization accuracy evaluation, compare with baseline test and direct conversion quantization accuracy without quantization algorithms, and verify quantization algorithm effectiveness.

calibrated_eval_dir = OUTPUT_ROOT / "05_calibrated_quant_eval" calibrated_eval_dir.mkdir(parents=True, exist_ok=True) calibrated_eval_cmd = [ "python", "-m", "amct_pytorch.eval", "--model", MODEL_PATH, "--model_name", MODEL_NAME, "--device", DEVICE, "--granularity", GRANULARITY, "--eval_mode", "quant", "--quant_dtype", QUANT_DTYPE, "--bit_config", W8A8_BIT_CONFIG, "--seq_len", str(SEQ_LEN), "--output_dir", calibrated_eval_dir, "--wikitext_final_out", calibrated_eval_dir / "wikitext", "--quant_target", *ALL_QUANT_TARGETS, "--moe_mlp_param_dir", MLP_MOE_PARAM_DIR, "--attn_linear_param_dir", ATTN_LINEAR_PARAM_DIR, "--attn_cache_param_dir", ATTN_CACHE_PARAM_DIR, ] show_cmd(calibrated_eval_cmd) # run_cmd(calibrated_eval_cmd, dry_run=False)

8. Quantization Model Export

This section exports deployable quantization weights, updated configuration files, and auxiliary files to an independent output directory for direct use by downstream inference repositories.

export_cmd = [ "python", "-m", "amct_pytorch.deploy", "--model", MODEL_PATH, "--model_name", MODEL_NAME, "--device", DEVICE, "--granularity", GRANULARITY, "--quant_target", *ALL_QUANT_TARGETS, "--quant_dtype", QUANT_DTYPE, "--bit_config", W8A8_BIT_CONFIG, "--output_dir", EXPORT_DIR, "--moe_mlp_param_dir", MLP_MOE_PARAM_DIR, "--attn_linear_param_dir", ATTN_LINEAR_PARAM_DIR, "--attn_cache_param_dir", ATTN_CACHE_PARAM_DIR, ] show_cmd(export_cmd) # run_cmd(export_cmd, dry_run=False)

【免费下载链接】amctAMCT是CANN提供的昇腾AI处理器亲和的模型压缩工具仓。项目地址: https://gitcode.com/cann/amct

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/6/13 13:44:02

AI动态简报之算力基建篇(2026.06.06)

⚡ 第1条:英伟达Q1营收816亿美元同比增85%,但"一家独大"时代走向终结——AI芯片进入多极竞争核心信息:英伟达FY2027 Q1营收816亿美元(同比85%),其中数据中心营收752亿美元(占比92.1%&…

作者头像 李华
网站建设 2026/6/6 14:33:22

OCR 火车票识别 API 服务介绍与使用考量

引言 光学字符识别(OCR)技术在自动化数据处理领域扮演着重要角色。针对火车票这类结构化票据的识别,能够大幅提升信息录入的效率和准确性。本文将基于 https://apizero.cn/marketplace/ocr-train-ticket 这一服务页面,介绍 OCR 火…

作者头像 李华
网站建设 2026/6/6 14:33:12

Arduino直流电机控制:DRV8833模块原理、接线与编程实战

1. 项目概述与核心价值玩机器人或者做个小车,控制电机是绕不开的一环。直流电机控制听起来简单,无非就是让它转、停、快慢和正反转,但真上手时,选对驱动芯片、接对线、写对代码,每一步都可能让你卡半天。我之前用过经典…

作者头像 李华