Ubuntu20.04系统下Baichuan-M2-32B-GPTQ-Int4部署全指南-程序员充电站

Ubuntu20.04系统下Baichuan-M2-32B-GPTQ-Int4部署全指南

最近百川智能开源的Baichuan-M2-32B模型在医疗推理领域引起了不小的关注，它在HealthBench评测集上表现相当出色，甚至超过了某些更大规模的模型。最吸引人的是，这个32B参数的模型经过GPTQ-Int4量化后，居然能在单张RTX 4090上跑起来，这对很多想尝试医疗大模型的研究者和开发者来说是个好消息。

不过，在实际部署过程中，我发现不少朋友会遇到各种环境配置问题——从CUDA驱动版本不匹配，到Python依赖冲突，再到模型加载失败，每一步都可能踩坑。我自己在Ubuntu 20.04系统上折腾了好一阵子，才把整个流程跑通。

这篇文章就是把我踩过的坑和总结的经验整理出来，手把手带你完成Baichuan-M2-32B-GPTQ-Int4在Ubuntu 20.04上的完整部署。我会从最基础的系统环境配置开始，一步步讲到模型加载和性能优化，确保你跟着做就能成功跑起来。

1. 系统环境准备与检查

部署大模型的第一步，也是最重要的一步，就是确保你的系统环境配置正确。很多部署失败的问题，根源都在于环境没准备好。

1.1 硬件与系统要求

先来看看你的机器是否满足基本要求。Baichuan-M2-32B-GPTQ-Int4虽然经过量化，但对硬件还是有一定要求的。

最低配置要求：

GPU：NVIDIA RTX 4090 24GB（单卡）或更高显存的显卡
内存：64GB系统内存（32GB勉强可以，但可能影响性能）
存储：至少100GB可用磁盘空间（模型文件约20GB，加上Python环境和缓存）
操作系统：Ubuntu 20.04 LTS 或更高版本

推荐配置：

GPU：RTX 4090 24GB 或 A100 40GB/80GB
内存：128GB 或更高
存储：NVMe SSD，500GB以上空间

你可以用下面的命令快速检查你的系统配置：

# 查看GPU信息 nvidia-smi # 查看内存信息 free -h # 查看磁盘空间 df -h

如果显存不足24GB，模型可能无法加载，或者只能以极低的batch size运行，影响使用体验。

1.2 CUDA驱动安装与验证

CUDA驱动是GPU计算的基础，版本不匹配是导致部署失败的最常见原因之一。

检查当前CUDA驱动版本：

nvidia-smi | grep "CUDA Version"

对于Baichuan-M2-32B-GPTQ-Int4，我推荐使用CUDA 12.1或更高版本。如果你的驱动版本低于12.1，需要先升级。

安装CUDA 12.1（如果尚未安装）：

# 首先添加NVIDIA官方仓库 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update # 安装CUDA 12.1 sudo apt-get install -y cuda-12-1 # 设置环境变量（添加到~/.bashrc） echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.bashrc echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc source ~/.bashrc # 验证安装 nvcc --version

安装完成后，重启系统确保驱动加载正确：

sudo reboot

重启后再次运行nvidia-smi，确认驱动和CUDA版本都显示正常。

1.3 Python环境配置

大模型部署对Python版本和虚拟环境管理有比较严格的要求，我强烈建议使用conda或venv创建独立的环境，避免依赖冲突。

安装Miniconda（如果尚未安装）：

# 下载Miniconda安装脚本 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh # 运行安装脚本 bash Miniconda3-latest-Linux-x86_64.sh # 按照提示完成安装，然后激活conda source ~/.bashrc

创建专用的Python环境：

# 创建名为baichuan-m2的Python 3.10环境 conda create -n baichuan-m2 python=3.10 -y # 激活环境 conda activate baichuan-m2

Python 3.10是个比较稳妥的选择，它在兼容性和性能之间取得了不错的平衡。太老的版本可能缺少某些新特性，太新的版本又可能遇到库兼容性问题。

2. 依赖库安装与配置

环境准备好后，接下来就是安装各种依赖库。这一步需要特别注意版本兼容性，很多问题都出在这里。

2.1 PyTorch与CUDA Toolkit安装

PyTorch的版本必须与CUDA版本匹配，否则无法使用GPU加速。

安装匹配CUDA 12.1的PyTorch：

# 使用conda安装PyTorch（推荐，自动处理CUDA依赖） conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia # 或者使用pip安装 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

验证PyTorch能否识别GPU：

import torch print(f"PyTorch版本: {torch.__version__}") print(f"CUDA是否可用: {torch.cuda.is_available()}") print(f"GPU数量: {torch.cuda.device_count()}") print(f"当前GPU: {torch.cuda.current_device()}") print(f"GPU名称: {torch.cuda.get_device_name(0)}")

如果一切正常，你应该能看到CUDA可用，并且正确识别了你的GPU。如果显示CUDA不可用，可能是PyTorch版本与CUDA驱动不匹配，需要重新安装。

2.2 模型推理相关库安装

Baichuan-M2-32B-GPTQ-Int4的部署主要依赖vLLM和Transformers这两个库。vLLM提供了高效的推理引擎，而Transformers是Hugging Face的模型加载库。

安装vLLM和Transformers：

# 安装vLLM（支持GPTQ量化模型） pip install vllm # 安装Transformers pip install transformers # 安装其他必要的依赖 pip install accelerate sentencepiece protobuf

版本兼容性提示：

vLLM版本建议使用0.4.0或更高版本
Transformers版本建议使用4.37.0或更高版本
如果遇到兼容性问题，可以尝试指定版本安装：
```
pip install vllm==0.4.0 transformers==4.37.0
```

2.3 模型下载与准备

Baichuan-M2-32B-GPTQ-Int4模型可以从Hugging Face Hub或ModelScope下载。考虑到国内网络环境，我推荐使用ModelScope，速度会快很多。

设置ModelScope镜像（国内用户推荐）：

# 设置环境变量使用ModelScope export VLLM_USE_MODELSCOPE=True # 将环境变量添加到bashrc，避免每次都要设置 echo 'export VLLM_USE_MODELSCOPE=True' >> ~/.bashrc

手动下载模型（可选）：

如果你希望先下载模型到本地，避免每次加载都重新下载，可以使用以下命令：

# 使用huggingface-cli下载 pip install huggingface-hub huggingface-cli download baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --local-dir ./Baichuan-M2-32B-GPTQ-Int4 # 或者使用git（需要安装git-lfs） git lfs install git clone https://huggingface.co/baichuan-inc/Baichuan-M2-32B-GPTQ-Int4

模型文件大约20GB，下载时间取决于你的网络速度。下载完成后，记得检查文件完整性。

3. 模型部署与加载

环境配置和依赖安装完成后，终于可以开始加载模型了。这部分我会介绍两种常用的部署方式：直接使用vLLM API服务和编写Python脚本加载。

3.1 使用vLLM启动API服务

vLLM提供了开箱即用的API服务，这是最简单快捷的部署方式。

启动vLLM服务器：

# 基本启动命令 python -m vllm.entrypoints.openai.api_server \ --model baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 \ --served-model-name baichuan-m2 \ --trust-remote-code \ --max-model-len 131072 \ --gpu-memory-utilization 0.9

参数说明：

--model：指定模型路径或Hugging Face模型ID
--served-model-name：API服务中使用的模型名称
--trust-remote-code：信任远程代码（Baichuan模型需要）
--max-model-len：最大上下文长度，设置为131072以支持长文本
--gpu-memory-utilization：GPU内存利用率，0.9表示使用90%的显存

高级启动选项（针对性能优化）：

python -m vllm.entrypoints.openai.api_server \ --model baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 \ --served-model-name baichuan-m2 \ --trust-remote-code \ --max-model-len 131072 \ --gpu-memory-utilization 0.9 \ --tensor-parallel-size 1 \ --block-size 16 \ --swap-space 4 \ --enforce-eager

--tensor-parallel-size：张量并行大小，单卡设置为1
--block-size：注意力块大小，影响内存使用和性能
--swap-space：交换空间大小（GB），当显存不足时使用系统内存
--enforce-eager：强制使用eager模式，避免某些兼容性问题

验证API服务：

服务启动后，默认监听在8000端口。你可以用curl测试服务是否正常：

curl http://localhost:8000/v1/models

如果返回模型信息，说明服务启动成功。你也可以在浏览器中访问http://localhost:8000/docs查看API文档。

3.2 Python脚本直接加载模型

如果你需要在Python程序中直接使用模型，而不是通过API服务，可以这样加载：

基础加载代码：

from vllm import LLM, SamplingParams # 初始化模型 llm = LLM( model="baichuan-inc/Baichuan-M2-32B-GPTQ-Int4", trust_remote_code=True, max_model_len=131072, gpu_memory_utilization=0.9, enforce_eager=True ) # 设置生成参数 sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=1024, stop=["</s>", "</think>"] ) # 准备输入 prompts = [ "用户：我最近总是感觉疲劳，容易头晕，这是什么原因？\n助手：", "用户：感冒了应该吃什么药？\n助手：" ] # 生成回复 outputs = llm.generate(prompts, sampling_params) # 输出结果 for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"输入：{prompt}") print(f"输出：{generated_text}") print("-" * 50)

处理思考链（Chain-of-Thought）：

Baichuan-M2支持思考链生成，这对于医疗推理特别有用。下面是提取思考链和最终回答的示例：

from transformers import AutoTokenizer from vllm import LLM, SamplingParams # 加载tokenizer tokenizer = AutoTokenizer.from_pretrained( "baichuan-inc/Baichuan-M2-32B-GPTQ-Int4", trust_remote_code=True ) # 初始化模型 llm = LLM( model="baichuan-inc/Baichuan-M2-32B-GPTQ-Int4", trust_remote_code=True, max_model_len=131072 ) # 构建包含思考链的prompt def build_medical_prompt(question): messages = [ {"role": "user", "content": question} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, thinking_mode='on' # 开启思考模式 ) return text # 医疗问题示例 medical_question = "患者：45岁男性，有高血压病史5年，最近一周出现胸闷、气短，活动后加重，休息后缓解。这是什么问题？" # 生成思考链和回答 prompt = build_medical_prompt(medical_question) sampling_params = SamplingParams(temperature=0.6, max_tokens=2048) outputs = llm.generate([prompt], sampling_params) # 解析输出 generated_ids = outputs[0].outputs[0].token_ids input_ids = outputs[0].prompt_token_ids output_ids = generated_ids[len(input_ids):] # 分离思考链和最终回答 try: # 查找思考链结束标记（</think>对应的token_id是151668） index = len(output_ids) - output_ids[::-1].index(151668) except ValueError: index = 0 thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n") final_answer = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n") print("思考过程：") print(thinking_content) print("\n最终回答：") print(final_answer)

4. 性能优化与问题排查

模型能跑起来只是第一步，要让它跑得又快又稳，还需要一些优化技巧。这部分我分享一些实际使用中的经验。

4.1 显存优化策略

32B模型即使经过4bit量化，在24GB显存的RTX 4090上运行也很有挑战性。下面是一些显存优化的方法：

1. 调整vLLM内存管理参数：

llm = LLM( model="baichuan-inc/Baichuan-M2-32B-GPTQ-Int4", trust_remote_code=True, max_model_len=32768, # 降低上下文长度可以节省显存 gpu_memory_utilization=0.85, # 稍微降低利用率，留出缓冲 swap_space=8, # 增加交换空间 block_size=8, # 减小块大小 enable_prefix_caching=True # 启用前缀缓存 )

2. 使用分页注意力（PagedAttention）：

vLLM默认使用分页注意力，但你可以调整相关参数：

# 启动API服务时添加分页注意力参数 python -m vllm.entrypoints.openai.api_server \ --model baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 \ --block-size 16 \ --enable-prefix-caching \ --max-num-batched-tokens 2048 \ --max-num-seqs 4

3. 批处理大小调整：

根据你的显存情况调整批处理大小：

# 小显存配置 sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=512, # 减少生成长度 ignore_eos=False ) # 单次处理少量prompt prompts = ["问题1", "问题2"] # 一次处理2个 outputs = llm.generate(prompts, sampling_params)

4.2 推理速度优化

除了显存，推理速度也是实际使用中需要关注的。下面是一些提速技巧：

1. 使用连续批处理（Continuous Batching）：

vLLM默认启用连续批处理，但你可以通过调整参数来优化：

llm = LLM( model="baichuan-inc/Baichuan-M2-32B-GPTQ-Int4", max_num_batched_tokens=4096, # 增加批处理token数 max_num_seqs=8, # 增加并发序列数 scheduler_policy="fcfs" # 先到先服务调度策略 )

2. 调整生成参数：

sampling_params = SamplingParams( temperature=0.7, # 适中温度，平衡创造性和一致性 top_p=0.9, # nucleus sampling top_k=50, # 限制候选token数 repetition_penalty=1.1, # 重复惩罚 length_penalty=1.0, # 长度惩罚 skip_special_tokens=True, # 跳过特殊token spaces_between_special_tokens=False )

3. 预热模型：

首次加载模型后，进行一次预热推理可以加快后续请求：

# 模型加载后立即进行预热 warmup_prompts = ["热身问题"] * 2 warmup_outputs = llm.generate(warmup_prompts, SamplingParams(max_tokens=10)) print("模型预热完成")

4.3 常见问题与解决方案

在实际部署中，你可能会遇到各种问题。这里整理了一些常见问题及其解决方法：

问题1：CUDA out of memory

这是最常见的问题，说明显存不足。

解决方案：

减小max_model_len（如从131072减到32768）
降低gpu_memory_utilization（如从0.9降到0.8）
减小批处理大小
启用交换空间：--swap-space 8
使用更小的block-size（如8或16）

问题2：模型加载失败，提示"KeyError"或"AttributeError"

这通常是模型文件损坏或版本不兼容导致的。

解决方案：

重新下载模型文件
检查vLLM和Transformers版本兼容性
尝试使用--enforce-eager参数
清除缓存：rm -rf ~/.cache/huggingface

问题3：生成速度很慢

如果推理速度明显低于预期，可能是配置问题。

解决方案：

检查GPU使用率：nvidia-smi -l 1
确保使用GPU推理而非CPU
调整max_num_batched_tokens和max_num_seqs
检查是否有其他进程占用GPU资源

问题4：中文输出乱码或异常

Baichuan-M2主要针对中文优化，但有时可能遇到编码问题。

解决方案：

确保系统locale设置为UTF-8：export LANG=C.UTF-8
在Python中设置编码：import sys; sys.stdout.reconfigure(encoding='utf-8')
检查tokenizer是否正确加载中文词汇表

问题5：API服务无法连接

vLLM API服务启动后无法访问。

解决方案：

检查端口是否被占用：netstat -tlnp | grep 8000
尝试更换端口：--port 8080
检查防火墙设置：sudo ufw allow 8000
查看服务日志：检查vLLM输出信息

5. 实际应用示例

为了让你更好地了解如何在实际场景中使用Baichuan-M2-32B，我准备了几个医疗相关的应用示例。

5.1 医疗问答系统

下面是一个简单的医疗问答系统实现：

import json from typing import List, Dict from vllm import LLM, SamplingParams class MedicalQASystem: def __init__(self, model_path: str = "baichuan-inc/Baichuan-M2-32B-GPTQ-Int4"): """初始化医疗问答系统""" print("正在加载医疗模型...") self.llm = LLM( model=model_path, trust_remote_code=True, max_model_len=32768, gpu_memory_utilization=0.85 ) self.sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=1024, stop=["</think>", "</s>", "用户："] ) print("模型加载完成！") def build_prompt(self, question: str, history: List[Dict] = None) -> str: """构建对话prompt""" messages = [] # 添加历史对话 if history: for item in history[-5:]: # 只保留最近5轮对话 messages.append({"role": "user", "content": item["question"]}) messages.append({"role": "assistant", "content": item["answer"]}) # 添加当前问题 messages.append({"role": "user", "content": question}) # 构建系统提示 system_prompt = """你是一个专业的医疗助手，请根据用户的描述提供专业的医疗建议。 注意：你的回答仅供参考，不能替代专业医生的诊断。如果症状严重，请及时就医。""" messages.insert(0, {"role": "system", "content": system_prompt}) # 转换为模型接受的格式 from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "baichuan-inc/Baichuan-M2-32B-GPTQ-Int4", trust_remote_code=True ) prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) return prompt def ask(self, question: str, history: List[Dict] = None) -> Dict: """回答医疗问题""" try: # 构建prompt prompt = self.build_prompt(question, history) # 生成回答 outputs = self.llm.generate([prompt], self.sampling_params) answer = outputs[0].outputs[0].text.strip() # 清理回答 answer = answer.replace("助手：", "").strip() return { "success": True, "question": question, "answer": answer, "suggestion": "以上建议仅供参考，如有不适请及时就医。" } except Exception as e: return { "success": False, "error": str(e), "suggestion": "系统暂时无法处理您的请求，请稍后重试。" } def batch_ask(self, questions: List[str]) -> List[Dict]: """批量回答问题""" prompts = [self.build_prompt(q) for q in questions] outputs = self.llm.generate(prompts, self.sampling_params) results = [] for i, output in enumerate(outputs): answer = output.outputs[0].text.strip() answer = answer.replace("助手：", "").strip() results.append({ "question": questions[i], "answer": answer }) return results # 使用示例 if __name__ == "__main__": # 初始化系统 qa_system = MedicalQASystem() # 单个问题示例 question = "我最近总是失眠，晚上睡不着，白天没精神，有什么建议吗？" result = qa_system.ask(question) print(f"问题：{result['question']}") print(f"回答：{result['answer']}") print(f"提示：{result['suggestion']}") # 批量问题示例 questions = [ "感冒了应该吃什么药？", "高血压患者平时要注意什么？", "如何预防糖尿病？" ] print("\n批量问答结果：") batch_results = qa_system.batch_ask(questions) for i, res in enumerate(batch_results, 1): print(f"\n{i}. 问题：{res['question']}") print(f" 回答：{res['answer'][:100]}...")

5.2 症状分析与初步诊断

对于更复杂的症状分析，可以使用思考链模式：

def analyze_symptoms(patient_info: Dict) -> Dict: """分析患者症状，提供初步建议""" # 构建详细的症状描述 symptom_text = f""" 患者基本信息： - 年龄：{patient_info.get('age', '未知')} - 性别：{patient_info.get('gender', '未知')} - 既往病史：{patient_info.get('medical_history', '无')} 当前症状： {patient_info.get('symptoms', '')} 持续时间：{patient_info.get('duration', '未知')} 严重程度：{patient_info.get('severity', '未知')} 伴随症状：{patient_info.get('accompanying_symptoms', '无')} """ # 构建分析prompt analysis_prompt = f"""请作为医疗专家分析以下患者情况，提供： 1. 可能的疾病方向 2. 建议的检查项目 3. 临时处理建议 4. 就医紧急程度评估（高/中/低） 患者信息： {symptom_text} 请按以下格式回答： 【可能疾病】... 【建议检查】... 【临时处理】... 【紧急程度】... 注意：你的分析仅供参考，最终诊断需由专业医生完成。""" # 使用思考链模式 from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "baichuan-inc/Baichuan-M2-32B-GPTQ-Int4", trust_remote_code=True ) messages = [{"role": "user", "content": analysis_prompt}] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, thinking_mode='on' # 开启思考模式 ) # 生成分析 from vllm import LLM, SamplingParams llm = LLM(model="baichuan-inc/Baichuan-M2-32B-GPTQ-Int4", trust_remote_code=True) sampling_params = SamplingParams( temperature=0.6, top_p=0.9, max_tokens=2048, stop=["</think>"] ) outputs = llm.generate([prompt], sampling_params) full_response = outputs[0].outputs[0].text # 解析响应 response_parts = {} current_section = None current_content = [] for line in full_response.split('\n'): line = line.strip() if line.startswith('【') and line.endswith('】'): if current_section: response_parts[current_section] = '\n'.join(current_content).strip() current_section = line[1:-1] current_content = [] elif current_section and line: current_content.append(line) if current_section: response_parts[current_section] = '\n'.join(current_content).strip() return { "patient_info": patient_info, "analysis": response_parts, "disclaimer": "本分析基于AI模型生成，仅供参考。如有不适请及时就医。" } # 使用示例 patient_case = { "age": "45", "gender": "男性", "medical_history": "高血压5年，规律服药", "symptoms": "最近一周出现胸闷、气短，活动后加重，休息后缓解。伴有轻微胸痛。", "duration": "1周", "severity": "中度", "accompanying_symptoms": "偶尔头晕，乏力" } result = analyze_symptoms(patient_case) print("症状分析报告：") print("=" * 50) for key, value in result["analysis"].items(): print(f"\n{key}：") print(value) print("\n" + "=" * 50) print(result["disclaimer"])

5.3 与现有系统集成

如果你希望将Baichuan-M2集成到现有系统中，这里有一个Flask API的示例：

from flask import Flask, request, jsonify from vllm import LLM, SamplingParams import threading import queue import time app = Flask(__name__) # 全局模型实例 model_pool = [] model_lock = threading.Lock() request_queue = queue.Queue() class ModelWorker: def __init__(self, worker_id): self.worker_id = worker_id self.llm = None self.sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=1024 ) self.load_model() def load_model(self): """加载模型""" print(f"Worker {self.worker_id}: 正在加载模型...") self.llm = LLM( model="baichuan-inc/Baichuan-M2-32B-GPTQ-Int4", trust_remote_code=True, max_model_len=32768, gpu_memory_utilization=0.8 ) print(f"Worker {self.worker_id}: 模型加载完成") def process(self, prompt): """处理请求""" try: outputs = self.llm.generate([prompt], self.sampling_params) return outputs[0].outputs[0].text.strip() except Exception as e: return f"处理失败: {str(e)}" def init_model_pool(pool_size=2): """初始化模型池""" for i in range(pool_size): worker = ModelWorker(i) model_pool.append(worker) print(f"模型池初始化完成，共{pool_size}个worker") def worker_thread(worker): """工作线程""" while True: try: task_id, prompt, callback = request_queue.get(timeout=1) start_time = time.time() result = worker.process(prompt) response_time = time.time() - start_time callback({ "task_id": task_id, "result": result, "worker_id": worker.worker_id, "response_time": response_time }) except queue.Empty: continue except Exception as e: print(f"Worker {worker.worker_id} 错误: {e}") # 初始化 init_model_pool(2) # 启动工作线程 for worker in model_pool: thread = threading.Thread(target=worker_thread, args=(worker,), daemon=True) thread.start() @app.route('/health', methods=['GET']) def health_check(): """健康检查""" return jsonify({ "status": "healthy", "model": "Baichuan-M2-32B-GPTQ-Int4", "workers": len(model_pool), "queue_size": request_queue.qsize() }) @app.route('/ask', methods=['POST']) def ask_question(): """问答接口""" data = request.json question = data.get('question', '') history = data.get('history', []) if not question: return jsonify({"error": "问题不能为空"}), 400 # 构建prompt from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "baichuan-inc/Baichuan-M2-32B-GPTQ-Int4", trust_remote_code=True ) messages = [] for h in history[-3:]: # 保留最近3轮历史 messages.append({"role": "user", "content": h.get('question', '')}) messages.append({"role": "assistant", "content": h.get('answer', '')}) messages.append({"role": "user", "content": question}) prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # 创建任务 task_id = str(time.time()) result_queue = queue.Queue() def callback(result): result_queue.put(result) # 提交任务 request_queue.put((task_id, prompt, callback)) # 等待结果（带超时） try: result = result_queue.get(timeout=30) return jsonify({ "success": True, "task_id": task_id, "answer": result["result"], "worker_id": result["worker_id"], "response_time": result["response_time"], "disclaimer": "本回答由AI生成，仅供参考，不能替代专业医疗建议。" }) except queue.Empty: return jsonify({"error": "请求超时"}), 504 if __name__ == '__main__': print("启动医疗问答API服务...") print("访问 http://localhost:5000/health 检查服务状态") app.run(host='0.0.0.0', port=5000, threaded=True)

这个Flask服务提供了简单的REST API，可以方便地集成到现有系统中。它使用线程池处理并发请求，每个请求都会分配到可用的模型worker进行处理。

6. 总结

走完这一整套部署流程，你应该已经成功在Ubuntu 20.04系统上跑起了Baichuan-M2-32B-GPTQ-Int4模型。从我的经验来看，整个过程最关键的几个点是：确保CUDA驱动和PyTorch版本匹配、合理配置vLLM的内存参数、根据实际硬件调整批处理大小。

实际使用中，这个模型在医疗问答方面的表现确实不错，特别是对中文医疗场景的理解比较到位。不过要记住，它毕竟是个AI模型，给出的建议只能作为参考，不能替代专业医生的诊断。在涉及具体用药或治疗方案时，一定要咨询医疗专业人士。

部署大模型确实是个技术活，会遇到各种意想不到的问题。如果你在跟着教程操作时遇到了其他问题，或者有更好的优化建议，欢迎分享出来。技术社区的力量就在于大家互相帮助，共同解决问题。

最后提醒一下，模型运行会占用大量显存和内存，长时间使用要注意散热和系统稳定性。特别是夏天，确保服务器有良好的散热条件，避免因为过热导致性能下降或硬件损坏。

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

Ubuntu20.04系统下Baichuan-M2-32B-GPTQ-Int4部署全指南