别再手动抠图了！用Labelme+Python脚本批量处理图像分割数据集（附完整代码）-程序员充电站

Labelme自动化实战：Python脚本批量处理图像分割数据集

在计算机视觉项目中，数据标注往往是耗时最长的环节。当面对数百甚至上千张需要标注的图片时，手动操作不仅效率低下，还容易因疲劳导致标注质量下降。Labelme作为一款开源的图像标注工具，虽然提供了友好的图形界面，但缺乏批量处理能力。本文将分享如何通过Python脚本扩展Labelme的功能，实现从标注到数据集转换的全流程自动化。

1. 环境配置与基础准备

在开始自动化流程前，需要确保开发环境配置正确。推荐使用Python 3.7+环境，并安装以下依赖包：

pip install labelme pyqt5 numpy pillow opencv-python imgviz -i https://pypi.tuna.tsinghua.edu.cn/simple

Labelme的核心功能通过命令行接口暴露，这为我们实现自动化提供了可能。基础功能验证可以通过以下命令测试：

# 测试Labelme安装 labelme --version # 测试JSON转换功能 labelme_json_to_dataset --help

为便于后续批量操作，建议按以下结构组织项目目录：

project_root/ ├── raw_images/ # 原始图片 ├── labeled_json/ # 标注生成的JSON文件 ├── datasets/ # 转换后的数据集 │ ├── VOC/ # VOC格式 │ └── COCO/ # COCO格式 └── scripts/ # 自动化脚本

2. 批量标注与JSON生成

对于大量图片，手动逐张打开标注显然不现实。我们可以利用Labelme的命令行参数实现半自动化标注流程：

import subprocess from pathlib import Path def batch_labeling(image_dir, output_dir): """批量启动Labelme标注界面""" image_dir = Path(image_dir) output_dir = Path(output_dir) output_dir.mkdir(exist_ok=True) for img_file in image_dir.glob("*.jpg"): cmd = f"labelme {img_file} -O {output_dir/img_file.stem}.json --autosave" subprocess.run(cmd, shell=True)

这个基础脚本实现了：

自动加载目录中的每张图片
设置自动保存JSON文件
保持标注界面的人机交互

进阶技巧：通过--labels参数预定义标签列表，可以规范标注内容：

cmd = f"labelme {img_file} -O {output.json} --labels labels.txt --nodata"

提示：添加--nodata参数可以减小JSON文件体积，因为不保存图片数据

3. JSON批量转换与格式标准化

Labelme生成的JSON文件需要转换为训练可用的格式。常见的转换需求包括：

转换类型	输出内容	适用场景
原生转换	PNG标签图	简单分割任务
VOC格式	多目录结构	兼容Pascal VOC
COCO格式	单一JSON文件	大型数据集

3.1 原生格式批量转换

Labelme自带的labelme_json_to_dataset可以批量处理：

import concurrent.futures def convert_to_dataset(json_files, output_dir): """多线程转换JSON到数据集格式""" with concurrent.futures.ThreadPoolExecutor() as executor: futures = [] for json_file in json_files: output_path = output_dir / json_file.stem cmd = f"labelme_json_to_dataset {json_file} -o {output_path}" futures.append(executor.submit(subprocess.run, cmd, shell=True)) for future in concurrent.futures.as_completed(futures): future.result() # 检查错误

3.2 VOC格式转换增强版

Labelme自带的labelme2voc.py有时需要功能增强。以下是改进版本的核心功能：

def enhance_labelme2voc(input_dir, output_dir, labels_file): """增强的VOC格式转换""" # 创建标准VOC目录结构 voc_dirs = ['Annotations', 'ImageSets', 'JPEGImages', 'SegmentationClass'] for d in voc_dirs: (output_dir/d).mkdir(parents=True, exist_ok=True) # 处理每个JSON文件 for json_file in Path(input_dir).glob('*.json'): # 转换逻辑 img_data = process_single_file(json_file, output_dir) # 写入ImageSets with open(output_dir/'ImageSets'/'trainval.txt', 'a') as f: f.write(f"{json_file.stem}\n") # 生成颜色映射文件 generate_colormap(output_dir, labels_file)

关键改进点：

完整的VOC目录结构生成
自动维护ImageSets分割文件
添加颜色映射信息
支持多线程处理

4. 高级技巧与质量控制

批量处理需要特别关注数据质量。以下是几个实用技巧：

4.1 自动可视化校验

生成标注预览图有助于快速发现标注问题：

def generate_visualizations(json_dir, output_dir): """生成标注可视化对比图""" for json_file in Path(json_dir).glob('*.json'): img_viz = visualize_annotation(json_file) output_path = output_dir / f"{json_file.stem}_viz.jpg" cv2.imwrite(str(output_path), img_viz)

4.2 标签一致性检查

批量检查所有JSON文件中的标签是否符合规范：

def validate_labels(json_files, allowed_labels): """验证标签一致性""" error_files = [] for json_file in json_files: with open(json_file) as f: data = json.load(f) for shape in data['shapes']: if shape['label'] not in allowed_labels: error_files.append(json_file) break return error_files

4.3 数据集统计分析

生成标注数据的统计报告：

def generate_stats_report(json_dir): """生成数据集统计信息""" stats = { 'total_images': 0, 'per_class_counts': defaultdict(int), 'avg_objs_per_image': 0 } for json_file in Path(json_dir).glob('*.json'): with open(json_file) as f: data = json.load(f) stats['total_images'] += 1 for shape in data['shapes']: stats['per_class_counts'][shape['label']] += 1 stats['avg_objs_per_image'] = sum(stats['per_class_counts'].values()) / stats['total_images'] return stats

5. 完整流程整合与优化

将上述模块整合成端到端的自动化流程：

class LabelmeAutoProcessor: def __init__(self, config): self.config = config self.validate_dirs() def run_pipeline(self): # 1. 批量标注 if self.config['do_labeling']: self.batch_labeling() # 2. 格式转换 if self.config['to_voc']: self.convert_to_voc() # 3. 质量检查 if self.config['do_qa']: self.quality_check() # 4. 生成统计报告 self.generate_report()

优化后的流程特点：