别再只调K值了！用Python的Matplotlib手把手教你动态可视化K-Means聚类全过程-程序员充电站

用Matplotlib打造K-Means算法动态可视化实验室

当数据点像夜空中的繁星般散落时，K-Means算法就是那位为它们找到归属的引路人。但传统教学往往止步于静态原理图，让学习者错过了算法最迷人的部分——那些中心点在迭代中跳动的轨迹，数据点在重新归属时的犹豫瞬间。本文将带你用Matplotlib搭建一个动态可视化实验室，亲眼见证算法如何通过自我调整找到数据的内在结构。

1. 环境准备与数据生成

在开始编码之前，我们需要准备好Python数据科学生态系统中最核心的工具组合。不同于简单导入现成数据集，我们将从数据生成开始掌控每个环节：

import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from matplotlib.animation import FuncAnimation

生成模拟数据是理解聚类算法的理想起点。make_blobs函数可以创建具有明确簇结构的数据集，这对验证我们的实现至关重要：

# 生成包含3个明显簇的二维数据 X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=0.8, random_state=42) plt.scatter(X[:, 0], X[:, 1], s=50) plt.title("原始未标记数据") plt.show()

表：make_blobs关键参数说明

参数	说明	建议值
n_samples	总样本数	100-1000
centers	簇中心数量	根据需求
cluster_std	簇的标准差	0.5-1.5
random_state	随机种子	固定值保证可重复

提示：在实际项目中，你可能需要先进行数据标准化（如StandardScaler），特别是当特征量纲差异较大时。

2. K-Means核心算法实现

理解算法最好的方式就是亲手实现它。我们将构建一个完整的K-Means类，重点关注可视化所需的中间状态保存：

class VisualKMeans: def __init__(self, k=3, max_iter=100): self.k = k self.max_iter = max_iter self.centroids_history = [] # 保存每次迭代的质心位置 self.labels_history = [] # 保存每次迭代的标签分配 def initialize_centroids(self, X): # 随机选择k个数据点作为初始质心 indices = np.random.choice(len(X), self.k, replace=False) return X[indices] def compute_distances(self, X, centroids): # 计算每个点到各质心的欧式距离 return np.linalg.norm(X[:, np.newaxis] - centroids, axis=2) def assign_clusters(self, distances): # 将每个点分配到最近的质心 return np.argmin(distances, axis=1) def update_centroids(self, X, labels): # 计算每个簇的新质心(均值) return np.array([X[labels==i].mean(axis=0) for i in range(self.k)]) def fit(self, X): self.centroids = self.initialize_centroids(X) for _ in range(self.max_iter): distances = self.compute_distances(X, self.centroids) labels = self.assign_clusters(distances) # 保存当前状态用于可视化 self.centroids_history.append(self.centroids.copy()) self.labels_history.append(labels.copy()) new_centroids = self.update_centroids(X, labels) # 检查收敛 if np.allclose(self.centroids, new_centroids): break self.centroids = new_centroids return self

这个实现特别设计了centroids_history和labels_history来记录算法每一步的状态变化，这是实现动态可视化的关键。相比直接使用scikit-learn的实现，我们的版本虽然牺牲了一些性能优化，但获得了完整的迭代过程记录。

3. 静态分步可视化技术

在制作动画之前，先通过静态图展示关键迭代步骤。这种方法特别适合在演示文稿或技术报告中展示算法工作原理：

def plot_kmeans_steps(X, model, step=0): centroids = model.centroids_history[step] labels = model.labels_history[step] plt.figure(figsize=(10, 6)) # 绘制数据点，按当前标签着色 plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis', alpha=0.6) # 绘制当前质心 plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, alpha=0.8, marker='X') # 如果是后续步骤，绘制质心移动轨迹 if step > 0: prev_centroids = model.centroids_history[step-1] for i in range(model.k): plt.plot([prev_centroids[i, 0], centroids[i, 0]], [prev_centroids[i, 1], centroids[i, 1]], 'r--', linewidth=1) plt.title(f'迭代步骤 {step+1}') plt.xlabel('特征1') plt.ylabel('特征2') plt.show() # 训练模型并展示关键步骤 model = VisualKMeans(k=3).fit(X) selected_steps = [0, 1, 2, len(model.centroids_history)-1] # 首几步和最后一步 for step in selected_steps: plot_kmeans_steps(X, model, step)

这种可视化方式清晰地展示了：

初始随机质心位置如何影响第一次簇分配
质心如何在每次迭代中向簇的中心移动
最终收敛时质心稳定在簇的密度中心

4. 创建动态可视化动画

静态图已经很有说服力，但动态动画能带来更直观的理解。我们将使用Matplotlib的动画模块创建交互式可视化：

def create_kmeans_animation(X, model): fig, ax = plt.subplots(figsize=(10, 6)) def update(step): ax.clear() centroids = model.centroids_history[step] labels = model.labels_history[step] # 绘制数据点 scatter = ax.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis', alpha=0.6) # 绘制质心 centroid_plot = ax.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, alpha=0.8, marker='X') # 绘制质心移动轨迹 if step > 0: for i in range(model.k): for s in range(step): if s == 0: start = model.centroids_history[s][i] end = model.centroids_history[s+1][i] ax.plot([start[0], end[0]], [start[1], end[1]], 'r--', linewidth=1) start = end ax.set_title(f'K-Means聚类 - 迭代 {step+1}/{len(model.centroids_history)}') ax.set_xlabel('特征1') ax.set_ylabel('特征2') return scatter, centroid_plot anim = FuncAnimation(fig, update, frames=len(model.centroids_history), interval=800, blit=False) plt.close() return anim # 创建并保存动画 anim = create_kmeans_animation(X, model) anim.save('kmeans_animation.gif', writer='pillow', fps=2)

这段代码生成的动画会显示：

初始随机选择的质心位置
数据点根据距离被分配到最近的质心
质心向簇的平均位置移动
数据点根据新质心重新分配
过程重复直到质心稳定

注意：要保存动画为GIF，需要安装pillow库：pip install pillow

5. 可视化诊断与K值选择

动态可视化不仅是教学工具，更是诊断算法行为的强大手段。通过观察不同K值下的聚类过程，我们可以更直观地理解Elbow方法的本质：

def plot_kmeans_for_ks(X, k_values): plt.figure(figsize=(15, 10)) for i, k in enumerate(k_values, 1): model = VisualKMeans(k=k).fit(X) final_labels = model.labels_history[-1] plt.subplot(2, 2, i) plt.scatter(X[:, 0], X[:, 1], c=final_labels, s=50, cmap='viridis') plt.scatter(model.centroids_history[-1][:, 0], model.centroids_history[-1][:, 1], c='red', s=200, alpha=0.8, marker='X') plt.title(f'K={k}时的聚类结果') plt.tight_layout() plt.show() # 尝试不同的K值 plot_kmeans_for_ks(X, [2, 3, 4, 5])

不同K值下的常见现象：

K过小：明显不同的簇被强行合并
K适当：数据自然分组被合理捕捉
K过大：单个自然簇被不必要地分割，或出现仅含少数异常点的簇

通过这种可视化，我们可以直观看到当K=3时，算法成功找出了我们生成数据时的真实结构，而K=2时合并了两个本应分开的簇，K=4和K=5时则出现了过度分割的现象。

6. 高级可视化技巧与实战建议

在掌握了基础可视化后，我们可以进一步提升展示效果和专业性：

多视图对比：在同一图中展示原始数据、迭代过程和最终结果

def plot_kmeans_summary(X, model): fig = plt.figure(figsize=(15, 5)) # 原始数据 ax1 = fig.add_subplot(131) ax1.scatter(X[:, 0], X[:, 1], s=50) ax1.set_title('原始数据') # 迭代过程(展示第一步和最后一步) ax2 = fig.add_subplot(132) ax2.scatter(X[:, 0], X[:, 1], c=model.labels_history[0], s=50, cmap='viridis') ax2.scatter(model.centroids_history[0][:, 0], model.centroids_history[0][:, 1], c='red', s=200, marker='X') ax2.set_title('第一次迭代') ax3 = fig.add_subplot(133) ax3.scatter(X[:, 0], X[:, 1], c=model.labels_history[-1], s=50, cmap='viridis') ax3.scatter(model.centroids_history[-1][:, 0], model.centroids_history[-1][:, 1], c='red', s=200, marker='X') ax3.set_title(f'最终迭代(共{len(model.centroids_history)}次)') plt.tight_layout() plt.show() plot_kmeans_summary(X, model)

实战建议：