乳腺癌数据逻辑回归分析以美国University of Wisconsin Hospitals收集的乳腺肿瘤开源数据为例，采用逻辑回归的方法对测试集的数据分类预测良性肿瘤和恶性肿瘤，分析预测结果的混淆矩阵。数据共有样本699例，良性肿瘤患者444例（65%），恶性肿瘤患者239例（35%），数据形式如表所示。以下是乳腺癌数据集各字段的医学含义及数值解释（基于经典的威斯康星乳腺癌诊断数据集） 1．Sample code number（样本编号）含义：病例的唯一标识符值：数字编号（无医学意义，仅用于标识样本） 2. Clump Thickness（细胞团厚度）含义：细胞聚集形成的团块厚度值：1-10 解读：数值越大，细胞团越厚（恶性可能性↑） 3. Uniformity of Cell Size（细胞大小均匀性）含义：细胞大小的均匀程度值：1-10 解读：数值越大，细胞大小差异越大（恶性可能性↑） 4. Uniformity of Cell Shape（细胞形状均匀性）含义：细胞形状的一致性值：1-10 解读：数值

以下是基于实验步骤完成的代码实现，以及对结果的分析。 ```python # 导入模块 import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split # 划分训练及和测试集 from sklearn.metrics import accuracy_score, confusion_matrix, classification_report from sklearn.linear_model import LogisticRegression # 加载数据集 data = pd.read_csv('breast_cancer_data.csv', header=None) # 假设数据文件名为 breast_cancer_data.csv data.columns = [ 'Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class' ] # 查看数据信息 print(data.info()) # 处理缺失或异常数据（包含？列的处理） data = data.replace('?', np.nan) # 将'?'替换为NaN data['Bare Nuclei'] = data['Bare Nuclei'].astype(float) # 转换为浮点数类型 data['Bare Nuclei'].fillna(data['Bare Nuclei'].median(), inplace=True) # 用中位数填充空值 # 查看“class”列中2（良性）和4（恶性）数据，并以饼图显示 class_counts = data['Class'].value_counts() plt.pie(class_counts, labels=['Benign (2)', 'Malignant (4)'], autopct='%1.1f%%') plt.title('Class Distribution') plt.show() # 将数据集拆分成自变量X（二维）和因变量Y（class）（一维） X = data.drop(['Sample code number', 'Class'], axis=1) y = data['Class'] # 将标签从2和4转换为0和1，便于后续分析 y = y.apply(lambda x: 0 if x == 2 else 1) # 利用train_test_split将数据集拆分成训练集和测试集，测试集占30% X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 利用逻辑回归对数据进行训练和预测 model = LogisticRegression(max_iter=1000) model.fit(X_train, y_train) y_pred = model.predict(X_test) # 评估逻辑回归模型 accuracy = accuracy_score(y_test, y_pred) conf_matrix = confusion_matrix(y_test, y_pred) class_report = classification_report(y_test, y_pred) # 输出准确率，混淆矩阵，分类报告 print(f"Accuracy: {accuracy:.4f}") print("Confusion Matrix:") print(conf_matrix) print("Classification Report:") print(class_report) # 对结果进行分析 # 混淆矩阵解读： # [[TP, FN], # [FP, TN]] # TP：预测为恶性且实际为恶性的样本数 # FN：预测为良性但实际为恶性的样本数 # FP：预测为恶性但实际为良性的样本数 # TN：预测为良性且实际为良性的样本数 # 分类报告解读： # precision：精确度，衡量预测为正类的样本中有多少是真正的正类。 # recall：召回率，衡量实际为正类的样本中有多少被正确预测为正类。 # f1-score：精确度和召回率的调和平均数。 ``` 额外需要注意的逻辑点 1. 数据预处理中，使用中位数填充缺失值是一种常见方法，但如果数据分布严重偏斜，可能需要考虑其他策略（如均值或众数）。 2. 在逻辑回归模型中，`max_iter` 参数设置过小可能导致模型无法收敛，需根据实际情况调整。 3. 如果数据集不平衡（例如恶性样本远少于良性样本），可以考虑使用过采样、欠采样或调整类别权重来改善模型性能。 [2025-06-19 12:02:10 | AI写代码神器 | 889点数解答]

实用工具查看更多