🐍 一、Python库及函数使用

1. Pandas库

函数/方法用法说明
pd.read_csv('文件路径')加载CSV数据集
pd.read_excel('文件路径')加载Excel文件
pd.read_excel('文件路径', encoding='gbk')加载带中文编码的文件
data.head(n)显示前n行数据
data.dtypes显示每列数据类型
data.info()查看表结构基本信息
data.isnull().sum()统计每列缺失值数量
data.dropna()删除含有缺失值的行
data.dropna(subset=['列名'])删除指定列含缺失值的行
data.drop_duplicates()删除重复行
data.duplicated().sum()统计重复行数量
data.drop(columns=['列名'])删除指定列
data.drop(['列名'], axis=1)删除列(axis=1表示列)
pd.to_numeric(data['列名'], errors='coerce')将列转换为数值类型,异常值转为NaN
pd.to_datetime(data['列名'])转换为日期类型
pd.get_dummies(data, drop_first=True)将分类变量转为哑变量
data.fillna(method='ffill')用前一个值填充缺失值(前向填充)
data.fillna(method='bfill')用后一个值填充缺失值(后向填充)
data.fillna(x.mode()[0])用众数填充缺失值
data.apply(lambda x: x.fillna(x.mode()[0]))按列用众数填充缺失值
data['列名'].astype(int)转换数据类型为整数
data['列名'].astype(float)转换数据类型为浮点数
data['列名'].value_counts()统计值分布
data.groupby('列名').mean()分组求平均值
data.groupby('列名')['列名2'].agg(['count','mean'])分组聚合(计数+均值)
data.groupby(['列1','列2'])['列3'].mean().unstack()多层分组求均值并转置
data.groupby('列名').size()分组计数
data.describe()统计描述信息
data.quantile(0.25/0.75)计算四分位数
data.shape[0]获取行数
data.rename(columns={'旧名':'新名'}, inplace=True)重命名列
data.to_csv('路径', index=False)保存CSV文件(不保存索引)
data.to_csv('路径', index=False, sep='\t')保存为制表符分隔的文本文件
data['列名'].between(a, b)判断值是否在[a,b]范围内(数据合理性审核)
pd.cut(data['列'], bins=区间, labels=标签, right=False)将连续数据按区间分类(左闭右开)
data.columns.str.strip()去除列名前后空格
data.select_dtypes(include=['float64','int64']).columns选择数值类型的列
data.applymap(lambda x: x.strip() if isinstance(x, str) else x)对每个元素应用函数(如去除字符串空格)
data['列名'].apply(lambda x: ...)对列应用函数
data['列'].dt.days提取日期差的天数

2. NumPy库

函数用法说明
np.array(img, dtype=np.float32)转为numpy数组
np.expand_dims(arr, axis=0)扩展维度(添加batch/channel维度)
np.argmax(arr)返回最大值的索引
np.transpose(arr, (2, 0, 1))转换维度顺序(HWC→CHW)
np.where(condition, x, y)条件判断,满足条件返回x,否则返回y
np.inf正无穷大(用于cut的bins边界)
arr.reshape((1,) + arr.shape)在前面添加batch维度
arr.astype(np.float32)转换数组数据类型

3. Scikit-learn库

函数/类用法说明
StandardScaler()标准化处理(均值0,标准差1)
scaler.fit_transform(data)对数据进行标准化变换
MinMaxScaler()归一化处理(缩放到0-1范围)
LabelEncoder()标签编码(将分类转为数字)
label_encoder.fit_transform(data['列名'])对列进行标签编码
train_test_split(X, y, test_size=0.2, random_state=42)划分训练集和测试集
LinearRegression()线性回归模型
LogisticRegression(max_iter=1000)逻辑回归分类模型
DecisionTreeRegressor(random_state=42)决策树回归模型
RandomForestRegressor(n_estimators=100, random_state=42)随机森林回归(100棵树)
Pipeline([('scaler', StandardScaler()), ('linreg', LinearRegression())])创建管道组合多个步骤
mean_squared_error(y_test, y_pred)计算均方误差(MSE)
mean_absolute_error(y_test, y_pred)计算平均绝对误差(MAE)
r2_score(y_test, y_pred)计算决定系数R²
classification_report(y_test, y_pred, zero_division=1)分类模型评估报告
model.score(X_train, y_train)模型评分
accuracy_score(y_test, y_pred)计算准确率

4. XGBoost库

函数/类用法说明
XGBRegressor(n_estimators=1000, learning_rate=0.05, max_depth=5)XGBoost回归模型
XGBRegressor(n_estimators=1000, learning_rate=0.05, max_depth=5, subsample=0.8, colsample_bytree=0.8)带子采样的XGBoost回归
xgb_model.fit(X_train, y_train)训练XGBoost模型
xgb_model.predict(X_test)使用模型预测
xgb_model.score(X_train, y_train)获取模型评分

5. imblearn库 (处理数据不平衡)

函数/类用法说明
SMOTE(random_state=42)SMOTE过采样处理不平衡数据
smote.fit_resample(X_train, y_train)对数据进行重采样

6. Matplotlib库

函数用法说明
plt.figure(figsize=(12, 8))设置图像尺寸
plt.subplot(行, 列, 位置)创建子图
plt.title('标题')设置标题
plt.xlabel('标签') / plt.ylabel('标签')设置轴标签
plt.xticks(fontproperties=font)设置刻度标签字体
plt.legend(prop=font)设置图例字体
plt.tight_layout()自动调整子图间距
plt.show()显示图像
plt.scatter(x, y)绘制散点图
plt.bar(data, stacked=True)绘制堆叠柱状图
plt.pie(data, autopct='%1.1f%%', startangle=90)绘制饼图

7. Matplotlib中文字体设置

import matplotlib.font_manager as fm
font_path = 'C:/Windows/Fonts/simhei.ttf'
my_font = fm.FontProperties(fname=font_path)
plt.title('标题', fontproperties=my_font)
plt.xlabel('标签', fontproperties=my_font)
plt.xticks(fontproperties=my_font)
plt.legend(prop=my_font)

8. Seaborn库

函数用法说明
sns.boxplot(x=data['列名'])绘制箱线图(用于异常值检测)

9. PIL (图像处理)

函数/方法用法说明
Image.open('路径')打开图像文件
Image.open('路径').convert('L')转为灰度图
Image.open('路径').convert('RGB')转为RGB格式
img.resize((宽, 高), Image.BILINEAR)调整图像大小(双线性插值)
img.resize((宽, 高), Image.ANTIALIAS)调整图像大小(抗锯齿)
img.crop((左, 上, 右, 下))裁剪图像
img.size获取图像宽高

10. ONNX Runtime (模型推理)

函数/方法用法说明
ort.InferenceSession('模型路径')加载ONNX模型
session.get_inputs()[0].name获取模型输入名称
session.get_outputs()[0].name获取模型输出名称
session.run(None, {input_name: data})运行模型推理
session.run([output_name], {input_name: data})运行推理并指定输出名

11. OpenCV (cv2)

函数用法说明
cv2.imread('路径')读取图像
cv2.cvtColor(img, cv2.COLOR_BGR2RGB)BGR转RGB
cv2.resize(img, (宽, 高))调整图像大小
cv2.rectangle(img, (x1,y1), (x2,y2), 颜色, 粗细)绘制矩形框
cv2.imwrite('路径', img)保存图像

12. Pickle/joblib (模型保存与加载)

函数用法说明
pickle.dump(model, file)保存模型到文件
pickle.load(file)从文件加载模型
joblib.dump(model, '路径')保存模型
joblib.load('路径')加载模型

13. Scipy (科学计算)

函数用法说明
scipy.special.softmax(x, axis=-1)计算softmax概率分布

14. datetime库

函数用法说明
datetime(2024, 9, 1)创建指定日期
(date1 - date2).dt.days计算日期差的天数

15. os库 (文件系统操作)

函数用法说明
os.path.exists(path)检查路径是否存在
os.makedirs(path)创建目录
os.listdir(path)列出目录下文件
os.path.join(path, filename)拼接文件路径

16. 网页爬虫工具

工具用法说明
BeautifulSoup解析HTML和XML文档
Scrapy快速高效的网页爬取框架
Selenium模拟浏览器操作,处理动态网页

17. 数据清洗工具

工具用法说明
Pandas强大的数据处理和分析工具
NumPy高性能科学计算数据处理库
OpenRefine交互式数据清洗的开源工具
Dask大规模数据并行计算库

18. Label Studio (数据标注工具)

功能用法说明
文本标注命名实体识别、情感分析、关系抽取
图像标注图像分类、目标检测、图像分割
音频标注语音识别、事件抽取
视频标注动作识别、事件抽取

📖 二、文字性知识点

1. 数据清洗流程规范

  • 缺失值处理: 检查(isnull().sum())、删除(dropna())或填充(fillna)缺失值
  • 异常值处理: 使用IQR方法(四分位距)检测和处理异常值
  • 重复值处理: 检查(duplicated().sum())并删除(drop_duplicates())重复行
  • 数据类型转换: 将字符串转为数值(to_numeric)、日期(to_datetime)等类型
  • 标准化/归一化: 使数据在同一量纲下分析

2. 数据审核流程

  • 完整性审核: 检查每个字段是否存在缺失值和重复值
  • 合理性审核: 根据业务规则验证数据范围(如年龄18-70、收入>2000等)
  • 使用between()方法: data['列名'].between(下限, 上限)判断数据合理性
  • 标记不合理数据: 创建布尔列标记有效/无效数据
  • 清洗效果验证: 确保清洗后数据完整、合理、适于建模

3. 数据标注流程

  • 特征选择: 根据业务需求选择对预测最有用的特征
  • 目标变量标注: 明确标注目标变量(如mpg、SeriousDlqin2yrs)
  • 数据划分: 划分训练集和测试集(通常80%/20%或70%/30%)

4. 机器学习模型选择

  • 回归问题: 线性回归、决策树回归、随机森林回归、XGBoost回归
  • 分类问题: 逻辑回归、决策树分类、随机森林分类
  • 数据不平衡处理: 使用SMOTE过采样技术
  • Pipeline管道: 将标准化和模型训练组合为一个流程

5. 模型评估指标

  • 回归模型: MSE(均方误差)、MAE(平均绝对误差)、R²(决定系数)
  • 分类模型: 准确率(accuracy)、精确率(precision)、召回率(recall)、F1-score
  • 模型评分: model.score()获取训练集/测试集得分

6. 业务数据处理流程设计

  • 数据采集: 传感器数据、用户行为数据、医疗数据等
  • 数据处理: 清洗、转换、标准化
  • 数据分析: 统计分析、可视化
  • 模型应用: 预测、分类

7. ONNX模型推理流程

  1. 加载模型和标签文件
  2. 加载并预处理图像(调整大小、归一化、维度转换)
  3. 运行模型推理
  4. 解码输出结果(argmax获取预测类别)
  5. 应用softmax获取概率分布

8. 图像预处理步骤

  1. 调整图像尺寸到模型输入要求(resize)
  2. 转换颜色空间(RGB/灰度)
  3. 中心裁剪(如256→224)
  4. 归一化处理(除以255.0,减均值、除标准差)
  5. 维度转换(HWC→CHW格式)
  6. 添加batch维度(reshape)
  7. 确保数据类型为float32

9. 业务模块优化方案设计

  1. 识别用户反映的问题
  2. 分析问题产生原因
  3. 设计优化方案(关键步骤)
  4. 预期优化效果

10. 目标检测流程 (3.2.5)

  1. 加载ONNX模型和类别标签
  2. 遍历图像目录
  3. 读取图像并预处理(BGR→RGB、resize、归一化、维度转换)
  4. 运行模型推理获取置信度和边界框
  5. 后处理:NMS非极大值抑制、置信度阈值筛选
  6. 绘制检测框和标签
  7. 保存结果图像

11. NMS非极大值抑制 & IoU

  • 目的:去除重叠的检测框,保留最佳检测
  • 流程:按置信度排序→选最高分框→计算IoU→删除IoU>阈值的框→重复
  • IoU(交并比) = 交集面积 / 并集面积
  • NMS阈值通常设为0.5
def calculate_iou(box1, box2):
    # box格式: [x1, y1, x2, y2]
    inter_x1 = max(box1[0], box2[0])
    inter_y1 = max(box1[1], box2[1])
    inter_x2 = min(box1[2], box2[2])
    inter_y2 = min(box1[3], box2[3])
    inter_area = max(0, inter_x2 - inter_x1) * max(0, inter_y2 - inter_y1)
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    iou = inter_area / (area1 + area2 - inter_area)
    return iou

12. 数据区间分类方法

  • 使用pd.cut()将连续数据按区间分类
  • 定义bins边界列表和labels标签列表
  • right=False表示左闭右开区间
  • 常见分类:BMI区间、年龄区间等

13. 培训大纲编写 (4.1.x / 4.2.x)

  • 培训大纲结构:学习目标→内容→实操→总结→作业
  • 数据标注类型:文本标注、图像标注、视频标注、音频标注
  • 数据标注工具:Label Studio的安装和使用
  • 网页爬虫工具:BeautifulSoup、Scrapy、Selenium
  • 数据清洗工具:Pandas、NumPy、OpenRefine、Dask
  • 数据采集方案设计:API接口、传感器、社交媒体、问卷调查
  • 数据安全合规:加密存储、访问控制、匿名化、GDPR/HIPAA合规

三、常见代码填空答案速查

数据加载

data = pd.read_csv('文件路径.csv')
data = pd.read_excel('文件路径.xlsx')

缺失值处理

data.isnull().sum()
data.dropna()
data = data.dropna()
data.fillna(method='ffill')  # 前向填充
data.fillna(method='bfill')  # 后向填充
data.apply(lambda x: x.fillna(x.mode()[0]))  # 众数填充

数据类型转换

pd.to_numeric(data['列名'], errors='coerce')
data['列名'].astype(int)
data['列名'].astype(float)
pd.to_datetime(data['列名'])

数据合理性审核

data['列名'].between(下限, 上限)  # 判断值是否在范围内
data['is_valid'] = data['Age'].between(18, 70)  # 示例

数据区间分类

bins = [0, 18.5, 24, 28, np.inf]
labels = ['偏瘦', '正常', '超重', '肥胖']
data['新列'] = pd.cut(data['列名'], bins=bins, labels=labels, right=False)

条件赋值

data['新列'] = np.where(条件, 真值, 假值)
# 示例: data['RiskLevel'] = np.where(data['DaysInHospital']>7, '高风险', '低风险')

标准化/归一化

scaler = StandardScaler()
data[列名] = scaler.fit_transform(data[列名])
scaler = MinMaxScaler()
data[列名] = scaler.fit_transform(data[列名])
# 手动标准化: (data - data.mean()) / data.std()

数据划分

X = data.drop(columns=['目标列'])
X = data[selected_features]
y = data['目标列']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

模型训练

model = LinearRegression()
model.fit(X_train, y_train)
model = LogisticRegression(max_iter=1000)
model = DecisionTreeRegressor(random_state=42)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model = XGBRegressor(n_estimators=1000, learning_rate=0.05, max_depth=5)
pipeline = Pipeline([('scaler', StandardScaler()), ('linreg', LinearRegression())])
pipeline.fit(X_train, y_train)

模型预测与评估

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, zero_division=1)

模型保存

data.to_csv('路径.csv', index=False)
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)
joblib.dump(model, '路径.pkl')

ONNX模型推理

ort_session = ort.InferenceSession('模型.onnx')
input_name = ort_session.get_inputs()[0].name
ort_inputs = {input_name: input_data}
ort_outs = ort_session.run(None, ort_inputs)
predicted_label = np.argmax(ort_outs[0])

图像预处理

image = Image.open('路径').convert('L')  # 灰度图
image = Image.open('路径').convert('RGB')  # RGB图
image = image.resize((28, 28))
image_array = np.array(image, dtype=np.float32)
image_array = np.expand_dims(image_array, axis=0)  # 添加batch维度
# ResNet风格预处理
image = image.resize((256, 256), Image.BILINEAR)
image = image.crop((left, top, left+224, top+224))  # 中心裁剪
image = np.array(image).astype(np.float32) / 255.0
image = (image - mean) / std  # 归一化
image = np.transpose(image, (2, 0, 1))  # HWC→CHW
image = image.reshape((1,) + image.shape)  # 添加batch维度

目标检测 (3.2.5)

orig_image = cv2.imread('路径')
image = cv2.cvtColor(orig_image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image, (320, 240))
image_mean = np.array([127, 127, 127])
image = (image - image_mean) / 128
image = np.transpose(image, [2, 0, 1])
image = np.expand_dims(image, axis=0)
confidences, boxes = session.run(None, {input_name: image})
cv2.rectangle(orig_image, (x1,y1), (x2,y2), (255,255,0), 4)
cv2.imwrite('保存路径', orig_image)

📊 四、IQR异常值检测方法

Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
# 异常值定义: 小于 Q1-1.5*IQR 或大于 Q3+1.5*IQR
data_cleaned = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]

⚖️ 五、SMOTE处理数据不平衡

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

🔢 六、Softmax概率计算

import scipy.special
probabilities = scipy.special.softmax(output, axis=-1)
# 获取Top5预测
top5_idx = np.argsort(probabilities)[-5:][::-1]
top5_prob = probabilities[top5_idx]

📋 1.1.x 业务数据处理流程设计

1.1.1 智能医疗系统 - patient_data.csv
import pandas as pd
import numpy as np

data = pd.read_csv('patient_data.csv')

# 1. 统计住院天数超过7天的患者数量及其占比
data['RiskLevel'] = np.where(data['DaysInHospital']>7, '高风险患者', '低风险患者')
risk_counts = data['RiskLevel'].value_counts()
high_risk_ratio = risk_counts['高风险患者'] / len(data)
low_risk_ratio = risk_counts['低风险患者'] / len(data)

# 2. 统计不同BMI区间中高风险患者的比例
bmi_bins = [0, 18.5, 24, 28, np.inf]
bmi_labels = ['偏瘦', '正常', '超重', '肥胖']
data['BMIRange'] = pd.cut(data['BMI'], bins=bmi_bins, labels=bmi_labels, right=False)
bmi_risk_rate = data.groupby('BMIRange')['RiskLevel'].apply(lambda x: (x == '高风险患者').mean())
bmi_patient_count = data['BMIRange'].value_counts()

# 3. 统计不同年龄区间中高风险患者的比例
age_bins = [0, 26, 36, 46, 56, 66, np.inf]
age_labels = ['≤25岁', '26-35岁', '36-45岁', '46-55岁', '56-65岁', '>65岁']
data['AgeRange'] = pd.cut(data['Age'], bins=age_bins, labels=age_labels, right=False)
age_risk_rate = data.groupby('AgeRange')['RiskLevel'].apply(lambda x: (x == '高风险患者').mean())
age_patient_count = data['AgeRange'].value_counts()
1.1.2 智能农业系统 - sensor_data.csv
import pandas as pd
import numpy as np
data = pd.read_csv('sensor_data.csv')

# 1. 传感器数据统计
sensor_stats = data.groupby('SensorType')['Value'].agg(['count','mean'])

# 2. 按位置统计温度和湿度数据
location_stats = data[data['SensorType'].isin(['Humidity','Temperature'])]\
    .groupby(['Location','SensorType'])['Value'].mean().unstack()

# 3. 数据清洗和异常值处理
data['is_abnormal'] = np.where(
    ((data['SensorType'] == 'Temperature') & ((data['Value'] < -10) | (data['Value'] > 50))) |
    ((data['SensorType'] == 'Humidity') & ((data['Value'] < 0) | (data['Value'] > 100))),
    True, False)
data['Value'].fillna(method='ffill', inplace=True)
data['Value'].fillna(method='bfill', inplace=True)
cleaned_data = data.drop(columns=['is_abnormal'])
cleaned_data.to_csv('cleaned_sensor_data.csv', index=False)
1.1.3 金融机构信用评估 - credit_data.csv
import pandas as pd
import numpy as np
data = pd.read_csv('credit_data.csv')

# 1. 数据完整性审核
missing_values = data.isnull().sum()
duplicate_values = data.duplicated().sum()

# 2. 数据合理性审核
data['is_age_valid'] = data['Age'].between(18, 70)
data['is_income_valid'] = data['Income'] > 2000
data['is_loan_amount_valid'] = data['LoanAmount'] < (data['Income'] * 5)
data['is_credit_score_valid'] = data['CreditScore'].between(300, 850)
validity_checks = data[['is_age_valid','is_income_valid','is_loan_amount_valid','is_credit_score_valid']].all(axis=1)
data['is_valid'] = validity_checks

# 3. 数据清洗
cleaned_data = data[data['is_valid']]
cleaned_data = cleaned_data.drop(columns=['is_age_valid','is_income_valid','is_loan_amount_valid','is_credit_score_valid','is_valid'])
cleaned_data.to_csv('cleaned_credit_data.csv', index=False)
1.1.4 电商平台用户行为 - user_behavior_data.csv
import pandas as pd
import numpy as np
data = pd.read_csv('user_behavior_data.csv')

# 数据清洗
data = data.dropna()
data['Age'] = data['Age'].astype(int)
data['PurchaseAmount'] = data['PurchaseAmount'].astype(float)
data['ReviewScore'] = data['ReviewScore'].astype(int)
data = data[(data['Age'].between(18, 70)) & (data['PurchaseAmount'] > 0) & (data['ReviewScore'].between(1, 5))]
data['PurchaseAmount'] = (data['PurchaseAmount'] - data['PurchaseAmount'].mean()) / data['PurchaseAmount'].std()
data['ReviewScore'] = (data['ReviewScore'] - data['ReviewScore'].mean()) / data['ReviewScore'].std()
data.to_csv('cleaned_user_behavior_data.csv', index=False)

# 数据统计
purchase_category_counts = data.groupby(['PurchaseCategory']).size()
gender_purchase_amount_mean = data.groupby(['Gender'])['PurchaseAmount'].mean()
bins = [18, 26, 36, 46, 56, 66, np.inf]
labels = ['18-25', '26-35', '36-45', '46-55', '56-65', '65+']
data['AgeGroup'] = pd.cut(data['Age'], bins=bins, labels=labels, right=False)
age_group_counts = data['AgeGroup'].value_counts().sort_index()
1.1.5 智能交通系统 - vehicle_traffic_data.csv
import pandas as pd
import numpy as np
data = pd.read_csv('vehicle_traffic_data.csv')

# 数据清洗
data = data.dropna()
data['Age'] = data['Age'].astype(int)
data['Speed'] = data['Speed'].astype(float)
data['TravelDistance'] = data['TravelDistance'].astype(float)
data['TravelTime'] = data['TravelTime'].astype(float)
data = data[(data['Age'].between(18, 70)) & (data['Speed'].between(0, 200)) &
            (data['TravelDistance'].between(1, 1000)) & (data['TravelTime'].between(1, 1440))]
data.to_csv('cleaned_vehicle_traffic_data.csv', index=False)

# 数据统计
traffic_event_counts = data.groupby(['TrafficEvent']).size()
gender_stats = data.groupby(['Gender'])[['Speed','TravelDistance','TravelTime']].mean()
age_bins = [18, 26, 36, 46, 56, 66, np.inf]
age_labels = ['18-25', '26-35', '36-45', '46-55', '56-65', '65+']
data['AgeGroup'] = pd.cut(data['Age'], bins=age_bins, labels=age_labels, right=False)
age_group_counts = data['AgeGroup'].value_counts()

🧹 2.1.x 数据清洗和标注流程设计

2.1.1 燃油效率模型 - 数据清洗标注
import pandas as pd
data = pd.read_csv('auto-mpg.csv')
print(data.head())
print(data.isnull().sum())
data = data.dropna()
data['horsepower'] = pd.to_numeric(data['horsepower'], errors='coerce')
data = data.dropna()

from sklearn.preprocessing import StandardScaler
numerical_features = ['displacement', 'horsepower', 'weight', 'acceleration']
scaler = StandardScaler()
data[numerical_features] = scaler.fit_transform(data[numerical_features])

from sklearn.model_selection import train_test_split
selected_features = ['cylinders','displacement','horsepower','weight','acceleration','model year','origin']
X = data[selected_features]
y = data['mpg']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

cleaned_data = X.copy()
cleaned_data['mpg'] = y
cleaned_data.to_csv('2.1.1_cleaned_data.csv', index=False)
2.1.2 低碳生活行为 - 数据清洗标注
data = pd.read_excel('低碳生活数据集.xlsx')
initial_row_count = data.shape[0]
data = data.dropna()
final_row_count = data.shape[0]
data = data.drop_duplicates()

scaler = StandardScaler()
data[numerical_features] = scaler.fit_transform(data[numerical_features])
X = data[selected_features]
y = data['目标列']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
cleaned_data = pd.concat([X, y], axis=1)
cleaned_data.to_csv('2.1.2_cleaned_data.csv', index=False)
2.1.3 信用评分模型 - IQR异常值+归一化
data = pd.read_csv('Finance.csv')
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data_cleaned = data[~((data[numeric_cols] < (Q1 - 1.5 * IQR)) | (data[numeric_cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
duplicates = data_cleaned.duplicated()
data_cleaned = data_cleaned[~duplicates]

scaler = MinMaxScaler()
data_cleaned[numeric_cols] = scaler.fit_transform(data_cleaned[numeric_cols])
X = data_cleaned.drop(columns=['SeriousDlqin2yrs'])
y = data_cleaned['SeriousDlqin2yrs']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
data_cleaned.to_csv('2.1.3_cleaned_data.csv', index=False)

🤖 2.2.x 模型开发与测试

2.2.1 Logistic回归 - 信用评分
data = pd.read_csv('finance.csv')
X = data.drop(['SeriousDlqin2yrs', 'Unnamed: 0'], axis=1)
y = data['SeriousDlqin2yrs']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
with open('2.2.1_model.pkl', 'wb') as file:
    pickle.dump(model, file)
y_pred = model.predict(X_test)
report = classification_report(y_test, y_pred, zero_division=1)
accuracy = accuracy_score(y_test, y_pred)

# SMOTE处理不平衡
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
model.fit(X_resampled, y_resampled)
y_pred_resampled = model.predict(X_test)
2.2.2 Pipeline+随机森林 - 燃油效率
df = pd.read_csv('auto-mpg.csv')
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
df = df.dropna()
X = df[selected_features]
y = df['mpg']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([('scaler', StandardScaler()), ('linreg', LinearRegression())])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
2.2.3 随机森林+XGBoost - 运动量预测
df = pd.read_csv('fitness_data.csv')
X = pd.get_dummies(df[['Your gender','How important is exercise to you?','How healthy do you consider yourself?']])
y = df['Your age'].apply(lambda x: int(x.split(' ')[0]))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
train_score = rf_model.score(X_train, y_train)
test_score = rf_model.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

xgb_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, max_depth=5, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
2.2.4 线性回归+XGBoost - 低碳行为预测
data = pd.read_csv('低碳数据集.csv')
data_cleaned = data.drop(columns=['序号','所用时间'])
data_cleaned = pd.get_dummies(data_cleaned, drop_first=True)
X = data_cleaned.drop(columns=['目标列'])
y = data_cleaned['目标列']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
joblib.dump(model, '2.2.4_model.pkl')
y_pred = model.predict(X_test)

xgb_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, max_depth=5, subsample=0.8, colsample_bytree=0.8)
xgb_model.fit(X_train, y_train)
y_pred_xg = xgb_model.predict(X_test)
2.2.5 决策树回归 - 步数预测
df = pd.read_csv('fitness_analysis.csv')
X = pd.get_dummies(df[feature_cols])
y = df['daily_steps']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

👁️ 3.2.x 模型推理交互流程

3.2.1 ResNet图像识别 - Top5分类
import onnxruntime as ort
import numpy as np
import scipy.special
from PIL import Image

def preprocess_image(image, resize_size=256, crop_size=224, mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]):
    image = image.resize((resize_size, resize_size), Image.BILINEAR)
    w, h = image.size
    left = (w - crop_size) / 2
    top = (h - crop_size) / 2
    image = image.crop((left, top, left + crop_size, top + crop_size))
    image = np.array(image).astype(np.float32) / 255.0
    image = (image - mean) / std
    image = np.transpose(image, (2, 0, 1))
    image = image.reshape((1,) + image.shape)
    return image

session = ort.InferenceSession('resnet.onnx')
with open('labels.txt') as f:
    labels = [line.strip() for line in f.readlines()]
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
image = Image.open('img_test.jpg').convert('RGB')
processed_image = preprocess_image(image)
output = session.run([output_name], {input_name: processed_image.astype(np.float32)})[0]
probabilities = scipy.special.softmax(output, axis=-1)
top5_idx = np.argsort(probabilities[0])[-5:][::-1]
top5_prob = probabilities[0][top5_idx]
3.2.2 MNIST手写数字识别
import onnxruntime
import numpy as np
from PIL import Image

ort_session = onnxruntime.InferenceSession('mnist.onnx')
image = Image.open('img_test.png').convert('L')
image = image.resize((28, 28))
image_array = np.array(image, dtype=np.float32)
image_array = np.expand_dims(image_array, axis=0)  # batch维度
image_array = np.expand_dims(image_array, axis=0)  # channel维度
ort_inputs = {ort_session.get_inputs()[0].name: image_array}
ort_outs = ort_session.run(None, ort_inputs)
predicted_class = np.argmax(ort_outs[0])
3.2.3 面部表情识别 - emotion-ferplus
emotion_table = {'neutral':0, 'happiness':1, 'surprise':2, 'sadness':3, 'anger':4, 'disgust':5, 'fear':6, 'contempt':7}
ort_session = ort.InferenceSession('emotion-ferplus.onnx')
input_data = preprocess('img_test.png')
ort_inputs = {ort_session.get_inputs()[0].name: input_data}
ort_outs = ort_session.run(None, ort_inputs)
predicted_label = np.argmax(ort_outs[0])
predicted_emotion = emotion_table[predicted_label]  # 需反向映射
3.2.4 花朵智能识别
session = ort.InferenceSession('flower-detection.onnx')
with open('labels.txt') as f:
    labels = [line.strip() for line in f.readlines()]
image = Image.open('flower_test.png').convert('RGB')
processed_image = preprocess_image(image)
output = session.run([output_name], {input_name: processed_image.astype(np.float32)})[0]
accuracy = scipy.special.softmax(output, axis=-1)
predicted_idx = np.argmax(accuracy[0])
prob_percentage = accuracy[0][predicted_idx] * 100
predicted_label = labels[predicted_idx]
3.2.5 人脸检测 - version-RFB-320
import os, cv2, numpy as np, onnxruntime as ort

class_names = [name.strip() for name in open('voc-model-labels.txt').readlines()]
ort_session = ort.InferenceSession('version-RFB-320.onnx')
input_name = ort_session.get_inputs()[0].name
result_path = "./detect_imgs_results_onnx"
if not os.path.exists(result_path):
    os.makedirs(result_path)

for file_path in os.listdir("imgs"):
    img_path = os.path.join("imgs", file_path)
    orig_image = cv2.imread(img_path)
    image = cv2.cvtColor(orig_image, cv2.COLOR_BGR2RGB)
    image = cv2.resize(image, (320, 240))
    image_mean = np.array([127, 127, 127])
    image = (image - image_mean) / 128
    image = np.transpose(image, [2, 0, 1])
    image = np.expand_dims(image, axis=0).astype(np.float32)
    confidences, boxes = ort_session.run(None, {input_name: image})
    # 后处理 + 绘制框 + 保存

📝 1.x 业务流程设计 (1.1.x + 1.2.x)

1.1.1 智能医疗系统中的业务数据处理流程设计

30min 代码题

数据集: patient_data.csv | 字段: PatientID, Age, BMI, BloodPressure, Cholesterol, DaysInHospital

(1) 统计住院天数>7天的患者数量及占比 (高风险/低风险)

(2) 不同BMI区间中高风险患者比例和患者数 (偏瘦/正常/超重/肥胖)

(3) 不同年龄区间中高风险患者比例和患者数

提交: 1.1.1.html, 1.1.1-1.jpg, 1.1.1-2.jpg, 1.1.1-3.jpg

1.1.2 智能农业系统中的业务数据采集和处理流程设计

30min 代码题

数据集: sensor_data.csv | 字段: SensorID, Timestamp, SensorType, Value, Location

(1) 每种传感器的数据数量和平均值

(2) 每个位置的温度和湿度平均值

(3) 数据清洗: 标记异常值→填补缺失值→保存cleaned_sensor_data.csv

提交: 1.1.2.html, cleaned_sensor_data.csv, 1.1.2-1.jpg, 1.1.2-2.jpg

1.1.3 金融机构信用评估系统中的业务数据审核流程设计

30min 代码题

数据集: credit_data.csv | 字段: CustomerID, Name, Age, Income, LoanAmount, LoanTerm, CreditScore, Default

(1) 数据完整性审核: 缺失值+重复值

(2) 数据合理性审核: 年龄18-70, 收入>2000, 贷款<收入*5, 信用分300-850

(3) 清洗并保存cleaned_credit_data.csv

提交: 1.1.3.html, cleaned_credit_data.csv, 1.1.3-1.jpg, 1.1.3-2.jpg

1.1.4 电商平台用户行为分析系统的数据采集与处理流程设计

30min 代码题

数据集: user_behavior_data.csv

(1) 数据采集: 读取+打印前5条

(2) 数据清洗: 缺失值→类型转换→异常值→标准化→保存

(3) 数据统计: 购买类别用户数/性别平均购买金额/年龄段用户数

提交: 1.1.4.html, cleaned_user_behavior_data.csv, 截图x3

1.1.5 智能交通系统的数据采集、处理和审核流程设计

30min 代码题

数据集: vehicle_traffic_data.csv

(1) 数据采集 (2) 清洗与预处理 (3) 合理性审核 (4) 数据统计

提交: 1.1.5.html, cleaned_vehicle_traffic_data.csv, 截图x5

1.2.1~1.2.5 业务模块效果优化 (文字题)

30min 文字题

1.2.1 顾客评价情感识别优化

1.2.2 老年人健康监测优化

1.2.3 智慧金融服务优化

1.2.4 智能卖点生成优化

1.2.5 腾讯云数智人优化

通用格式: (1)列举问题+解释 (2)优化方案+步骤+预期效果

🔬 2.x 数据清洗标注与模型开发 (2.1.x + 2.2.x)

2.1.1~2.1.5 数据清洗和标注流程设计

20min 代码题

2.1.1 燃油效率模型 - 标准化+特征选择+划分

2.1.2 低碳生活行为 - 缺失值+重复值+标准化+划分

2.1.3 信用评分模型 - IQR异常值+归一化+划分

2.1.4 医疗研究数据 - 日期处理+归一化+可视化

2.1.5 健康与营养咨询 - 缺失值+类型转换+LabelEncoder+饼图

提交: x.x.x.html, x.x.x.docx

2.2.1~2.2.5 模型开发与测试

20min 代码题

2.2.1 Logistic回归 + SMOTE - 信用评分

2.2.2 Pipeline线性回归 + 随机森林 - 燃油效率

2.2.3 随机森林 + XGBoost - 运动量预测

2.2.4 线性回归 + XGBoost - 低碳行为预测

2.2.5 决策树回归 - 步数预测

提交: x.x.x.html, x.x.x.docx

🎯 3.x 产品分析与模型推理 (3.1.x + 3.2.x)

3.1.1~3.1.5 智能产品数据分析与优化 (文字题)

20min 文字题

3.1.1 智能音箱 - 使用习惯/功能频率/响应时间

3.1.2 智能照明系统 - 亮度偏好/场景频率/响应时间

3.1.3 智能健康手环 - 活动模式/指标关注度/同步性能

3.1.4 智能健康监测系统 - 活动周期/指标偏好/响应准确性

3.1.5 智能家居环境控制 - 环境偏好/响应时间/能耗分析

通用格式: (1)分析报告 (2)3个优化方向+解决方案

3.2.1~3.2.5 模型推理交互流程设计

20min 代码题

3.2.1 ResNet图像识别 - Top5分类 (resnet.onnx)

3.2.2 MNIST手写数字识别 (mnist.onnx)

3.2.3 面部表情识别 (emotion-ferplus.onnx)

3.2.4 花朵智能识别 (flower-detection.onnx)

3.2.5 人脸检测 (version-RFB-320.onnx)

通用: (1)补全代码+截图 (2)人机交互最优方式/流程(docx)

提交: x.x.x.html, x.x.x.docx, x.x.x-1.jpg

📚 4.x 培训大纲与指导方案 (4.1.x + 4.2.x)

4.1.1~4.1.5 培训大纲编写 (文字题)

10min 文字题

4.1.1 Label Studio标注工具培训

4.1.2 网页爬虫工具培训 (BeautifulSoup/Scrapy/Selenium)

4.1.3 数据清洗工具培训 (Pandas/NumPy/OpenRefine/Dask)

4.1.4 Pandas数据清洗培训

4.1.5 Python数据可视化培训

结构: 学习目标→内容→实操→总结→作业

4.2.1~4.2.5 数据采集和处理指导方案 (文字题)

10min 文字题

4.2.1 智能零售分析系统

4.2.2 AI辅助医疗影像诊断系统

4.2.3 AI智能安防监控系统

4.2.4 自动驾驶汽车感知系统

4.2.5 文化遗产数字化保护