MG Public Knowledge Hub

Previous Editions

Ctrl+K

MG Public Knowledge Hub

Teaching & Research Knowledge Base (WIP)

Educational Technology and Digitalization

Teaching Methods and Learning Processes

Curriculum and Subject Areas

Educational Research Methods and Statistic

Classroom Resource

Classroom Interactive Tools

Literature Review

Teaching Resource

Recommended Videos

Standardized Test Prep

AP Chemistry(WIP)

College Application

Application Timeline

Personal Statement

Extracurriculum Activities

Math Competition(WIP)

Letter of Recommendation(WIP)

Literature

文学核心词汇

英语速读指南

英国文学时代和流派、体裁划分

美国文学时代和流派、体裁划分

AP/IB/A-Level Literature

Language Learning

ئۇيغۇر تىلى 123

Programming

Use Kaggle_Titanic - Machine Learning from Disaste(Python)

Teacher Qualification

Teacher Certification

Use Kaggle_Titanic - Machine Learning from Disaste(Python)

type

status

date

slug

summary

tags

category

icon

password

Test Type

加载并预览数据

import pandas as pd

这是导入 pandas 库，并给它取一个简称叫 pd（业界通用写法）。

pandas 是 Python 中最常用的数据分析工具，可以处理表格、Excel、CSV 等数据结构。

如果你还没安装 pandas，可以先运行：

df = pd.read_csv("train.csv")

用 pandas 中的 read_csv() 函数来读取 CSV 文件，这里是 Titanic 数据集 train.csv

df 是你读入的数据表（DataFrame），你可以把它当成一个 Excel 表格

📌 小提醒：

这行代码默认读取当前文件夹下的 train.csv

如果文件路径不对，程序会报错：FileNotFoundError

df.head()

显示 DataFrame（数据表）的前五行

这是查看数据结构的快速方法，比如列名、数值、格式是否正常

你也可以自定义行数：

Step 1：了解数据结构

df.info()

作用：查看整个 DataFrame 的结构

输出内容包括：

每一列的名字

非空值的数量（Non-Null Count）

数据类型（如 int64, float64, object）

总共有多少行（行数）

你可以看出：

哪些列有缺失值

每列的类型是否适合建模（比如文本类要转为数字）

🧩 例子输出：

df.describe()

作用：查看数值列的统计信息

默认只对**数值型列（int, float）**进行统计，包括：

count（非缺失值个数）

mean（均值）

std（标准差）

min / max（最小值 / 最大值）

25%, 50%, 75% 分位数

你可以看出：

是否有极端值（离群点）

数据的集中程度（均值和中位数接近？）

分布情况（标准差大说明波动大）

🧩 例子输出：

df.isnull().sum()

作用：统计每一列中有多少个缺失值（NaN）

.isnull() 会返回一个和 df 同大小的布尔矩阵（True 表示缺失）

.sum() 对列求和，得到每列缺失值数量

你可以看出：

哪些列缺失严重（比如 Cabin）

是否需要填补或删除这些列

🧩 例子输出：

Step 2：开始清洗数据

🟠 1. 处理缺失值

我们先看一下缺失情况：

假设你看到的输出是这样（大致）：

✅ 下一步清洗：

✅ 清洗完成后再运行一次：

🟠 2. 将分类变量转为数值（类别编码）

Sex 列：二分类（male / female）

Embarked 列：三个类别（S, C, Q）

使用 one-hot 编码：

说明：

drop_first=True 会去掉一个类别（如 “S”），防止多重共线性

得到两个新列：Embarked_Q、Embarked_C

✅ 清洗后结果检查

Teacher Certification

Loading...

Catalog

Last update: 2025-02-26

Hi~~~~~~~

Glad to meet you!

Hope you have a nice day everyday =)

Article List

MG Public Knowledge Hub

Teaching & Research Knowledge Base (WIP)

Educational Technology and Digitalization

Teaching Methods and Learning Processes

Curriculum and Subject Areas

Educational Research Methods and Statistic

Classroom Resource

Classroom Interactive Tools

Literature Review

Teaching Resource

Recommended Videos

Standardized Test Prep

AP Chemistry(WIP)

College Application

Application Timeline

Personal Statement

Extracurriculum Activities

Math Competition(WIP)

Letter of Recommendation(WIP)

Literature

文学核心词汇

英语速读指南

英国文学时代和流派、体裁划分

美国文学时代和流派、体裁划分

AP/IB/A-Level Literature

Language Learning

ئۇيغۇر تىلى 123

Programming

Use Kaggle_Titanic - Machine Learning from Disaste(Python)

Teacher Qualification

Teacher Certification