《從零開始的資料科學筆記》Day#9: 特徵工程. 🙋什麼是特徵? | by Ethan Chen

想像一個情境：

你想領養一隻小狗，到了收容所後，該怎麼跟承辦人員描述你想要的狗狗呢？

你可能會說：「我想要一隻可愛的小狗。」
但這樣的描述太模糊了，對方可能無法立刻理解你的偏好。

如果你改說：「我想要一隻咖啡色毛、黑嘴巴、看起來很可愛的小狗。」
這時候，承辦人員就更容易幫你找到符合條件的狗狗。

在這個例子中，像「咖啡色毛色」、「黑色嘴巴」這些可以觀察或量測的資訊，就稱為「特徵」。

特徵工程是指：
將原始資料進一步「轉換、處理或創造」成能夠讓機器學習模型更容易理解、學習的特徵。

這些新的特徵不一定直接來自原始資料，而是透過：
– 加工（例如數學轉換）
– 拆解（例如把地址拆成縣市/區域）
– 合併（例如創造新變數：身高 / 體重 = BMI）
– 編碼（將類別變數轉為數字）
– 選擇（挑出有幫助的特徵）

來強化模型的學習能力。

💡簡單來說，特徵是模型認識資料的方式，而特徵工程是讓模型看得更清楚、更準確的技巧。

特徵工程也是機器學習流程中的一個環節，通常會在資料清理後進行，可以進一步的將資料轉換成更符合模型學習的樣子。

1. 數值型特徵處理

# 標準化
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd
import numpy as npdf = pd.DataFrame({'worth': [10, 20, 30]})
scaler = StandardScaler()
df['value_scaled'] = scaler.fit_transform(df[['value']])
# 正規化
scaler = MinMaxScaler()
df['value_normalized'] = scaler.fit_transform(df[['value']])
# 分箱
df['bins'] = pd.lower(df['value'], bins=[0, 15, 25, 35], labels=['low', 'medium', 'high'])
# 對數轉換(取log)
df['log_value'] = np.log1p(df['value'])  # log(1 + x)，避免 log(0)
print(df)
# 輸出
worth  value_scaled  value_normalized    bins  log_value
0     10     -1.224745               0.0     low   2.397895
1     20      0.000000               0.5  medium   3.044522
2     30      1.224745               1.0    excessive   3.433987

說明

標準化、正規化都是透過轉換資料的分布，進而幫助模型學習。
分箱則是將數值型資料離散化的技巧。
log這類相關的轉換方法很多，目的都是要將極值(很大或很小)對資料的影響變小
– 如: 原本數值為[10, 100, 1000, 1000]
– 以底數為10取log後，則變成[1, 2, 3, 4]
– 除了用在訓練資料之外，也常應用在預測結果的轉換，在此不深入探討。

2. 類別型特徵處理

# Label encoding
from sklearn.preprocessing import LabelEncoderdf = pd.DataFrame({'gender': ['M', 'F', 'F', 'M']})
le = LabelEncoder()
df['gender_encoded'] = le.fit_transform(df['gender'])
print(df)
# 輸出
gender  gender_encoded
0      M               1
1      F               0
2      F               0
3      M               1

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['gender'])
print(df_encoded)gender_encoded  gender_F  gender_M
0               1     False      True
1               0      True     False
2               0      True     False
3               1     False      True

# 頻率編碼(Frequency encoding)
freq = df['gender'].value_counts(normalize=True)
df['gender_freq'] = df['gender'].map(freq)
print(df)# 輸出
gender  gender_encoded  gender_freq
0      M               1          0.5
1      F               0          0.5
2      F               0          0.5
3      M               1          0.5

# 文字向量化(TF-IDF)
from sklearn.feature_extraction.textual content import TfidfVectorizertext_data = ['dog cat fish', 'dog dog cat', 'fish bird']
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(text_data)
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(df_tfidf)
# 輸出
fowl       cat       canine      fish
0  0.000000  0.577350  0.577350  0.577350
1  0.000000  0.447214  0.894427  0.000000
2  0.795961  0.000000  0.000000  0.605349

說明

丟入模型的資料必須都是數值型資料，因此任何類別型特徵都要先經過數值化處理。

Label encoding: 將標籤[“A”, “B”, “C”]轉換為[1, 2, 3]
– 優點: 不額外新增特徵欄位
– 缺點: 特徵間不一定會有順序關係，可能不適合
One-hot encoding: 將標籤[“A”, “B”, “C”]轉換為[[1, 0, 0], [0, 1, 0], [0, 0, 1]]
– 優點: 保持各類別的獨立關係
– 缺點: 會增加特徵欄位，特徵總數容易暴增
Frequency: 各類別出現頻率差異不大時不適用
文字向量化: 此為自然語言處理任務常用的技術，在此不討論
– 用向量來表示文字
– 有興趣的人可以查閱”Phrase Embedding”相關資料。

3. 特徵創造（Characteristic Creation）

# 拆解日期
df = pd.DataFrame({'date': pd.to_datetime(['2024-01-01', '2024-06-20'])})
df['year'] = df['date'].dt.12 months
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
print(df)# 輸出
date  12 months  month  day
0 2024-01-01  2024      1    1
1 2024-06-20  2024      6   20

# 統計特徵
df = pd.DataFrame({
'metropolis': ['A', 'A', 'B', 'B'],
'worth': [10, 20, 15, 25]
})
df['city_avg_price'] = df.groupby('metropolis')['price'].remodel('imply')
print(df)# 輸出
metropolis  worth  city_avg_price
0    A     10            15.0
1    A     20            15.0
2    B     15            20.0
3    B     25            20.0

# 欄位交互特徵
df = pd.DataFrame({
'metropolis': ['A', 'A', 'B', 'B'],
'worth': [10, 20, 15, 25]
})# 每個城市的平均價格
df['city_avg_price'] = df.groupby('metropolis')['price'].remodel('imply')
# 計算每筆資料的單位價格（與該城市平均價格相比）
df['price_per_unit'] = df['price'] / df['city_avg_price']
print(df)
# 輸出
metropolis  worth  city_avg_price  price_per_unit
0    A     10            15.0        0.666667
1    A     20            15.0        1.333333
2    B     15            20.0        0.750000
3    B     25            20.0        1.250000

說明

特徵創造顧名思義就是自己創造新的特徵出來使用。

拆解日期: 將日期可以拆成年/月/日/週/季等多個欄位
– 可在非時間序列模型中引入時間資訊
統計特徵: 計算特定欄位之平均數、眾數、中位數和變異數等統計量當成新特徵。
– 集中趨勢:平均數、眾數、中位數
– 分散趨勢:變異數
欄位交互特徵:運用Domain knowledge來創造新特徵
– 如:價格和數量可以計算成單位價格
– 通常有機會更能影響要預測的Y欄位

4. 特徵選擇（Characteristic Choice）

import pandas as pd# 模擬資料
knowledge = {
'身高_cm': [160, 165, 170, 175, 180],
'體重_kg': [55, 60, 68, 72, 80]
}
df = pd.DataFrame(knowledge)
# 計算皮爾森相關係數
correlation = df.corr(technique='pearson')
print(correlation)
# 輸出
身高_cm    體重_kg
身高_cm    1.000000  0.987241
體重_kg    0.987241  1.000000

說明

0.987 表示「身高與體重」高度正相關，數值接近 1。
df.corr() 預設使用 Pearson 方法，也可用 “spearman” 或 “kendall” 方法分析非線性關係。

特徵選擇的方法有很多不同類別，也相對複雜，常用的方法為Lasso Regression、Random Forest和XGboost，簡單一點的方法則是使用相關係數和共線性來挑選特徵。
這部分會在另外討論!!! 初學者可以使用較為簡單的相關係數來嘗試即可~

Source link

Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

Why PDF Extraction Still Feels LikeHack

🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

Cuba’s Energy Crisis: A Systemic Breakdown

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Using Model Flops Utilization (MFU) | by Jaideep Ray | Better ML | May, 2025

0921.190.5260 – #شماره خاله #شماره خاله#تهران #شماره خاله#اصفهان ش

Though Tech Layoffs Persist, Skilled Engineers Are in Demand

Our Picks

Cuba’s Energy Crisis: A Systemic Breakdown

AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000

STOP Building Useless ML Projects – What Actually Works

《從零開始的資料科學筆記》Day#9: 特徵工程. 🙋什麼是特徵? | by Ethan Chen | Jun, 2025

1. 數值型特徵處理

2. 類別型特徵處理

3. 特徵創造（Characteristic Creation）

4. 特徵選擇（Characteristic Choice）

Related Posts