Listed below are three battle-tested methods for dealing with lacking knowledge, which you could apply relying in your use case:
1. Retrieve the Lacking Knowledge from the Supply
Finest for: Inner firm datasets or real-time knowledge pipelines.
Instance: If bmi
is lacking, contact the healthcare workforce gathering the info and request a patch or replace.
Execs:
- Highest accuracy.
- Preserves dataset integrity.
Cons:
- Not all the time possible.
- May be time-consuming and bureaucratic.
2. Drop Rows with Lacking Values
Finest for: Giant datasets the place lacking knowledge is minimal.
dataset.dropna(inplace=True)
Execs:
- Easy and quick.
- Clear knowledge with out assumptions.
Cons:
- You lose knowledge — presumably precious patterns.
- Can bias the mannequin if lacking knowledge isn’t random.
3. Impute Lacking Values
Finest for: When the missingness is small and knowledge patterns are constant.
For numerical values you should use imply():
dataset['bmi'].fillna(dataset['bmi'].imply(), inplace=True)
For categorical values you should use mode()
however there are additionally different imputation methods:
dataset['region'].fillna(dataset['region'].mode()[0], inplace=True)
Execs:
- Retains all rows.
- Permits mannequin coaching with out interruption.
Cons:
- Injects synthetic knowledge.
- Could dilute knowledge high quality or disguise underlying points.