Seamless Data Preprocessing: the Magic of sklearn Pipelines
Most data scientists make the mistake of not using sklearn’s pipelines as the default way of writing preprocessing jobs.
Pipelines instantly improve your data transformation process as they encapsulate sequential operations in an elegant and seamless way.
Three advantages:
- Cleaner, shorter code and more reusable
- More robust
- More production-ready
Here’s a quick example:
# Isolating numerical features
numerical_columns = [column for column in df.columns if df[column].dtype in ["int64", "float64"]]
# Creating a preprocessing pipeline for numerical features (mean missing value imputation + standard scaling)
numerical_preprocessor = pipeline.Pipeline(steps=[
("imputer", impute.SimpleImputer(strategy="mean")),
("scaler", preprocessing.StandardScaler())
])
# Creating a preprocessing pipeline for categorical features (most frequent imputation + one-hot encoding)
categorical_preprocessor = pipeline.Pipeline(steps=[
("imputer", impute.SimpleImputer(strategy="most_frequent")),
("onehot", preprocessing.OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = compose.ColumnTransformer(
transformers=[
("numerical_preprocessor", numerical_preprocessor, numerical_columns),
("categorical_preprocessor", categorical_preprocessor, ["island"])
]
)
# Applying the preprocessing pipeline
X = pd. DataFrame (preprocessor. fit_transform(x))
Even simpler and faster, make_pipeline(), doc here.
More about this here