Unleash the Potential of Data with ARFS — Part 2: Basic Feature Selection Automated
Hey Data Enthusiasts! 🌟 It’s me again
Have you ever felt lost in the labyrinth of data analysis, trying to figure out which features in your dataset truly matter? Fear not, because today we’re diving into the magical world of Basic Feature Selection automation using ARFS! 🧙♂️📊
As you recall from our previous trip, we’re well on our way to automating the basic feature selection steps, neatly packaged into a single object. Those are a bit boring but this is just the first step! In the next story, I’ll showcase features that truly set ARFS apart from other feature selection packages.
Feature selection is crucial for optimizing machine learning models. The ARFS (All Relevant Feature Selection) library provides robust tools to refine datasets, ensuring accurate and efficient models.
Tackling Missing Values
High percentages of missing values can compromise your dataset’s integrity. ARFS offers the `MissingValueThreshold` function to exclude features with excessive gaps.
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import gc
import arfs
import arfs.preprocessing as arfspp
import arfs.feature_selection as arfsfs
from arfs.utils import (
_make_corr_dataset_regression,
_make_corr_dataset_classification,
)
X, y, w = _make_corr_dataset_regression()
data = X.copy()
data["target"] = y
# significant regressors
x_vars = ["var0", "var1", "var2", "var3", "var4"]
y_vars = ["target"]
g = sns.PairGrid(data, x_vars=x_vars, y_vars=y_vars)
g.map(plt.scatter, alpha=0.1)
# noise
x_vars = ["var5", "var6", "var7", "var8", "var9", "var10"]
y_vars = ["target"]
g = sns.PairGrid(data, x_vars=x_vars, y_vars=y_vars)
g.map(plt.scatter, alpha=0.1)
plt.plot();
from arfs.feature_selection import MissingValueThreshold
selector = MissingValueThreshold(0.05)
X_trans = selector.fit_transform(X)
print(f"The features going in the selector are : {selector.feature_names_in_}")
print(f"The support is : {selector.support_}")
print(f"The selected features are : {selector.get_feature_names_out()}")
Eliminating Zero-Variance Features
Zero-variance features offer no predictive power. ARFS’s `UniqueValuesThreshold` helps you remove these redundant columns.
from arfs.feature_selection import UniqueValuesThreshold
selector = UniqueValuesThreshold(threshold=2)
X_trans = selector.fit_transform(X)
print(f"The features going in the selector are : {selector.feature_names_in_}")
print(f"The support is : {selector.support_}")
print(f"The selected features are : {selector.get_feature_names_out()}")
Filtering High Cardinality Features
Categorical features with numerous unique values can lead to overfitting. Use the `CardinalityThreshold` selector to manage high cardinality, especially in tree-based models.
# high cardinality for categoricals predictors
# unsupervised learning, doesn't need a target
selector = CardinalityThreshold(threshold=100)
X_trans = selector.fit_transform(X)
print(f"The features going in the selector are : {selector.feature_names_in_}")
print(f"The support is : {selector.support_}")
print(f"The selected features are : {selector.get_feature_names_out()}")
Managing Highly Correlated Features
High correlation between features can skew results. The `CollinearityThreshold` selector in ARFS helps you maintain balance by removing excessively correlated features.
For a deep dive, consult the dedicated story, explaining all the components and how to customize the association functions.
from arfs.feature_selection import CollinearityThreshold
selector = CollinearityThreshold(threshold=0.85, n_jobs=1)
X_trans = selector.fit_transform(X)
f = selector.plot_association()
Evaluating Predictive Power
Assessing the predictive power of features is paramount. ARFS’s `VariableImportance` function uses gradient boosting machines and Shapley values to rank features by importance.
lgb_kwargs = {"objective": "rmse", "zero_as_missing": False}
selector = VariableImportance(
verbose=2, threshold=0.99, lgb_kwargs=lgb_kwargs, fastshap=False
)
X_trans = selector.fit_transform(X=X, y=y, sample_weight=w)
print(f"The features going in the selector are : {selector.feature_names_in_}")
print(f"The support is : {selector.support_}")
print(f"The selected features are : {selector.get_feature_names_out()}")
All at once — Sklearn Pipeline
The selectors follow the scikit-learn base class, therefore they are compatible with scikit-learn in general. This has several advantages: — easier to maintain — easier to version — more flexible — running faster by removing unnecessary columns before going to the computationally demanding steps.
ARFS is quite flexible and mimics sklearn. You can easily build a column selector, for pipelining purposes.
from arfs.preprocessing import dtype_column_selector
cat_features_selector = dtype_column_selector(
dtype_include=["category", "object", "bool"],
dtype_exclude=[np.number],
pattern=None,
exclude_cols=["nice_guys"],
)
cat_features_selector(X)
Wrapping multiple steps, note that it handles sample_weights, which is convenient for Poisson, Gamma, Tweedie, and other non-Gaussian regressions.
basic_fs_pipeline = Pipeline(
[
("missing", MissingValueThreshold(threshold=0.05)),
("unique", UniqueValuesThreshold(threshold=1)),
("cardinality", CardinalityThreshold(threshold=10)),
("collinearity", CollinearityThreshold(threshold=0.75)),
("encoder", OrdinalEncoderPandas()),
(
"lowimp",
VariableImportance(
verbose=2, threshold=0.99, lgb_kwargs=lgb_kwargs, encode=False
),
),
]
)
X_trans = basic_fs_pipeline.fit_transform(
X=X, y=y, collinearity__sample_weight=w, lowimp__sample_weight=w
)
Copy code
Easy tracking and reporting
Whenever your manager is asking questions, use
from import arfs.feature_selection import make_fs_summary
make_fs_summary(basic_fs_pipeline)
and go grab a coffee ☕
Conclusion
All the basics in one go, with a convenient oneliner for the reporting part!
But get ready, data enthusiasts! Our next story dives into intriguing and innovative approaches to model-agnostic feature selection. Stay tuned!