Unleash the Potential of Data with ARFS — Part 2: Basic Feature Selection Automated

Thomas Bury
4 min readJun 15, 2024

--

Image by Microsoft Designer

Hey Data Enthusiasts! 🌟 It’s me again

Have you ever felt lost in the labyrinth of data analysis, trying to figure out which features in your dataset truly matter? Fear not, because today we’re diving into the magical world of Basic Feature Selection automation using ARFS! 🧙‍♂️📊

As you recall from our previous trip, we’re well on our way to automating the basic feature selection steps, neatly packaged into a single object. Those are a bit boring but this is just the first step! In the next story, I’ll showcase features that truly set ARFS apart from other feature selection packages.

Feature selection is crucial for optimizing machine learning models. The ARFS (All Relevant Feature Selection) library provides robust tools to refine datasets, ensuring accurate and efficient models.

Tackling Missing Values

High percentages of missing values can compromise your dataset’s integrity. ARFS offers the `MissingValueThreshold` function to exclude features with excessive gaps.

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import gc

import arfs
import arfs.preprocessing as arfspp
import arfs.feature_selection as arfsfs
from arfs.utils import (
_make_corr_dataset_regression,
_make_corr_dataset_classification,
)

X, y, w = _make_corr_dataset_regression()
data = X.copy()
data["target"] = y

# significant regressors
x_vars = ["var0", "var1", "var2", "var3", "var4"]
y_vars = ["target"]
g = sns.PairGrid(data, x_vars=x_vars, y_vars=y_vars)
g.map(plt.scatter, alpha=0.1)

# noise
x_vars = ["var5", "var6", "var7", "var8", "var9", "var10"]
y_vars = ["target"]
g = sns.PairGrid(data, x_vars=x_vars, y_vars=y_vars)
g.map(plt.scatter, alpha=0.1)

plt.plot();
from arfs.feature_selection import MissingValueThreshold
selector = MissingValueThreshold(0.05)
X_trans = selector.fit_transform(X)
print(f"The features going in the selector are : {selector.feature_names_in_}")
print(f"The support is : {selector.support_}")
print(f"The selected features are : {selector.get_feature_names_out()}")
Image by Author

Eliminating Zero-Variance Features

Zero-variance features offer no predictive power. ARFS’s `UniqueValuesThreshold` helps you remove these redundant columns.

from arfs.feature_selection import UniqueValuesThreshold
selector = UniqueValuesThreshold(threshold=2)
X_trans = selector.fit_transform(X)
print(f"The features going in the selector are : {selector.feature_names_in_}")
print(f"The support is : {selector.support_}")
print(f"The selected features are : {selector.get_feature_names_out()}")

Filtering High Cardinality Features

Categorical features with numerous unique values can lead to overfitting. Use the `CardinalityThreshold` selector to manage high cardinality, especially in tree-based models.

# high cardinality for categoricals predictors
# unsupervised learning, doesn't need a target
selector = CardinalityThreshold(threshold=100)
X_trans = selector.fit_transform(X)
print(f"The features going in the selector are : {selector.feature_names_in_}")
print(f"The support is : {selector.support_}")
print(f"The selected features are : {selector.get_feature_names_out()}")

Managing Highly Correlated Features

High correlation between features can skew results. The `CollinearityThreshold` selector in ARFS helps you maintain balance by removing excessively correlated features.

For a deep dive, consult the dedicated story, explaining all the components and how to customize the association functions.

from arfs.feature_selection import CollinearityThreshold
selector = CollinearityThreshold(threshold=0.85, n_jobs=1)
X_trans = selector.fit_transform(X)
f = selector.plot_association()
Image by Author

Evaluating Predictive Power

Assessing the predictive power of features is paramount. ARFS’s `VariableImportance` function uses gradient boosting machines and Shapley values to rank features by importance.

lgb_kwargs = {"objective": "rmse", "zero_as_missing": False}
selector = VariableImportance(
verbose=2, threshold=0.99, lgb_kwargs=lgb_kwargs, fastshap=False
)
X_trans = selector.fit_transform(X=X, y=y, sample_weight=w)
print(f"The features going in the selector are : {selector.feature_names_in_}")
print(f"The support is : {selector.support_}")
print(f"The selected features are : {selector.get_feature_names_out()}")

All at once — Sklearn Pipeline

The selectors follow the scikit-learn base class, therefore they are compatible with scikit-learn in general. This has several advantages: — easier to maintain — easier to version — more flexible — running faster by removing unnecessary columns before going to the computationally demanding steps.

ARFS is quite flexible and mimics sklearn. You can easily build a column selector, for pipelining purposes.

from arfs.preprocessing import dtype_column_selector

cat_features_selector = dtype_column_selector(
dtype_include=["category", "object", "bool"],
dtype_exclude=[np.number],
pattern=None,
exclude_cols=["nice_guys"],
)

cat_features_selector(X)

Wrapping multiple steps, note that it handles sample_weights, which is convenient for Poisson, Gamma, Tweedie, and other non-Gaussian regressions.

basic_fs_pipeline = Pipeline(
[
("missing", MissingValueThreshold(threshold=0.05)),
("unique", UniqueValuesThreshold(threshold=1)),
("cardinality", CardinalityThreshold(threshold=10)),
("collinearity", CollinearityThreshold(threshold=0.75)),
("encoder", OrdinalEncoderPandas()),
(
"lowimp",
VariableImportance(
verbose=2, threshold=0.99, lgb_kwargs=lgb_kwargs, encode=False
),
),
]
)

X_trans = basic_fs_pipeline.fit_transform(
X=X, y=y, collinearity__sample_weight=w, lowimp__sample_weight=w
)
Copy code

Easy tracking and reporting

Whenever your manager is asking questions, use

from import arfs.feature_selection import make_fs_summary
make_fs_summary(basic_fs_pipeline)
Image by Author. NaN means either “not a feature selection step” or “was eliminated at a previous step”

and go grab a coffee ☕

Conclusion

All the basics in one go, with a convenient oneliner for the reporting part!

But get ready, data enthusiasts! Our next story dives into intriguing and innovative approaches to model-agnostic feature selection. Stay tuned!

--

--

Thomas Bury
Thomas Bury

Written by Thomas Bury

Physicist by passion and training, Data Scientist and MLE for a living (it's fun too), interdisciplinary by conviction.