Unleash the Potential of Data with ARFS — Part 1: Uncover Hidden Relationships 🔍🚀

8 min readJan 8, 2024

Hey Data Enthusiasts! 🌟

Have you ever felt lost in the labyrinth of data analysis, trying to figure out which features in your dataset truly matter? Fear not, because today we’re diving into the magical world of Association and Feature Selection using ARFS! 🧙‍♂️📊

In this guide, we’ll explore the art of computing associations between different data types — be it categorical-categorical, numerical-numerical, or categorical-numerical. It’s all about understanding those sneaky relationships and dependencies hidden within your data. 🤔💡

But wait, there’s more! We’ll also get our hands dirty with real examples, showcasing how these associations can be your secret weapon for feature selection. Whether you’re exploring correlations in continuous data or deciphering dependencies in categorical data, these functions are your robust toolkit for uncovering the intricate web of relationships in your data. 🕸️🔍

And for the cherry on top, we’ll show you how to turbocharge these computations with parallel processing. Imagine handling batches of columns like a data superhero! 🦸‍♀️💨

Ready to transform your exploratory data analysis and predictive modeling? Let’s dive in and unravel the mysteries of your dataset! 🌊🔮

For a deep dive into the tutorial, head over to ARFS Documentation — Association and Feature Selection. Let’s turn your data challenges into opportunities! 🚀🌈

Deciphering the Mosaic of Data Interconnections 🧩📈

Association analysis encompasses various techniques that quantify the strength and direction of relationships between variables. By delving into these relationships, we gain valuable insights into the underlying structure of our data, enabling us to make informed decisions in various data analysis tasks.

ARFS uses a comprehensive suite of association measures tailored for different combinations of data types:

Categorical-Categorical: Theil’s U statistic, an asymmetric measure that captures the strength of association between categorical variables.
Numerical-Numerical: Spearman correlation coefficient, a non-parametric measure that assesses the monotonic relationship between continuous data.
Categorical-Numerical: Correlation ratio, a symmetric measure that quantifies the linear relationship between categorical and numerical variables.

However, you can also define your own, keep tuned for more details.

Setting Sail with the Titanic Data: A Twist of Random Predictors! 🛳️🎲

Imagine aboard the iconic Titanic, navigating through a sea of data. But here’s the fun part — we’re stirring in some wild cards, a dash of random predictors, to spice up the journey! It’s like adding a pinch of mystery to an already intriguing adventure🌊✨.

import pandas as pd
import numpy as np

from arfs.utils import load_data
from arfs.feature_selection.unsupervised import CollinearityThreshold
from arfs.association import asymmetric_function, xy_to_matrix
from arfs.association import (association_matrix,
 correlation_ratio_matrix,
 _callable_association_matrix_fn,
 correlation_ratio,
 weighted_corr,
 wcorr_matrix,
 theils_u_matrix,
 weighted_theils_u,
 association_series,
 _callable_association_series_fn)

titanic = load_data(name="Titanic")
X, y = titanic.data, titanic.target
y = y.astype(int)
X.head()

|   | pclass | sex    | embarked | random_cat | is_alone | title  | age    | family_size | fare    | random_num |
|---|--------|--------|----------|------------|----------|--------|--------|-------------|---------|------------|
| 0 | 1      | female | S        | Fry        | 1        | Mrs    | 29.0000 | 0.0         | 211.3375 | 0.496714   |
| 1 | 1      | male   | S        | Bender     | 0        | Master | 0.9167 | 3.0         | 151.5500 | -0.138264  |
| 2 | 1      | female | S        | Thanos     | 0        | Mrs    | 2.0000 | 3.0         | 151.5500 | 0.647689   |
| 3 | 1      | male   | S        | Morty      | 0        | Mr     | 30.0000 | 3.0         | 151.5500 | 1.523030   |
| 4 | 1      | female | S        | Morty      | 0        | Mrs    | 25.0000 | 3.0         | 151.5500 | -0.234153  |

Using ARFS to compute the association (a generalization of the correlation coefficient) series and matrix is easy as a breeze:

association_series(
    X=X,
    target="age",
    normalize=False,
    n_jobs=1,
    handle_na="drop",
)

| Feature     | Association |
|-------------|-------------|
| age         | 1.000000    |
| title       | 0.403618    |
| pclass      | 0.375524    |
| is_alone    | 0.222841    |
| fare        | 0.163930    |
| embarked    | 0.098631    |
| sex         | 0.057398    |
| random_cat  | 0.037237    |
| random_num  | -0.035203   |
| family_size | -0.139715   |

assoc_m = association_matrix(X=X, n_jobs=1)
xy_to_matrix(assoc_m)

|             | age       | embarked  | family_size | fare      | is_alone  | pclass    | random_cat | random_num | sex       | title     |
|-------------|-----------|-----------|-------------|-----------|-----------|-----------|------------|------------|-----------|-----------|
| age         | 0.000000  | 0.098631  | -0.196996   | 0.171521  | 0.222841  | 0.375524  | 0.037237   | -0.041389  | 0.057398  | 0.403618  |
| embarked    | 0.098631  | 0.000000  | 0.104125    | 0.300998  | 0.006505  | 0.100167  | 0.006122   | 0.052706   | 0.011014  | 0.012525  |
| family_size | -0.196996 | 0.104125  | 0.000000    | 0.226465  | 0.785592  | 0.059053  | 0.064754   | -0.019169  | 0.188583  | 0.438517  |
| fare        | 0.171521  | 0.300998  | 0.226465    | 0.000000  | 0.175140  | 0.602869  | 0.070161   | -0.024327  | 0.185484  | 0.196217  |
| is_alone    | 0.222841  | 0.010056  | 0.785592    | 0.175140  | 0.000000  | 0.002527  | 0.005153   | 0.010023   | 0.031810  | 0.172022  |
| pclass      | 0.375524  | 0.080502  | 0.059053    | 0.602869  | 0.001314  | 0.000000  | 0.003934   | 0.052158   | 0.007678  | 0.029489  |
| random_cat  | 0.037237  | 0.002545  | 0.064754    | 0.070161  | 0.001386  | 0.002035  | 0.000000   | 0.064163   | 0.000818  | 0.002632  |
| random_num  | -0.041389 | 0.052706  | -0.019169   | -0.024327 | 0.010023  | 0.052158  | 0.064163   | 0.000000   | 0.045685  | 0.066008  |
| sex         | 0.057398  | 0.013678  | 0.188583    | 0.185484  | 0.025554  | 0.011864  | 0.002444   | 0.045685   | 0.000000  | 0.976945  |
| title       | 0.403618  | 0.010986  | 0.438517    | 0.196217  | 0.097605  | 0.032183  | 0.005553   | 0.066008   | 0.690022  | 0.000000  |

Crafting Your Association! ✨📐

In ARFS, the `association_series` and `association_matrix` are like your personal data wizards, computing association values with a sprinkle of parallel processing (when `n_jobs` is greater than 1, that is). Think of `_callable_association_series_fn` and `_callable_association_matrix_fn` as your trusty assistants, adept at weaving together the threads of generic functions to compute those association coefficients. Let’s brew some custom association! 🧙‍♂️🔢

However, the input functions must have a well-defined structure:

@symmetric_function
def input_function_computing_coefficient_values(x, y, sample_weight=None, as_frame=True):
    """
    Calculate the [DESCRIPTION HERE] for series x with respect to series y.

    Parameters
    ----------
    x : pandas.Series
        A pandas Series representing a feature.
    y : pandas.Series
        Another pandas Series representing a feature.
    as_frame : bool, optional
        If True, the function returns the result as a pandas DataFrame;
        otherwise, it returns a float value. The default is False.

    Returns
    -------
    Union[float, pandas.DataFrame]
        A score representing the [COEFFICIENT NAME] between x and y.
        If `as_frame` is True, returns a DataFrame with the columns "row", "col", and "val",
        where "row" and "col" represent the names of the series x and y, respectively,
        and "val" is the PPS score. If `as_frame` is False, returns the PPS score as a float.
    """

    if x.name == y.name:
        score = 1
    else:
        df = pd.DataFrame({"x": x.values, "y": y.values})
        # Calculating the PPS and extracting the score
        [CUSTOM CODE HERE, RETURNING THE COEFFICIENT VALUE c]

    if as_frame:
        # Symmetry allows to not compute twice the same quantity
        return pd.DataFrame(
            {"row": [x_name, y_name], "col": [y_name, x_name], "val": [v, v]}
        )
    else:
        return c

Let’s use an example to illustrate

import ppscore as pps

@asymmetric_function
def ppscore_arfs(x, y, sample_weight=None, as_frame=True):
    """
    Calculate the Predictive Power Score (PPS) for series x with respect to series y.

    The PPS is a score that shows the predictive relationship between two variables.
    This function calculates the PPS of x predicting y. If the series have the same name,
    the function assumes they are identical and returns a score of 1.

    Parameters
    ----------
    x : pandas.Series
        A pandas Series representing a feature.
    y : pandas.Series
        Another pandas Series representing a feature.
    as_frame : bool, optional
        If True, the function returns the result as a pandas DataFrame;
        otherwise, it returns a float value. The default is False.

    Returns
    -------
    Union[float, pandas.DataFrame]
        A score representing the PPS between x and y.
        If `as_frame` is True, returns a DataFrame with the columns "row", "col", and "val",
        where "row" and "col" represent the names of the series x and y, respectively,
        and "val" is the PPS score. If `as_frame` is False, returns the PPS score as a float.
    """

    # Merging x and y into a single DataFrame

    # Ensure x and y are DataFrames with only one column
    if (isinstance(x, pd.DataFrame) and isinstance(y, pd.DataFrame) and x.shape[1] == 1 and y.shape[1] == 1):
        # Extracting the series from the DataFrames
        x = x.iloc[:, 0]
        y = y.iloc[:, 0]

    if x.name == y.name:
        score = 1
    else:
        df = pd.DataFrame({"x": x.values, "y": y.values})
        # Calculating the PPS and extracting the score
        score = pps.score(df, df.columns[0], df.columns[1])['ppscore']

    if as_frame:
        return pd.DataFrame({"row": x.name, "col": y.name, "val":score}, index=[0])
    else:
        return score

The custom association function is ready to use:

d = association_matrix(
    X=X,
    n_jobs=1,
    nom_nom_assoc=ppscore_arfs,
    num_num_assoc=ppscore_arfs,
    nom_num_assoc=ppscore_arfs)

xy_to_matrix(d)

|             | age      | embarked | family_size | fare     | is_alone | pclass   | random_cat | random_num | sex      | title    |
|-------------|----------|----------|-------------|----------|----------|----------|------------|------------|----------|----------|
| age         | 0.00000  | 0.000000 | 0.000000    | 0.000000 | 0.000000 | 0.000000 | 0.0        | 0.000000   | 0.000000 | 0.000000 |
| embarked    | 0.00000  | 0.000000 | 0.000000    | 0.000000 | 0.000000 | 0.132714 | 0.0        | 0.000000   | 0.000000 | 0.000000 |
| family_size | 0.00000  | 0.000000 | 0.000000    | 0.000000 | 0.000000 | 0.000000 | 0.0        | 0.000000   | 0.000000 | 0.000000 |
| fare        | 0.00000  | 0.000000 | 0.603721    | 0.000000 | 0.000000 | 0.000000 | 0.0        | 0.000000   | 0.000000 | 0.000000 |
| is_alone    | 0.00000  | 0.000012 | 0.324128    | 0.000000 | 0.000000 | 0.000000 | 0.0        | 0.000000   | 0.166500 | 0.208732 |
| pclass      | 0.00000  | 0.000012 | 0.000000    | 0.188409 | 0.000000 | 0.000000 | 0.0        | 0.000606   | 0.000000 | 0.000000 |
| random_cat  | 0.00000  | 0.000012 | 0.000000    | 0.000000 | 0.000000 | 0.000000 | 0.0        | 0.000000   | 0.000000 | 0.000000 |
| random_num  | 0.00000  | 0.000000 | 0.000000    | 0.000000 | 0.000000 | 0.000000 | 0.0        | 0.000000   | 0.000000 | 0.000000 |
| sex         | 0.00000  | 0.000012 | 0.000000    | 0.000000 | 0.000000 | 0.000000 | 0.0        | 0.001554   | 0.000000 | 0.796922 |
| title       | 0.04168  | 0.000012 | 0.000000    | 0.000000 | 0.267387 | 0.000000 | 0.0        | 0.001815   | 0.984495 | 0.000000 |

Link to feature selection🛠️🌟

Embark on a journey through the arfs.feature_selection.unsupervised module, where the CollinearityThreshold selector quietly harnesses the power of the association matrix in the backdrop. You can wield the power to swap out the default functions with your very own creations for a truly bespoke feature selection experience!

selector = CollinearityThreshold(
    method="association",
    nom_nom_assoc=ppscore_arfs,
    num_num_assoc=ppscore_arfs,
    nom_num_assoc=ppscore_arfs,
    threshold=0.5,
).fit(X)

print(f"The features going in the selector are : {selector.feature_names_in_}")
print(f"The support is : {selector.support_}")
print(f"The selected features are : {selector.get_feature_names_out()}")

f = selector.plot_association(figsize=(4, 4))

The features going in the selector are : ['pclass' 'sex' 'embarked' 'random_cat' 'is_alone' 'title' 'age'
 'family_size' 'fare' 'random_num']
The support is : [ True  True  True  True  True False  True  True False  True]
The selected features are : ['pclass' 'sex' 'embarked' 'random_cat' 'is_alone' 'age' 'family_size'
 'random_num']

The custom association matrix, image by author.

You can find all the resources in:

the GitHub repo
the documentation
Install: pip install -U arfs

— — — —

Remember, the thousand-mile journey of data exploration starts with a single step (or click!). Happy data diving, stay tuned for part 2! 🎉📈

Unleash the Potential of Data with ARFS — Part 1: Uncover Hidden Relationships 🔍🚀

Deciphering the Mosaic of Data Interconnections 🧩📈

Setting Sail with the Titanic Data: A Twist of Random Predictors! 🛳️🎲

Crafting Your Association! ✨📐

Link to feature selection🛠️🌟

Written by Thomas Bury

No responses yet