Skip to content

EDA

Exploratory Data Analysis (EDA) is the first thing that needs to be done before start working on model development. This module contains common utilities for EDA on tabular data. Most of these functions are around plotting, data quality, and computing stats on the data.

get_ecdf(a)

Compute empirical cumulative distribution function of a.

Parameters:

Name Type Description Default
a list, Array, or pd.Series

Array to compute ECDF on.

required

Returns:

Name Type Description
x array

Sorted version of given array in ascending order.

y array

Cumulative probability of each value if the sorted array.

na_percentages(df, formatted=True)

Compute percentage of missing values in df columns.

Parameters:

Name Type Description Default
df DataFrame

Dataframe to compute missing values.

required
formatted bool

Whether to return styled/formatted dataframe or raw percentages.

True

Returns:

Name Type Description
res Series or DataFrame

Percentages of missing values in each column.

plot_corr_matrix(df, method='pearson', figsize=(12, 6))

Plot method correlation matrix heatmap.

Parameters:

Name Type Description Default
df DataFrame

Dataframe to compute correlation.

required
method str

Method of correlation.

'pearson'
figsize tuple

Figure size.

(12, 6)

plot_ecdf(a, xlabel='X')

Plot empirical cumulative distribution of a.

Parameters:

Name Type Description Default
a list, Array, or pd.Series

Array to compute ECDF on.

required
xlabel str

XlLabel of the plot.

"X"

plot_featurebased_hier_clustering(X, feature_names=None, linkage_method='single', figsize=(16, 12))

Plot features-based hierarchical clustering based on spearman correlation matrix.

Parameters:

Name Type Description Default
X ndarray | DataFrame

Data to compute hierarchical clustering.

required
feature_names ndarray | list | None

Feature names to use as labels with plotting.

None
linkage_method str

method for calculating the distance between clusters.

"single"
figsize tuple

Figure size.

(16, 12)

plot_pca_var_explained(pca_transformer, figsize=(12, 6))

Plot individual and cumulative of the variance explained by each PCA component.

Parameters:

Name Type Description Default
pca_transformer PCA

Fitted PCA transformer.

required
figsize tuple

Figure size.

(12, 6)