EDA
Exploratory Data Analysis (EDA) is the first thing that needs to be done before start working on model development. This module contains common utilities for EDA on tabular data. Most of these functions are around plotting, data quality, and computing stats on the data.
get_ecdf(a)
Compute empirical cumulative distribution function of a
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
a |
list, Array, or pd.Series
|
Array to compute ECDF on. |
required |
Returns:
Name | Type | Description |
---|---|---|
x |
array
|
Sorted version of given array in ascending order. |
y |
array
|
Cumulative probability of each value if the sorted array. |
na_percentages(df, formatted=True)
Compute percentage of missing values in df
columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Dataframe to compute missing values. |
required |
formatted |
bool
|
Whether to return styled/formatted dataframe or raw percentages. |
True
|
Returns:
Name | Type | Description |
---|---|---|
res |
Series or DataFrame
|
Percentages of missing values in each column. |
plot_corr_matrix(df, method='pearson', figsize=(12, 6))
Plot method
correlation matrix heatmap.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Dataframe to compute correlation. |
required |
method |
str
|
Method of correlation. |
'pearson'
|
figsize |
tuple
|
Figure size. |
(12, 6)
|
plot_ecdf(a, xlabel='X')
Plot empirical cumulative distribution of a
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
a |
list, Array, or pd.Series
|
Array to compute ECDF on. |
required |
xlabel |
str
|
XlLabel of the plot. |
"X"
|
plot_featurebased_hier_clustering(X, feature_names=None, linkage_method='single', figsize=(16, 12))
Plot features-based hierarchical clustering based on spearman correlation matrix.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray | DataFrame
|
Data to compute hierarchical clustering. |
required |
feature_names |
ndarray | list | None
|
Feature names to use as labels with plotting. |
None
|
linkage_method |
str
|
method for calculating the distance between clusters. |
"single"
|
figsize |
tuple
|
Figure size. |
(16, 12)
|
plot_pca_var_explained(pca_transformer, figsize=(12, 6))
Plot individual and cumulative of the variance explained by each PCA component.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pca_transformer |
PCA
|
Fitted PCA transformer. |
required |
figsize |
tuple
|
Figure size. |
(12, 6)
|