EDA

Exploratory Data Analysis (EDA) is the first thing that needs to be done before start working on model development. This module contains common utilities for EDA on tabular data. Most of these functions are around plotting, data quality, and computing stats on the data.

`get_ecdf(a)`

Compute empirical cumulative distribution function of a.

Parameters:

Name	Type	Description	Default
`a`	`list, Array, or pd.Series`	Array to compute ECDF on.	required

Returns:

Name	Type	Description
`x`	`array`	Sorted version of given array in ascending order.
`y`	`array`	Cumulative probability of each value if the sorted array.

`na_percentages(df, formatted=True)`

Compute percentage of missing values in df columns.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Dataframe to compute missing values.	required
`formatted`	`bool`	Whether to return styled/formatted dataframe or raw percentages.	`True`

Returns:

Name	Type	Description
`res`	`Series or DataFrame`	Percentages of missing values in each column.

`plot_corr_matrix(df, method='pearson', figsize=(12, 6))`

Plot method correlation matrix heatmap.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Dataframe to compute correlation.	required
`method`	`str`	Method of correlation.	`'pearson'`
`figsize`	`tuple`	Figure size.	`(12, 6)`

`plot_ecdf(a, xlabel='X')`

Plot empirical cumulative distribution of a.

Parameters:

Name	Type	Description	Default
`a`	`list, Array, or pd.Series`	Array to compute ECDF on.	required
`xlabel`	`str`	XlLabel of the plot.	`"X"`

`plot_featurebased_hier_clustering(X, feature_names=None, linkage_method='single', figsize=(16, 12))`

Plot features-based hierarchical clustering based on spearman correlation matrix.

Parameters:

Name	Type	Description	Default
`X`	`ndarray \| DataFrame`	Data to compute hierarchical clustering.	required
`feature_names`	`ndarray \| list \| None`	Feature names to use as labels with plotting.	`None`
`linkage_method`	`str`	method for calculating the distance between clusters.	`"single"`
`figsize`	`tuple`	Figure size.	`(16, 12)`

`plot_pca_var_explained(pca_transformer, figsize=(12, 6))`

Plot individual and cumulative of the variance explained by each PCA component.

Parameters:

Name	Type	Description	Default
`pca_transformer`	`PCA`	Fitted PCA transformer.	required
`figsize`	`tuple`	Figure size.	`(12, 6)`