Skip to content

Data

Tabular data manipulation and splitting utilities.

This module provides advanced data splitting functionality for tabular datasets, with support for stratified splitting, group-aware splitting, and automatic validation of class distribution requirements.

Classes:

  • DataSplitter : class

    A configurable data splitter with support for train/validation/test splits, stratification, group-aware splitting, and automatic size adjustment.

Functions:

  • get_data_splits : function

    Convenience function for one-time data splitting operations.

  • _suggest_sizes : function

    Internal utility for adjusting split sizes to ensure stratification feasibility.

Notes

The module is designed to handle common challenges in machine learning data preparation, including:

  • Ensuring all classes have sufficient samples in each split
  • Automatic size adjustment when requested splits are infeasible
  • Group-aware splitting to prevent data leakage
  • Comprehensive validation with helpful error messages

The splitting functions are compatible with both NumPy arrays and pandas DataFrames, making them suitable for various data science workflows.

Examples:

>>> import pandas as pd
>>> import numpy as np
>>> from cmn_ai.tabular.data import get_data_splits
>>>
>>> # Create sample data
>>> X = pd.DataFrame({'feature1': range(100), 'feature2': range(100, 200)})
>>> y = np.random.choice(['A', 'B', 'C'], size=100)
>>>
>>> # Basic stratified split
>>> X_train, X_val, X_test, y_train, y_val, y_test = get_data_splits(
...     X, y, train_size=0.7, val_size=0.15, test_size=0.15
... )
>>>
>>> # Group-aware splitting
>>> groups = np.random.choice(['group1', 'group2', 'group3'], size=100)
>>> splits = get_data_splits(X, y, groups=groups, stratify=False)
>>>
>>> # Using DataSplitter for repeated operations
>>> splitter = DataSplitter(train_size=0.8, val_size=0.1, test_size=0.1)
>>> splits1 = splitter.split(X1, y1)
>>> splits2 = splitter.split(X2, y2)

DataSplitter

Advanced data splitter for train/validation/test sets with stratification support.

This class provides a comprehensive solution for splitting tabular datasets with support for stratified splitting, group-aware splitting, automatic size adjustment, and validation of class distribution requirements. Designed for reusable splitting configurations across multiple datasets.

Parameters:

  • train_size (float, default: 0.7 ) –

    Fraction of data for training set (0 < train_size < 1).

  • val_size (float, default: 0.15 ) –

    Fraction of data for validation set (0 < val_size < 1).

  • test_size (float, default: 0.15 ) –

    Fraction of data for test set (0 < test_size < 1). train_size + val_size + test_size must equal 1.0.

  • stratify (bool, default: True ) –

    Whether to preserve class proportions in each split. Incompatible with group-aware splitting.

  • shuffle (bool, default: True ) –

    Whether to shuffle data before splitting.

  • random_state (int, default: None ) –

    Random seed for reproducible splits.

  • min_class_count_check (bool, default: True ) –

    Whether to validate that all classes have sufficient samples for the requested split sizes with automatic adjustment suggestions.

Attributes:

  • train_size (float) –

    Configured training set fraction.

  • val_size (float) –

    Configured validation set fraction.

  • test_size (float) –

    Configured test set fraction.

  • stratify (bool) –

    Configured stratification setting.

  • shuffle (bool) –

    Configured shuffle setting.

  • random_state (int or None) –

    Configured random state.

  • min_class_count_check (bool) –

    Configured class count validation setting.

Methods:

  • split

    Split data into train/validation/test sets with optional parameter overrides.

Notes

The DataSplitter is designed to handle several common challenges in ML data preparation:

  • Class imbalance: Automatically detects when requested split sizes would result in missing classes and suggests minimal adjustments.
  • Group leakage: Supports group-aware splitting to prevent data leakage when samples are not independent (e.g., time series, grouped data).
  • Flexibility: Allows parameter overrides per split operation while maintaining default configuration.
  • Validation: Comprehensive input validation with informative error messages.

Examples:

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Basic usage with default settings
>>> splitter = DataSplitter()
>>> X = pd.DataFrame({'feature': range(100)})
>>> y = np.random.choice(['A', 'B', 'C'], 100)
>>> X_train, X_val, X_test, y_train, y_val, y_test = splitter.split(X, y)
>>> # Custom configuration for imbalanced data
>>> splitter = DataSplitter(
...     train_size=0.8, val_size=0.1, test_size=0.1,
...     stratify=True, random_state=42
... )
>>> splits = splitter.split(X, y)
>>> # Group-aware splitting for time series
>>> groups = np.repeat(range(10), 10)  # 10 groups of 10 samples each
>>> splits = splitter.split(X, y, groups=groups, stratify=False)
>>> # Parameter override for specific split
>>> splits = splitter.split(X, y, train_size=0.9, val_size=0.05, test_size=0.05)

__init__(train_size=0.7, val_size=0.15, test_size=0.15, stratify=True, shuffle=True, random_state=None, min_class_count_check=True)

Initialize DataSplitter with configuration for train/validation/test splitting.

Parameters:

  • train_size (float, default: 0.7 ) –

    Fraction of data allocated to training set. Must be in range (0, 1). Combined with val_size and test_size must sum to 1.0.

  • val_size (float, default: 0.15 ) –

    Fraction of data allocated to validation set. Must be in range (0, 1). Can be set to None to auto-infer from train_size and test_size.

  • test_size (float, default: 0.15 ) –

    Fraction of data allocated to test set. Must be in range (0, 1). Can be set to None to auto-infer from train_size and val_size.

  • stratify (bool, default: True ) –

    Whether to preserve class distribution proportions across splits. When True, each split will maintain approximately the same class ratios as the original dataset. Not compatible with group-based splitting.

  • shuffle (bool, default: True ) –

    Whether to shuffle the data before splitting. Recommended for most use cases unless data order is important (e.g., time series).

  • random_state (int, default: None ) –

    Seed for random number generator to ensure reproducible splits. If None, splits will be different each time.

  • min_class_count_check (bool, default: True ) –

    Whether to validate that each class has sufficient samples for the requested split sizes. When True, will automatically suggest adjusted sizes if stratification would fail due to insufficient samples in rare classes.

Raises:

  • ValueError

    If train_size + val_size + test_size != 1.0 (after auto-inference). If any size parameter is not in range (0, 1). If stratify=True but target has insufficient samples per class.

Examples:

>>> # Standard 70/15/15 split with stratification
>>> splitter = DataSplitter()
>>> # Custom proportions for small datasets
>>> splitter = DataSplitter(
...     train_size=0.6, val_size=0.2, test_size=0.2,
...     random_state=42
... )
>>> # Non-stratified splitting for regression or grouped data
>>> splitter = DataSplitter(stratify=False, shuffle=True)
>>> # Disable class count validation for advanced use cases
>>> splitter = DataSplitter(min_class_count_check=False)

split(X, y=None, *, train_size=None, val_size=None, test_size=None, stratify=None, shuffle=None, random_state=None, return_indices=False, groups=None, min_class_count_check=None)

Split data into train, validation, and test sets with optional parameter overrides.

This method applies the configured splitting strategy to the provided data, with options to override instance defaults for specific operations. Supports both stratified and group-aware splitting with automatic validation.

Parameters:

  • X (array-like of shape (n_samples, n_features)) –

    Feature matrix to split. Accepts NumPy arrays, pandas DataFrames, or other array-like structures with indexing support.

  • y (array-like of shape (n_samples,), default: None ) –

    Target vector for supervised learning. Required for stratified splitting. Accepts NumPy arrays, pandas Series, or lists.

  • train_size (float, default: None ) –

    Override instance train_size for this split operation. Must be in range (0, 1) and sum with val_size and test_size to 1.0.

  • val_size (float, default: None ) –

    Override instance val_size for this split operation. Can be None to auto-infer from other sizes.

  • test_size (float, default: None ) –

    Override instance test_size for this split operation. Can be None to auto-infer from other sizes.

  • stratify (bool, default: None ) –

    Override instance stratify setting. If True, maintains class proportions across splits. Cannot be used with groups parameter.

  • shuffle (bool, default: None ) –

    Override instance shuffle setting. Whether to shuffle data before splitting.

  • random_state (int, default: None ) –

    Override instance random_state for this operation. Ensures reproducible splits when provided.

  • return_indices (bool, default: False ) –

    If True, returns indices of split samples instead of the data itself. Useful for custom data handling or debugging.

  • groups (array-like of shape (n_samples,), default: None ) –

    Group labels for group-aware splitting. When provided, ensures samples from the same group don't appear in different splits. Automatically disables stratification.

  • min_class_count_check (bool, default: None ) –

    Override instance min_class_count_check setting. Whether to validate class distribution feasibility and suggest adjustments if needed.

Returns:

  • tuple of arrays or indices

    If return_indices=False (default): (X_train, X_val, X_test, y_train, y_val, y_test) if y is provided (X_train, X_val, X_test) if y is None If return_indices=True: (train_indices, val_indices, test_indices) as NumPy arrays

Raises:

  • ValueError

    If split sizes don't sum to 1.0 after inference. If stratification is requested but infeasible due to class distribution. If groups is provided together with stratify=True. If X and y have different lengths.

  • UserWarning

    When classes barely meet minimum sample requirements for stratification.

Examples:

>>> import pandas as pd
>>> import numpy as np
>>> splitter = DataSplitter(train_size=0.7, val_size=0.15, test_size=0.15)
>>> # Basic stratified split
>>> X = pd.DataFrame({'feature': range(100)})
>>> y = np.random.choice(['A', 'B', 'C'], 100)
>>> X_train, X_val, X_test, y_train, y_val, y_test = splitter.split(X, y)
>>> # Override parameters for specific split
>>> X_train, X_val, X_test, y_train, y_val, y_test = splitter.split(
...     X, y, train_size=0.8, val_size=0.1, test_size=0.1, random_state=42
... )
>>> # Group-aware splitting
>>> groups = np.repeat(range(10), 10)  # 10 groups of 10 samples each
>>> splits = splitter.split(X, y, groups=groups, stratify=False)
>>> # Return indices instead of data
>>> train_idx, val_idx, test_idx = splitter.split(
...     X, y, return_indices=True
... )
>>> # Unsupervised data (no y)
>>> X_train, X_val, X_test = splitter.split(X)

get_data_splits(X, y=None, *, train_size=0.7, val_size=0.15, test_size=0.15, stratify=True, shuffle=True, random_state=None, return_indices=False, groups=None, min_class_count_check=True)

Convenient one-call function for train/validation/test data splitting.

This function provides a simple interface for splitting datasets with advanced features including stratification, group-aware splitting, automatic class distribution validation, and size adjustment suggestions. For multiple split operations on different datasets, consider using DataSplitter directly for better performance.

Parameters:

  • X (array-like of shape (n_samples, n_features)) –

    Feature matrix to split. Supports NumPy arrays, pandas DataFrames, or any array-like structure with indexing capabilities.

  • y (array-like of shape (n_samples,), default: None ) –

    Target vector for supervised learning. Required when stratify=True. Supports NumPy arrays, pandas Series, or lists.

  • train_size (float, default: 0.7 ) –

    Proportion of data allocated to training set. Must be in range (0, 1). Combined with val_size and test_size must sum to 1.0.

  • val_size (float, default: 0.15 ) –

    Proportion of data allocated to validation set. Must be in range (0, 1). Can be None to auto-infer from train_size and test_size.

  • test_size (float, default: 0.15 ) –

    Proportion of data allocated to test set. Must be in range (0, 1). Can be None to auto-infer from train_size and val_size.

  • stratify (bool, default: True ) –

    Whether to preserve class distribution proportions across all splits. When True, each split maintains approximately the same class ratios as the original dataset. Automatically disabled when groups is provided.

  • shuffle (bool, default: True ) –

    Whether to shuffle data before splitting. Recommended for most use cases except when sample order is meaningful (e.g., time series data).

  • random_state (int, default: None ) –

    Seed for random number generator to ensure reproducible splits. If None, results will vary between calls.

  • return_indices (bool, default: False ) –

    If True, returns sample indices instead of actual data splits. Useful for custom data handling or when working with complex data structures.

  • groups (array-like of shape (n_samples,), default: None ) –

    Group labels for group-aware splitting. When provided, ensures that samples with the same group label don't appear across different splits, preventing data leakage in grouped data scenarios.

  • min_class_count_check (bool, default: True ) –

    Whether to validate class distribution feasibility for stratified splitting. When True, automatically suggests adjusted split sizes if any class has insufficient samples for the requested proportions.

Returns:

  • tuple of arrays or indices

    If return_indices=False (default): For supervised data (y provided): (X_train, X_val, X_test, y_train, y_val, y_test) For unsupervised data (y=None): (X_train, X_val, X_test) If return_indices=True: (train_indices, val_indices, test_indices) as NumPy arrays

Raises:

  • ValueError

    If train_size + val_size + test_size != 1.0 (after auto-inference). If any size parameter is not in range (0, 1). If stratify=True but insufficient samples per class for requested splits. If groups is provided together with stratify=True. If X and y have mismatched lengths.

  • UserWarning

    When classes barely meet minimum requirements for stratified splitting.

See Also

DataSplitter : Class-based interface for repeated splitting operations. sklearn.model_selection.train_test_split : Scikit-learn's 2-way splitting function.

Notes

This function internally creates a DataSplitter instance and calls its split() method. For applications requiring multiple split operations with the same configuration, using DataSplitter directly is more efficient as it avoids repeated parameter validation and object instantiation.

The stratification algorithm ensures that rare classes are handled gracefully by automatically suggesting feasible split sizes when the requested proportions would result in empty classes in some splits.

Examples:

>>> import pandas as pd
>>> import numpy as np
>>> from cmn_ai.tabular.data import get_data_splits
>>> # Basic stratified splitting
>>> X = pd.DataFrame({'feature1': range(100), 'feature2': range(100, 200)})
>>> y = np.random.choice(['A', 'B', 'C'], size=100)
>>> X_train, X_val, X_test, y_train, y_val, y_test = get_data_splits(X, y)
>>> print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
Train: 70, Val: 15, Test: 15
>>> # Custom proportions with reproducible results
>>> splits = get_data_splits(
...     X, y, train_size=0.8, val_size=0.1, test_size=0.1, random_state=42
... )
>>> # Group-aware splitting for time series or clustered data
>>> groups = np.repeat(range(20), 5)  # 20 groups of 5 samples each
>>> X_train, X_val, X_test, y_train, y_val, y_test = get_data_splits(
...     X, y, groups=groups, stratify=False
... )
>>> # Unsupervised data splitting
>>> X_train, X_val, X_test = get_data_splits(X, shuffle=True, random_state=123)
>>> # Get indices for custom data handling
>>> train_idx, val_idx, test_idx = get_data_splits(
...     X, y, return_indices=True, random_state=456
... )
>>> # Handle imbalanced data with automatic size adjustment
>>> y_imbalanced = np.array(['rare'] * 2 + ['common'] * 98)
>>> try:
...     splits = get_data_splits(X, y_imbalanced)
... except ValueError as e:
...     print("Automatic suggestion provided:", str(e))