Data
Data utilities for PyTorch-based machine learning workflows.
This module provides utilities for working with data in PyTorch-based machine learning pipelines. It includes functions and classes for creating and managing DataLoaders, device management for tensors and collections, data collation and preprocessing, file system operations for data discovery, dataset splitting and labeling strategies, and container classes for organizing data items.
The module is designed to work seamlessly with PyTorch's Dataset and DataLoader classes, as well as Hugging Face datasets.
Functions:
-
get_dls : Create training and validation DataLoaders
– -
to_device : Copy tensor
– -
to_cpu : Copy tensor
– -
collate_dict : Create collate function for Hugging Face Dataset dictionary
– -
collate_device : Create collate function that moves batch to specified device
– -
compose : Apply transformations in sequence to input
– -
get_files : Get filenames in path with specified extensions
– -
random_splitter : Randomly split items with specified probability
– -
grandparent_splitter : Split items based on directory structure
– -
split_by_func : Split items into train/valid lists using a function
– -
parent_labeler : Label a file based on its parent directory
– -
label_by_func : Label split data using a labeling function
–
Classes:
-
DataLoaders : Container for training and validation DataLoaders
– -
ListContainer : Extended list with improved representation
– -
ItemList : Base class for all types of datasets
– -
SplitData : Split ItemList into train and validation data lists
– -
LabeledData : Create labeled data with input and target ItemLists
–
Notes
This module is part of the cmn_ai library and provides high-level abstractions for common data processing tasks in machine learning workflows.
Examples:
>>> from cmn_ai.utils.data import get_files, to_device, compose, SplitData, LabeledData
>>> from cmn_ai.utils.data import ItemList, grandparent_splitter, parent_labeler
>>> # File operations
>>> files = get_files('./data', extensions=['.txt', '.csv'])
>>> # Device operations
>>> import torch
>>> tensor = torch.randn(3, 3)
>>> tensor_on_gpu = to_device(tensor, 'cuda')
>>> # Function composition
>>> def add_one(x): return x + 1
>>> def multiply_two(x): return x * 2
>>> result = compose(5, [add_one, multiply_two]) # Returns 12
>>> # Data splitting
>>> items = ItemList(['train/cat/1.jpg', 'valid/dog/2.jpg', 'train/cat/3.jpg'])
>>> split_data = SplitData.split_by_func(items, grandparent_splitter)
>>> print(f"Train: {len(split_data.train)}, Valid: {len(split_data.valid)}")
>>> # Labeled data
>>> x_items = ItemList(['image1.jpg', 'image2.jpg'])
>>> y_items = ItemList(['cat', 'dog'])
>>> labeled_data = LabeledData(x_items, y_items)
>>> x, y = labeled_data[0] # Returns ('image1.jpg', 'cat')
>>> # Labeling with function
>>> items = ItemList(['data/cat/1.jpg', 'data/dog/2.jpg'])
>>> labeled = LabeledData.label_by_func(items, parent_labeler)
>>> print(labeled.y[0], labeled.y[1]) # 'cat' 'dog'
DataLoaders
Container for training and validation DataLoaders.
A convenience class that holds training and validation DataLoaders and provides easy access to them.
Attributes:
-
train
(DataLoader
) –Training DataLoader.
-
valid
(DataLoader
) –Validation DataLoader.
Examples:
>>> train_dl = DataLoader(train_ds, batch_size=32)
>>> valid_dl = DataLoader(valid_ds, batch_size=32)
>>> dls = DataLoaders(train_dl, valid_dl)
>>> dls.train # Access training DataLoader
>>> dls.valid # Access validation DataLoader
__init__(*dls)
Initialize DataLoaders with training and validation DataLoaders.
Parameters:
-
*dls
(DataLoader
, default:()
) –List of DataLoaders. First is assumed to be training, second is assumed to be validation.
from_dd(dd, batch_size=32, **kwargs)
classmethod
Create DataLoaders from Hugging Face Dataset dictionary.
Parameters:
-
dd
(DatasetDict
) –Hugging Face Dataset dictionary. Must have at least two datasets: train and valid/test datasets.
-
batch_size
(int
, default:32
) –Batch size passed to DataLoader.
-
**kwargs
(dict
, default:{}
) –Additional keyword arguments passed to DataLoader.
Returns:
-
DataLoaders
–DataLoaders instance with train and validation DataLoaders.
Examples:
>>> from datasets import DatasetDict
>>> dd = DatasetDict({'train': train_ds, 'validation': valid_ds})
>>> dls = DataLoaders.from_dd(dd, batch_size=32)
ItemList
Bases: ListContainer
Base class for all types of datasets such as image, text, etc.
A container class that holds items and provides functionality for applying transformations and retrieving items with transformations applied.
Attributes:
-
path
(Path
) –Path of the items that were used to create the list.
-
tfms
((callable, optional)
) –Transformations to apply on items before returning them.
Examples:
>>> items = ItemList(['file1.jpg', 'file2.jpg'], path='./data')
>>> items[0] # Returns transformed item
__getitem__(idx)
Get item(s) with transformations applied.
Parameters:
-
idx
(int or slice
) –Index or slice to retrieve.
Returns:
-
Any or list of Any
–Item(s) with transformations applied.
__init__(items, path='.', tfms=None, **kwargs)
Initialize ItemList with items, path, and transformations.
Parameters:
-
items
(Sequence
) –Items to create list.
-
path
(str or Path
, default:"."
) –Path of the items that were used to create the list.
-
tfms
(callable
, default:None
) –Transformations to apply on items before returning them.
-
**kwargs
(dict
, default:{}
) –Additional attributes to set on the instance.
get(item)
Get item without transformations.
Every class that inherits from ItemList
has to override this
method to provide custom item retrieval logic.
Parameters:
-
item
(Any
) –Item to retrieve.
Returns:
-
Any
–Retrieved item.
LabeledData
Create labeled data with input and target ItemLists.
A container class that holds input (x) and target (y) ItemLists and provides functionality for processing and retrieving labeled data.
Attributes:
-
x
(ItemList
) –Input items to the model.
-
y
(ItemList
) –Label items.
-
proc_x
(Processor or iterable of Processor, optional
) –Input items processor(s).
-
proc_y
(Processor or iterable of Processor, optional
) –Label items processor(s).
Examples:
>>> x_items = ItemList(['image1.jpg', 'image2.jpg'])
>>> y_items = ItemList(['cat', 'dog'])
>>> labeled_data = LabeledData(x_items, y_items)
>>> x, y = labeled_data[0] # Get first labeled example
__getitem__(idx)
Get labeled example(s) at index.
Parameters:
-
idx
(int or slice
) –Index or slice to retrieve.
Returns:
-
tuple or list of tuple
–Labeled example(s) as (x, y) pairs.
__init__(x, y, proc_x=None, proc_y=None)
Initialize LabeledData with input and target ItemLists.
Parameters:
__len__()
Return the number of labeled examples.
label_by_func(item_list, label_func, proc_x=None, proc_y=None)
classmethod
Label an ItemList using a labeling function.
Parameters:
-
item_list
(ItemList
) –The ItemList to be labeled.
-
label_func
(callable
) –The function to be used for labeling.
-
proc_x
(callable
, default:None
) –The processor to be applied to the input data.
-
proc_y
(callable
, default:None
) –The processor to be applied to the label data.
Returns:
-
LabeledData
–The labeled ItemList.
Examples:
>>> items = ItemList(['image1.jpg', 'image2.jpg'])
>>> labeled = LabeledData.label_by_func(items, parent_labeler)
process(item_list, proc)
x_obj(idx)
Get input object at index after deprocessing.
Parameters:
-
idx
(int
) –Index of the input object to retrieve.
Returns:
-
Any
–The input object at index idx after applying all processors in proc_x.
y_obj(idx)
Get label object at index after deprocessing.
Parameters:
-
idx
(int
) –Index of the label object to retrieve.
Returns:
-
Any
–The label object at index idx after applying all processors in proc_y.
ListContainer
Bases: UserList
Extended list with improved representation.
Extends builtin list by changing the creation of the list from the given items and changing repr to return first 10 items along with total number of items and the class name. This will be the base class where other containers will inherit from.
Attributes:
-
data
(list
) –The underlying list data.
Examples:
>>> container = ListContainer([1, 2, 3, 4, 5])
>>> print(container)
ListContainer: (5 items)
[1, 2, 3, 4, 5]
__init__(items)
Initialize ListContainer with items.
Parameters:
-
items
(Any
) –Items to create list from.
SplitData
Split ItemList into train and validation data lists.
A container class that holds training and validation ItemLists and provides functionality for creating DataLoaders from them.
Attributes:
Examples:
>>> train_items = ItemList(['train1.jpg', 'train2.jpg'])
>>> valid_items = ItemList(['valid1.jpg', 'valid2.jpg'])
>>> split_data = SplitData(train_items, valid_items)
>>> train_dl, valid_dl = split_data.to_dls(batch_size=32)
__getattr__(k)
Delegate attribute access to training ItemList.
__init__(train, valid)
__setstate__(data)
Set state for pickling.
split_by_func(item_list, split_func)
classmethod
Split ItemList by splitter function.
Parameters:
-
item_list
(ItemList
) –ItemList to split.
-
split_func
(callable
) –Function to use for splitting items.
Returns:
-
SplitData
–SplitData object with train and validation ItemLists.
Examples:
>>> items = ItemList(['train/cat/1.jpg', 'val/dog/2.jpg'])
>>> split_data = SplitData.split_by_func(items, grandparent_splitter)
to_dls(batch_size=32, **kwargs)
Create training and validation DataLoaders.
Parameters:
-
batch_size
(int
, default:32
) –Batch size for DataLoaders.
-
**kwargs
(dict
, default:{}
) –Additional keyword arguments passed to DataLoader.
Returns:
-
tuple[DataLoader, DataLoader]
–Training and validation DataLoaders.
collate_device(device)
Create a collate function that moves batch to specified device.
Parameters:
-
device
(str or device
) –Device to copy batch to.
Returns:
-
callable
–Collate function that returns batch on specified device.
Examples:
>>> collate_fn = collate_device('cuda')
>>> dataloader = DataLoader(dataset, collate_fn=collate_fn)
collate_dict(*keys)
Create a collate function for a Dataset dictionary.
Creates a collate function that extracts specified keys from a batch and applies PyTorch's default collate function.
Parameters:
-
*keys
(str
, default:()
) –Keys to extract from the batch dictionary.
Returns:
-
callable
–Collate function that returns tuple of collated inputs.
Examples:
>>> collate_fn = collate_dict('input_ids', 'attention_mask', 'labels')
>>> dataloader = DataLoader(dataset, collate_fn=collate_fn)
compose(x, funcs, *args, order='order', **kwargs)
Apply transformations in sequence to input.
Applies transformations in funcs
to the input x
in the specified order.
Functions are sorted by their order
attribute if present.
Parameters:
-
x
(Any
) –Input to transform.
-
funcs
(callable or iterable of callables
) –Function(s) to apply to input.
-
*args
(tuple
, default:()
) –Positional arguments passed to each function.
-
order
(str
, default:'order'
) –Attribute name used to determine function order.
-
**kwargs
(dict
, default:{}
) –Keyword arguments passed to each function.
Returns:
-
Any
–Transformed input.
Examples:
>>> def add_one(x): return x + 1
>>> def multiply_two(x): return x * 2
>>> result = compose(5, [add_one, multiply_two]) # Returns 12
get_dls(train_ds, valid_ds, batch_size, **kwargs)
Create training and validation DataLoaders.
Creates two DataLoaders: one for training and one for validation. The validation DataLoader has twice the batch size and doesn't shuffle data.
Parameters:
-
train_ds
(Dataset
) –Training dataset.
-
valid_ds
(Dataset
) –Validation dataset.
-
batch_size
(int
) –Batch size for the training DataLoader. Validation DataLoader will use batch_size * 2.
-
**kwargs
(dict
, default:{}
) –Additional keyword arguments passed to DataLoader constructor.
Returns:
-
tuple[DataLoader, DataLoader]
–A tuple containing (train_dataloader, valid_dataloader).
Examples:
>>> train_ds = MyDataset(train_data)
>>> valid_ds = MyDataset(valid_data)
>>> train_dl, valid_dl = get_dls(train_ds, valid_ds, batch_size=32)
get_files(path, extensions=None, include=None, recurse=False)
Get filenames in path with specified extensions.
Get filenames in path
that have extension extensions
starting
with path
and optionally recurse to subdirectories.
Parameters:
-
path
(str or Path
) –Path for the root directory to search for files.
-
extensions
(str or iterable of str
, default:None
) –Suffixes of filenames to look for (with or without dot).
-
include
(iterable of str
, default:None
) –Top-level directory(ies) under
path
to use for searching files. -
recurse
(bool
, default:False
) –Whether to search subdirectories recursively.
Returns:
-
list of Path
–List of file paths that end with specified extensions under
path
.
Examples:
>>> files = get_files('./data', extensions=['.jpg', '.png'])
>>> files = get_files('./data', extensions='.txt', recurse=True)
grandparent_splitter(f_name, valid_name='valid', train_name='train')
Split items based on directory structure.
Split items based on whether they fall under validation or training directories. This assumes that the directory structure is train/label/items or valid/label/items.
Parameters:
-
f_name
(str or Path
) –Item's filename.
-
valid_name
(str
, default:"valid"
) –Name of the directory that holds the validation items.
-
train_name
(str
, default:"train"
) –Name of the directory that holds the training items.
Returns:
-
bool or None
–True if the item is in validation, False if in training, None if neither.
Examples:
>>> splitter = lambda f: grandparent_splitter(f, 'val', 'train')
>>> is_valid = splitter('train/cat/image.jpg') # Returns False
>>> is_valid = splitter('val/dog/image.jpg') # Returns True
label_by_func(splitted_data, label_func, proc_x=None, proc_y=None)
Label split data using a labeling function.
Parameters:
-
splitted_data
(SplitData
) –The split data to be labeled.
-
label_func
(callable
) –The function to be used for labeling.
-
proc_x
(callable
, default:None
) –The processor to be applied to the input data.
-
proc_y
(callable
, default:None
) –The processor to be applied to the label data.
Returns:
-
SplitData
–The labeled split data.
Examples:
>>> split_data = SplitData(train_items, valid_items)
>>> labeled_split = label_by_func(split_data, parent_labeler)
parent_labeler(f_name)
Label a file based on its parent directory.
Parameters:
-
f_name
(str or Path
) –Filename to get the parent directory.
Returns:
-
str
–Name of the parent directory.
Examples:
>>> label = parent_labeler('data/cat/image.jpg') # Returns 'cat'
>>> label = parent_labeler('data/dog/image.png') # Returns 'dog'
random_splitter(f_name, p_valid=0.2)
Randomly split items with specified probability.
Randomly split items with p_valid
probability to be in the
validation set.
Parameters:
-
f_name
(str
) –Item's filename. Not used here, but left for API consistency with other splitters.
-
p_valid
(float
, default:0.2
) –Probability of the item to be in the validation set.
Returns:
-
bool
–True if the item is in validation else False (training).
Examples:
>>> splitter = lambda f: random_splitter(f, p_valid=0.3)
>>> is_valid = splitter('file.jpg') # Returns True or False
split_by_func(items, func)
Split items into train/valid lists using a function.
Parameters:
-
items
(Iterable
) –Items to be split into train/valid.
-
func
(callable
) –Split function to split items. Should return True for validation items, False for training items, and None to exclude items.
Returns:
-
tuple[list, list]
–Train and valid item lists.
Examples:
>>> files = ['train/cat/1.jpg', 'val/dog/2.jpg', 'train/cat/3.jpg']
>>> train, valid = split_by_func(files, grandparent_splitter)
to_cpu(x)
Copy tensor(s) to CPU.
If a tensor is already on the CPU, returns the tensor itself; otherwise, returns a copy of the tensor on CPU.
Parameters:
-
x
(Tensor or Iterable[Tensor] or Mapping[str, Tensor]
) –Tensor or collection of tensors to move to CPU.
Returns:
-
Tensor or Iterable[Tensor] or Mapping[str, Tensor]
–Copied tensor(s) on CPU.
Examples:
>>> tensor = torch.randn(3, 3, device='cuda')
>>> tensor_on_cpu = to_cpu(tensor)
to_device(x, device=DEFAULT_DEVICE)
Copy tensor(s) to specified device.
Recursively moves tensors and collections of tensors to the specified device. If a tensor is already on the target device, returns the tensor itself.
Parameters:
-
x
(Tensor or Iterable[Tensor] or Mapping[str, Tensor]
) –Tensor or collection of tensors to move to device.
-
device
(str or device
, default:'cuda' if available else 'cpu'
) –Device to copy the tensor(s) to.
Returns:
-
Tensor or Iterable[Tensor] or Mapping[str, Tensor]
–Copied tensor(s) on the specified device.
Notes
This function may fail if iterables contain non-tensor objects that
don't have a .to()
method.
Examples:
>>> tensor = torch.randn(3, 3)
>>> tensor_on_gpu = to_device(tensor, 'cuda')
>>> batch = {'input': tensor, 'target': tensor}
>>> batch_on_gpu = to_device(batch, 'cuda')