Skip to content

Data

This module includes most of the utilities related to working with data in general, and with pytorch's related data in specific. It includes functions from composing transforms and getting train/valid DataLoaders to splitting and labeling Datasets.

DataLoaders

Create train/valid DataLoaders.

__init__(*dls)

Parameters:

Name Type Description Default
dls

list of DataLoaders.

()

from_dd(dd, batch_size, **kwargs) classmethod

Create train/valid data loaders from HF Dataset dictionary.

Parameters:

Name Type Description Default
dd DatasetDict

HF Dataset dictionary. Must have at least two datasets: train and valid/test datasets.

required
batch_size int

batch size passed to DataLoader.

required

Returns:

Type Description
tuple[DataLoader]

Train/valid data loaders.

ItemList

Bases: ListContainer

Base class for all types of datasets such as image, text, etc.

__init__(items, path='.', tfms=None, **kwargs)

Parameters:

Name Type Description Default
items Sequence

Items to create list.

required
path str | Path

Path of the items that were used to create the list.

"."
tfms Callable | None

Transformations to apply on items before returning them.

None

get(item)

Every class that inherits from ItemList has to override this method.

new(items, cls=None)

Create a new instance of the ItemList with items.

Parameters:

Name Type Description Default
items Sequence

The items to create the list from.

required
cls ItemList | None

The class to instantiate. If None, the same class will be used.

None

Returns:

Type Description
ItemList

The new instance of the ItemList.

LabeledData

Create a labeled data and expose both x & y as item lists after passing them through all processors.

__init__(x, y, proc_x=None, proc_y=None)

Parameters:

Name Type Description Default
x ItemList

Input items to the model.

required
y ItemList

Label items.

required
proc_x Processor | Iterable[Processor] | None

Input items processor(s).

None
proc_y Processor | Iterable[Processor] | None

Label items processor(s).

None

label_by_func(item_list, label_func, proc_x=None, proc_y=None) classmethod

Label an ItemList using a labeling function.

Parameters:

Name Type Description Default
item_list ItemList

The ItemList to be labeled.

required
label_func Callable

The function to be used for labeling.

required
proc_x Callable | None

The processor to be applied to the input data.

None
proc_y Callable | None

The processor to be applied to the label data.

None

Returns:

Type Description
LabeledData

The labeled ItemList.

process(item_list, proc)

Applies processors to an ItemList.

Parameters:

Name Type Description Default
item_list ItemList

The ItemList to process.

required
proc Processor | Iterable[Processor]

The processor or list of processors to apply.

required

Returns:

Type Description
ItemList

The processed ItemList.

x_obj(idx)

Returns the input object at index idx after applying all processors in proc_x.

Parameters:

Name Type Description Default
idx int

Index of the input object to retrieve.

required

Returns:

Type Description
Any

The input object at index idx after applying all processors in proc_x.

y_obj(idx)

Returns the label object at index idx after applying all processors in proc_y.

Parameters:

Name Type Description Default
idx int

Index of the label object to retrieve.

required

Returns:

Type Description
Any

The label object at index idx after applying all processors in proc_y.

ListContainer

Bases: UserList

Extend builtin list by changing the creation of the list from the given items and changing repr to return first 10 items along with total number of items and the class name. This will be the base class where other containers will inherit from.

__init__(items)

Parameters:

Name Type Description Default
items Any

Items to create list from.

required

SplitData

Split Item list into train and validation data lists.

__init__(train, valid)

Parameters:

Name Type Description Default
train ItemList

Training items.

required
valid ItemList

Validation items.

required

split_by_func(item_list, split_func) classmethod

Split item list by splitter function and returns a SplitData object.

to_dls(batch_size=32, **kwargs)

Returns a tuple of training and validation DataLoaders using train and valid datasets.

collate_device(device)

Collate inputs from batch and copy it to device.

Parameters:

Name Type Description Default
device str | device

Device to copy batch to.

required

Returns:

Type Description
Callable

Wrapper function that returns tuple of collated inputs.

collate_dict(ds)

Collate inputs from HF Dataset dictionary and returns list of inputs after applying pytorch's default collate function.

Parameters:

Name Type Description Default
ds DatasetDict

HF Dataset dictionary.

required

Returns:

Type Description
Callable

Wrapper function that returns tuple of collated inputs.

compose(x, funcs, *args, order='order', **kwargs)

Applies transformations in funcs to the input x in order.

get_dls(train_ds, valid_ds, batch_size, **kwargs)

Returns two dataloaders: 1 for training and 1 for validation. The validation dataloader has twice the batch size and doesn't shuffle data.

get_files(path, extensions=None, include=None, recurse=False)

Get filenames in path that have extension extensions starting with path and optionally recurse to subdirectories.

Parameters:

Name Type Description Default
path str | Path

Path for the root directory to search for files.

required
extensions str | Iterable[str] | None

Suffixes of filenames to look for.

None
include Iterable[str] | None

Top-level Director(y|ies) under path to use to search for files.

None
recurse bool

Whether to search subdirectories recursively.

False

Returns:

Type Description
list[str]

List of filenames that ends with extensions under path.

grandparent_splitter(f_name, valid_name='valid', train_name='train')

Split items based on whether they fall under validation or training directories. This assumes that the directory structure is train/label/items or valid/label/items.

Parameters:

Name Type Description Default
f_name str | Path

Item's filename.

required
valid_name str

Name of the directory that holds the validation items.

"valid"
train_name str

Name of the directory that holds the training items.

"train"

Returns:

Type Description
bool | None

True if the item is in validation else False (training). If neither, returns None.

label_by_func(splitted_data, label_func, proc_x=None, proc_y=None)

Label splitted data using label_func.

Parameters:

Name Type Description Default
splitted_data SplitData

The splitted data to be labeled.

required
label_func Callable

The function to be used for labeling.

required
proc_x Callable | None

The processor to be applied to the input data.

None
proc_y Callable | None

The processor to be applied to the label data.

None

Returns:

Type Description
SplitData

The labeled splitted data.

parent_labeler(f_name)

Label a file based on its parent directory.

Parameters:

Name Type Description Default
f_name str | Path

Filename to get the parent directory.

required

Returns:

Type Description
str

Name of the parent directory.

random_splitter(f_name, p_valid=0.2)

Randomly split items with p_valid probability to be in the validation set.

Parameters:

Name Type Description Default
f_name str

Item's filename. Not used here, but left for API consistency with other splitters.

required
p_valid float

Probability of the item to be in the validation set.

0.2

Returns:

Type Description
bool

True if the item is in validation else False (training).

split_by_func(items, func)

Split items into train/valid lists using func.

Parameters:

Name Type Description Default
items Iterable

Items to be split into train/valid.

required
func Callable

Split function to split items.

required

Returns:

Type Description
tuple[list, list]

Train and valid item lists.

to_cpu(x)

Copy tensor(s) to CPU. If a tensor is already on the CPU, returns the tensor itself; otherwise, returns a copy of the tensor.

to_device(x, device=default_device)

Copy tensor(s) to device. If the tensor is already on the device, returns the tensor itself.

Parameters:

Name Type Description Default
x Tensor | Iterable[Tensor] | Mapping[str, Tensor]

Tensor or collection of tensors to move to device.

required
device str | device

Device to copy the tensor to.

'cuda:0` if available else 'cpu'

Returns:

Name Type Description
out Tensor | Iterable[Tensor] | Mapping[str, Tensor]

Copied tensor(s) on the device.