Data

This module includes most of the utilities related to working with data in general, and with pytorch's related data in specific. It includes functions from composing transforms and getting train/valid DataLoaders to splitting and labeling Datasets.

`DataLoaders`

Create train/valid DataLoaders.

`init(*dls)`

Parameters:

Name	Type	Description	Default
`dls`		list of DataLoaders.	`()`

`from_dd(dd, batch_size, **kwargs)` `classmethod`

Create train/valid data loaders from HF Dataset dictionary.

Parameters:

Name	Type	Description	Default
`dd`	`DatasetDict`	HF Dataset dictionary.	required
`batch_size`	`int`	batch size passed to DataLoader.	required

Returns:

Type	Description
`tuple[DataLoader]`	Train/valid data loaders.

`ItemList`

Bases: ListContainer

Base class for all types of datasets such as image, text, etc.

`init(items, path='.', tfms=None, **kwargs)`

Parameters:

Name	Type	Description	Default
`items`	`Sequence`	Items to create list.	required
`path`	`str \| Path`	Path of the items that were used to create the list.	`"."`
`tfms`	`Callable \| None`	Transformations to apply on items before returning them.	`None`

`get(item)`

Every class that inherits from ItemList has to override this method.

`new(items, cls=None)`

Create a new instance of the ItemList with items.

Parameters:

Name	Type	Description	Default
`items`	`Sequence`	The items to create the list from.	required
`cls`	`ItemList \| None`	The class to instantiate. If None, the same class will be used.	`None`

Returns:

Type	Description
`ItemList`	The new instance of the `ItemList`.

`LabeledData`

Create a labeled data and expose both x & y as item lists after passing them through all processors.

`init(x, y, proc_x=None, proc_y=None)`

Parameters:

Name	Type	Description	Default
`x`	`ItemList`	Input items to the model.	required
`y`	`ItemList`	Label items.	required
`proc_x`	`Processor \| Iterable[Processor] \| None`	Input items processor(s).	`None`
`proc_y`	`Processor \| Iterable[Processor] \| None`	Label items processor(s).	`None`

`label_by_func(item_list, label_func, proc_x=None, proc_y=None)` `classmethod`

Label an ItemList using a labeling function.

Parameters:

Name	Type	Description	Default
`item_list`	`ItemList`	The ItemList to be labeled.	required
`label_func`	`Callable`	The function to be used for labeling.	required
`proc_x`	`Callable \| None`	The processor to be applied to the input data.	`None`
`proc_y`	`Callable \| None`	The processor to be applied to the label data.	`None`

Returns:

Type	Description
`LabeledData`	The labeled ItemList.

`process(item_list, proc)`

Applies processors to an ItemList.

Parameters:

Name	Type	Description	Default
`item_list`	`ItemList`	The ItemList to process.	required
`proc`	`Processor \| Iterable[Processor]`	The processor or list of processors to apply.	required

Returns:

Type	Description
`ItemList`	The processed ItemList.

`x_obj(idx)`

Returns the input object at index idx after applying all processors in proc_x.

Parameters:

Name	Type	Description	Default
`idx`	`int`	Index of the input object to retrieve.	required

Returns:

Type	Description
`Any`	The input object at index idx after applying all processors in proc_x.

`y_obj(idx)`

Returns the label object at index idx after applying all processors in proc_y.

Parameters:

Name	Type	Description	Default
`idx`	`int`	Index of the label object to retrieve.	required

Returns:

Type	Description
`Any`	The label object at index idx after applying all processors in proc_y.

`ListContainer`

Bases: UserList

Extend builtin list by changing the creation of the list from the given items and changing repr to return first 10 items along with total number of items and the class name. This will be the base class where other containers will inherit from.

`init(items)`

Parameters:

Name	Type	Description	Default
`items`	`Any`	Items to create list from.	required

`SplitData`

Split Item list into train and validation data lists.

`init(train, valid)`

Parameters:

Name	Type	Description	Default
`train`	`ItemList`	Training items.	required
`valid`	`ItemList`	Validation items.	required

`split_by_func(item_list, split_func)` `classmethod`

Split item list by splitter function and returns a SplitData object.

`to_dls(batch_size=32, **kwargs)`

Returns a tuple of training and validation DataLoaders using train and valid datasets.

`collate_device(device)`

Collate inputs from batch and copy it to device.

Parameters:

Name	Type	Description	Default
`device`	`device`	Device to copy batch to.	required

Returns:

Type	Description
`Callable`	Wrapper function that returns tuple of collated inputs.

`collate_dict(ds)`

Collate inputs from HF Dataset dictionary and returns list of inputs after applying pytorch's default collate function.

Parameters:

Name	Type	Description	Default
`ds`	`DatasetDict`	HF Dataset dictionary.	required

Returns:

Type	Description
`Callable`	Wrapper function that returns tuple of collated inputs.

`compose(x, funcs, *args, order='order', **kwargs)`

Applies transformations in funcs to the input x in order.

`get_dls(train_ds, valid_ds, batch_size, **kwargs)`

Returns two dataloaders: 1 for training and 1 for validation. The validation dataloader has twice the batch size and doesn't shuffle data.

`get_files(path, extensions=None, include=None, recurse=False)`

Get filenames in path that have extension extensions starting with path and optionally recurse to subdirectories.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path for the root directory to search for files.	required
`extensions`	`str \| Iterable[str] \| None`	Suffixes of filenames to look for.	`None`
`include`	`Iterable[str] \| None`	Top-level Director(y\|ies) under `path` to use to search for files.	`None`
`recurse`	`bool`	Whether to search subdirectories recursively.	`False`

Returns:

Type	Description
`list[str]`	List of filenames that ends with `extensions` under `path`.

`grandparent_splitter(f_name, valid_name='valid', train_name='train')`

Split items based on whether they fall under validation or training directories. This assumes that the directory structure is train/label/items or valid/label/items.

Parameters:

Name	Type	Description	Default
`f_name`	`str \| Path`	Item's filename.	required
`valid_name`	`str`	Name of the directory that holds the validation items.	`"valid"`
`train_name`	`str`	Name of the directory that holds the training items.	`"train"`

Returns:

Type	Description
`bool \| None`	True if the item is in validation else False (training). If neither, returns None.

`label_by_func(splitted_data, label_func, proc_x=None, proc_y=None)`

Label splitted data using label_func.

Parameters:

Name	Type	Description	Default
`splitted_data`	`SplitData`	The splitted data to be labeled.	required
`label_func`	`Callable`	The function to be used for labeling.	required
`proc_x`	`Callable \| None`	The processor to be applied to the input data.	`None`
`proc_y`	`Callable \| None`	The processor to be applied to the label data.	`None`

Returns:

Type	Description
`SplitData`	The labeled splitted data.

`parent_labeler(f_name)`

Label a file based on its parent directory.

Parameters:

Name	Type	Description	Default
`f_name`	`str \| Path`	Filename to get the parent directory.	required

Returns:

Type	Description
`str`	Name of the parent directory.

`random_splitter(f_name, p_valid=0.2)`

Randomly split items with p_valid probability to be in the validation set.

Parameters:

Name	Type	Description	Default
`f_name`	`str`	Item's filename. Not used here, but left for API consistency with other splitters.	required
`p_valid`	`float`	Probability of the item to be in the validation set.	`0.2`

Returns:

Type	Description
`bool`	True if the item is in validation else False (training).

`split_by_func(items, func)`

Split items into train/valid lists using func.

Parameters:

Name	Type	Description	Default
`items`	`Iterable`	Items to be split into train/valid.	required
`func`	`Callable`	Split function to split items.	required

Returns:

Type	Description
`tuple[list, list]`	Train and valid item lists.

`to_cpu(x)`

Copy tensor(s) to CPU. If a tensor is already on the CPU, returns the tensor itself; otherwise, returns a copy of the tensor.

`to_device(x, device=default_device)`

Copy tensor(s) to device. If the tensor is already on the device, returns the tensor itself.

Parameters:

Name	Type	Description	Default
`x`	`Tensor \| Iterable[Tensor] \| Mapping[str, Tensor]`	Tensor or collection of tensors to move to device.	required
`device`	`str`	Device to copy the tensor to.	'cuda:0` if available else 'cpu'

Returns:

Name	Type	Description
`out`	`Tensor \| Iterable[Tensor] \| Mapping[str, Tensor]`	Copied tensor(s) on the `device`.

Data

DataLoaders

__init__(*dls)

from_dd(dd, batch_size, **kwargs) classmethod

ItemList

__init__(items, path='.', tfms=None, **kwargs)

get(item)

new(items, cls=None)

LabeledData

__init__(x, y, proc_x=None, proc_y=None)

label_by_func(item_list, label_func, proc_x=None, proc_y=None) classmethod

process(item_list, proc)

x_obj(idx)

y_obj(idx)

ListContainer

__init__(items)

SplitData

__init__(train, valid)

split_by_func(item_list, split_func) classmethod

to_dls(batch_size=32, **kwargs)

collate_device(device)

collate_dict(ds)

compose(x, funcs, *args, order='order', **kwargs)

get_dls(train_ds, valid_ds, batch_size, **kwargs)

get_files(path, extensions=None, include=None, recurse=False)

grandparent_splitter(f_name, valid_name='valid', train_name='train')

label_by_func(splitted_data, label_func, proc_x=None, proc_y=None)

parent_labeler(f_name)

random_splitter(f_name, p_valid=0.2)

split_by_func(items, func)

to_cpu(x)

to_device(x, device=default_device)

`DataLoaders`

`init(*dls)`

`from_dd(dd, batch_size, **kwargs)` `classmethod`

`ItemList`

`init(items, path='.', tfms=None, **kwargs)`

`get(item)`

`new(items, cls=None)`

`LabeledData`

`init(x, y, proc_x=None, proc_y=None)`

`label_by_func(item_list, label_func, proc_x=None, proc_y=None)` `classmethod`

`process(item_list, proc)`

`x_obj(idx)`

`y_obj(idx)`

`ListContainer`

`init(items)`

`SplitData`

`init(train, valid)`

`split_by_func(item_list, split_func)` `classmethod`

`to_dls(batch_size=32, **kwargs)`

`collate_device(device)`

`collate_dict(ds)`

`compose(x, funcs, *args, order='order', **kwargs)`

`get_dls(train_ds, valid_ds, batch_size, **kwargs)`

`get_files(path, extensions=None, include=None, recurse=False)`

`grandparent_splitter(f_name, valid_name='valid', train_name='train')`

`label_by_func(splitted_data, label_func, proc_x=None, proc_y=None)`

`parent_labeler(f_name)`

`random_splitter(f_name, p_valid=0.2)`

`split_by_func(items, func)`

`to_cpu(x)`

`to_device(x, device=default_device)`