Data
This module includes most of the utilities related to working with data
in general, and with pytorch
's related data in specific. It includes
functions from composing transforms and getting train/valid
DataLoader
s to splitting and labeling Dataset
s.
DataLoaders
Create train/valid DataLoaders.
__init__(*dls)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dls |
list of DataLoaders. |
()
|
from_dd(dd, batch_size, **kwargs)
classmethod
Create train/valid data loaders from HF Dataset dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dd |
DatasetDict
|
HF Dataset dictionary. Must have at least two datasets: train and valid/test datasets. |
required |
batch_size |
int
|
batch size passed to DataLoader. |
required |
Returns:
Type | Description |
---|---|
tuple[DataLoader]
|
Train/valid data loaders. |
ItemList
Bases: ListContainer
Base class for all types of datasets such as image, text, etc.
__init__(items, path='.', tfms=None, **kwargs)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
items |
Sequence
|
Items to create list. |
required |
path |
str | Path
|
Path of the items that were used to create the list. |
"."
|
tfms |
Callable | None
|
Transformations to apply on items before returning them. |
None
|
get(item)
Every class that inherits from ItemList
has to override this
method.
new(items, cls=None)
Create a new instance of the ItemList
with items
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
items |
Sequence
|
The items to create the list from. |
required |
cls |
ItemList | None
|
The class to instantiate. If None, the same class will be used. |
None
|
Returns:
Type | Description |
---|---|
ItemList
|
The new instance of the |
LabeledData
Create a labeled data and expose both x & y as item lists after passing them through all processors.
__init__(x, y, proc_x=None, proc_y=None)
label_by_func(item_list, label_func, proc_x=None, proc_y=None)
classmethod
Label an ItemList using a labeling function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
item_list |
ItemList
|
The ItemList to be labeled. |
required |
label_func |
Callable
|
The function to be used for labeling. |
required |
proc_x |
Callable | None
|
The processor to be applied to the input data. |
None
|
proc_y |
Callable | None
|
The processor to be applied to the label data. |
None
|
Returns:
Type | Description |
---|---|
LabeledData
|
The labeled ItemList. |
process(item_list, proc)
x_obj(idx)
Returns the input object at index idx after applying all processors in proc_x.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx |
int
|
Index of the input object to retrieve. |
required |
Returns:
Type | Description |
---|---|
Any
|
The input object at index idx after applying all processors in proc_x. |
y_obj(idx)
Returns the label object at index idx after applying all processors in proc_y.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx |
int
|
Index of the label object to retrieve. |
required |
Returns:
Type | Description |
---|---|
Any
|
The label object at index idx after applying all processors in proc_y. |
ListContainer
Bases: UserList
Extend builtin list by changing the creation of the list from the given items and changing repr to return first 10 items along with total number of items and the class name. This will be the base class where other containers will inherit from.
__init__(items)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
items |
Any
|
Items to create list from. |
required |
SplitData
Split Item list into train and validation data lists.
__init__(train, valid)
split_by_func(item_list, split_func)
classmethod
Split item list by splitter function and returns a SplitData object.
to_dls(batch_size=32, **kwargs)
Returns a tuple of training and validation DataLoaders using train and valid datasets.
collate_device(device)
Collate inputs from batch and copy it to device
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device |
str | device
|
Device to copy batch to. |
required |
Returns:
Type | Description |
---|---|
Callable
|
Wrapper function that returns tuple of collated inputs. |
collate_dict(ds)
Collate inputs from HF Dataset dictionary and returns list of inputs after applying pytorch's default collate function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ds |
DatasetDict
|
HF Dataset dictionary. |
required |
Returns:
Type | Description |
---|---|
Callable
|
Wrapper function that returns tuple of collated inputs. |
compose(x, funcs, *args, order='order', **kwargs)
Applies transformations in funcs
to the input x
in order
.
get_dls(train_ds, valid_ds, batch_size, **kwargs)
Returns two dataloaders: 1 for training and 1 for validation. The validation dataloader has twice the batch size and doesn't shuffle data.
get_files(path, extensions=None, include=None, recurse=False)
Get filenames in path
that have extension extensions
starting
with path
and optionally recurse to subdirectories.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str | Path
|
Path for the root directory to search for files. |
required |
extensions |
str | Iterable[str] | None
|
Suffixes of filenames to look for. |
None
|
include |
Iterable[str] | None
|
Top-level Director(y|ies) under |
None
|
recurse |
bool
|
Whether to search subdirectories recursively. |
False
|
Returns:
Type | Description |
---|---|
list[str]
|
List of filenames that ends with |
grandparent_splitter(f_name, valid_name='valid', train_name='train')
Split items based on whether they fall under validation or training directories. This assumes that the directory structure is train/label/items or valid/label/items.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
f_name |
str | Path
|
Item's filename. |
required |
valid_name |
str
|
Name of the directory that holds the validation items. |
"valid"
|
train_name |
str
|
Name of the directory that holds the training items. |
"train"
|
Returns:
Type | Description |
---|---|
bool | None
|
True if the item is in validation else False (training). If neither, returns None. |
label_by_func(splitted_data, label_func, proc_x=None, proc_y=None)
Label splitted data using label_func
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
splitted_data |
SplitData
|
The splitted data to be labeled. |
required |
label_func |
Callable
|
The function to be used for labeling. |
required |
proc_x |
Callable | None
|
The processor to be applied to the input data. |
None
|
proc_y |
Callable | None
|
The processor to be applied to the label data. |
None
|
Returns:
Type | Description |
---|---|
SplitData
|
The labeled splitted data. |
parent_labeler(f_name)
Label a file based on its parent directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
f_name |
str | Path
|
Filename to get the parent directory. |
required |
Returns:
Type | Description |
---|---|
str
|
Name of the parent directory. |
random_splitter(f_name, p_valid=0.2)
Randomly split items with p_valid
probability to be in the
validation set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
f_name |
str
|
Item's filename. Not used here, but left for API consistency with other splitters. |
required |
p_valid |
float
|
Probability of the item to be in the validation set. |
0.2
|
Returns:
Type | Description |
---|---|
bool
|
True if the item is in validation else False (training). |
split_by_func(items, func)
Split items into train/valid lists using func
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
items |
Iterable
|
Items to be split into train/valid. |
required |
func |
Callable
|
Split function to split items. |
required |
Returns:
Type | Description |
---|---|
tuple[list, list]
|
Train and valid item lists. |
to_cpu(x)
Copy tensor(s) to CPU. If a tensor is already on the CPU, returns the tensor itself; otherwise, returns a copy of the tensor.
to_device(x, device=default_device)
Copy tensor(s) to device. If the tensor is already on the device, returns the tensor itself.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
Tensor | Iterable[Tensor] | Mapping[str, Tensor]
|
Tensor or collection of tensors to move to device. |
required |
device |
str | device
|
Device to copy the tensor to. |
'cuda:0` if available else 'cpu'
|
Returns:
Name | Type | Description |
---|---|---|
out |
Tensor | Iterable[Tensor] | Mapping[str, Tensor]
|
Copied tensor(s) on the |