Processors
Data processors for deep learning pipelines.
This module provides processor classes designed to transform data before it is fed into deep learning models. Processors implement common data preprocessing patterns such as tokenization, numericalization, and vocabulary management for text and other sequential data.
The processors follow a consistent interface pattern:
- Initialize with configuration parameters
- Fit to training data to learn vocabulary/transformations
- Transform data to numerical representations
- Reverse transform (decode) numerical data back to original form
All processors support:
- Vocabulary management with automatic building from training data
- Frequency filtering to remove low-frequency tokens below threshold
- Size limits to respect maximum vocabulary size constraints
- Reserved tokens for special tokens (padding, unknown, etc.)
- Immediate usability when processors have predefined vocabularies
- Modern typing with Python 3.10+ union syntax
Classes:
-
Processor
–Base class for all processors. Defines the processor interface.
-
NumericalizeProcessor
–Converts tokens to numerical indices using a learned vocabulary. Supports frequency filtering, vocabulary size limits, and reserved tokens. Can be initialized with a predefined vocabulary for immediate use.
Examples:
Basic token numericalization workflow:
>>> from cmn_ai.utils.processors import NumericalizeProcessor
>>> processor = NumericalizeProcessor(min_freq=2, max_vocab=1000)
>>> training_data = [["hello", "world"], ["hello", "there"], ["world", "peace"]]
>>> processor.fit(training_data)
>>> indices = processor.process(["hello", "world"])
>>> print(indices)
[1, 2]
>>> tokens = processor.deprocess(indices)
>>> print(tokens)
['hello', 'world']
Using a predefined vocabulary:
>>> vocab = ["<unk>", "<pad>", "hello", "world", "goodbye"]
>>> processor = NumericalizeProcessor(vocab=vocab)
>>> indices = processor.process(["hello", "world"])
>>> print(indices)
[2, 3]
Including reserved tokens in vocabulary:
>>> processor = NumericalizeProcessor(
... max_vocab=100,
... min_freq=1,
... reserved_tokens=["<pad>", "<eos>", "<sos>"]
... )
>>> processor.fit(training_data)
Notes
Processors include comprehensive error handling:
ValueError
is raised when attempting to use unfitted processors- Unknown tokens are gracefully handled (returns unk_token index)
- Out-of-bounds indices are gracefully handled (returns unk_token)
- Vocabulary size constraints are strictly enforced
Implementation details:
- Vocabulary building respects both frequency thresholds and size limits
- Reserved tokens are prioritized over data tokens when space is limited
- The unk_token is always included and has the highest priority
- All processors use modern Python type hints with union syntax (|)
See Also
cmn_ai.utils.utils : Utility functions used by processors cmn_ai.text.data : Text data handling utilities
NumericalizeProcessor
Bases: Processor
A processor that converts tokens to numerical indices and vice versa.
This processor builds a vocabulary from input tokens and provides methods to convert between tokens and their corresponding numerical indices.
Parameters:
-
vocab
(List[str] | None
, default:None
) –Pre-defined vocabulary. If None, vocabulary will be built from input data.
-
max_vocab
(int
, default:60000
) –Maximum vocabulary size.
-
min_freq
(int
, default:2
) –Minimum frequency threshold for tokens to be included in vocabulary.
-
reserved_tokens
(str | List[str] | None
, default:None
) –Reserved tokens to always include in vocabulary (e.g., special tokens).
-
unk_token
(str
, default:"<unk>"
) –Token to use for unknown/out-of-vocabulary items.
Attributes:
-
vocab
(List[str]
) –The vocabulary list mapping indices to tokens.
-
token_to_idx
(Dict[str, int]
) –Mapping from tokens to their indices.
-
idx_to_token
(Dict[int, str]
) –Mapping from indices to their tokens.
-
is_fitted
(bool
) –Whether the processor has been fitted with data.
Examples:
>>> processor = NumericalizeProcessor(min_freq=1)
>>> tokens = [["hello", "world"], ["hello", "there"]]
>>> indices = processor(tokens)
>>> processor.deprocess(indices[0])
['hello', 'world']
idx_to_token
property
Get the index-to-token mapping.
is_fitted
property
Check if the processor has been fitted.
token_to_idx
property
Get the token-to-index mapping.
unk
property
Alias for unk_idx for backward compatibility.
unk_idx
property
Get the index of the unknown token.
__call__(items)
Process a list of items, building vocabulary if needed.
Parameters:
-
items
(List[str | List[str]]
) –List of tokens or token sequences to process.
Returns:
-
List[int | List[int]]
–List of corresponding indices or index sequences.
__getitem__(tokens)
Get indices for tokens using bracket notation.
Parameters:
-
tokens
(str | List[str]
) –Token or list of tokens.
Returns:
-
int | List[int]
–Corresponding index or list of indices.
__len__()
Return the size of the vocabulary.
deprocess(indices)
Convert indices back to tokens.
Parameters:
-
indices
(int | List[int]
) –Index or list of indices to convert.
Returns:
-
str | List[str]
–Corresponding token or list of tokens.
fit(items)
Fit the processor to the data without processing.
Parameters:
-
items
(List[str | List[str]]
) –Items to build vocabulary from.
Returns:
-
NumericalizeProcessor
–Self for method chaining.
get_index(token)
Get index for a given token.
Parameters:
-
token
(str
) –Token to look up.
Returns:
-
int
–Corresponding index.
get_token(idx)
Get token for a given index.
Parameters:
-
idx
(int
) –Index to look up.
Returns:
-
str
–Corresponding token.
process(items)
Convert tokens to indices.
Parameters:
-
items
(str | List[str]
) –Token or list of tokens to convert.
Returns:
-
int | List[int]
–Corresponding index or list of indices.
Processor
Base class for all processors.