Skip to content

Processors

Data processors for deep learning pipelines.

This module provides processor classes designed to transform data before it is fed into deep learning models. Processors implement common data preprocessing patterns such as tokenization, numericalization, and vocabulary management for text and other sequential data.

The processors follow a consistent interface pattern:

  1. Initialize with configuration parameters
  2. Fit to training data to learn vocabulary/transformations
  3. Transform data to numerical representations
  4. Reverse transform (decode) numerical data back to original form

All processors support:

  • Vocabulary management with automatic building from training data
  • Frequency filtering to remove low-frequency tokens below threshold
  • Size limits to respect maximum vocabulary size constraints
  • Reserved tokens for special tokens (padding, unknown, etc.)
  • Immediate usability when processors have predefined vocabularies
  • Modern typing with Python 3.10+ union syntax

Classes:

  • Processor

    Base class for all processors. Defines the processor interface.

  • NumericalizeProcessor

    Converts tokens to numerical indices using a learned vocabulary. Supports frequency filtering, vocabulary size limits, and reserved tokens. Can be initialized with a predefined vocabulary for immediate use.

Examples:

Basic token numericalization workflow:

>>> from cmn_ai.utils.processors import NumericalizeProcessor
>>> processor = NumericalizeProcessor(min_freq=2, max_vocab=1000)
>>> training_data = [["hello", "world"], ["hello", "there"], ["world", "peace"]]
>>> processor.fit(training_data)
>>> indices = processor.process(["hello", "world"])
>>> print(indices)
[1, 2]
>>> tokens = processor.deprocess(indices)
>>> print(tokens)
['hello', 'world']

Using a predefined vocabulary:

>>> vocab = ["<unk>", "<pad>", "hello", "world", "goodbye"]
>>> processor = NumericalizeProcessor(vocab=vocab)
>>> indices = processor.process(["hello", "world"])
>>> print(indices)
[2, 3]

Including reserved tokens in vocabulary:

>>> processor = NumericalizeProcessor(
...     max_vocab=100,
...     min_freq=1,
...     reserved_tokens=["<pad>", "<eos>", "<sos>"]
... )
>>> processor.fit(training_data)
Notes

Processors include comprehensive error handling:

  • ValueError is raised when attempting to use unfitted processors
  • Unknown tokens are gracefully handled (returns unk_token index)
  • Out-of-bounds indices are gracefully handled (returns unk_token)
  • Vocabulary size constraints are strictly enforced

Implementation details:

  • Vocabulary building respects both frequency thresholds and size limits
  • Reserved tokens are prioritized over data tokens when space is limited
  • The unk_token is always included and has the highest priority
  • All processors use modern Python type hints with union syntax (|)
See Also

cmn_ai.utils.utils : Utility functions used by processors cmn_ai.text.data : Text data handling utilities

NumericalizeProcessor

Bases: Processor

A processor that converts tokens to numerical indices and vice versa.

This processor builds a vocabulary from input tokens and provides methods to convert between tokens and their corresponding numerical indices.

Parameters:

  • vocab (List[str] | None, default: None ) –

    Pre-defined vocabulary. If None, vocabulary will be built from input data.

  • max_vocab (int, default: 60000 ) –

    Maximum vocabulary size.

  • min_freq (int, default: 2 ) –

    Minimum frequency threshold for tokens to be included in vocabulary.

  • reserved_tokens (str | List[str] | None, default: None ) –

    Reserved tokens to always include in vocabulary (e.g., special tokens).

  • unk_token (str, default: "<unk>" ) –

    Token to use for unknown/out-of-vocabulary items.

Attributes:

  • vocab (List[str]) –

    The vocabulary list mapping indices to tokens.

  • token_to_idx (Dict[str, int]) –

    Mapping from tokens to their indices.

  • idx_to_token (Dict[int, str]) –

    Mapping from indices to their tokens.

  • is_fitted (bool) –

    Whether the processor has been fitted with data.

Examples:

>>> processor = NumericalizeProcessor(min_freq=1)
>>> tokens = [["hello", "world"], ["hello", "there"]]
>>> indices = processor(tokens)
>>> processor.deprocess(indices[0])
['hello', 'world']

idx_to_token property

Get the index-to-token mapping.

is_fitted property

Check if the processor has been fitted.

token_to_idx property

Get the token-to-index mapping.

unk property

Alias for unk_idx for backward compatibility.

unk_idx property

Get the index of the unknown token.

__call__(items)

Process a list of items, building vocabulary if needed.

Parameters:

  • items (List[str | List[str]]) –

    List of tokens or token sequences to process.

Returns:

  • List[int | List[int]]

    List of corresponding indices or index sequences.

__getitem__(tokens)

Get indices for tokens using bracket notation.

Parameters:

  • tokens (str | List[str]) –

    Token or list of tokens.

Returns:

  • int | List[int]

    Corresponding index or list of indices.

__len__()

Return the size of the vocabulary.

deprocess(indices)

Convert indices back to tokens.

Parameters:

  • indices (int | List[int]) –

    Index or list of indices to convert.

Returns:

  • str | List[str]

    Corresponding token or list of tokens.

fit(items)

Fit the processor to the data without processing.

Parameters:

  • items (List[str | List[str]]) –

    Items to build vocabulary from.

Returns:

get_index(token)

Get index for a given token.

Parameters:

  • token (str) –

    Token to look up.

Returns:

  • int

    Corresponding index.

get_token(idx)

Get token for a given index.

Parameters:

  • idx (int) –

    Index to look up.

Returns:

  • str

    Corresponding token.

process(items)

Convert tokens to indices.

Parameters:

  • items (str | List[str]) –

    Token or list of tokens to convert.

Returns:

  • int | List[int]

    Corresponding index or list of indices.

Processor

Base class for all processors.