Processing Module Reference
This section provides a detailed API reference for all modules related to processing datasets.
Binarize
Bases: Processor
A class for binarizing rating values in a dataset based on a given threshold.
This class processes a dataset wrapped in a DataRec object and modifies the rating column
based on the specified threshold. If implicit
is set to True, rows with ratings below
the threshold are removed, and the rating column is dropped. Otherwise, ratings are binarized
to either over_threshold
or under_threshold
values.
Source code in datarec/processing/binarizer.py
binary_threshold
property
Returns the rating threshold used to distinguish positive interactions.
over_threshold
property
Returns the value assigned to ratings at or above the threshold.
under_threshold
property
Returns the value assigned to ratings below the threshold.
__init__(threshold, implicit=False, over_threshold=1, under_threshold=0)
Initializes the Binarize object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold
|
float
|
The threshold for binarization. |
required |
implicit
|
bool
|
If True, removes rows below the threshold and drops the rating column. |
False
|
over_threshold
|
(int, float)
|
The value assigned to ratings equal to or above the threshold. |
1
|
under_threshold
|
(int, float)
|
The value assigned to ratings below the threshold. |
0
|
Source code in datarec/processing/binarizer.py
run(datarec)
Binarizes the rating values in the given dataset based on a threshold.
If implicit
is True, removes rows where the rating is below the threshold
and drops the rating column. If implicit
is False, replaces the rating
values with binary values (over_threshold if >= threshold, under_threshold otherwise).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The input dataset wrapped in a DataRec object. |
required |
Returns:
Type | Description |
---|---|
DataRec
|
A new DataRec object with the processed dataset. |
Source code in datarec/processing/binarizer.py
ColdFilter
Bases: Processor
A filtering class to retain only cold users or cold items, i.e., those with at most interactions
interactions
in the original DataRec dataset.
Source code in datarec/processing/cold.py
__init__(interactions, mode='user')
Initializes the ColdFilter object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
interactions
|
int
|
The maximum number of interactions a user or item can have to be retained. |
required |
mode
|
str
|
Filtering mode, either "user" for cold users or "item" for cold items. |
'user'
|
Raises:
Type | Description |
---|---|
TypeError
|
If |
ValueError
|
If |
Source code in datarec/processing/cold.py
run(datarec)
Filters the dataset to keep only cold users or cold items with at most self.interactions
interactions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The input dataset wrapped in a DataRec object. |
required |
Returns:
Type | Description |
---|---|
DataRec
|
A new DataRec object containing only the filtered users or items. |
Source code in datarec/processing/cold.py
KCore
This class filters a dataset based on a minimum number of records (core) for each group defined by a specific column.
Source code in datarec/processing/kcore.py
__init__(column, core)
Initializes the KCore object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The column name used to group the data (e.g., user or item). |
required |
core
|
int
|
The minimum number of records required for each group to be kept. |
required |
Raises:
Type | Description |
---|---|
TypeError
|
If 'core' is not an integer. |
Source code in datarec/processing/kcore.py
run(dataset)
Filters the dataset by keeping only groups with at least the specified number of records.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
DataFrame
|
The dataset to be filtered. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A new dataframe with groups filtered by the core condition. |
Raises:
Type | Description |
---|---|
ValueError
|
If 'self._column' is not in the dataset. |
Source code in datarec/processing/kcore.py
UserKCore
Bases: Processor
Filters a dataset based on a minimum number of records (core) for each user.
This class applies a KCore filter on the user column of the dataset.
Source code in datarec/processing/kcore.py
__init__(core)
Initializes the UserKCore object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
core
|
int
|
The minimum number of records required for each user to be kept. |
required |
Raises:
Type | Description |
---|---|
TypeErrore
|
If 'core' is not an integer. |
Source code in datarec/processing/kcore.py
run(datarec)
Filters the dataset by user, applying the KCore filter, and returns a new DataRec object containing the filtered data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The DataRec object containing the dataset to be filtered. |
required |
Returns:
Type | Description |
---|---|
DataRec
|
A new DataRec object with the filtered data. |
Source code in datarec/processing/kcore.py
ItemKCore
Bases: Processor
Filters a dataset based on a minimum number of records (core) for each item.
This class applies a KCore filter on the item column of the dataset.
Source code in datarec/processing/kcore.py
__init__(core)
Initializes the ItemKCore object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
core
|
int
|
The minimum number of records required for each item to be kept. |
required |
Raises:
Type | Description |
---|---|
TypeError
|
If "core" is not an integer. |
Source code in datarec/processing/kcore.py
run(datarec)
Filters the dataset by item, applying the KCore filter, and returns a new DataRec object containing the filtered data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The DataRec object containing the dataset to be filtered. |
required |
Returns:
Type | Description |
---|---|
DataRec
|
A new DataRec object with the filtered data. |
Source code in datarec/processing/kcore.py
IterativeKCore
Iteratively filters a dataset based on a set of columns and minimum core values.
This class applies KCore filters to multiple columns and iteratively removes groups that do not meet the core requirement until no further changes occur.
Source code in datarec/processing/kcore.py
__init__(columns, cores)
Initializes the IterativeKCore object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
list
|
A list of column names to apply the KCore filter on. |
required |
cores
|
list of int or int
|
The minimum number of records required for each column to be kept. |
required |
Raises:
Type | Description |
---|---|
TypeError
|
If 'cores' in not a list or an integer. |
Source code in datarec/processing/kcore.py
run(dataset)
Iteratively applies the KCore filters on the dataset until no changes occur, then returns the filtered dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
DataFrame
|
The dataset to be iteratively filtered. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The filtered dataset after all iterations. |
Source code in datarec/processing/kcore.py
UserItemIterativeKCore
Bases: Processor
Iteratively filters a dataset based on both user and item columns with specified core values.
This class applies the IterativeKCore filter to both the user and item columns of the dataset.
Source code in datarec/processing/kcore.py
__init__(cores)
Initializes the UserItemIterativeKCore object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cores
|
list or int
|
A list of core values for the user and item columns. |
required |
Raises:
Type | Description |
---|---|
TypeError
|
If "cores" is not a list or an integer. |
Source code in datarec/processing/kcore.py
run(datarec)
Applies the iterative KCore filter to both user and item columns, and returns a new DataRec object containing the filtered data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The DataRec object containing the dataset to be filtered. |
required |
Returns:
Type | Description |
---|---|
DataRec
|
A new DataRec object with the filtered data. |
Source code in datarec/processing/kcore.py
NRoundsKCore
Filters a dataset based on a minimum number of records (core) for each column over multiple rounds.
This class applies KCore filters iteratively over a specified number of rounds.
Source code in datarec/processing/kcore.py
__init__(columns, cores, rounds)
Initializes the NRoundsKCore object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
list
|
A list of column names to apply the KCore filter on. |
required |
cores
|
list of int or int
|
The minimum number of records required for each column to be kept. |
required |
rounds
|
int
|
The number of rounds to apply the filtering process. |
required |
Raises:
Type | Description |
---|---|
TypeError
|
If 'cores' is not a list or an integer. |
TypeError
|
If 'rounds' is not an integer. |
Source code in datarec/processing/kcore.py
run(dataset)
Applies the KCore filters over the specified number of rounds and returns the filtered dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
DataFrame
|
The dataset to be filtered. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The dataset after filtering over the specified number of rounds. |
Source code in datarec/processing/kcore.py
UserItemNRoundsKCore
Bases: Processor
Filters a dataset based on both user and item columns with specified core values over multiple rounds.
This class applies the NRoundsKCore filter to both the user and item columns of the dataset.
Source code in datarec/processing/kcore.py
__init__(cores, rounds)
Initializes the UserItemNRoundsKCore object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cores
|
(int, list)
|
A list of core values for the user and item columns. |
required |
rounds
|
int
|
The number of rounds to apply the filtering process. |
required |
Raises:
Type | Description |
---|---|
TypeError
|
If 'cores' is not a list or an integer. |
TypeError
|
If 'rounds' is not an integer. |
Source code in datarec/processing/kcore.py
run(datarec)
Applies the NRoundsKCore filter to both user and item columns over multiple rounds, and returns a new DataRec object containing the filtered data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The DataRec object containing the dataset to be filtered. |
required |
Returns:
Type | Description |
---|---|
DataRec
|
A new DataRec object with the filtered data. |
Source code in datarec/processing/kcore.py
Processor
Utility class for handling the output of preprocessing steps on DataRec
objects.
This class provides functionality to build a new DataRec
from
transformation results while updating the processing pipeline accordingly.
Source code in datarec/processing/processor.py
output(datarec, result, step_info)
staticmethod
Create a new DataRec
object from a transformation result and update
the processing pipeline with a new step.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The original |
required |
result
|
DataFrame
|
The result of the transformation. |
required |
step_info
|
dict
|
Metadata of the transformation. |
required |
Returns:
Type | Description |
---|---|
DataRec
|
A new |
Source code in datarec/processing/processor.py
FilterByRatingThreshold
Bases: Processor
Filters the dataset by removing interactions with a rating below a given threshold.
Source code in datarec/processing/rating.py
__init__(rating_threshold)
Initializes the FilterByRatingThreshold object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
rating_threshold
|
float
|
The minimum rating required for an interaction to be kept. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in datarec/processing/rating.py
run(datarec)
Filters interactions with a rating below the threshold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The input dataset wrapped in a DataRec object. |
required |
Returns:
Type | Description |
---|---|
DataRec
|
A new DataRec object with the processed dataset. |
Source code in datarec/processing/rating.py
FilterByUserMeanRating
Bases: Processor
Filters the dataset by removing interactions with a rating below the user's average rating.
This filter calculates the average rating given by each user and removes interactions where the rating is below that average.
Source code in datarec/processing/rating.py
run(datarec)
Filters interactions with a rating below the user's mean rating.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The input dataset wrapped in a DataRec object. |
required |
Returns:
Type | Description |
---|---|
DataRec
|
A new DataRec object with the processed dataset. |
Source code in datarec/processing/rating.py
FilterOutDuplicatedInteractions
Bases: Processor
Filters a dataset by removing duplicated (user, item) interactions based on a specified strategy.
Source code in datarec/processing/rating.py
__init__(keep='first', random_seed=42)
Initializes the FilterOutDuplicatedInteractions object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keep
|
str
|
Strategy to determine which interaction to keep when duplicates are found. Must be one of ['first', 'last', 'earliest', 'latest', 'random']. |
'first'
|
random_seed
|
int
|
Random seed used for reproducibility when using the 'random' strategy. |
42
|
Raises:
Type | Description |
---|---|
ValueError
|
If the provided strategy ( |
Source code in datarec/processing/rating.py
run(datarec, verbose=True)
Filter out duplicated (user, item) interactions in the dataset using the specified strategy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
An object containing the dataset and metadata (user, item, timestamp columns, etc.) |
required |
verbose
|
bool
|
Whether to print logging information during execution. |
True
|
Returns:
Type | Description |
---|---|
DataRec
|
A new DataRec object with duplicated (user, item) interactions removed according to the selected strategy. |
Raises:
Type | Description |
---|---|
ValueError
|
If Date colum is not provided for 'earliest' and 'latest' strategies. |
ValueError
|
If the provided strategy ( |
Source code in datarec/processing/rating.py
FilterByTime
Bases: Processor
Filters the dataset based on a time threshold and specified drop condition.
This class allows filtering a dataset by a time threshold, either dropping records before or after the specified time.
Source code in datarec/processing/temporal.py
__init__(time_threshold=0, drop='after')
Initializes the FilterByTime object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
time_threshold
|
float
|
The time threshold used for filtering. The dataset will be filtered based on this value. |
0
|
drop
|
str
|
Specifies whether to drop records 'before' or 'after' the time threshold. |
'after'
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in datarec/processing/temporal.py
run(datarec)
Filters the dataset of the given DataRec based on the specified time threshold and drop condition, returning a new DataRec object with the filtered data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The input dataset wrapped in a DataRec object. |
required |
Returns:
Type | Description |
---|---|
DataRec
|
A new DataRec object with the processed dataset. |
Raises:
Type | Description |
---|---|
TypeError
|
If the DataRec does not contain temporal information. |