Splitters Module Reference
This section provides a detailed API reference for all modules related to splitting datasets into training, validation, and test sets.
Core Splitting Utilities
These modules define the base class and common utilities used by all splitters.
Splitter
Base class for dataset splitters.
This class provides a common interface for splitting datasets into training, validation, and test sets. Subclasses should implement specific splitting strategies.
Source code in datarec/splitters/splitter.py
output(datarec, train, test, validation, step_info)
staticmethod
Creates a dictionary of DataRec
objects for train, test, and validation splits.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The original dataset wrapped in a |
required |
train
|
DataFrame
|
The training split of the dataset. |
required |
test
|
DataFrame
|
The test split of the dataset. |
required |
validation
|
DataFrame
|
The validation split of the dataset. |
required |
step_info
|
Dict[str, Dict]
|
Metadata of the transformation. |
required |
Returns:
Type | Description |
---|---|
Dict[str, DataRec]
|
A dictionary containing the split datasets:
- 'train': The training dataset as a |
Source code in datarec/splitters/splitter.py
random_sample(dataframe, seed, n_samples=1)
Randomly selects a specified number of samples from a given DataFrame.
This function splits the input DataFrame into two subsets:
- One containing n_samples
randomly selected rows.
- One containing the remaining rows after the selection.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataframe
|
DataFrame
|
The input DataFrame from which to sample. |
required |
seed
|
int
|
Random seed for reproducibility. |
required |
n_samples
|
int
|
The number of samples to extract. Must be at least 1. Default is 1. |
1
|
Returns:
Type | Description |
---|---|
Tuple[DataFrame, DataFrame]
|
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in datarec/splitters/utils.py
max_by_col(dataframe, discriminative_column, seed)
Selects the row with the minimum value in the specified column from the given DataFrame. If multiple rows have the same minimum value, one is randomly selected.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataframe
|
DataFrame
|
The input DataFrame. |
required |
discriminative_column
|
str
|
The column used to determine the minimum value. |
required |
seed
|
int
|
Random seed for reproducibility. |
required |
Returns:
Type | Description |
---|---|
Tuple[DataFrame, DataFrame]
|
|
DataFrame
|
|
Tuple[DataFrame, DataFrame]
|
|
Raises:
Type | Description |
---|---|
ValueError
|
If the specified column is not present in the DataFrame. |
ValueError
|
If no candidates are found (should not happen unless DataFrame is empty). |
ValueError
|
If the random selection fails to return exactly one row. |
Source code in datarec/splitters/utils.py
temporal_holdout(dataframe, test_ratio, val_ratio, temporal_col)
Splits a dataset into training, validation, and test sets based on temporal ordering.
The function sorts the dataset according to a specified timestamp column and assigns the oldest interactions to the training set, followed by the validation set (if applicable), and the most recent interactions to the test set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataframe
|
DataFrame
|
The input dataset containing interaction data. |
required |
test_ratio
|
float
|
The proportion of the dataset to allocate to the test set. Must be between 0 and 1. |
required |
val_ratio
|
float
|
The proportion of the dataset to allocate to the validation set. Must be between 0 and 1. |
required |
temporal_col
|
str
|
The name of the column containing timestamp information. |
required |
Returns:
Type | Description |
---|---|
Tuple[DataFrame, DataFrame, DataFrame]
|
A tuple containing the train, validation, and test sets. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in datarec/splitters/utils.py
Uniform Splitting Strategies
These splitters operate on the entire dataset globally.
RandomHoldOut
Bases: Splitter
Implements a random holdout split for recommendation datasets.
This splitter partitions the dataset into training, validation, and test sets
using a random sampling approach. The proportions of the dataset allocated to
the validation and test sets are controlled by val_ratio
and test_ratio
, respectively.
Source code in datarec/splitters/uniform/hold_out.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
test_ratio
property
writable
The proportion of the dataset for the test set.
val_ratio
property
writable
The proportion of the dataset for the validation set.
__init__(test_ratio=0, val_ratio=0, seed=42)
Initializes the RandomHoldOut object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_ratio
|
float
|
The proportion of the dataset to include in the test set. Must be between 0 and 1. Default is 0. |
0
|
val_ratio
|
float
|
The proportion of the training set to include in the validation set. Must be between 0 and 1. Default is 0. |
0
|
seed
|
int
|
The random seed for reproducibility. Defaults to 42. |
42
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in datarec/splitters/uniform/hold_out.py
run(datarec)
Splits the dataset into training, validation, and test sets according to the specified ratios, with the val_ratio being applied to the dataset after the test set has been partitioned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The dataset to be split. |
required |
Returns:
Type | Description |
---|---|
Dict[str, DataRec]
|
A dictionary with the following keys:
- "train": The training dataset ( |
Source code in datarec/splitters/uniform/hold_out.py
TemporalHoldOut
Bases: Splitter
Implements a temporal hold-out splitting strategy for recommendation datasets.
This splitter partitions a dataset into training, validation, and test sets based on the timestamps associated with interactions. The training set contains the oldest interactions, while the test set contains the most recent ones.
Source code in datarec/splitters/uniform/temporal/hold_out.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
test_ratio
property
writable
The proportion of the dataset allocated to the test set.
val_ratio
property
writable
The proportion of the dataset allocated to the validation set.
__init__(test_ratio=0, val_ratio=0)
Initializes the TemporalHoldOut object. Args: test_ratio (float, optional): The proportion of the dataset to allocate to the test set. Must be between 0 and 1. Default is 0. val_ratio (float, optional): The proportion of the dataset to allocate to the validation set. Must be between 0 and 1. Default is 0.
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in datarec/splitters/uniform/temporal/hold_out.py
run(datarec)
Splits the dataset using a temporal hold-out strategy.
This method partitions the dataset into training, validation, and test sets based on
the timestamps present in the datarec
object. The split is performed such that the
training set contains older interactions, while the test set contains more recent ones.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
A DataRec object containing the dataset and a timestamp column. |
required |
Returns:
Type | Description |
---|---|
Dict[str, DataRec]
|
A dictionary with three keys:
- |
Raises:
Type | Description |
---|---|
TypeError
|
If the |
Source code in datarec/splitters/uniform/temporal/hold_out.py
TemporalThresholdSplit
Bases: Splitter
Splits a dataset into training, validation, and test sets based on two timestamp thresholds.
The dataset is divided such that:
- The training set contains interactions occurring strictly before val_threshold
.
- The validation set contains interactions occurring between val_threshold
(inclusive)
and test_threshold
(exclusive).
- The test set contains interactions occurring at or after test_threshold
.
Source code in datarec/splitters/uniform/temporal/threshold.py
__init__(val_threshold, test_threshold)
Initializes the TemporalThresholdSplit object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
val_threshold
|
float
|
The timestamp value that defines the split between training and validation. |
required |
test_threshold
|
float
|
The timestamp value that defines the split between validation and test. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in datarec/splitters/uniform/temporal/threshold.py
run(datarec)
Splits the dataset into training, validation, and test sets based on two thresholds.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
A DataRec object containing the dataset with a timestamp column. |
required |
Returns:
Type | Description |
---|---|
Dict[str, DataRec]
|
Dict[str, DataRec]: A dictionary with:
- |
Raises:
Type | Description |
---|---|
TypeError
|
If the |
Source code in datarec/splitters/uniform/temporal/threshold.py
User-Stratified Splitting Strategies
These splitters operate on a per-user basis, ensuring that each user's interaction history is partitioned across the splits.
UserStratifiedHoldOut
Bases: Splitter
Implements a user-stratified holdout split for a recommendation dataset.
This splitter ensures that each user's interactions are split into training, validation,
and test sets while maintaining the proportion specified by test_ratio
and val_ratio
.
Source code in datarec/splitters/user_stratified/hold_out.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|
test_ratio
property
writable
The proportion of interactions per user for the test set.
val_ratio
property
writable
The proportion of interactions per user for the validation set.
__init__(test_ratio=0, val_ratio=0, seed=42)
Initializes the UserStratifiedHoldOut splitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_ratio
|
float
|
The proportion of interactions per user to include in the test set. Must be between 0 and 1. Default is 0. |
0
|
val_ratio
|
float
|
The proportion of interactions per user to include in the validation set. Must be between 0 and 1. Default is 0. |
0
|
seed
|
int
|
Random seed for reproducibility. Defaults to 42. |
42
|
Raises:
ValueError: If test_ratio
or val_ratio
is not in the range [0, 1].
Source code in datarec/splitters/user_stratified/hold_out.py
run(datarec)
Splits the dataset into train, validation, and test sets using a user-stratified holdout approach.
Each user's interactions are split independently according to test_ratio
and val_ratio
, ensuring
that the distribution is preserved per user. The function returns a dictionary containing the three
resulting subsets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The dataset to be split. |
required |
Returns:
Type | Description |
---|---|
Dict[str, DataRec]
|
A dictionary with the following keys:
- "train": DataRec containing the training set.
- "test": DataRec containing the test set, if |
Source code in datarec/splitters/user_stratified/hold_out.py
LeaveNOut
Bases: Splitter
Implements the Leave-N-Out splitting strategy for recommendation datasets.
This splitter ensures that for each user, a fixed number of interactions (test_n
and validation_n
)
are randomly selected and moved to the test and validation sets, respectively. The remaining interactions
are kept in the training set.
Source code in datarec/splitters/user_stratified/leave_out.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
|
test_n
property
writable
Number of interactions to move to the test set per user.
validation_n
property
writable
Number of interactions to move to the test set per user.
__init__(test_n=0, validation_n=0, seed=42)
Initializes the LeaveNOut splitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_n
|
int
|
Number of interactions to move to the test set per user. Default is 0. |
0
|
validation_n
|
int
|
Number of interactions to move to the validation set per user. Default is 0. |
0
|
seed
|
int
|
Random seed for reproducibility. Default is 42. |
42
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
TypeError
|
If |
Source code in datarec/splitters/user_stratified/leave_out.py
run(datarec)
Splits the dataset into train, validation, and test sets using a Leave-N-Out approach.
For each user, test_n
interactions are randomly assigned to the test set, and validation_n
interactions are assigned to the validation set. The remaining interactions are used for training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The dataset to be split. |
required |
Returns:
Type | Description |
---|---|
Dict[str, DataRec]
|
A dictionary with the following keys:
- "train": DataRec containing the training set.
- "test": DataRec containing the test set, if |
Source code in datarec/splitters/user_stratified/leave_out.py
LeaveOneOut
Bases: LeaveNOut
Implements the Leave-One-Out splitting strategy for recommendation datasets.
This splitter ensures that for each user, at most one interaction is randomly selected and moved to the test and/or validation set, depending on the specified parameters. The remaining interactions are kept in the training set.
This is a special case of LeaveNOut
where test_n=1
and/or validation_n=1
if test
and validation
are set to True
, respectively.
Source code in datarec/splitters/user_stratified/leave_out.py
__init__(test=True, validation=True, seed=42)
Initializes the LeaveOneOut splitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test
|
bool
|
Whether to include a test set. Defaults to True. |
True
|
validation
|
bool
|
Whether to include a validation set. Defaults to True. |
True
|
seed
|
int
|
Random seed for reproducibility. Default is 42. |
42
|
Raises:
Type | Description |
---|---|
TypeError
|
If |
Source code in datarec/splitters/user_stratified/leave_out.py
LeaveRatioOut
Bases: Splitter
Splits the dataset into training, test, and validation sets based on a ratio instead of a fixed number of samples.
This splitter selects a fraction of interactions for each user to be assigned to the test and validation sets, ensuring that the splits are proportional to the user's total number of interactions.
Source code in datarec/splitters/user_stratified/leave_out.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
|
__init__(test_ratio=0, val_ratio=0, seed=42)
Initializes the LeaveRatioOut splitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_ratio
|
float
|
Proportion of each user's interactions assigned to the test set. Default is 0. |
0
|
val_ratio
|
float
|
Proportion of each user's interactions assigned to the validation set. Default is 0. |
0
|
seed
|
int
|
Random seed for reproducibility. Default is 42. |
42
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If the sum of |
Source code in datarec/splitters/user_stratified/leave_out.py
run(datarec)
Splits the dataset into train, test, and validation sets based on the specified ratios.
The interactions of each user are sampled proportionally to create the test and validation sets. The remaining interactions are used as the training set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The dataset containing interactions and user-item relationships. |
required |
Returns:
Type | Description |
---|---|
Dict[str, DataRec]
|
A dictionary containing the following keys:
- |
Raises:
Type | Description |
---|---|
ValueError
|
If an empty dataset is encountered after sampling. |
Source code in datarec/splitters/user_stratified/leave_out.py
LeaveNLast
Bases: Splitter
Splits the dataset by removing the last n
interactions per user based on a timestamp column.
This splitter selects the last test_n
interactions for the test set and the last validation_n
interactions for the validation set while keeping the remaining interactions in the training set.
Source code in datarec/splitters/user_stratified/temporal/leave_out.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
|
test_n
property
writable
The number of last interactions per user for the test set.
validation_n
property
writable
The number of last interactions per user for the validation set.
__init__(test_n=0, validation_n=0, seed=42)
Initializes the LeaveNLast splitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_n
|
int
|
Number of last interactions for the test set. Defaults to 0. |
0
|
validation_n
|
int
|
Number of last interactions for the validation set. Defaults to 0. |
0
|
seed
|
int
|
Random seed for reproducibility. Defaults to 42. |
42
|
Source code in datarec/splitters/user_stratified/temporal/leave_out.py
run(datarec)
Splits the dataset into train, test, and validation sets based on the last n
interactions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The dataset containing the interactions and timestamp column. |
required |
Returns:
Type | Description |
---|---|
Dict[str, DataRec]
|
A dictionary with the following keys:
- "train": The training dataset ( |
Raises:
Type | Description |
---|---|
TypeError
|
If the dataset does not contain a timestamp column. |
Source code in datarec/splitters/user_stratified/temporal/leave_out.py
LeaveOneLastItem
Bases: LeaveNLast
Special case of LeaveNLast that removes only the last interaction per user for test and validation.
This class sets test_n
and validation_n
to 1 if their corresponding boolean parameters are True.
Source code in datarec/splitters/user_stratified/temporal/leave_out.py
__init__(test=True, validation=True, seed=42)
Initializes the LeaveOneLastItem splitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test
|
bool
|
Whether to remove the last interaction for the test set. Defaults to True. |
True
|
validation
|
bool
|
Whether to remove the last interaction for the validation set. Defaults to True. |
True
|
seed
|
int
|
Random seed for reproducibility. Default is 42. |
42
|
Raises:
Type | Description |
---|---|
TypeError
|
If |
Source code in datarec/splitters/user_stratified/temporal/leave_out.py
LeaveRatioLast
Bases: Splitter
Splits the dataset into training, test, and validation sets by selecting the most recent interactions for each user based on a specified ratio.
Unlike LeaveNLast
, which selects a fixed number of interactions, this splitter chooses a fraction
of the total interactions per user, preserving temporal order.
Source code in datarec/splitters/user_stratified/temporal/leave_out.py
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
|
__init__(test_ratio=0, val_ratio=0, seed=42)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_ratio
|
float
|
Proportion of each user's interactions assigned to the test set. Default is 0. |
0
|
val_ratio
|
float
|
Proportion of each user's interactions assigned to the validation set. Default is 0. |
0
|
seed
|
int
|
Random seed for reproducibility. Default is 42. |
42
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If |
Source code in datarec/splitters/user_stratified/temporal/leave_out.py
run(datarec)
Splits the dataset into train, test, and validation sets by selecting the last interactions (in chronological order) for each user.
The most recent interactions are removed first for the test set, then for the validation set, leaving the remaining interactions for training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datarec
|
DataRec
|
The dataset containing interactions with a timestamp column. |
required |
Returns:
Type | Description |
---|---|
Dict[str, DataRec]
|
A dictionary containing the following keys:
- |
Raises:
Type | Description |
---|---|
TypeError
|
If the dataset does not contain a timestamp column. |