Skip to content

Splitters Module Reference

This section provides a detailed API reference for all modules related to splitting datasets into training, validation, and test sets.

Core Splitting Utilities

These modules define the base class and common utilities used by all splitters.

Splitter

Base class for dataset splitters.

This class provides a common interface for splitting datasets into training, validation, and test sets. Subclasses should implement specific splitting strategies.

Source code in datarec/splitters/splitter.py
class Splitter:
    """
    Base class for dataset splitters.

    This class provides a common interface for splitting datasets into training,
    validation, and test sets. Subclasses should implement specific splitting strategies.
    """

    @staticmethod
    def output(datarec: DataRec, train: pd.DataFrame, test: pd.DataFrame, validation: pd.DataFrame,
               step_info: Dict[str, Dict]) -> Dict[str, DataRec]:
        """
        Creates a dictionary of `DataRec` objects for train, test, and validation splits.

        Args:
            datarec (DataRec): The original dataset wrapped in a `DataRec` object.
            train (pd.DataFrame): The training split of the dataset.
            test (pd.DataFrame): The test split of the dataset.
            validation (pd.DataFrame): The validation split of the dataset.
            step_info (Dict[str, Dict]): Metadata of the transformation.

        Returns:
            (Dict[str, DataRec]): A dictionary containing the split datasets:
                - 'train': The training dataset as a `DataRec` object (if not empty).
                - 'test': The test dataset as a `DataRec` object (if not empty).
                - 'val': The validation dataset as a `DataRec` object (if not empty).
        """

        pipeline = datarec.pipeline.copy()
        pipeline.add_step(name='split', operation=step_info['operation'], params=step_info['params'])

        result = dict()
        for k, d in zip(['train', 'test', 'val'], [train, test, validation]):
            if len(d) > 0:
                new_datarec = DataRec(RawData(d,
                                              user=datarec.user_col,
                                              item=datarec.item_col,
                                              rating=datarec.rating_col,
                                              timestamp=datarec.timestamp_col),
                                      derives_from=datarec,
                                      dataset_name=datarec.dataset_name,
                                      pipeline=pipeline.copy())
                result[k] = new_datarec
        return result

output(datarec, train, test, validation, step_info) staticmethod

Creates a dictionary of DataRec objects for train, test, and validation splits.

Parameters:

Name Type Description Default
datarec DataRec

The original dataset wrapped in a DataRec object.

required
train DataFrame

The training split of the dataset.

required
test DataFrame

The test split of the dataset.

required
validation DataFrame

The validation split of the dataset.

required
step_info Dict[str, Dict]

Metadata of the transformation.

required

Returns:

Type Description
Dict[str, DataRec]

A dictionary containing the split datasets: - 'train': The training dataset as a DataRec object (if not empty). - 'test': The test dataset as a DataRec object (if not empty). - 'val': The validation dataset as a DataRec object (if not empty).

Source code in datarec/splitters/splitter.py
@staticmethod
def output(datarec: DataRec, train: pd.DataFrame, test: pd.DataFrame, validation: pd.DataFrame,
           step_info: Dict[str, Dict]) -> Dict[str, DataRec]:
    """
    Creates a dictionary of `DataRec` objects for train, test, and validation splits.

    Args:
        datarec (DataRec): The original dataset wrapped in a `DataRec` object.
        train (pd.DataFrame): The training split of the dataset.
        test (pd.DataFrame): The test split of the dataset.
        validation (pd.DataFrame): The validation split of the dataset.
        step_info (Dict[str, Dict]): Metadata of the transformation.

    Returns:
        (Dict[str, DataRec]): A dictionary containing the split datasets:
            - 'train': The training dataset as a `DataRec` object (if not empty).
            - 'test': The test dataset as a `DataRec` object (if not empty).
            - 'val': The validation dataset as a `DataRec` object (if not empty).
    """

    pipeline = datarec.pipeline.copy()
    pipeline.add_step(name='split', operation=step_info['operation'], params=step_info['params'])

    result = dict()
    for k, d in zip(['train', 'test', 'val'], [train, test, validation]):
        if len(d) > 0:
            new_datarec = DataRec(RawData(d,
                                          user=datarec.user_col,
                                          item=datarec.item_col,
                                          rating=datarec.rating_col,
                                          timestamp=datarec.timestamp_col),
                                  derives_from=datarec,
                                  dataset_name=datarec.dataset_name,
                                  pipeline=pipeline.copy())
            result[k] = new_datarec
    return result

random_sample(dataframe, seed, n_samples=1)

Randomly selects a specified number of samples from a given DataFrame.

This function splits the input DataFrame into two subsets: - One containing n_samples randomly selected rows. - One containing the remaining rows after the selection.

Parameters:

Name Type Description Default
dataframe DataFrame

The input DataFrame from which to sample.

required
seed int

Random seed for reproducibility.

required
n_samples int

The number of samples to extract. Must be at least 1. Default is 1.

1

Returns:

Type Description
Tuple[DataFrame, DataFrame]
  • The first DataFrame contains the remaining data after sampling.
  • The second DataFrame contains the randomly selected samples.

Raises:

Type Description
ValueError

If n_samples is less than 1 or lesser/greater than the number of rows in the DataFrame.

Source code in datarec/splitters/utils.py
def random_sample(dataframe: pd.DataFrame, seed: int, n_samples: int = 1) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Randomly selects a specified number of samples from a given DataFrame.

    This function splits the input DataFrame into two subsets:
    - One containing `n_samples` randomly selected rows.
    - One containing the remaining rows after the selection.

    Args:
        dataframe (pd.DataFrame): The input DataFrame from which to sample.
        seed (int): Random seed for reproducibility.
        n_samples (int, optional): The number of samples to extract. Must be at least 1. Default is 1.

    Returns:
        (Tuple[pd.DataFrame, pd.DataFrame]):
            - The first DataFrame contains the remaining data after sampling.
            - The second DataFrame contains the randomly selected samples.

    Raises:
        ValueError: If `n_samples` is less than 1 or lesser/greater than the number of rows in the DataFrame.
    """

    if n_samples < 1:
        raise ValueError('number of samples must be greater than 1.')

    if n_samples > len(dataframe):
        raise ValueError('number of samples greater than the number of samples in the DataFrame.')

    samples = dataframe.sample(n=n_samples, random_state=seed)

    if len(samples) != n_samples:
        raise ValueError('number of samples lesser or greater than the number of rows in the DataFrame.')
    else:
        return dataframe.drop(samples.index), samples

max_by_col(dataframe, discriminative_column, seed)

Selects the row with the minimum value in the specified column from the given DataFrame. If multiple rows have the same minimum value, one is randomly selected.

Parameters:

Name Type Description Default
dataframe DataFrame

The input DataFrame.

required
discriminative_column str

The column used to determine the minimum value.

required
seed int

Random seed for reproducibility.

required

Returns:

Type Description
Tuple[DataFrame, DataFrame]
DataFrame
  • The first DataFrame contains the remaining rows after removing the selected row.
Tuple[DataFrame, DataFrame]
  • The second DataFrame contains the selected row with the minimum value.

Raises:

Type Description
ValueError

If the specified column is not present in the DataFrame.

ValueError

If no candidates are found (should not happen unless DataFrame is empty).

ValueError

If the random selection fails to return exactly one row.

Source code in datarec/splitters/utils.py
def max_by_col(dataframe: pd.DataFrame, discriminative_column: str, seed: int) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Selects the row with the minimum value in the specified column from the given DataFrame.
    If multiple rows have the same minimum value, one is randomly selected.

    Args:
        dataframe (pd.DataFrame): The input DataFrame.
        discriminative_column (str): The column used to determine the minimum value.
        seed (int): Random seed for reproducibility.

    Returns:
        (Tuple[pd.DataFrame, pd.DataFrame]):
        - The first DataFrame contains the remaining rows after removing the selected row.
        - The second DataFrame contains the selected row with the minimum value.

    Raises:
        ValueError: If the specified column is not present in the DataFrame.
        ValueError: If no candidates are found (should not happen unless DataFrame is empty).
        ValueError: If the random selection fails to return exactly one row.
    """

    if discriminative_column not in dataframe:
        raise ValueError(f'Column \'{discriminative_column}\' must be in the dataframe.')

    max_value = dataframe[discriminative_column].max()
    candidates = dataframe.loc[dataframe[discriminative_column] == max_value]
    n_candidates = len(candidates)

    if n_candidates == 0:
        raise ValueError('No candidate.')
    elif n_candidates == 1:
        return dataframe.drop(candidates.index), candidates
    else:
        candidates = candidates.sample(n=1, random_state=seed)
        if len(candidates) != 1:
            raise ValueError('Number of candidates lesser or greater than 1.')
        return dataframe.drop(candidates.index), candidates

temporal_holdout(dataframe, test_ratio, val_ratio, temporal_col)

Splits a dataset into training, validation, and test sets based on temporal ordering.

The function sorts the dataset according to a specified timestamp column and assigns the oldest interactions to the training set, followed by the validation set (if applicable), and the most recent interactions to the test set.

Parameters:

Name Type Description Default
dataframe DataFrame

The input dataset containing interaction data.

required
test_ratio float

The proportion of the dataset to allocate to the test set. Must be between 0 and 1.

required
val_ratio float

The proportion of the dataset to allocate to the validation set. Must be between 0 and 1.

required
temporal_col str

The name of the column containing timestamp information.

required

Returns:

Type Description
Tuple[DataFrame, DataFrame, DataFrame]

A tuple containing the train, validation, and test sets.

Raises:

Type Description
ValueError

If test_ratio or val_ratio are not in the range [0, 1].

Source code in datarec/splitters/utils.py
def temporal_holdout(dataframe: pd.DataFrame, test_ratio: float, val_ratio: float, temporal_col: str) \
        -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Splits a dataset into training, validation, and test sets based on temporal ordering.

    The function sorts the dataset according to a specified timestamp column and assigns
    the oldest interactions to the training set, followed by the validation set (if applicable),
    and the most recent interactions to the test set.

    Args:
        dataframe (pd.DataFrame): The input dataset containing interaction data.
        test_ratio (float): The proportion of the dataset to allocate to the test set. Must be between 0 and 1.
        val_ratio (float): The proportion of the dataset to allocate to the validation set. Must be between 0 and 1.
        temporal_col (str): The name of the column containing timestamp information.

    Returns:
        (Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]): A tuple containing the train, validation, and test sets.

    Raises:
        ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
    """

    if test_ratio < 0 or test_ratio > 1:
        raise ValueError('test ratio must be between 0 and 1.')

    if val_ratio < 0 or val_ratio > 1:
        raise ValueError('val ratio must be between 0 and 1.')

    train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    total_samples = len(dataframe)

    test_samples = round(total_samples * test_ratio)
    train_samples = total_samples - test_samples
    val_samples = round(train_samples * val_ratio)

    train_samples = total_samples - test_samples - val_samples

    assert (train_samples + val_samples + test_samples) == total_samples

    ordered = dataframe.sort_values(by=temporal_col)

    train = ordered.iloc[:train_samples]
    if val_samples:
        val = ordered.iloc[train_samples:(train_samples + val_samples)]
    if test_samples:
        test = ordered.iloc[(train_samples + val_samples):]

    assert len(train) == train_samples
    assert len(val) == val_samples
    assert len(test) == test_samples

    return train, test, val

Uniform Splitting Strategies

These splitters operate on the entire dataset globally.

RandomHoldOut

Bases: Splitter

Implements a random holdout split for recommendation datasets.

This splitter partitions the dataset into training, validation, and test sets using a random sampling approach. The proportions of the dataset allocated to the validation and test sets are controlled by val_ratio and test_ratio, respectively.

Source code in datarec/splitters/uniform/hold_out.py
class RandomHoldOut(Splitter):
    """
    Implements a random holdout split for recommendation datasets.

    This splitter partitions the dataset into training, validation, and test sets
    using a random sampling approach. The proportions of the dataset allocated to
    the validation and test sets are controlled by `val_ratio` and `test_ratio`, respectively.

    """

    def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
        """
        Initializes the RandomHoldOut object.

        Args:
            test_ratio (float, optional): The proportion of the dataset to include in the test set.
                Must be between 0 and 1. Default is 0.
            val_ratio (float, optional): The proportion of the training set to include in the validation set.
                Must be between 0 and 1. Default is 0.
            seed (int, optional): The random seed for reproducibility. Defaults to 42.

        Raises:
            ValueError: If `test_ratio` or `val_ratio` is not in the range [0, 1].
        """

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_ratio = test_ratio
        self.val_ratio = val_ratio
        self.seed = seed

    @property
    def test_ratio(self) -> float:
        """
        The proportion of the dataset for the test set.
        """
        return self._test_ratio

    @test_ratio.setter
    def test_ratio(self, value: float) -> None:
        """
        Sets the proportion of the dataset for the test set.

        Args:
            value (float): The proportion to allocate to the test set.
                Must be between 0 and 1.

        Raises:
            ValueError: If `value` is not in the range [0, 1].
        """
        if value < 0 or value > 1:
            raise ValueError('ratio must be between 0 and 1')
        self._test_ratio = value

    @property
    def val_ratio(self) -> float:
        """
        The proportion of the dataset for the validation set.
        """
        return self._val_ratio

    @val_ratio.setter
    def val_ratio(self, value: float) -> None:
        """
        Sets the proportion of the dataset for the validation set.

        Args:
            value (float): The proportion to allocate to the validation set.
                Must be between 0 and 1.

        Raises:
            ValueError: If `value` is not in the range [0, 1].
        """
        if value < 0 or value > 1:
            raise ValueError('ratio must be between 0 and 1')
        self._val_ratio = value

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset into training, validation, and test sets according to the specified ratios, 
        with the val_ratio being applied to the dataset after the test set has been partitioned.

        Args:
            datarec (DataRec): The dataset to be split.

        Returns:
            (Dict[str, DataRec]): A dictionary with the following keys:
                - "train": The training dataset (`DataRec`).
                - "test": The test dataset (`DataRec`), if `test_ratio` > 0.
                - "val": The validation dataset (`DataRec`), if `val_ratio` > 0.
        """

        train, val, test = datarec.data, pd.DataFrame(), pd.DataFrame()

        if self.test_ratio:
            train, test = split(train, test_size=self._test_ratio, random_state=self.seed)

        if self.val_ratio:
            train, val = split(train, test_size=self._val_ratio, random_state=self.seed)

        return self.output(datarec=datarec, train=train, test=test, validation=val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

test_ratio property writable

The proportion of the dataset for the test set.

val_ratio property writable

The proportion of the dataset for the validation set.

__init__(test_ratio=0, val_ratio=0, seed=42)

Initializes the RandomHoldOut object.

Parameters:

Name Type Description Default
test_ratio float

The proportion of the dataset to include in the test set. Must be between 0 and 1. Default is 0.

0
val_ratio float

The proportion of the training set to include in the validation set. Must be between 0 and 1. Default is 0.

0
seed int

The random seed for reproducibility. Defaults to 42.

42

Raises:

Type Description
ValueError

If test_ratio or val_ratio is not in the range [0, 1].

Source code in datarec/splitters/uniform/hold_out.py
def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
    """
    Initializes the RandomHoldOut object.

    Args:
        test_ratio (float, optional): The proportion of the dataset to include in the test set.
            Must be between 0 and 1. Default is 0.
        val_ratio (float, optional): The proportion of the training set to include in the validation set.
            Must be between 0 and 1. Default is 0.
        seed (int, optional): The random seed for reproducibility. Defaults to 42.

    Raises:
        ValueError: If `test_ratio` or `val_ratio` is not in the range [0, 1].
    """

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_ratio = test_ratio
    self.val_ratio = val_ratio
    self.seed = seed

run(datarec)

Splits the dataset into training, validation, and test sets according to the specified ratios, with the val_ratio being applied to the dataset after the test set has been partitioned.

Parameters:

Name Type Description Default
datarec DataRec

The dataset to be split.

required

Returns:

Type Description
Dict[str, DataRec]

A dictionary with the following keys: - "train": The training dataset (DataRec). - "test": The test dataset (DataRec), if test_ratio > 0. - "val": The validation dataset (DataRec), if val_ratio > 0.

Source code in datarec/splitters/uniform/hold_out.py
def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset into training, validation, and test sets according to the specified ratios, 
    with the val_ratio being applied to the dataset after the test set has been partitioned.

    Args:
        datarec (DataRec): The dataset to be split.

    Returns:
        (Dict[str, DataRec]): A dictionary with the following keys:
            - "train": The training dataset (`DataRec`).
            - "test": The test dataset (`DataRec`), if `test_ratio` > 0.
            - "val": The validation dataset (`DataRec`), if `val_ratio` > 0.
    """

    train, val, test = datarec.data, pd.DataFrame(), pd.DataFrame()

    if self.test_ratio:
        train, test = split(train, test_size=self._test_ratio, random_state=self.seed)

    if self.val_ratio:
        train, val = split(train, test_size=self._val_ratio, random_state=self.seed)

    return self.output(datarec=datarec, train=train, test=test, validation=val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

TemporalHoldOut

Bases: Splitter

Implements a temporal hold-out splitting strategy for recommendation datasets.

This splitter partitions a dataset into training, validation, and test sets based on the timestamps associated with interactions. The training set contains the oldest interactions, while the test set contains the most recent ones.

Source code in datarec/splitters/uniform/temporal/hold_out.py
class TemporalHoldOut(Splitter):
    """
    Implements a temporal hold-out splitting strategy for recommendation datasets.

    This splitter partitions a dataset into training, validation, and test sets based on
    the timestamps associated with interactions. The training set contains the oldest interactions,
    while the test set contains the most recent ones.

    """

    def __init__(self, test_ratio: float = 0, val_ratio: float = 0):

        """
        Initializes the TemporalHoldOut object.
        Args:
            test_ratio (float, optional): The proportion of the dataset to allocate to the test set.
                Must be between 0 and 1. Default is 0.
            val_ratio (float, optional): The proportion of the dataset to allocate to the validation set.
                Must be between 0 and 1. Default is 0.

        Raises:
            ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
        """

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_ratio = test_ratio
        self.val_ratio = val_ratio

    @property
    def test_ratio(self) -> float:
        "The proportion of the dataset allocated to the test set."
        return self._test_ratio

    @test_ratio.setter
    def test_ratio(self, value: float) -> None:
        """
        Sets the test ratio.

        Args:
            value (float): The proportion of the dataset to allocate to the test set.
                Must be between 0 and 1.

        Raises:
            ValueError: If `value` is not in the range [0, 1].
        """
        if value < 0 or value > 1:
            raise ValueError('ratio must be between 0 and 1')
        self._test_ratio = value

    @property
    def val_ratio(self) -> float:
        """
        The proportion of the dataset allocated to the validation set.
        """
        return self._val_ratio

    @val_ratio.setter
    def val_ratio(self, value: float) -> None:
        """
        Sets the validation ratio.

        Args:
            value (float): The proportion of the dataset to allocate to the validation set.
                Must be between 0 and 1.

        Raises:
            ValueError: If `value` is not in the range [0, 1].
        """
        if value < 0 or value > 1:
            raise ValueError('ratio must be between 0 and 1')
        self._val_ratio = value

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset using a temporal hold-out strategy.

        This method partitions the dataset into training, validation, and test sets based on
        the timestamps present in the `datarec` object. The split is performed such that the
        training set contains older interactions, while the test set contains more recent ones.

        Args:
            datarec (DataRec): A DataRec object containing the dataset and a timestamp column.

        Returns:
            (Dict[str, DataRec]): A dictionary with three keys:
                - `'train'`: A DataRec object containing the training set.
                - `'val'`: A DataRec object containing the validation set (if `val_ratio` > 0).
                - `'test'`: A DataRec object containing the test set (if `test_ratio` > 0).

        Raises:
            TypeError: If the `datarec` object does not contain a timestamp column.
        """

        if datarec.timestamp_col is None:
            raise TypeError('This DataRec does not contain temporal information')

        train, test, val = temporal_holdout(dataframe=datarec.data,
                                            test_ratio=self.test_ratio, val_ratio=self.val_ratio,
                                            temporal_col=datarec.timestamp_col)

        return self.output(datarec, train, test, val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

test_ratio property writable

The proportion of the dataset allocated to the test set.

val_ratio property writable

The proportion of the dataset allocated to the validation set.

__init__(test_ratio=0, val_ratio=0)

Initializes the TemporalHoldOut object. Args: test_ratio (float, optional): The proportion of the dataset to allocate to the test set. Must be between 0 and 1. Default is 0. val_ratio (float, optional): The proportion of the dataset to allocate to the validation set. Must be between 0 and 1. Default is 0.

Raises:

Type Description
ValueError

If test_ratio or val_ratio are not in the range [0, 1].

Source code in datarec/splitters/uniform/temporal/hold_out.py
def __init__(self, test_ratio: float = 0, val_ratio: float = 0):

    """
    Initializes the TemporalHoldOut object.
    Args:
        test_ratio (float, optional): The proportion of the dataset to allocate to the test set.
            Must be between 0 and 1. Default is 0.
        val_ratio (float, optional): The proportion of the dataset to allocate to the validation set.
            Must be between 0 and 1. Default is 0.

    Raises:
        ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
    """

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_ratio = test_ratio
    self.val_ratio = val_ratio

run(datarec)

Splits the dataset using a temporal hold-out strategy.

This method partitions the dataset into training, validation, and test sets based on the timestamps present in the datarec object. The split is performed such that the training set contains older interactions, while the test set contains more recent ones.

Parameters:

Name Type Description Default
datarec DataRec

A DataRec object containing the dataset and a timestamp column.

required

Returns:

Type Description
Dict[str, DataRec]

A dictionary with three keys: - 'train': A DataRec object containing the training set. - 'val': A DataRec object containing the validation set (if val_ratio > 0). - 'test': A DataRec object containing the test set (if test_ratio > 0).

Raises:

Type Description
TypeError

If the datarec object does not contain a timestamp column.

Source code in datarec/splitters/uniform/temporal/hold_out.py
def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset using a temporal hold-out strategy.

    This method partitions the dataset into training, validation, and test sets based on
    the timestamps present in the `datarec` object. The split is performed such that the
    training set contains older interactions, while the test set contains more recent ones.

    Args:
        datarec (DataRec): A DataRec object containing the dataset and a timestamp column.

    Returns:
        (Dict[str, DataRec]): A dictionary with three keys:
            - `'train'`: A DataRec object containing the training set.
            - `'val'`: A DataRec object containing the validation set (if `val_ratio` > 0).
            - `'test'`: A DataRec object containing the test set (if `test_ratio` > 0).

    Raises:
        TypeError: If the `datarec` object does not contain a timestamp column.
    """

    if datarec.timestamp_col is None:
        raise TypeError('This DataRec does not contain temporal information')

    train, test, val = temporal_holdout(dataframe=datarec.data,
                                        test_ratio=self.test_ratio, val_ratio=self.val_ratio,
                                        temporal_col=datarec.timestamp_col)

    return self.output(datarec, train, test, val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

TemporalThresholdSplit

Bases: Splitter

Splits a dataset into training, validation, and test sets based on two timestamp thresholds.

The dataset is divided such that: - The training set contains interactions occurring strictly before val_threshold. - The validation set contains interactions occurring between val_threshold (inclusive) and test_threshold (exclusive). - The test set contains interactions occurring at or after test_threshold.

Source code in datarec/splitters/uniform/temporal/threshold.py
class TemporalThresholdSplit(Splitter):
    """
    Splits a dataset into training, validation, and test sets based on two timestamp thresholds.

    The dataset is divided such that:
    - The training set contains interactions occurring strictly before `val_threshold`.
    - The validation set contains interactions occurring between `val_threshold` (inclusive)
      and `test_threshold` (exclusive).
    - The test set contains interactions occurring at or after `test_threshold`.
    """

    def __init__(self, val_threshold: float, test_threshold: float):
        """Initializes the TemporalThresholdSplit object.

        Args:
            val_threshold (float): The timestamp value that defines the split between training and validation.
            test_threshold (float): The timestamp value that defines the split between validation and test.

        Raises:
            ValueError: If `val_threshold` is not strictly less than `test_threshold`.
        """

        if val_threshold >= test_threshold:
            raise ValueError('val_threshold must be strictly less than test_threshold')

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.val_threshold = val_threshold
        self.test_threshold = test_threshold

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset into training, validation, and test sets based on two thresholds.

        Args:
            datarec (DataRec): A DataRec object containing the dataset with a timestamp column.

        Returns:
            Dict[str, DataRec]: A dictionary with:
                - `'train'`: Training set (timestamps < `val_threshold`).
                - `'val'`: Validation set (timestamps between `val_threshold` and `test_threshold`).
                - `'test'`: Test set (timestamps >= `test_threshold`).

        Raises:
            TypeError: If the `datarec` object does not contain a timestamp column.
        """

        if datarec.timestamp_col is None:
            raise TypeError('This DataRec does not contain temporal information')

        dataset = datarec.data

        train = dataset[dataset[datarec.timestamp_col] < self.val_threshold]

        val = dataset[(dataset[datarec.timestamp_col] >= self.val_threshold) &
                      (dataset[datarec.timestamp_col] < self.test_threshold)]

        test = dataset[dataset[datarec.timestamp_col] >= self.test_threshold]

        return self.output(datarec, train, test, val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

__init__(val_threshold, test_threshold)

Initializes the TemporalThresholdSplit object.

Parameters:

Name Type Description Default
val_threshold float

The timestamp value that defines the split between training and validation.

required
test_threshold float

The timestamp value that defines the split between validation and test.

required

Raises:

Type Description
ValueError

If val_threshold is not strictly less than test_threshold.

Source code in datarec/splitters/uniform/temporal/threshold.py
def __init__(self, val_threshold: float, test_threshold: float):
    """Initializes the TemporalThresholdSplit object.

    Args:
        val_threshold (float): The timestamp value that defines the split between training and validation.
        test_threshold (float): The timestamp value that defines the split between validation and test.

    Raises:
        ValueError: If `val_threshold` is not strictly less than `test_threshold`.
    """

    if val_threshold >= test_threshold:
        raise ValueError('val_threshold must be strictly less than test_threshold')

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.val_threshold = val_threshold
    self.test_threshold = test_threshold

run(datarec)

Splits the dataset into training, validation, and test sets based on two thresholds.

Parameters:

Name Type Description Default
datarec DataRec

A DataRec object containing the dataset with a timestamp column.

required

Returns:

Type Description
Dict[str, DataRec]

Dict[str, DataRec]: A dictionary with: - 'train': Training set (timestamps < val_threshold). - 'val': Validation set (timestamps between val_threshold and test_threshold). - 'test': Test set (timestamps >= test_threshold).

Raises:

Type Description
TypeError

If the datarec object does not contain a timestamp column.

Source code in datarec/splitters/uniform/temporal/threshold.py
def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset into training, validation, and test sets based on two thresholds.

    Args:
        datarec (DataRec): A DataRec object containing the dataset with a timestamp column.

    Returns:
        Dict[str, DataRec]: A dictionary with:
            - `'train'`: Training set (timestamps < `val_threshold`).
            - `'val'`: Validation set (timestamps between `val_threshold` and `test_threshold`).
            - `'test'`: Test set (timestamps >= `test_threshold`).

    Raises:
        TypeError: If the `datarec` object does not contain a timestamp column.
    """

    if datarec.timestamp_col is None:
        raise TypeError('This DataRec does not contain temporal information')

    dataset = datarec.data

    train = dataset[dataset[datarec.timestamp_col] < self.val_threshold]

    val = dataset[(dataset[datarec.timestamp_col] >= self.val_threshold) &
                  (dataset[datarec.timestamp_col] < self.test_threshold)]

    test = dataset[dataset[datarec.timestamp_col] >= self.test_threshold]

    return self.output(datarec, train, test, val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

User-Stratified Splitting Strategies

These splitters operate on a per-user basis, ensuring that each user's interaction history is partitioned across the splits.

UserStratifiedHoldOut

Bases: Splitter

Implements a user-stratified holdout split for a recommendation dataset.

This splitter ensures that each user's interactions are split into training, validation, and test sets while maintaining the proportion specified by test_ratio and val_ratio.

Source code in datarec/splitters/user_stratified/hold_out.py
class UserStratifiedHoldOut(Splitter):
    """
    Implements a user-stratified holdout split for a recommendation dataset.

    This splitter ensures that each user's interactions are split into training, validation,
    and test sets while maintaining the proportion specified by `test_ratio` and `val_ratio`.

    """

    def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
        """Initializes the UserStratifiedHoldOut splitter.

        Args:
            test_ratio (float, optional): The proportion of interactions per user to include in the test set.
                Must be between 0 and 1. Default is 0.
            val_ratio (float, optional): The proportion of interactions per user to include in the validation set.
                Must be between 0 and 1. Default is 0.
            seed (int, optional): Random seed for reproducibility. Defaults to 42.

         Raises:
            ValueError: If `test_ratio` or `val_ratio` is not in the range [0, 1].
        """

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_ratio = test_ratio
        self.val_ratio = val_ratio
        self.seed = seed

    @property
    def test_ratio(self) -> float:
        """The proportion of interactions per user for the test set."""
        return self._test_ratio

    @test_ratio.setter
    def test_ratio(self, value: float) -> None:
        """
        Sets the proportion of interactions per user for the test set.

        Args:
            value (float): Ratio for the test set. Must be between 0 and 1.

        Raises:
            ValueError: If the ratio is not between 0 and 1.
        """
        if value < 0 or value > 1:
            raise ValueError('ratio must be between 0 and 1')
        self._test_ratio = value

    @property
    def val_ratio(self) -> float:
        """ 
        The proportion of interactions per user for the validation set.
        """
        return self._val_ratio

    @val_ratio.setter
    def val_ratio(self, value: float) -> None:
        """
        Sets the proportion of remaining interactions per user for the validation set.

        Args:
            value (float): Ratio for the validation set. Must be between 0 and 1.

        Raises:
            ValueError: If the ratio is not between 0 and 1.
        """
        if value < 0 or value > 1:
            raise ValueError('ratio must be between 0 and 1')
        self._val_ratio = value

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset into train, validation, and test sets using a user-stratified holdout approach.

        Each user's interactions are split independently according to `test_ratio` and `val_ratio`, ensuring
        that the distribution is preserved per user. The function returns a dictionary containing the three
        resulting subsets.

        Args:
            datarec (DataRec): The dataset to be split.

        Returns:
            (Dict[str, DataRec]): A dictionary with the following keys:
                - "train": DataRec containing the training set.
                - "test": DataRec containing the test set, if `test_ratio` > 0.
                - "val": DataRec containing the validation set, if `val_ratio` > 0.
        """

        train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

        data = datarec.data
        for u in datarec.users:

            u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

            if self.test_ratio:
                u_train, u_test = split(u_train, test_size=self._test_ratio, random_state=self.seed)
            if self.val_ratio:
                u_train, u_val = split(u_train, test_size=self._val_ratio, random_state=self.seed)

            train = pd.concat([train, u_train], axis=0, ignore_index=True)
            test = pd.concat([test, u_test], axis=0, ignore_index=True)
            val = pd.concat([val, u_val], axis=0, ignore_index=True)

        return self.output(datarec=datarec, train=train, test=test, validation=val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

test_ratio property writable

The proportion of interactions per user for the test set.

val_ratio property writable

The proportion of interactions per user for the validation set.

__init__(test_ratio=0, val_ratio=0, seed=42)

Initializes the UserStratifiedHoldOut splitter.

Parameters:

Name Type Description Default
test_ratio float

The proportion of interactions per user to include in the test set. Must be between 0 and 1. Default is 0.

0
val_ratio float

The proportion of interactions per user to include in the validation set. Must be between 0 and 1. Default is 0.

0
seed int

Random seed for reproducibility. Defaults to 42.

42

Raises: ValueError: If test_ratio or val_ratio is not in the range [0, 1].

Source code in datarec/splitters/user_stratified/hold_out.py
def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
    """Initializes the UserStratifiedHoldOut splitter.

    Args:
        test_ratio (float, optional): The proportion of interactions per user to include in the test set.
            Must be between 0 and 1. Default is 0.
        val_ratio (float, optional): The proportion of interactions per user to include in the validation set.
            Must be between 0 and 1. Default is 0.
        seed (int, optional): Random seed for reproducibility. Defaults to 42.

     Raises:
        ValueError: If `test_ratio` or `val_ratio` is not in the range [0, 1].
    """

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_ratio = test_ratio
    self.val_ratio = val_ratio
    self.seed = seed

run(datarec)

Splits the dataset into train, validation, and test sets using a user-stratified holdout approach.

Each user's interactions are split independently according to test_ratio and val_ratio, ensuring that the distribution is preserved per user. The function returns a dictionary containing the three resulting subsets.

Parameters:

Name Type Description Default
datarec DataRec

The dataset to be split.

required

Returns:

Type Description
Dict[str, DataRec]

A dictionary with the following keys: - "train": DataRec containing the training set. - "test": DataRec containing the test set, if test_ratio > 0. - "val": DataRec containing the validation set, if val_ratio > 0.

Source code in datarec/splitters/user_stratified/hold_out.py
def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset into train, validation, and test sets using a user-stratified holdout approach.

    Each user's interactions are split independently according to `test_ratio` and `val_ratio`, ensuring
    that the distribution is preserved per user. The function returns a dictionary containing the three
    resulting subsets.

    Args:
        datarec (DataRec): The dataset to be split.

    Returns:
        (Dict[str, DataRec]): A dictionary with the following keys:
            - "train": DataRec containing the training set.
            - "test": DataRec containing the test set, if `test_ratio` > 0.
            - "val": DataRec containing the validation set, if `val_ratio` > 0.
    """

    train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    data = datarec.data
    for u in datarec.users:

        u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

        if self.test_ratio:
            u_train, u_test = split(u_train, test_size=self._test_ratio, random_state=self.seed)
        if self.val_ratio:
            u_train, u_val = split(u_train, test_size=self._val_ratio, random_state=self.seed)

        train = pd.concat([train, u_train], axis=0, ignore_index=True)
        test = pd.concat([test, u_test], axis=0, ignore_index=True)
        val = pd.concat([val, u_val], axis=0, ignore_index=True)

    return self.output(datarec=datarec, train=train, test=test, validation=val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

LeaveNOut

Bases: Splitter

Implements the Leave-N-Out splitting strategy for recommendation datasets.

This splitter ensures that for each user, a fixed number of interactions (test_n and validation_n) are randomly selected and moved to the test and validation sets, respectively. The remaining interactions are kept in the training set.

Source code in datarec/splitters/user_stratified/leave_out.py
class LeaveNOut(Splitter):
    """
    Implements the Leave-N-Out splitting strategy for recommendation datasets.

    This splitter ensures that for each user, a fixed number of interactions (`test_n` and `validation_n`)
    are randomly selected and moved to the test and validation sets, respectively. The remaining interactions
    are kept in the training set.
    """

    def __init__(self, test_n: int = 0, validation_n: int = 0, seed: int = 42):
        """Initializes the LeaveNOut splitter.

        Args:
            test_n (int, optional): Number of interactions to move to the test set per user. Default is 0.
            validation_n (int, optional): Number of interactions to move to the validation set per user. Default is 0.
            seed (int, optional): Random seed for reproducibility. Default is 42.

        Raises:
            ValueError: If `test_n` or `validation_n` are negative.
            TypeError: If `test_n` or `validation_n` are not integers.
        """

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_n = test_n
        self.validation_n = validation_n
        self.seed = seed

    @property
    def test_n(self) -> int:
        """Number of interactions to move to the test set per user."""
        return self._test_n

    @test_n.setter
    def test_n(self, value: int) -> None:
        """
        Sets the number of interactions to move to the test set per user.

        Args:
            value (int): Number of interactions.

        Raises:
            ValueError: If `value` is negative.
            TypeError: If `value` is not an integer.
        """
        if value < 0:
            raise ValueError("test_n must be greater or equal to 0.")
        if isinstance(value, float):
            raise TypeError("test_n must be an integer.")
        self._test_n = value

    @property
    def validation_n(self) -> int:
        """Number of interactions to move to the test set per user."""
        return self._validation_n

    @validation_n.setter
    def validation_n(self, value: int) -> None:
        """
        Sets the number of interactions to move to the validation set per user.

        Args:
            value (int): Number of interactions.

        Raises:
            ValueError: If `value` is negative.
            TypeError: If `value` is not an integer.
        """
        if value < 0:
            raise ValueError("validation_n must be greater or equal to 0.")
        if isinstance(value, float):
            raise TypeError("validation_n must be an integer.")
        self._validation_n = value

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset into train, validation, and test sets using a Leave-N-Out approach.

        For each user, `test_n` interactions are randomly assigned to the test set, and `validation_n`
        interactions are assigned to the validation set. The remaining interactions are used for training.

        Args:
            datarec (DataRec): The dataset to be split.

        Returns:
            (Dict[str, DataRec]): A dictionary with the following keys:
                - "train": DataRec containing the training set.
                - "test": DataRec containing the test set, if `test_n` > 0.
                - "validation": DataRec containing the validation set, if `val_n` > 0.
        """

        train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

        data = datarec.data
        for u in datarec.users:

            u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

            if self.test_n:
                u_train, sample = random_sample(dataframe=u_train, n_samples=self.test_n, seed=self.seed)
                u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

            if self.validation_n:
                u_train, sample = random_sample(dataframe=u_train, n_samples=self.validation_n, seed=self.seed)
                u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

            train = pd.concat([train, u_train], axis=0, ignore_index=True)
            test = pd.concat([test, u_test], axis=0, ignore_index=True)
            val = pd.concat([val, u_val], axis=0, ignore_index=True)

        return self.output(datarec, train, test, val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

test_n property writable

Number of interactions to move to the test set per user.

validation_n property writable

Number of interactions to move to the test set per user.

__init__(test_n=0, validation_n=0, seed=42)

Initializes the LeaveNOut splitter.

Parameters:

Name Type Description Default
test_n int

Number of interactions to move to the test set per user. Default is 0.

0
validation_n int

Number of interactions to move to the validation set per user. Default is 0.

0
seed int

Random seed for reproducibility. Default is 42.

42

Raises:

Type Description
ValueError

If test_n or validation_n are negative.

TypeError

If test_n or validation_n are not integers.

Source code in datarec/splitters/user_stratified/leave_out.py
def __init__(self, test_n: int = 0, validation_n: int = 0, seed: int = 42):
    """Initializes the LeaveNOut splitter.

    Args:
        test_n (int, optional): Number of interactions to move to the test set per user. Default is 0.
        validation_n (int, optional): Number of interactions to move to the validation set per user. Default is 0.
        seed (int, optional): Random seed for reproducibility. Default is 42.

    Raises:
        ValueError: If `test_n` or `validation_n` are negative.
        TypeError: If `test_n` or `validation_n` are not integers.
    """

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_n = test_n
    self.validation_n = validation_n
    self.seed = seed

run(datarec)

Splits the dataset into train, validation, and test sets using a Leave-N-Out approach.

For each user, test_n interactions are randomly assigned to the test set, and validation_n interactions are assigned to the validation set. The remaining interactions are used for training.

Parameters:

Name Type Description Default
datarec DataRec

The dataset to be split.

required

Returns:

Type Description
Dict[str, DataRec]

A dictionary with the following keys: - "train": DataRec containing the training set. - "test": DataRec containing the test set, if test_n > 0. - "validation": DataRec containing the validation set, if val_n > 0.

Source code in datarec/splitters/user_stratified/leave_out.py
def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset into train, validation, and test sets using a Leave-N-Out approach.

    For each user, `test_n` interactions are randomly assigned to the test set, and `validation_n`
    interactions are assigned to the validation set. The remaining interactions are used for training.

    Args:
        datarec (DataRec): The dataset to be split.

    Returns:
        (Dict[str, DataRec]): A dictionary with the following keys:
            - "train": DataRec containing the training set.
            - "test": DataRec containing the test set, if `test_n` > 0.
            - "validation": DataRec containing the validation set, if `val_n` > 0.
    """

    train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    data = datarec.data
    for u in datarec.users:

        u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

        if self.test_n:
            u_train, sample = random_sample(dataframe=u_train, n_samples=self.test_n, seed=self.seed)
            u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

        if self.validation_n:
            u_train, sample = random_sample(dataframe=u_train, n_samples=self.validation_n, seed=self.seed)
            u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

        train = pd.concat([train, u_train], axis=0, ignore_index=True)
        test = pd.concat([test, u_test], axis=0, ignore_index=True)
        val = pd.concat([val, u_val], axis=0, ignore_index=True)

    return self.output(datarec, train, test, val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

LeaveOneOut

Bases: LeaveNOut

Implements the Leave-One-Out splitting strategy for recommendation datasets.

This splitter ensures that for each user, at most one interaction is randomly selected and moved to the test and/or validation set, depending on the specified parameters. The remaining interactions are kept in the training set.

This is a special case of LeaveNOut where test_n=1 and/or validation_n=1 if test and validation are set to True, respectively.

Source code in datarec/splitters/user_stratified/leave_out.py
class LeaveOneOut(LeaveNOut):
    """
    Implements the Leave-One-Out splitting strategy for recommendation datasets.

    This splitter ensures that for each user, at most one interaction is randomly selected and moved
    to the test and/or validation set, depending on the specified parameters. The remaining interactions
    are kept in the training set.

    This is a special case of `LeaveNOut` where `test_n=1` and/or `validation_n=1` if `test` and `validation`
    are set to `True`, respectively.
    """

    def __init__(self, test: bool = True, validation: bool = True, seed: int = 42):
        """Initializes the LeaveOneOut splitter.

        Args:
            test (bool, optional): Whether to include a test set. Defaults to True.
            validation (bool, optional): Whether to include a validation set. Defaults to True.
            seed (int, optional): Random seed for reproducibility. Default is 42.

        Raises:
            TypeError: If `test` or `validation` is not a boolean.
        """
        if not isinstance(test, bool):
            raise TypeError("test must be a boolean.")
        if not isinstance(validation, bool):
            raise TypeError("validation must be an boolean.")

        test = 1 if test else 0
        validation = 1 if validation else 0

        super().__init__(test_n=test, validation_n=validation, seed=seed)

__init__(test=True, validation=True, seed=42)

Initializes the LeaveOneOut splitter.

Parameters:

Name Type Description Default
test bool

Whether to include a test set. Defaults to True.

True
validation bool

Whether to include a validation set. Defaults to True.

True
seed int

Random seed for reproducibility. Default is 42.

42

Raises:

Type Description
TypeError

If test or validation is not a boolean.

Source code in datarec/splitters/user_stratified/leave_out.py
def __init__(self, test: bool = True, validation: bool = True, seed: int = 42):
    """Initializes the LeaveOneOut splitter.

    Args:
        test (bool, optional): Whether to include a test set. Defaults to True.
        validation (bool, optional): Whether to include a validation set. Defaults to True.
        seed (int, optional): Random seed for reproducibility. Default is 42.

    Raises:
        TypeError: If `test` or `validation` is not a boolean.
    """
    if not isinstance(test, bool):
        raise TypeError("test must be a boolean.")
    if not isinstance(validation, bool):
        raise TypeError("validation must be an boolean.")

    test = 1 if test else 0
    validation = 1 if validation else 0

    super().__init__(test_n=test, validation_n=validation, seed=seed)

LeaveRatioOut

Bases: Splitter

Splits the dataset into training, test, and validation sets based on a ratio instead of a fixed number of samples.

This splitter selects a fraction of interactions for each user to be assigned to the test and validation sets, ensuring that the splits are proportional to the user's total number of interactions.

Source code in datarec/splitters/user_stratified/leave_out.py
class LeaveRatioOut(Splitter):
    """
    Splits the dataset into training, test, and validation sets based on a ratio instead of a fixed number of samples.

    This splitter selects a fraction of interactions for each user to be assigned to the test and validation sets,
    ensuring that the splits are proportional to the user's total number of interactions.
    """

    def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
        """Initializes the LeaveRatioOut splitter.

        Args:
            test_ratio (float, optional): Proportion of each user's interactions assigned to the test set. Default is 0.
            val_ratio (float, optional): Proportion of each user's interactions assigned to the validation set. Default is 0.
            seed (int, optional): Random seed for reproducibility. Default is 42.

        Raises:
            ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
            ValueError: If the sum of `test_ratio` and `val_ratio` exceeds 1.
        """
        if not (0 <= test_ratio <= 1):
            raise ValueError('ratio must be between 0 and 1')
        if not (0 <= val_ratio <= 1):
            raise ValueError('ratio must be between 0 and 1')
        if test_ratio + val_ratio > 1:
            raise ValueError("sum of test_ratio and val_ratio must not exceed 1")

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_ratio = test_ratio
        self.val_ratio = val_ratio
        self.seed = seed

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
         Splits the dataset into train, test, and validation sets based on the specified ratios.

         The interactions of each user are sampled proportionally to create the test and validation sets.
         The remaining interactions are used as the training set.

         Args:
             datarec (DataRec): The dataset containing interactions and user-item relationships.

         Returns:
             (Dict[str, DataRec]): A dictionary containing the following keys:
                 - `"train"` (`DataRec`): The training dataset.
                 - `"test"` (`DataRec`): The test dataset, if `test_ratio` > 0.
                 - `"val"` (`DataRec`): The validation dataset, if `val_ratio` > 0.

         Raises:
             ValueError: If an empty dataset is encountered after sampling.
         """

        train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

        data = datarec.data
        for u in datarec.users:
            u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

            user_total = len(u_train)

            test_n_samples = round(self.test_ratio * user_total)
            val_n_samples = round(self.val_ratio * user_total)

            if test_n_samples > 0:
                u_train, sample = random_sample(dataframe=u_train, n_samples=min(test_n_samples, len(u_train)),
                                                seed=self.seed)
                u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

            if val_n_samples > 0:
                u_train, sample = random_sample(dataframe=u_train, n_samples=min(val_n_samples, len(u_train)),
                                                seed=self.seed)
                u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

            train = pd.concat([train, u_train], axis=0, ignore_index=True)
            test = pd.concat([test, u_test], axis=0, ignore_index=True)
            val = pd.concat([val, u_val], axis=0, ignore_index=True)

        return self.output(datarec, train, test, val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

__init__(test_ratio=0, val_ratio=0, seed=42)

Initializes the LeaveRatioOut splitter.

Parameters:

Name Type Description Default
test_ratio float

Proportion of each user's interactions assigned to the test set. Default is 0.

0
val_ratio float

Proportion of each user's interactions assigned to the validation set. Default is 0.

0
seed int

Random seed for reproducibility. Default is 42.

42

Raises:

Type Description
ValueError

If test_ratio or val_ratio are not in the range [0, 1].

ValueError

If the sum of test_ratio and val_ratio exceeds 1.

Source code in datarec/splitters/user_stratified/leave_out.py
def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
    """Initializes the LeaveRatioOut splitter.

    Args:
        test_ratio (float, optional): Proportion of each user's interactions assigned to the test set. Default is 0.
        val_ratio (float, optional): Proportion of each user's interactions assigned to the validation set. Default is 0.
        seed (int, optional): Random seed for reproducibility. Default is 42.

    Raises:
        ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
        ValueError: If the sum of `test_ratio` and `val_ratio` exceeds 1.
    """
    if not (0 <= test_ratio <= 1):
        raise ValueError('ratio must be between 0 and 1')
    if not (0 <= val_ratio <= 1):
        raise ValueError('ratio must be between 0 and 1')
    if test_ratio + val_ratio > 1:
        raise ValueError("sum of test_ratio and val_ratio must not exceed 1")

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_ratio = test_ratio
    self.val_ratio = val_ratio
    self.seed = seed

run(datarec)

Splits the dataset into train, test, and validation sets based on the specified ratios.

The interactions of each user are sampled proportionally to create the test and validation sets. The remaining interactions are used as the training set.

Parameters:

Name Type Description Default
datarec DataRec

The dataset containing interactions and user-item relationships.

required

Returns:

Type Description
Dict[str, DataRec]

A dictionary containing the following keys: - "train" (DataRec): The training dataset. - "test" (DataRec): The test dataset, if test_ratio > 0. - "val" (DataRec): The validation dataset, if val_ratio > 0.

Raises:

Type Description
ValueError

If an empty dataset is encountered after sampling.

Source code in datarec/splitters/user_stratified/leave_out.py
def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
     Splits the dataset into train, test, and validation sets based on the specified ratios.

     The interactions of each user are sampled proportionally to create the test and validation sets.
     The remaining interactions are used as the training set.

     Args:
         datarec (DataRec): The dataset containing interactions and user-item relationships.

     Returns:
         (Dict[str, DataRec]): A dictionary containing the following keys:
             - `"train"` (`DataRec`): The training dataset.
             - `"test"` (`DataRec`): The test dataset, if `test_ratio` > 0.
             - `"val"` (`DataRec`): The validation dataset, if `val_ratio` > 0.

     Raises:
         ValueError: If an empty dataset is encountered after sampling.
     """

    train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    data = datarec.data
    for u in datarec.users:
        u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

        user_total = len(u_train)

        test_n_samples = round(self.test_ratio * user_total)
        val_n_samples = round(self.val_ratio * user_total)

        if test_n_samples > 0:
            u_train, sample = random_sample(dataframe=u_train, n_samples=min(test_n_samples, len(u_train)),
                                            seed=self.seed)
            u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

        if val_n_samples > 0:
            u_train, sample = random_sample(dataframe=u_train, n_samples=min(val_n_samples, len(u_train)),
                                            seed=self.seed)
            u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

        train = pd.concat([train, u_train], axis=0, ignore_index=True)
        test = pd.concat([test, u_test], axis=0, ignore_index=True)
        val = pd.concat([val, u_val], axis=0, ignore_index=True)

    return self.output(datarec, train, test, val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

LeaveNLast

Bases: Splitter

Splits the dataset by removing the last n interactions per user based on a timestamp column.

This splitter selects the last test_n interactions for the test set and the last validation_n interactions for the validation set while keeping the remaining interactions in the training set.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py
class LeaveNLast(Splitter):
    """
    Splits the dataset by removing the last `n` interactions per user based on a timestamp column.

    This splitter selects the last `test_n` interactions for the test set and the last `validation_n`
    interactions for the validation set while keeping the remaining interactions in the training set.
    """
    def __init__(self, test_n: int = 0, validation_n: int = 0, seed: int = 42):
        """Initializes the LeaveNLast splitter.

        Args:
            test_n (int, optional): Number of last interactions for the test set. Defaults to 0.
            validation_n (int, optional): Number of last interactions for the validation set. Defaults to 0.
            seed (int, optional): Random seed for reproducibility. Defaults to 42.
        """

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_n = test_n
        self.validation_n = validation_n
        self.seed = seed

    @property
    def test_n(self) -> int:
        """The number of last interactions per user for the test set."""
        return self._test_n

    @test_n.setter
    def test_n(self, value: int) -> None:
        """
        Sets the number of last interactions per user for the test set.

        Args:
            value (int): Number of interactions. Must be >= 0.

        Raises:
            ValueError: If `value` < 0.
            TypeError: If `value` is not an integer.
        """
        if value < 0:
            raise ValueError("test_n must be greater or equal than 0.")
        if isinstance(value, float):
            raise TypeError("test_n must be an integer.")
        self._test_n = value

    @property
    def validation_n(self) -> int:
        """The number of last interactions per user for the validation set."""
        return self._validation_n

    @validation_n.setter
    def validation_n(self, value: int) -> None:
        """
        Sets the number of last interactions per user for the validation set.

        Args:
            value (int): Number of interactions. Must be >= 0.

        Raises:
            ValueError: If `value` < 0.
            TypeError: If `value` is not an integer.
        """
        if value < 0:
            raise ValueError("validation_n must be greater or equal than 0.")
        if isinstance(value, float):
            raise TypeError("validation_n must be and integer.")
        self._validation_n = value

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset into train, test, and validation sets based on the last `n` interactions.

        Args:
            datarec (DataRec): The dataset containing the interactions and timestamp column.

        Returns:
            (Dict[str, DataRec]): A dictionary with the following keys:
                - "train": The training dataset (`DataRec`).
                - "test": The test dataset (`DataRec`), if `test_n` > 0.
                - "val": The validation dataset (`DataRec`), if `val_n` > 0.

        Raises:
            TypeError: If the dataset does not contain a timestamp column.
        """

        if datarec.timestamp_col is None:
            raise TypeError('This DataRec does not contain temporal information')

        train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

        data = datarec.data
        for u in datarec.users:

            u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

            if self.test_n:
                for _ in range(self.test_n):
                    u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                    u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

            if self.validation_n:
                for _ in range(self.validation_n):
                    u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                    u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

            train = pd.concat([train, u_train], axis=0, ignore_index=True)
            test = pd.concat([test, u_test], axis=0, ignore_index=True)
            val = pd.concat([val, u_val], axis=0, ignore_index=True)

        return self.output(datarec, train, test, val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

test_n property writable

The number of last interactions per user for the test set.

validation_n property writable

The number of last interactions per user for the validation set.

__init__(test_n=0, validation_n=0, seed=42)

Initializes the LeaveNLast splitter.

Parameters:

Name Type Description Default
test_n int

Number of last interactions for the test set. Defaults to 0.

0
validation_n int

Number of last interactions for the validation set. Defaults to 0.

0
seed int

Random seed for reproducibility. Defaults to 42.

42
Source code in datarec/splitters/user_stratified/temporal/leave_out.py
def __init__(self, test_n: int = 0, validation_n: int = 0, seed: int = 42):
    """Initializes the LeaveNLast splitter.

    Args:
        test_n (int, optional): Number of last interactions for the test set. Defaults to 0.
        validation_n (int, optional): Number of last interactions for the validation set. Defaults to 0.
        seed (int, optional): Random seed for reproducibility. Defaults to 42.
    """

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_n = test_n
    self.validation_n = validation_n
    self.seed = seed

run(datarec)

Splits the dataset into train, test, and validation sets based on the last n interactions.

Parameters:

Name Type Description Default
datarec DataRec

The dataset containing the interactions and timestamp column.

required

Returns:

Type Description
Dict[str, DataRec]

A dictionary with the following keys: - "train": The training dataset (DataRec). - "test": The test dataset (DataRec), if test_n > 0. - "val": The validation dataset (DataRec), if val_n > 0.

Raises:

Type Description
TypeError

If the dataset does not contain a timestamp column.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py
def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset into train, test, and validation sets based on the last `n` interactions.

    Args:
        datarec (DataRec): The dataset containing the interactions and timestamp column.

    Returns:
        (Dict[str, DataRec]): A dictionary with the following keys:
            - "train": The training dataset (`DataRec`).
            - "test": The test dataset (`DataRec`), if `test_n` > 0.
            - "val": The validation dataset (`DataRec`), if `val_n` > 0.

    Raises:
        TypeError: If the dataset does not contain a timestamp column.
    """

    if datarec.timestamp_col is None:
        raise TypeError('This DataRec does not contain temporal information')

    train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    data = datarec.data
    for u in datarec.users:

        u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

        if self.test_n:
            for _ in range(self.test_n):
                u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

        if self.validation_n:
            for _ in range(self.validation_n):
                u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

        train = pd.concat([train, u_train], axis=0, ignore_index=True)
        test = pd.concat([test, u_test], axis=0, ignore_index=True)
        val = pd.concat([val, u_val], axis=0, ignore_index=True)

    return self.output(datarec, train, test, val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

LeaveOneLastItem

Bases: LeaveNLast

Special case of LeaveNLast that removes only the last interaction per user for test and validation.

This class sets test_n and validation_n to 1 if their corresponding boolean parameters are True.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py
class LeaveOneLastItem(LeaveNLast):
    """
    Special case of LeaveNLast that removes only the last interaction per user for test and validation.

    This class sets `test_n` and `validation_n` to 1 if their corresponding boolean parameters are True.

    """

    def __init__(self, test: bool = True, validation: bool = True, seed: int = 42):
        """
        Initializes the LeaveOneLastItem splitter.

        Args:
            test (bool, optional): Whether to remove the last interaction for the test set. Defaults to True.
            validation (bool, optional): Whether to remove the last interaction for the validation set. Defaults to True.
            seed (int, optional): Random seed for reproducibility. Default is 42.

        Raises:
            TypeError: If `test` or `validation` are not boolean.
        """
        if not isinstance(test, bool):
            raise TypeError("test must be a boolean.")
        if not isinstance(validation, bool):
            raise TypeError("validation must be an boolean.")

        test = 1 if test else 0
        validation = 1 if validation else 0

        super().__init__(test_n=test, validation_n=validation, seed=seed)

__init__(test=True, validation=True, seed=42)

Initializes the LeaveOneLastItem splitter.

Parameters:

Name Type Description Default
test bool

Whether to remove the last interaction for the test set. Defaults to True.

True
validation bool

Whether to remove the last interaction for the validation set. Defaults to True.

True
seed int

Random seed for reproducibility. Default is 42.

42

Raises:

Type Description
TypeError

If test or validation are not boolean.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py
def __init__(self, test: bool = True, validation: bool = True, seed: int = 42):
    """
    Initializes the LeaveOneLastItem splitter.

    Args:
        test (bool, optional): Whether to remove the last interaction for the test set. Defaults to True.
        validation (bool, optional): Whether to remove the last interaction for the validation set. Defaults to True.
        seed (int, optional): Random seed for reproducibility. Default is 42.

    Raises:
        TypeError: If `test` or `validation` are not boolean.
    """
    if not isinstance(test, bool):
        raise TypeError("test must be a boolean.")
    if not isinstance(validation, bool):
        raise TypeError("validation must be an boolean.")

    test = 1 if test else 0
    validation = 1 if validation else 0

    super().__init__(test_n=test, validation_n=validation, seed=seed)

LeaveRatioLast

Bases: Splitter

Splits the dataset into training, test, and validation sets by selecting the most recent interactions for each user based on a specified ratio.

Unlike LeaveNLast, which selects a fixed number of interactions, this splitter chooses a fraction of the total interactions per user, preserving temporal order.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py
class LeaveRatioLast(Splitter):
    """
    Splits the dataset into training, test, and validation sets by selecting the most recent interactions
    for each user based on a specified ratio.

    Unlike `LeaveNLast`, which selects a fixed number of interactions, this splitter chooses a fraction
    of the total interactions per user, preserving temporal order.

    """

    def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
        """
        Args:
            test_ratio (float, optional): Proportion of each user's interactions assigned to the test set. Default is 0.
            val_ratio (float, optional): Proportion of each user's interactions assigned to the validation set. Default is 0.
            seed (int, optional): Random seed for reproducibility. Default is 42.

        Raises:
            ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
            ValueError: If `test_ratio + val_ratio` > 1.
        """
        if not (0 <= test_ratio <= 1):
            raise ValueError('ratio must be between 0 and 1')
        if not (0 <= val_ratio <= 1):
            raise ValueError('ratio must be between 0 and 1')
        if test_ratio + val_ratio > 1:
            raise ValueError("sum of test_ratio and val_ratio must not exceed 1")

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_ratio = test_ratio
        self.val_ratio = val_ratio
        self.seed = seed

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset into train, test, and validation sets by selecting the last interactions
        (in chronological order) for each user.

        The most recent interactions are removed first for the test set, then for the validation set,
        leaving the remaining interactions for training.

        Args:
            datarec (DataRec): The dataset containing interactions with a timestamp column.

        Returns:
            (Dict[str, DataRec]): A dictionary containing the following keys:
                - `"train"` (`DataRec`): The training dataset.
                - `"test"` (`DataRec`): The test dataset, if `test_ratio` > 0.
                - `"val"` (`DataRec`): The validation dataset, if `val_ratio` > 0.

        Raises:
            TypeError: If the dataset does not contain a timestamp column.
        """

        if datarec.timestamp_col is None:
            raise TypeError('This DataRec does not contain temporal information')

        train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

        data = datarec.data
        for u in datarec.users:
            u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

            user_total = len(u_train)

            test_n_samples = round(self.test_ratio * user_total)
            val_n_samples = round(self.val_ratio * user_total)

            if test_n_samples > 0:
                for _ in range(min(test_n_samples, len(u_train))):
                    u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                    u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

            if val_n_samples > 0:
                for _ in range(min(val_n_samples, len(u_train))):
                    u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                    u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

            train = pd.concat([train, u_train], axis=0, ignore_index=True)
            test = pd.concat([test, u_test], axis=0, ignore_index=True)
            val = pd.concat([val, u_val], axis=0, ignore_index=True)

        return self.output(datarec, train, test, val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

__init__(test_ratio=0, val_ratio=0, seed=42)

Parameters:

Name Type Description Default
test_ratio float

Proportion of each user's interactions assigned to the test set. Default is 0.

0
val_ratio float

Proportion of each user's interactions assigned to the validation set. Default is 0.

0
seed int

Random seed for reproducibility. Default is 42.

42

Raises:

Type Description
ValueError

If test_ratio or val_ratio are not in the range [0, 1].

ValueError

If test_ratio + val_ratio > 1.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py
def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
    """
    Args:
        test_ratio (float, optional): Proportion of each user's interactions assigned to the test set. Default is 0.
        val_ratio (float, optional): Proportion of each user's interactions assigned to the validation set. Default is 0.
        seed (int, optional): Random seed for reproducibility. Default is 42.

    Raises:
        ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
        ValueError: If `test_ratio + val_ratio` > 1.
    """
    if not (0 <= test_ratio <= 1):
        raise ValueError('ratio must be between 0 and 1')
    if not (0 <= val_ratio <= 1):
        raise ValueError('ratio must be between 0 and 1')
    if test_ratio + val_ratio > 1:
        raise ValueError("sum of test_ratio and val_ratio must not exceed 1")

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_ratio = test_ratio
    self.val_ratio = val_ratio
    self.seed = seed

run(datarec)

Splits the dataset into train, test, and validation sets by selecting the last interactions (in chronological order) for each user.

The most recent interactions are removed first for the test set, then for the validation set, leaving the remaining interactions for training.

Parameters:

Name Type Description Default
datarec DataRec

The dataset containing interactions with a timestamp column.

required

Returns:

Type Description
Dict[str, DataRec]

A dictionary containing the following keys: - "train" (DataRec): The training dataset. - "test" (DataRec): The test dataset, if test_ratio > 0. - "val" (DataRec): The validation dataset, if val_ratio > 0.

Raises:

Type Description
TypeError

If the dataset does not contain a timestamp column.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py
def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset into train, test, and validation sets by selecting the last interactions
    (in chronological order) for each user.

    The most recent interactions are removed first for the test set, then for the validation set,
    leaving the remaining interactions for training.

    Args:
        datarec (DataRec): The dataset containing interactions with a timestamp column.

    Returns:
        (Dict[str, DataRec]): A dictionary containing the following keys:
            - `"train"` (`DataRec`): The training dataset.
            - `"test"` (`DataRec`): The test dataset, if `test_ratio` > 0.
            - `"val"` (`DataRec`): The validation dataset, if `val_ratio` > 0.

    Raises:
        TypeError: If the dataset does not contain a timestamp column.
    """

    if datarec.timestamp_col is None:
        raise TypeError('This DataRec does not contain temporal information')

    train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    data = datarec.data
    for u in datarec.users:
        u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

        user_total = len(u_train)

        test_n_samples = round(self.test_ratio * user_total)
        val_n_samples = round(self.val_ratio * user_total)

        if test_n_samples > 0:
            for _ in range(min(test_n_samples, len(u_train))):
                u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

        if val_n_samples > 0:
            for _ in range(min(val_n_samples, len(u_train))):
                u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

        train = pd.concat([train, u_train], axis=0, ignore_index=True)
        test = pd.concat([test, u_test], axis=0, ignore_index=True)
        val = pd.concat([val, u_val], axis=0, ignore_index=True)

    return self.output(datarec, train, test, val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})