Splitters Module Reference

This section provides a detailed API reference for all modules related to splitting datasets into training, validation, and test sets.

Core Splitting Utilities

These modules define the base class and common utilities used by all splitters.

`Splitter`

Base class for dataset splitters.

This class provides a common interface for splitting datasets into training, validation, and test sets. Subclasses should implement specific splitting strategies.

Source code in datarec/splitters/splitter.py

class Splitter:
    """
    Base class for dataset splitters.

    This class provides a common interface for splitting datasets into training,
    validation, and test sets. Subclasses should implement specific splitting strategies.
    """

    @staticmethod
    def output(datarec: DataRec, train: pd.DataFrame, test: pd.DataFrame, validation: pd.DataFrame,
               step_info: Dict[str, Dict]) -> Dict[str, DataRec]:
        """
        Creates a dictionary of `DataRec` objects for train, test, and validation splits.

        Args:
            datarec (DataRec): The original dataset wrapped in a `DataRec` object.
            train (pd.DataFrame): The training split of the dataset.
            test (pd.DataFrame): The test split of the dataset.
            validation (pd.DataFrame): The validation split of the dataset.
            step_info (Dict[str, Dict]): Metadata of the transformation.

        Returns:
            (Dict[str, DataRec]): A dictionary containing the split datasets:
                - 'train': The training dataset as a `DataRec` object (if not empty).
                - 'test': The test dataset as a `DataRec` object (if not empty).
                - 'val': The validation dataset as a `DataRec` object (if not empty).
        """

        pipeline = datarec.pipeline.copy()
        pipeline.add_step(name='split', operation=step_info['operation'], params=step_info['params'])

        result = dict()
        for k, d in zip(['train', 'test', 'val'], [train, test, validation]):
            if len(d) > 0:
                new_datarec = DataRec(RawData(d,
                                              user=datarec.user_col,
                                              item=datarec.item_col,
                                              rating=datarec.rating_col,
                                              timestamp=datarec.timestamp_col),
                                      derives_from=datarec,
                                      dataset_name=datarec.dataset_name,
                                      pipeline=pipeline.copy())
                result[k] = new_datarec
        return result

`output(datarec, train, test, validation, step_info)` `staticmethod`

Creates a dictionary of DataRec objects for train, test, and validation splits.

Parameters:

Name	Type	Description	Default
`datarec`	`DataRec`	The original dataset wrapped in a `DataRec` object.	required
`train`	`DataFrame`	The training split of the dataset.	required
`test`	`DataFrame`	The test split of the dataset.	required
`validation`	`DataFrame`	The validation split of the dataset.	required
`step_info`	`Dict[str, Dict]`	Metadata of the transformation.	required

Returns:

Type	Description
`Dict[str, DataRec]`	A dictionary containing the split datasets: - 'train': The training dataset as a `DataRec` object (if not empty). - 'test': The test dataset as a `DataRec` object (if not empty). - 'val': The validation dataset as a `DataRec` object (if not empty).

Source code in datarec/splitters/splitter.py

@staticmethod
def output(datarec: DataRec, train: pd.DataFrame, test: pd.DataFrame, validation: pd.DataFrame,
           step_info: Dict[str, Dict]) -> Dict[str, DataRec]:
    """
    Creates a dictionary of `DataRec` objects for train, test, and validation splits.

    Args:
        datarec (DataRec): The original dataset wrapped in a `DataRec` object.
        train (pd.DataFrame): The training split of the dataset.
        test (pd.DataFrame): The test split of the dataset.
        validation (pd.DataFrame): The validation split of the dataset.
        step_info (Dict[str, Dict]): Metadata of the transformation.

    Returns:
        (Dict[str, DataRec]): A dictionary containing the split datasets:
            - 'train': The training dataset as a `DataRec` object (if not empty).
            - 'test': The test dataset as a `DataRec` object (if not empty).
            - 'val': The validation dataset as a `DataRec` object (if not empty).
    """

    pipeline = datarec.pipeline.copy()
    pipeline.add_step(name='split', operation=step_info['operation'], params=step_info['params'])

    result = dict()
    for k, d in zip(['train', 'test', 'val'], [train, test, validation]):
        if len(d) > 0:
            new_datarec = DataRec(RawData(d,
                                          user=datarec.user_col,
                                          item=datarec.item_col,
                                          rating=datarec.rating_col,
                                          timestamp=datarec.timestamp_col),
                                  derives_from=datarec,
                                  dataset_name=datarec.dataset_name,
                                  pipeline=pipeline.copy())
            result[k] = new_datarec
    return result

`random_sample(dataframe, seed, n_samples=1)`

Randomly selects a specified number of samples from a given DataFrame.

This function splits the input DataFrame into two subsets: - One containing n_samples randomly selected rows. - One containing the remaining rows after the selection.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The input DataFrame from which to sample.	required
`seed`	`int`	Random seed for reproducibility.	required
`n_samples`	`int`	The number of samples to extract. Must be at least 1. Default is 1.	`1`

Returns:

Type	Description
`Tuple[DataFrame, DataFrame]`	The first DataFrame contains the remaining data after sampling. The second DataFrame contains the randomly selected samples.

Raises:

Type	Description
`ValueError`	If `n_samples` is less than 1 or lesser/greater than the number of rows in the DataFrame.

Source code in datarec/splitters/utils.py

def random_sample(dataframe: pd.DataFrame, seed: int, n_samples: int = 1) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Randomly selects a specified number of samples from a given DataFrame.

    This function splits the input DataFrame into two subsets:
    - One containing `n_samples` randomly selected rows.
    - One containing the remaining rows after the selection.

    Args:
        dataframe (pd.DataFrame): The input DataFrame from which to sample.
        seed (int): Random seed for reproducibility.
        n_samples (int, optional): The number of samples to extract. Must be at least 1. Default is 1.

    Returns:
        (Tuple[pd.DataFrame, pd.DataFrame]):
            - The first DataFrame contains the remaining data after sampling.
            - The second DataFrame contains the randomly selected samples.

    Raises:
        ValueError: If `n_samples` is less than 1 or lesser/greater than the number of rows in the DataFrame.
    """

    if n_samples < 1:
        raise ValueError('number of samples must be greater than 1.')

    if n_samples > len(dataframe):
        raise ValueError('number of samples greater than the number of samples in the DataFrame.')

    samples = dataframe.sample(n=n_samples, random_state=seed)

    if len(samples) != n_samples:
        raise ValueError('number of samples lesser or greater than the number of rows in the DataFrame.')
    else:
        return dataframe.drop(samples.index), samples

`max_by_col(dataframe, discriminative_column, seed)`

Selects the row with the minimum value in the specified column from the given DataFrame. If multiple rows have the same minimum value, one is randomly selected.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The input DataFrame.	required
`discriminative_column`	`str`	The column used to determine the minimum value.	required
`seed`	`int`	Random seed for reproducibility.	required

Returns:

Type	Description
`Tuple[DataFrame, DataFrame]`
`DataFrame`	The first DataFrame contains the remaining rows after removing the selected row.
`Tuple[DataFrame, DataFrame]`	The second DataFrame contains the selected row with the minimum value.

Raises:

Type	Description
`ValueError`	If the specified column is not present in the DataFrame.
`ValueError`	If no candidates are found (should not happen unless DataFrame is empty).
`ValueError`	If the random selection fails to return exactly one row.

Source code in datarec/splitters/utils.py

def max_by_col(dataframe: pd.DataFrame, discriminative_column: str, seed: int) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Selects the row with the minimum value in the specified column from the given DataFrame.
    If multiple rows have the same minimum value, one is randomly selected.

    Args:
        dataframe (pd.DataFrame): The input DataFrame.
        discriminative_column (str): The column used to determine the minimum value.
        seed (int): Random seed for reproducibility.

    Returns:
        (Tuple[pd.DataFrame, pd.DataFrame]):
        - The first DataFrame contains the remaining rows after removing the selected row.
        - The second DataFrame contains the selected row with the minimum value.

    Raises:
        ValueError: If the specified column is not present in the DataFrame.
        ValueError: If no candidates are found (should not happen unless DataFrame is empty).
        ValueError: If the random selection fails to return exactly one row.
    """

    if discriminative_column not in dataframe:
        raise ValueError(f'Column \'{discriminative_column}\' must be in the dataframe.')

    max_value = dataframe[discriminative_column].max()
    candidates = dataframe.loc[dataframe[discriminative_column] == max_value]
    n_candidates = len(candidates)

    if n_candidates == 0:
        raise ValueError('No candidate.')
    elif n_candidates == 1:
        return dataframe.drop(candidates.index), candidates
    else:
        candidates = candidates.sample(n=1, random_state=seed)
        if len(candidates) != 1:
            raise ValueError('Number of candidates lesser or greater than 1.')
        return dataframe.drop(candidates.index), candidates

`temporal_holdout(dataframe, test_ratio, val_ratio, temporal_col)`

Splits a dataset into training, validation, and test sets based on temporal ordering.

The function sorts the dataset according to a specified timestamp column and assigns the oldest interactions to the training set, followed by the validation set (if applicable), and the most recent interactions to the test set.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The input dataset containing interaction data.	required
`test_ratio`	`float`	The proportion of the dataset to allocate to the test set. Must be between 0 and 1.	required
`val_ratio`	`float`	The proportion of the dataset to allocate to the validation set. Must be between 0 and 1.	required
`temporal_col`	`str`	The name of the column containing timestamp information.	required

Returns:

Type	Description
`Tuple[DataFrame, DataFrame, DataFrame]`	A tuple containing the train, validation, and test sets.

Raises:

Type	Description
`ValueError`	If `test_ratio` or `val_ratio` are not in the range [0, 1].

Source code in datarec/splitters/utils.py

def temporal_holdout(dataframe: pd.DataFrame, test_ratio: float, val_ratio: float, temporal_col: str) \
        -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Splits a dataset into training, validation, and test sets based on temporal ordering.

    The function sorts the dataset according to a specified timestamp column and assigns
    the oldest interactions to the training set, followed by the validation set (if applicable),
    and the most recent interactions to the test set.

    Args:
        dataframe (pd.DataFrame): The input dataset containing interaction data.
        test_ratio (float): The proportion of the dataset to allocate to the test set. Must be between 0 and 1.
        val_ratio (float): The proportion of the dataset to allocate to the validation set. Must be between 0 and 1.
        temporal_col (str): The name of the column containing timestamp information.

    Returns:
        (Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]): A tuple containing the train, validation, and test sets.

    Raises:
        ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
    """

    if test_ratio < 0 or test_ratio > 1:
        raise ValueError('test ratio must be between 0 and 1.')

    if val_ratio < 0 or val_ratio > 1:
        raise ValueError('val ratio must be between 0 and 1.')

    train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    total_samples = len(dataframe)

    test_samples = round(total_samples * test_ratio)
    train_samples = total_samples - test_samples
    val_samples = round(train_samples * val_ratio)

    train_samples = total_samples - test_samples - val_samples

    assert (train_samples + val_samples + test_samples) == total_samples

    ordered = dataframe.sort_values(by=temporal_col)

    train = ordered.iloc[:train_samples]
    if val_samples:
        val = ordered.iloc[train_samples:(train_samples + val_samples)]
    if test_samples:
        test = ordered.iloc[(train_samples + val_samples):]

    assert len(train) == train_samples
    assert len(val) == val_samples
    assert len(test) == test_samples

    return train, test, val

Uniform Splitting Strategies

These splitters operate on the entire dataset globally.

`RandomHoldOut`

Bases: Splitter

Implements a random holdout split for recommendation datasets.

This splitter partitions the dataset into training, validation, and test sets using a random sampling approach. The proportions of the dataset allocated to the validation and test sets are controlled by val_ratio and test_ratio, respectively.

Source code in datarec/splitters/uniform/hold_out.py

class RandomHoldOut(Splitter):
    """
    Implements a random holdout split for recommendation datasets.

    This splitter partitions the dataset into training, validation, and test sets
    using a random sampling approach. The proportions of the dataset allocated to
    the validation and test sets are controlled by `val_ratio` and `test_ratio`, respectively.

    """

    def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
        """
        Initializes the RandomHoldOut object.

        Args:
            test_ratio (float, optional): The proportion of the dataset to include in the test set.
                Must be between 0 and 1. Default is 0.
            val_ratio (float, optional): The proportion of the training set to include in the validation set.
                Must be between 0 and 1. Default is 0.
            seed (int, optional): The random seed for reproducibility. Defaults to 42.

        Raises:
            ValueError: If `test_ratio` or `val_ratio` is not in the range [0, 1].
        """

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_ratio = test_ratio
        self.val_ratio = val_ratio
        self.seed = seed

    @property
    def test_ratio(self) -> float:
        """
        The proportion of the dataset for the test set.
        """
        return self._test_ratio

    @test_ratio.setter
    def test_ratio(self, value: float) -> None:
        """
        Sets the proportion of the dataset for the test set.

        Args:
            value (float): The proportion to allocate to the test set.
                Must be between 0 and 1.

        Raises:
            ValueError: If `value` is not in the range [0, 1].
        """
        if value < 0 or value > 1:
            raise ValueError('ratio must be between 0 and 1')
        self._test_ratio = value

    @property
    def val_ratio(self) -> float:
        """
        The proportion of the dataset for the validation set.
        """
        return self._val_ratio

    @val_ratio.setter
    def val_ratio(self, value: float) -> None:
        """
        Sets the proportion of the dataset for the validation set.

        Args:
            value (float): The proportion to allocate to the validation set.
                Must be between 0 and 1.

        Raises:
            ValueError: If `value` is not in the range [0, 1].
        """
        if value < 0 or value > 1:
            raise ValueError('ratio must be between 0 and 1')
        self._val_ratio = value

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset into training, validation, and test sets according to the specified ratios, 
        with the val_ratio being applied to the dataset after the test set has been partitioned.

        Args:
            datarec (DataRec): The dataset to be split.

        Returns:
            (Dict[str, DataRec]): A dictionary with the following keys:
                - "train": The training dataset (`DataRec`).
                - "test": The test dataset (`DataRec`), if `test_ratio` > 0.
                - "val": The validation dataset (`DataRec`), if `val_ratio` > 0.
        """

        train, val, test = datarec.data, pd.DataFrame(), pd.DataFrame()

        if self.test_ratio:
            train, test = split(train, test_size=self._test_ratio, random_state=self.seed)

        if self.val_ratio:
            train, val = split(train, test_size=self._val_ratio, random_state=self.seed)

        return self.output(datarec=datarec, train=train, test=test, validation=val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

`test_ratio` `property` `writable`

The proportion of the dataset for the test set.

`val_ratio` `property` `writable`

The proportion of the dataset for the validation set.

`init(test_ratio=0, val_ratio=0, seed=42)`

Initializes the RandomHoldOut object.

Parameters:

Name	Type	Description	Default
`test_ratio`	`float`	The proportion of the dataset to include in the test set. Must be between 0 and 1. Default is 0.	`0`
`val_ratio`	`float`	The proportion of the training set to include in the validation set. Must be between 0 and 1. Default is 0.	`0`
`seed`	`int`	The random seed for reproducibility. Defaults to 42.	`42`

Raises:

Type	Description
`ValueError`	If `test_ratio` or `val_ratio` is not in the range [0, 1].

Source code in datarec/splitters/uniform/hold_out.py

def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
    """
    Initializes the RandomHoldOut object.

    Args:
        test_ratio (float, optional): The proportion of the dataset to include in the test set.
            Must be between 0 and 1. Default is 0.
        val_ratio (float, optional): The proportion of the training set to include in the validation set.
            Must be between 0 and 1. Default is 0.
        seed (int, optional): The random seed for reproducibility. Defaults to 42.

    Raises:
        ValueError: If `test_ratio` or `val_ratio` is not in the range [0, 1].
    """

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_ratio = test_ratio
    self.val_ratio = val_ratio
    self.seed = seed

`run(datarec)`

Splits the dataset into training, validation, and test sets according to the specified ratios, with the val_ratio being applied to the dataset after the test set has been partitioned.

Parameters:

Name	Type	Description	Default
`datarec`	`DataRec`	The dataset to be split.	required

Returns:

Type	Description
`Dict[str, DataRec]`	A dictionary with the following keys: - "train": The training dataset (`DataRec`). - "test": The test dataset (`DataRec`), if `test_ratio` > 0. - "val": The validation dataset (`DataRec`), if `val_ratio` > 0.

Source code in datarec/splitters/uniform/hold_out.py

def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset into training, validation, and test sets according to the specified ratios, 
    with the val_ratio being applied to the dataset after the test set has been partitioned.

    Args:
        datarec (DataRec): The dataset to be split.

    Returns:
        (Dict[str, DataRec]): A dictionary with the following keys:
            - "train": The training dataset (`DataRec`).
            - "test": The test dataset (`DataRec`), if `test_ratio` > 0.
            - "val": The validation dataset (`DataRec`), if `val_ratio` > 0.
    """

    train, val, test = datarec.data, pd.DataFrame(), pd.DataFrame()

    if self.test_ratio:
        train, test = split(train, test_size=self._test_ratio, random_state=self.seed)

    if self.val_ratio:
        train, val = split(train, test_size=self._val_ratio, random_state=self.seed)

    return self.output(datarec=datarec, train=train, test=test, validation=val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

`TemporalHoldOut`

Bases: Splitter

Implements a temporal hold-out splitting strategy for recommendation datasets.

This splitter partitions a dataset into training, validation, and test sets based on the timestamps associated with interactions. The training set contains the oldest interactions, while the test set contains the most recent ones.

Source code in datarec/splitters/uniform/temporal/hold_out.py

class TemporalHoldOut(Splitter):
    """
    Implements a temporal hold-out splitting strategy for recommendation datasets.

    This splitter partitions a dataset into training, validation, and test sets based on
    the timestamps associated with interactions. The training set contains the oldest interactions,
    while the test set contains the most recent ones.

    """

    def __init__(self, test_ratio: float = 0, val_ratio: float = 0):

        """
        Initializes the TemporalHoldOut object.
        Args:
            test_ratio (float, optional): The proportion of the dataset to allocate to the test set.
                Must be between 0 and 1. Default is 0.
            val_ratio (float, optional): The proportion of the dataset to allocate to the validation set.
                Must be between 0 and 1. Default is 0.

        Raises:
            ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
        """

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_ratio = test_ratio
        self.val_ratio = val_ratio

    @property
    def test_ratio(self) -> float:
        "The proportion of the dataset allocated to the test set."
        return self._test_ratio

    @test_ratio.setter
    def test_ratio(self, value: float) -> None:
        """
        Sets the test ratio.

        Args:
            value (float): The proportion of the dataset to allocate to the test set.
                Must be between 0 and 1.

        Raises:
            ValueError: If `value` is not in the range [0, 1].
        """
        if value < 0 or value > 1:
            raise ValueError('ratio must be between 0 and 1')
        self._test_ratio = value

    @property
    def val_ratio(self) -> float:
        """
        The proportion of the dataset allocated to the validation set.
        """
        return self._val_ratio

    @val_ratio.setter
    def val_ratio(self, value: float) -> None:
        """
        Sets the validation ratio.

        Args:
            value (float): The proportion of the dataset to allocate to the validation set.
                Must be between 0 and 1.

        Raises:
            ValueError: If `value` is not in the range [0, 1].
        """
        if value < 0 or value > 1:
            raise ValueError('ratio must be between 0 and 1')
        self._val_ratio = value

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset using a temporal hold-out strategy.

        This method partitions the dataset into training, validation, and test sets based on
        the timestamps present in the `datarec` object. The split is performed such that the
        training set contains older interactions, while the test set contains more recent ones.

        Args:
            datarec (DataRec): A DataRec object containing the dataset and a timestamp column.

        Returns:
            (Dict[str, DataRec]): A dictionary with three keys:
                - `'train'`: A DataRec object containing the training set.
                - `'val'`: A DataRec object containing the validation set (if `val_ratio` > 0).
                - `'test'`: A DataRec object containing the test set (if `test_ratio` > 0).

        Raises:
            TypeError: If the `datarec` object does not contain a timestamp column.
        """

        if datarec.timestamp_col is None:
            raise TypeError('This DataRec does not contain temporal information')

        train, test, val = temporal_holdout(dataframe=datarec.data,
                                            test_ratio=self.test_ratio, val_ratio=self.val_ratio,
                                            temporal_col=datarec.timestamp_col)

        return self.output(datarec, train, test, val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

`test_ratio` `property` `writable`

The proportion of the dataset allocated to the test set.

`val_ratio` `property` `writable`

The proportion of the dataset allocated to the validation set.

`init(test_ratio=0, val_ratio=0)`

Initializes the TemporalHoldOut object. Args: test_ratio (float, optional): The proportion of the dataset to allocate to the test set. Must be between 0 and 1. Default is 0. val_ratio (float, optional): The proportion of the dataset to allocate to the validation set. Must be between 0 and 1. Default is 0.

Raises:

Type	Description
`ValueError`	If `test_ratio` or `val_ratio` are not in the range [0, 1].

Source code in datarec/splitters/uniform/temporal/hold_out.py

def __init__(self, test_ratio: float = 0, val_ratio: float = 0):

    """
    Initializes the TemporalHoldOut object.
    Args:
        test_ratio (float, optional): The proportion of the dataset to allocate to the test set.
            Must be between 0 and 1. Default is 0.
        val_ratio (float, optional): The proportion of the dataset to allocate to the validation set.
            Must be between 0 and 1. Default is 0.

    Raises:
        ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
    """

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_ratio = test_ratio
    self.val_ratio = val_ratio

`run(datarec)`

Splits the dataset using a temporal hold-out strategy.

This method partitions the dataset into training, validation, and test sets based on the timestamps present in the datarec object. The split is performed such that the training set contains older interactions, while the test set contains more recent ones.

Parameters:

Name	Type	Description	Default
`datarec`	`DataRec`	A DataRec object containing the dataset and a timestamp column.	required

Returns:

Type	Description
`Dict[str, DataRec]`	A dictionary with three keys: - `'train'`: A DataRec object containing the training set. - `'val'`: A DataRec object containing the validation set (if `val_ratio` > 0). - `'test'`: A DataRec object containing the test set (if `test_ratio` > 0).

Raises:

Type	Description
`TypeError`	If the `datarec` object does not contain a timestamp column.

Source code in datarec/splitters/uniform/temporal/hold_out.py

def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset using a temporal hold-out strategy.

    This method partitions the dataset into training, validation, and test sets based on
    the timestamps present in the `datarec` object. The split is performed such that the
    training set contains older interactions, while the test set contains more recent ones.

    Args:
        datarec (DataRec): A DataRec object containing the dataset and a timestamp column.

    Returns:
        (Dict[str, DataRec]): A dictionary with three keys:
            - `'train'`: A DataRec object containing the training set.
            - `'val'`: A DataRec object containing the validation set (if `val_ratio` > 0).
            - `'test'`: A DataRec object containing the test set (if `test_ratio` > 0).

    Raises:
        TypeError: If the `datarec` object does not contain a timestamp column.
    """

    if datarec.timestamp_col is None:
        raise TypeError('This DataRec does not contain temporal information')

    train, test, val = temporal_holdout(dataframe=datarec.data,
                                        test_ratio=self.test_ratio, val_ratio=self.val_ratio,
                                        temporal_col=datarec.timestamp_col)

    return self.output(datarec, train, test, val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

`TemporalThresholdSplit`

Bases: Splitter

Splits a dataset into training, validation, and test sets based on two timestamp thresholds.

The dataset is divided such that: - The training set contains interactions occurring strictly before val_threshold. - The validation set contains interactions occurring between val_threshold (inclusive) and test_threshold (exclusive). - The test set contains interactions occurring at or after test_threshold.

Source code in datarec/splitters/uniform/temporal/threshold.py

class TemporalThresholdSplit(Splitter):
    """
    Splits a dataset into training, validation, and test sets based on two timestamp thresholds.

    The dataset is divided such that:
    - The training set contains interactions occurring strictly before `val_threshold`.
    - The validation set contains interactions occurring between `val_threshold` (inclusive)
      and `test_threshold` (exclusive).
    - The test set contains interactions occurring at or after `test_threshold`.
    """

    def __init__(self, val_threshold: float, test_threshold: float):
        """Initializes the TemporalThresholdSplit object.

        Args:
            val_threshold (float): The timestamp value that defines the split between training and validation.
            test_threshold (float): The timestamp value that defines the split between validation and test.

        Raises:
            ValueError: If `val_threshold` is not strictly less than `test_threshold`.
        """

        if val_threshold >= test_threshold:
            raise ValueError('val_threshold must be strictly less than test_threshold')

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.val_threshold = val_threshold
        self.test_threshold = test_threshold

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset into training, validation, and test sets based on two thresholds.

        Args:
            datarec (DataRec): A DataRec object containing the dataset with a timestamp column.

        Returns:
            Dict[str, DataRec]: A dictionary with:
                - `'train'`: Training set (timestamps < `val_threshold`).
                - `'val'`: Validation set (timestamps between `val_threshold` and `test_threshold`).
                - `'test'`: Test set (timestamps >= `test_threshold`).

        Raises:
            TypeError: If the `datarec` object does not contain a timestamp column.
        """

        if datarec.timestamp_col is None:
            raise TypeError('This DataRec does not contain temporal information')

        dataset = datarec.data

        train = dataset[dataset[datarec.timestamp_col] < self.val_threshold]

        val = dataset[(dataset[datarec.timestamp_col] >= self.val_threshold) &
                      (dataset[datarec.timestamp_col] < self.test_threshold)]

        test = dataset[dataset[datarec.timestamp_col] >= self.test_threshold]

        return self.output(datarec, train, test, val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

`init(val_threshold, test_threshold)`

Initializes the TemporalThresholdSplit object.

Parameters:

Name	Type	Description	Default
`val_threshold`	`float`	The timestamp value that defines the split between training and validation.	required
`test_threshold`	`float`	The timestamp value that defines the split between validation and test.	required

Raises:

Type	Description
`ValueError`	If `val_threshold` is not strictly less than `test_threshold`.

Source code in datarec/splitters/uniform/temporal/threshold.py

def __init__(self, val_threshold: float, test_threshold: float):
    """Initializes the TemporalThresholdSplit object.

    Args:
        val_threshold (float): The timestamp value that defines the split between training and validation.
        test_threshold (float): The timestamp value that defines the split between validation and test.

    Raises:
        ValueError: If `val_threshold` is not strictly less than `test_threshold`.
    """

    if val_threshold >= test_threshold:
        raise ValueError('val_threshold must be strictly less than test_threshold')

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.val_threshold = val_threshold
    self.test_threshold = test_threshold

`run(datarec)`

Splits the dataset into training, validation, and test sets based on two thresholds.

Parameters:

Name	Type	Description	Default
`datarec`	`DataRec`	A DataRec object containing the dataset with a timestamp column.	required

Returns:

Type	Description
`Dict[str, DataRec]`	Dict[str, DataRec]: A dictionary with: - `'train'`: Training set (timestamps < `val_threshold`). - `'val'`: Validation set (timestamps between `val_threshold` and `test_threshold`). - `'test'`: Test set (timestamps >= `test_threshold`).

Raises:

Type	Description
`TypeError`	If the `datarec` object does not contain a timestamp column.

Source code in datarec/splitters/uniform/temporal/threshold.py

def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset into training, validation, and test sets based on two thresholds.

    Args:
        datarec (DataRec): A DataRec object containing the dataset with a timestamp column.

    Returns:
        Dict[str, DataRec]: A dictionary with:
            - `'train'`: Training set (timestamps < `val_threshold`).
            - `'val'`: Validation set (timestamps between `val_threshold` and `test_threshold`).
            - `'test'`: Test set (timestamps >= `test_threshold`).

    Raises:
        TypeError: If the `datarec` object does not contain a timestamp column.
    """

    if datarec.timestamp_col is None:
        raise TypeError('This DataRec does not contain temporal information')

    dataset = datarec.data

    train = dataset[dataset[datarec.timestamp_col] < self.val_threshold]

    val = dataset[(dataset[datarec.timestamp_col] >= self.val_threshold) &
                  (dataset[datarec.timestamp_col] < self.test_threshold)]

    test = dataset[dataset[datarec.timestamp_col] >= self.test_threshold]

    return self.output(datarec, train, test, val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

User-Stratified Splitting Strategies

These splitters operate on a per-user basis, ensuring that each user's interaction history is partitioned across the splits.

`UserStratifiedHoldOut`

Bases: Splitter

Implements a user-stratified holdout split for a recommendation dataset.

This splitter ensures that each user's interactions are split into training, validation, and test sets while maintaining the proportion specified by test_ratio and val_ratio.

Source code in datarec/splitters/user_stratified/hold_out.py

class UserStratifiedHoldOut(Splitter):
    """
    Implements a user-stratified holdout split for a recommendation dataset.

    This splitter ensures that each user's interactions are split into training, validation,
    and test sets while maintaining the proportion specified by `test_ratio` and `val_ratio`.

    """

    def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
        """Initializes the UserStratifiedHoldOut splitter.

        Args:
            test_ratio (float, optional): The proportion of interactions per user to include in the test set.
                Must be between 0 and 1. Default is 0.
            val_ratio (float, optional): The proportion of interactions per user to include in the validation set.
                Must be between 0 and 1. Default is 0.
            seed (int, optional): Random seed for reproducibility. Defaults to 42.

         Raises:
            ValueError: If `test_ratio` or `val_ratio` is not in the range [0, 1].
        """

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_ratio = test_ratio
        self.val_ratio = val_ratio
        self.seed = seed

    @property
    def test_ratio(self) -> float:
        """The proportion of interactions per user for the test set."""
        return self._test_ratio

    @test_ratio.setter
    def test_ratio(self, value: float) -> None:
        """
        Sets the proportion of interactions per user for the test set.

        Args:
            value (float): Ratio for the test set. Must be between 0 and 1.

        Raises:
            ValueError: If the ratio is not between 0 and 1.
        """
        if value < 0 or value > 1:
            raise ValueError('ratio must be between 0 and 1')
        self._test_ratio = value

    @property
    def val_ratio(self) -> float:
        """ 
        The proportion of interactions per user for the validation set.
        """
        return self._val_ratio

    @val_ratio.setter
    def val_ratio(self, value: float) -> None:
        """
        Sets the proportion of remaining interactions per user for the validation set.

        Args:
            value (float): Ratio for the validation set. Must be between 0 and 1.

        Raises:
            ValueError: If the ratio is not between 0 and 1.
        """
        if value < 0 or value > 1:
            raise ValueError('ratio must be between 0 and 1')
        self._val_ratio = value

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset into train, validation, and test sets using a user-stratified holdout approach.

        Each user's interactions are split independently according to `test_ratio` and `val_ratio`, ensuring
        that the distribution is preserved per user. The function returns a dictionary containing the three
        resulting subsets.

        Args:
            datarec (DataRec): The dataset to be split.

        Returns:
            (Dict[str, DataRec]): A dictionary with the following keys:
                - "train": DataRec containing the training set.
                - "test": DataRec containing the test set, if `test_ratio` > 0.
                - "val": DataRec containing the validation set, if `val_ratio` > 0.
        """

        train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

        data = datarec.data
        for u in datarec.users:

            u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

            if self.test_ratio:
                u_train, u_test = split(u_train, test_size=self._test_ratio, random_state=self.seed)
            if self.val_ratio:
                u_train, u_val = split(u_train, test_size=self._val_ratio, random_state=self.seed)

            train = pd.concat([train, u_train], axis=0, ignore_index=True)
            test = pd.concat([test, u_test], axis=0, ignore_index=True)
            val = pd.concat([val, u_val], axis=0, ignore_index=True)

        return self.output(datarec=datarec, train=train, test=test, validation=val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

`test_ratio` `property` `writable`

The proportion of interactions per user for the test set.

`val_ratio` `property` `writable`

The proportion of interactions per user for the validation set.

`init(test_ratio=0, val_ratio=0, seed=42)`

Initializes the UserStratifiedHoldOut splitter.

Parameters:

Name	Type	Description	Default
`test_ratio`	`float`	The proportion of interactions per user to include in the test set. Must be between 0 and 1. Default is 0.	`0`
`val_ratio`	`float`	The proportion of interactions per user to include in the validation set. Must be between 0 and 1. Default is 0.	`0`
`seed`	`int`	Random seed for reproducibility. Defaults to 42.	`42`

Raises: ValueError: If test_ratio or val_ratio is not in the range [0, 1].

Source code in datarec/splitters/user_stratified/hold_out.py

def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
    """Initializes the UserStratifiedHoldOut splitter.

    Args:
        test_ratio (float, optional): The proportion of interactions per user to include in the test set.
            Must be between 0 and 1. Default is 0.
        val_ratio (float, optional): The proportion of interactions per user to include in the validation set.
            Must be between 0 and 1. Default is 0.
        seed (int, optional): Random seed for reproducibility. Defaults to 42.

     Raises:
        ValueError: If `test_ratio` or `val_ratio` is not in the range [0, 1].
    """

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_ratio = test_ratio
    self.val_ratio = val_ratio
    self.seed = seed

`run(datarec)`

Splits the dataset into train, validation, and test sets using a user-stratified holdout approach.

Each user's interactions are split independently according to test_ratio and val_ratio, ensuring that the distribution is preserved per user. The function returns a dictionary containing the three resulting subsets.

Parameters:

Name	Type	Description	Default
`datarec`	`DataRec`	The dataset to be split.	required

Returns:

Type	Description
`Dict[str, DataRec]`	A dictionary with the following keys: - "train": DataRec containing the training set. - "test": DataRec containing the test set, if `test_ratio` > 0. - "val": DataRec containing the validation set, if `val_ratio` > 0.

Source code in datarec/splitters/user_stratified/hold_out.py

def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset into train, validation, and test sets using a user-stratified holdout approach.

    Each user's interactions are split independently according to `test_ratio` and `val_ratio`, ensuring
    that the distribution is preserved per user. The function returns a dictionary containing the three
    resulting subsets.

    Args:
        datarec (DataRec): The dataset to be split.

    Returns:
        (Dict[str, DataRec]): A dictionary with the following keys:
            - "train": DataRec containing the training set.
            - "test": DataRec containing the test set, if `test_ratio` > 0.
            - "val": DataRec containing the validation set, if `val_ratio` > 0.
    """

    train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    data = datarec.data
    for u in datarec.users:

        u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

        if self.test_ratio:
            u_train, u_test = split(u_train, test_size=self._test_ratio, random_state=self.seed)
        if self.val_ratio:
            u_train, u_val = split(u_train, test_size=self._val_ratio, random_state=self.seed)

        train = pd.concat([train, u_train], axis=0, ignore_index=True)
        test = pd.concat([test, u_test], axis=0, ignore_index=True)
        val = pd.concat([val, u_val], axis=0, ignore_index=True)

    return self.output(datarec=datarec, train=train, test=test, validation=val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

`LeaveNOut`

Bases: Splitter

Implements the Leave-N-Out splitting strategy for recommendation datasets.

This splitter ensures that for each user, a fixed number of interactions (test_n and validation_n) are randomly selected and moved to the test and validation sets, respectively. The remaining interactions are kept in the training set.

Source code in datarec/splitters/user_stratified/leave_out.py

class LeaveNOut(Splitter):
    """
    Implements the Leave-N-Out splitting strategy for recommendation datasets.

    This splitter ensures that for each user, a fixed number of interactions (`test_n` and `validation_n`)
    are randomly selected and moved to the test and validation sets, respectively. The remaining interactions
    are kept in the training set.
    """

    def __init__(self, test_n: int = 0, validation_n: int = 0, seed: int = 42):
        """Initializes the LeaveNOut splitter.

        Args:
            test_n (int, optional): Number of interactions to move to the test set per user. Default is 0.
            validation_n (int, optional): Number of interactions to move to the validation set per user. Default is 0.
            seed (int, optional): Random seed for reproducibility. Default is 42.

        Raises:
            ValueError: If `test_n` or `validation_n` are negative.
            TypeError: If `test_n` or `validation_n` are not integers.
        """

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_n = test_n
        self.validation_n = validation_n
        self.seed = seed

    @property
    def test_n(self) -> int:
        """Number of interactions to move to the test set per user."""
        return self._test_n

    @test_n.setter
    def test_n(self, value: int) -> None:
        """
        Sets the number of interactions to move to the test set per user.

        Args:
            value (int): Number of interactions.

        Raises:
            ValueError: If `value` is negative.
            TypeError: If `value` is not an integer.
        """
        if value < 0:
            raise ValueError("test_n must be greater or equal to 0.")
        if isinstance(value, float):
            raise TypeError("test_n must be an integer.")
        self._test_n = value

    @property
    def validation_n(self) -> int:
        """Number of interactions to move to the test set per user."""
        return self._validation_n

    @validation_n.setter
    def validation_n(self, value: int) -> None:
        """
        Sets the number of interactions to move to the validation set per user.

        Args:
            value (int): Number of interactions.

        Raises:
            ValueError: If `value` is negative.
            TypeError: If `value` is not an integer.
        """
        if value < 0:
            raise ValueError("validation_n must be greater or equal to 0.")
        if isinstance(value, float):
            raise TypeError("validation_n must be an integer.")
        self._validation_n = value

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset into train, validation, and test sets using a Leave-N-Out approach.

        For each user, `test_n` interactions are randomly assigned to the test set, and `validation_n`
        interactions are assigned to the validation set. The remaining interactions are used for training.

        Args:
            datarec (DataRec): The dataset to be split.

        Returns:
            (Dict[str, DataRec]): A dictionary with the following keys:
                - "train": DataRec containing the training set.
                - "test": DataRec containing the test set, if `test_n` > 0.
                - "validation": DataRec containing the validation set, if `val_n` > 0.
        """

        train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

        data = datarec.data
        for u in datarec.users:

            u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

            if self.test_n:
                u_train, sample = random_sample(dataframe=u_train, n_samples=self.test_n, seed=self.seed)
                u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

            if self.validation_n:
                u_train, sample = random_sample(dataframe=u_train, n_samples=self.validation_n, seed=self.seed)
                u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

            train = pd.concat([train, u_train], axis=0, ignore_index=True)
            test = pd.concat([test, u_test], axis=0, ignore_index=True)
            val = pd.concat([val, u_val], axis=0, ignore_index=True)

        return self.output(datarec, train, test, val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

`test_n` `property` `writable`

Number of interactions to move to the test set per user.

`validation_n` `property` `writable`

Number of interactions to move to the test set per user.

`init(test_n=0, validation_n=0, seed=42)`

Initializes the LeaveNOut splitter.

Parameters:

Name	Type	Description	Default
`test_n`	`int`	Number of interactions to move to the test set per user. Default is 0.	`0`
`validation_n`	`int`	Number of interactions to move to the validation set per user. Default is 0.	`0`
`seed`	`int`	Random seed for reproducibility. Default is 42.	`42`

Raises:

Type	Description
`ValueError`	If `test_n` or `validation_n` are negative.
`TypeError`	If `test_n` or `validation_n` are not integers.

Source code in datarec/splitters/user_stratified/leave_out.py

def __init__(self, test_n: int = 0, validation_n: int = 0, seed: int = 42):
    """Initializes the LeaveNOut splitter.

    Args:
        test_n (int, optional): Number of interactions to move to the test set per user. Default is 0.
        validation_n (int, optional): Number of interactions to move to the validation set per user. Default is 0.
        seed (int, optional): Random seed for reproducibility. Default is 42.

    Raises:
        ValueError: If `test_n` or `validation_n` are negative.
        TypeError: If `test_n` or `validation_n` are not integers.
    """

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_n = test_n
    self.validation_n = validation_n
    self.seed = seed

`run(datarec)`

Splits the dataset into train, validation, and test sets using a Leave-N-Out approach.

For each user, test_n interactions are randomly assigned to the test set, and validation_n interactions are assigned to the validation set. The remaining interactions are used for training.

Parameters:

Name	Type	Description	Default
`datarec`	`DataRec`	The dataset to be split.	required

Returns:

Type	Description
`Dict[str, DataRec]`	A dictionary with the following keys: - "train": DataRec containing the training set. - "test": DataRec containing the test set, if `test_n` > 0. - "validation": DataRec containing the validation set, if `val_n` > 0.

Source code in datarec/splitters/user_stratified/leave_out.py

def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset into train, validation, and test sets using a Leave-N-Out approach.

    For each user, `test_n` interactions are randomly assigned to the test set, and `validation_n`
    interactions are assigned to the validation set. The remaining interactions are used for training.

    Args:
        datarec (DataRec): The dataset to be split.

    Returns:
        (Dict[str, DataRec]): A dictionary with the following keys:
            - "train": DataRec containing the training set.
            - "test": DataRec containing the test set, if `test_n` > 0.
            - "validation": DataRec containing the validation set, if `val_n` > 0.
    """

    train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    data = datarec.data
    for u in datarec.users:

        u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

        if self.test_n:
            u_train, sample = random_sample(dataframe=u_train, n_samples=self.test_n, seed=self.seed)
            u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

        if self.validation_n:
            u_train, sample = random_sample(dataframe=u_train, n_samples=self.validation_n, seed=self.seed)
            u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

        train = pd.concat([train, u_train], axis=0, ignore_index=True)
        test = pd.concat([test, u_test], axis=0, ignore_index=True)
        val = pd.concat([val, u_val], axis=0, ignore_index=True)

    return self.output(datarec, train, test, val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

`LeaveOneOut`

Bases: LeaveNOut

Implements the Leave-One-Out splitting strategy for recommendation datasets.

This splitter ensures that for each user, at most one interaction is randomly selected and moved to the test and/or validation set, depending on the specified parameters. The remaining interactions are kept in the training set.

This is a special case of LeaveNOut where test_n=1 and/or validation_n=1 if test and validation are set to True, respectively.

Source code in datarec/splitters/user_stratified/leave_out.py

class LeaveOneOut(LeaveNOut):
    """
    Implements the Leave-One-Out splitting strategy for recommendation datasets.

    This splitter ensures that for each user, at most one interaction is randomly selected and moved
    to the test and/or validation set, depending on the specified parameters. The remaining interactions
    are kept in the training set.

    This is a special case of `LeaveNOut` where `test_n=1` and/or `validation_n=1` if `test` and `validation`
    are set to `True`, respectively.
    """

    def __init__(self, test: bool = True, validation: bool = True, seed: int = 42):
        """Initializes the LeaveOneOut splitter.

        Args:
            test (bool, optional): Whether to include a test set. Defaults to True.
            validation (bool, optional): Whether to include a validation set. Defaults to True.
            seed (int, optional): Random seed for reproducibility. Default is 42.

        Raises:
            TypeError: If `test` or `validation` is not a boolean.
        """
        if not isinstance(test, bool):
            raise TypeError("test must be a boolean.")
        if not isinstance(validation, bool):
            raise TypeError("validation must be an boolean.")

        test = 1 if test else 0
        validation = 1 if validation else 0

        super().__init__(test_n=test, validation_n=validation, seed=seed)

`init(test=True, validation=True, seed=42)`

Initializes the LeaveOneOut splitter.

Parameters:

Name	Type	Description	Default
`test`	`bool`	Whether to include a test set. Defaults to True.	`True`
`validation`	`bool`	Whether to include a validation set. Defaults to True.	`True`
`seed`	`int`	Random seed for reproducibility. Default is 42.	`42`

Raises:

Type	Description
`TypeError`	If `test` or `validation` is not a boolean.

Source code in datarec/splitters/user_stratified/leave_out.py

def __init__(self, test: bool = True, validation: bool = True, seed: int = 42):
    """Initializes the LeaveOneOut splitter.

    Args:
        test (bool, optional): Whether to include a test set. Defaults to True.
        validation (bool, optional): Whether to include a validation set. Defaults to True.
        seed (int, optional): Random seed for reproducibility. Default is 42.

    Raises:
        TypeError: If `test` or `validation` is not a boolean.
    """
    if not isinstance(test, bool):
        raise TypeError("test must be a boolean.")
    if not isinstance(validation, bool):
        raise TypeError("validation must be an boolean.")

    test = 1 if test else 0
    validation = 1 if validation else 0

    super().__init__(test_n=test, validation_n=validation, seed=seed)

`LeaveRatioOut`

Bases: Splitter

Splits the dataset into training, test, and validation sets based on a ratio instead of a fixed number of samples.

This splitter selects a fraction of interactions for each user to be assigned to the test and validation sets, ensuring that the splits are proportional to the user's total number of interactions.

Source code in datarec/splitters/user_stratified/leave_out.py

class LeaveRatioOut(Splitter):
    """
    Splits the dataset into training, test, and validation sets based on a ratio instead of a fixed number of samples.

    This splitter selects a fraction of interactions for each user to be assigned to the test and validation sets,
    ensuring that the splits are proportional to the user's total number of interactions.
    """

    def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
        """Initializes the LeaveRatioOut splitter.

        Args:
            test_ratio (float, optional): Proportion of each user's interactions assigned to the test set. Default is 0.
            val_ratio (float, optional): Proportion of each user's interactions assigned to the validation set. Default is 0.
            seed (int, optional): Random seed for reproducibility. Default is 42.

        Raises:
            ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
            ValueError: If the sum of `test_ratio` and `val_ratio` exceeds 1.
        """
        if not (0 <= test_ratio <= 1):
            raise ValueError('ratio must be between 0 and 1')
        if not (0 <= val_ratio <= 1):
            raise ValueError('ratio must be between 0 and 1')
        if test_ratio + val_ratio > 1:
            raise ValueError("sum of test_ratio and val_ratio must not exceed 1")

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_ratio = test_ratio
        self.val_ratio = val_ratio
        self.seed = seed

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
         Splits the dataset into train, test, and validation sets based on the specified ratios.

         The interactions of each user are sampled proportionally to create the test and validation sets.
         The remaining interactions are used as the training set.

         Args:
             datarec (DataRec): The dataset containing interactions and user-item relationships.

         Returns:
             (Dict[str, DataRec]): A dictionary containing the following keys:
                 - `"train"` (`DataRec`): The training dataset.
                 - `"test"` (`DataRec`): The test dataset, if `test_ratio` > 0.
                 - `"val"` (`DataRec`): The validation dataset, if `val_ratio` > 0.

         Raises:
             ValueError: If an empty dataset is encountered after sampling.
         """

        train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

        data = datarec.data
        for u in datarec.users:
            u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

            user_total = len(u_train)

            test_n_samples = round(self.test_ratio * user_total)
            val_n_samples = round(self.val_ratio * user_total)

            if test_n_samples > 0:
                u_train, sample = random_sample(dataframe=u_train, n_samples=min(test_n_samples, len(u_train)),
                                                seed=self.seed)
                u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

            if val_n_samples > 0:
                u_train, sample = random_sample(dataframe=u_train, n_samples=min(val_n_samples, len(u_train)),
                                                seed=self.seed)
                u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

            train = pd.concat([train, u_train], axis=0, ignore_index=True)
            test = pd.concat([test, u_test], axis=0, ignore_index=True)
            val = pd.concat([val, u_val], axis=0, ignore_index=True)

        return self.output(datarec, train, test, val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

`init(test_ratio=0, val_ratio=0, seed=42)`

Initializes the LeaveRatioOut splitter.

Parameters:

Name	Type	Description	Default
`test_ratio`	`float`	Proportion of each user's interactions assigned to the test set. Default is 0.	`0`
`val_ratio`	`float`	Proportion of each user's interactions assigned to the validation set. Default is 0.	`0`
`seed`	`int`	Random seed for reproducibility. Default is 42.	`42`

Raises:

Type	Description
`ValueError`	If `test_ratio` or `val_ratio` are not in the range [0, 1].
`ValueError`	If the sum of `test_ratio` and `val_ratio` exceeds 1.

Source code in datarec/splitters/user_stratified/leave_out.py

def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
    """Initializes the LeaveRatioOut splitter.

    Args:
        test_ratio (float, optional): Proportion of each user's interactions assigned to the test set. Default is 0.
        val_ratio (float, optional): Proportion of each user's interactions assigned to the validation set. Default is 0.
        seed (int, optional): Random seed for reproducibility. Default is 42.

    Raises:
        ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
        ValueError: If the sum of `test_ratio` and `val_ratio` exceeds 1.
    """
    if not (0 <= test_ratio <= 1):
        raise ValueError('ratio must be between 0 and 1')
    if not (0 <= val_ratio <= 1):
        raise ValueError('ratio must be between 0 and 1')
    if test_ratio + val_ratio > 1:
        raise ValueError("sum of test_ratio and val_ratio must not exceed 1")

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_ratio = test_ratio
    self.val_ratio = val_ratio
    self.seed = seed

`run(datarec)`

Splits the dataset into train, test, and validation sets based on the specified ratios.

The interactions of each user are sampled proportionally to create the test and validation sets. The remaining interactions are used as the training set.

Parameters:

Name	Type	Description	Default
`datarec`	`DataRec`	The dataset containing interactions and user-item relationships.	required

Returns:

Type	Description
`Dict[str, DataRec]`	A dictionary containing the following keys: - `"train"` (`DataRec`): The training dataset. - `"test"` (`DataRec`): The test dataset, if `test_ratio` > 0. - `"val"` (`DataRec`): The validation dataset, if `val_ratio` > 0.

Raises:

Type	Description
`ValueError`	If an empty dataset is encountered after sampling.

Source code in datarec/splitters/user_stratified/leave_out.py

def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
     Splits the dataset into train, test, and validation sets based on the specified ratios.

     The interactions of each user are sampled proportionally to create the test and validation sets.
     The remaining interactions are used as the training set.

     Args:
         datarec (DataRec): The dataset containing interactions and user-item relationships.

     Returns:
         (Dict[str, DataRec]): A dictionary containing the following keys:
             - `"train"` (`DataRec`): The training dataset.
             - `"test"` (`DataRec`): The test dataset, if `test_ratio` > 0.
             - `"val"` (`DataRec`): The validation dataset, if `val_ratio` > 0.

     Raises:
         ValueError: If an empty dataset is encountered after sampling.
     """

    train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    data = datarec.data
    for u in datarec.users:
        u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

        user_total = len(u_train)

        test_n_samples = round(self.test_ratio * user_total)
        val_n_samples = round(self.val_ratio * user_total)

        if test_n_samples > 0:
            u_train, sample = random_sample(dataframe=u_train, n_samples=min(test_n_samples, len(u_train)),
                                            seed=self.seed)
            u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

        if val_n_samples > 0:
            u_train, sample = random_sample(dataframe=u_train, n_samples=min(val_n_samples, len(u_train)),
                                            seed=self.seed)
            u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

        train = pd.concat([train, u_train], axis=0, ignore_index=True)
        test = pd.concat([test, u_test], axis=0, ignore_index=True)
        val = pd.concat([val, u_val], axis=0, ignore_index=True)

    return self.output(datarec, train, test, val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

`LeaveNLast`

Bases: Splitter

Splits the dataset by removing the last n interactions per user based on a timestamp column.

This splitter selects the last test_n interactions for the test set and the last validation_n interactions for the validation set while keeping the remaining interactions in the training set.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py

class LeaveNLast(Splitter):
    """
    Splits the dataset by removing the last `n` interactions per user based on a timestamp column.

    This splitter selects the last `test_n` interactions for the test set and the last `validation_n`
    interactions for the validation set while keeping the remaining interactions in the training set.
    """
    def __init__(self, test_n: int = 0, validation_n: int = 0, seed: int = 42):
        """Initializes the LeaveNLast splitter.

        Args:
            test_n (int, optional): Number of last interactions for the test set. Defaults to 0.
            validation_n (int, optional): Number of last interactions for the validation set. Defaults to 0.
            seed (int, optional): Random seed for reproducibility. Defaults to 42.
        """

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_n = test_n
        self.validation_n = validation_n
        self.seed = seed

    @property
    def test_n(self) -> int:
        """The number of last interactions per user for the test set."""
        return self._test_n

    @test_n.setter
    def test_n(self, value: int) -> None:
        """
        Sets the number of last interactions per user for the test set.

        Args:
            value (int): Number of interactions. Must be >= 0.

        Raises:
            ValueError: If `value` < 0.
            TypeError: If `value` is not an integer.
        """
        if value < 0:
            raise ValueError("test_n must be greater or equal than 0.")
        if isinstance(value, float):
            raise TypeError("test_n must be an integer.")
        self._test_n = value

    @property
    def validation_n(self) -> int:
        """The number of last interactions per user for the validation set."""
        return self._validation_n

    @validation_n.setter
    def validation_n(self, value: int) -> None:
        """
        Sets the number of last interactions per user for the validation set.

        Args:
            value (int): Number of interactions. Must be >= 0.

        Raises:
            ValueError: If `value` < 0.
            TypeError: If `value` is not an integer.
        """
        if value < 0:
            raise ValueError("validation_n must be greater or equal than 0.")
        if isinstance(value, float):
            raise TypeError("validation_n must be and integer.")
        self._validation_n = value

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset into train, test, and validation sets based on the last `n` interactions.

        Args:
            datarec (DataRec): The dataset containing the interactions and timestamp column.

        Returns:
            (Dict[str, DataRec]): A dictionary with the following keys:
                - "train": The training dataset (`DataRec`).
                - "test": The test dataset (`DataRec`), if `test_n` > 0.
                - "val": The validation dataset (`DataRec`), if `val_n` > 0.

        Raises:
            TypeError: If the dataset does not contain a timestamp column.
        """

        if datarec.timestamp_col is None:
            raise TypeError('This DataRec does not contain temporal information')

        train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

        data = datarec.data
        for u in datarec.users:

            u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

            if self.test_n:
                for _ in range(self.test_n):
                    u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                    u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

            if self.validation_n:
                for _ in range(self.validation_n):
                    u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                    u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

            train = pd.concat([train, u_train], axis=0, ignore_index=True)
            test = pd.concat([test, u_test], axis=0, ignore_index=True)
            val = pd.concat([val, u_val], axis=0, ignore_index=True)

        return self.output(datarec, train, test, val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

`test_n` `property` `writable`

The number of last interactions per user for the test set.

`validation_n` `property` `writable`

The number of last interactions per user for the validation set.

`init(test_n=0, validation_n=0, seed=42)`

Initializes the LeaveNLast splitter.

Parameters:

Name	Type	Description	Default
`test_n`	`int`	Number of last interactions for the test set. Defaults to 0.	`0`
`validation_n`	`int`	Number of last interactions for the validation set. Defaults to 0.	`0`
`seed`	`int`	Random seed for reproducibility. Defaults to 42.	`42`

Source code in datarec/splitters/user_stratified/temporal/leave_out.py

def __init__(self, test_n: int = 0, validation_n: int = 0, seed: int = 42):
    """Initializes the LeaveNLast splitter.

    Args:
        test_n (int, optional): Number of last interactions for the test set. Defaults to 0.
        validation_n (int, optional): Number of last interactions for the validation set. Defaults to 0.
        seed (int, optional): Random seed for reproducibility. Defaults to 42.
    """

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_n = test_n
    self.validation_n = validation_n
    self.seed = seed

`run(datarec)`

Splits the dataset into train, test, and validation sets based on the last n interactions.

Parameters:

Name	Type	Description	Default
`datarec`	`DataRec`	The dataset containing the interactions and timestamp column.	required

Returns:

Type	Description
`Dict[str, DataRec]`	A dictionary with the following keys: - "train": The training dataset (`DataRec`). - "test": The test dataset (`DataRec`), if `test_n` > 0. - "val": The validation dataset (`DataRec`), if `val_n` > 0.

Raises:

Type	Description
`TypeError`	If the dataset does not contain a timestamp column.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py

def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset into train, test, and validation sets based on the last `n` interactions.

    Args:
        datarec (DataRec): The dataset containing the interactions and timestamp column.

    Returns:
        (Dict[str, DataRec]): A dictionary with the following keys:
            - "train": The training dataset (`DataRec`).
            - "test": The test dataset (`DataRec`), if `test_n` > 0.
            - "val": The validation dataset (`DataRec`), if `val_n` > 0.

    Raises:
        TypeError: If the dataset does not contain a timestamp column.
    """

    if datarec.timestamp_col is None:
        raise TypeError('This DataRec does not contain temporal information')

    train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    data = datarec.data
    for u in datarec.users:

        u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

        if self.test_n:
            for _ in range(self.test_n):
                u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

        if self.validation_n:
            for _ in range(self.validation_n):
                u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

        train = pd.concat([train, u_train], axis=0, ignore_index=True)
        test = pd.concat([test, u_test], axis=0, ignore_index=True)
        val = pd.concat([val, u_val], axis=0, ignore_index=True)

    return self.output(datarec, train, test, val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

`LeaveOneLastItem`

Bases: LeaveNLast

Special case of LeaveNLast that removes only the last interaction per user for test and validation.

This class sets test_n and validation_n to 1 if their corresponding boolean parameters are True.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py

class LeaveOneLastItem(LeaveNLast):
    """
    Special case of LeaveNLast that removes only the last interaction per user for test and validation.

    This class sets `test_n` and `validation_n` to 1 if their corresponding boolean parameters are True.

    """

    def __init__(self, test: bool = True, validation: bool = True, seed: int = 42):
        """
        Initializes the LeaveOneLastItem splitter.

        Args:
            test (bool, optional): Whether to remove the last interaction for the test set. Defaults to True.
            validation (bool, optional): Whether to remove the last interaction for the validation set. Defaults to True.
            seed (int, optional): Random seed for reproducibility. Default is 42.

        Raises:
            TypeError: If `test` or `validation` are not boolean.
        """
        if not isinstance(test, bool):
            raise TypeError("test must be a boolean.")
        if not isinstance(validation, bool):
            raise TypeError("validation must be an boolean.")

        test = 1 if test else 0
        validation = 1 if validation else 0

        super().__init__(test_n=test, validation_n=validation, seed=seed)

`init(test=True, validation=True, seed=42)`

Initializes the LeaveOneLastItem splitter.

Parameters:

Name	Type	Description	Default
`test`	`bool`	Whether to remove the last interaction for the test set. Defaults to True.	`True`
`validation`	`bool`	Whether to remove the last interaction for the validation set. Defaults to True.	`True`
`seed`	`int`	Random seed for reproducibility. Default is 42.	`42`

Raises:

Type	Description
`TypeError`	If `test` or `validation` are not boolean.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py

def __init__(self, test: bool = True, validation: bool = True, seed: int = 42):
    """
    Initializes the LeaveOneLastItem splitter.

    Args:
        test (bool, optional): Whether to remove the last interaction for the test set. Defaults to True.
        validation (bool, optional): Whether to remove the last interaction for the validation set. Defaults to True.
        seed (int, optional): Random seed for reproducibility. Default is 42.

    Raises:
        TypeError: If `test` or `validation` are not boolean.
    """
    if not isinstance(test, bool):
        raise TypeError("test must be a boolean.")
    if not isinstance(validation, bool):
        raise TypeError("validation must be an boolean.")

    test = 1 if test else 0
    validation = 1 if validation else 0

    super().__init__(test_n=test, validation_n=validation, seed=seed)

`LeaveRatioLast`

Bases: Splitter

Splits the dataset into training, test, and validation sets by selecting the most recent interactions for each user based on a specified ratio.

Unlike LeaveNLast, which selects a fixed number of interactions, this splitter chooses a fraction of the total interactions per user, preserving temporal order.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py

class LeaveRatioLast(Splitter):
    """
    Splits the dataset into training, test, and validation sets by selecting the most recent interactions
    for each user based on a specified ratio.

    Unlike `LeaveNLast`, which selects a fixed number of interactions, this splitter chooses a fraction
    of the total interactions per user, preserving temporal order.

    """

    def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
        """
        Args:
            test_ratio (float, optional): Proportion of each user's interactions assigned to the test set. Default is 0.
            val_ratio (float, optional): Proportion of each user's interactions assigned to the validation set. Default is 0.
            seed (int, optional): Random seed for reproducibility. Default is 42.

        Raises:
            ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
            ValueError: If `test_ratio + val_ratio` > 1.
        """
        if not (0 <= test_ratio <= 1):
            raise ValueError('ratio must be between 0 and 1')
        if not (0 <= val_ratio <= 1):
            raise ValueError('ratio must be between 0 and 1')
        if test_ratio + val_ratio > 1:
            raise ValueError("sum of test_ratio and val_ratio must not exceed 1")

        self.params = {k: v for k, v in locals().items() if k != 'self'}

        self.test_ratio = test_ratio
        self.val_ratio = val_ratio
        self.seed = seed

    def run(self, datarec: DataRec) -> Dict[str, DataRec]:
        """
        Splits the dataset into train, test, and validation sets by selecting the last interactions
        (in chronological order) for each user.

        The most recent interactions are removed first for the test set, then for the validation set,
        leaving the remaining interactions for training.

        Args:
            datarec (DataRec): The dataset containing interactions with a timestamp column.

        Returns:
            (Dict[str, DataRec]): A dictionary containing the following keys:
                - `"train"` (`DataRec`): The training dataset.
                - `"test"` (`DataRec`): The test dataset, if `test_ratio` > 0.
                - `"val"` (`DataRec`): The validation dataset, if `val_ratio` > 0.

        Raises:
            TypeError: If the dataset does not contain a timestamp column.
        """

        if datarec.timestamp_col is None:
            raise TypeError('This DataRec does not contain temporal information')

        train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

        data = datarec.data
        for u in datarec.users:
            u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

            user_total = len(u_train)

            test_n_samples = round(self.test_ratio * user_total)
            val_n_samples = round(self.val_ratio * user_total)

            if test_n_samples > 0:
                for _ in range(min(test_n_samples, len(u_train))):
                    u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                    u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

            if val_n_samples > 0:
                for _ in range(min(val_n_samples, len(u_train))):
                    u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                    u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

            train = pd.concat([train, u_train], axis=0, ignore_index=True)
            test = pd.concat([test, u_test], axis=0, ignore_index=True)
            val = pd.concat([val, u_val], axis=0, ignore_index=True)

        return self.output(datarec, train, test, val,
                           step_info={'operation': self.__class__.__name__, 'params': self.params})

`init(test_ratio=0, val_ratio=0, seed=42)`

Parameters:

Name	Type	Description	Default
`test_ratio`	`float`	Proportion of each user's interactions assigned to the test set. Default is 0.	`0`
`val_ratio`	`float`	Proportion of each user's interactions assigned to the validation set. Default is 0.	`0`
`seed`	`int`	Random seed for reproducibility. Default is 42.	`42`

Raises:

Type	Description
`ValueError`	If `test_ratio` or `val_ratio` are not in the range [0, 1].
`ValueError`	If `test_ratio + val_ratio` > 1.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py

def __init__(self, test_ratio: float = 0, val_ratio: float = 0, seed: int = 42):
    """
    Args:
        test_ratio (float, optional): Proportion of each user's interactions assigned to the test set. Default is 0.
        val_ratio (float, optional): Proportion of each user's interactions assigned to the validation set. Default is 0.
        seed (int, optional): Random seed for reproducibility. Default is 42.

    Raises:
        ValueError: If `test_ratio` or `val_ratio` are not in the range [0, 1].
        ValueError: If `test_ratio + val_ratio` > 1.
    """
    if not (0 <= test_ratio <= 1):
        raise ValueError('ratio must be between 0 and 1')
    if not (0 <= val_ratio <= 1):
        raise ValueError('ratio must be between 0 and 1')
    if test_ratio + val_ratio > 1:
        raise ValueError("sum of test_ratio and val_ratio must not exceed 1")

    self.params = {k: v for k, v in locals().items() if k != 'self'}

    self.test_ratio = test_ratio
    self.val_ratio = val_ratio
    self.seed = seed

`run(datarec)`

Splits the dataset into train, test, and validation sets by selecting the last interactions (in chronological order) for each user.

The most recent interactions are removed first for the test set, then for the validation set, leaving the remaining interactions for training.

Parameters:

Name	Type	Description	Default
`datarec`	`DataRec`	The dataset containing interactions with a timestamp column.	required

Returns:

Type	Description
`Dict[str, DataRec]`	A dictionary containing the following keys: - `"train"` (`DataRec`): The training dataset. - `"test"` (`DataRec`): The test dataset, if `test_ratio` > 0. - `"val"` (`DataRec`): The validation dataset, if `val_ratio` > 0.

Raises:

Type	Description
`TypeError`	If the dataset does not contain a timestamp column.

Source code in datarec/splitters/user_stratified/temporal/leave_out.py

def run(self, datarec: DataRec) -> Dict[str, DataRec]:
    """
    Splits the dataset into train, test, and validation sets by selecting the last interactions
    (in chronological order) for each user.

    The most recent interactions are removed first for the test set, then for the validation set,
    leaving the remaining interactions for training.

    Args:
        datarec (DataRec): The dataset containing interactions with a timestamp column.

    Returns:
        (Dict[str, DataRec]): A dictionary containing the following keys:
            - `"train"` (`DataRec`): The training dataset.
            - `"test"` (`DataRec`): The test dataset, if `test_ratio` > 0.
            - `"val"` (`DataRec`): The validation dataset, if `val_ratio` > 0.

    Raises:
        TypeError: If the dataset does not contain a timestamp column.
    """

    if datarec.timestamp_col is None:
        raise TypeError('This DataRec does not contain temporal information')

    train, test, val = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    data = datarec.data
    for u in datarec.users:
        u_train, u_val, u_test = data[data.iloc[:, 0] == u], pd.DataFrame(), pd.DataFrame()

        user_total = len(u_train)

        test_n_samples = round(self.test_ratio * user_total)
        val_n_samples = round(self.val_ratio * user_total)

        if test_n_samples > 0:
            for _ in range(min(test_n_samples, len(u_train))):
                u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                u_test = pd.concat([u_test, sample], axis=0, ignore_index=True)

        if val_n_samples > 0:
            for _ in range(min(val_n_samples, len(u_train))):
                u_train, sample = max_by_col(u_train, datarec.timestamp_col, seed=self.seed)
                u_val = pd.concat([u_val, sample], axis=0, ignore_index=True)

        train = pd.concat([train, u_train], axis=0, ignore_index=True)
        test = pd.concat([test, u_test], axis=0, ignore_index=True)
        val = pd.concat([val, u_val], axis=0, ignore_index=True)

    return self.output(datarec, train, test, val,
                       step_info={'operation': self.__class__.__name__, 'params': self.params})

Splitters Module Reference

Core Splitting Utilities

Splitter

output(datarec, train, test, validation, step_info) staticmethod

random_sample(dataframe, seed, n_samples=1)

max_by_col(dataframe, discriminative_column, seed)

temporal_holdout(dataframe, test_ratio, val_ratio, temporal_col)

Uniform Splitting Strategies

RandomHoldOut

test_ratio property writable

val_ratio property writable

__init__(test_ratio=0, val_ratio=0, seed=42)

run(datarec)

TemporalHoldOut

test_ratio property writable

val_ratio property writable

__init__(test_ratio=0, val_ratio=0)

run(datarec)

TemporalThresholdSplit

__init__(val_threshold, test_threshold)

run(datarec)

User-Stratified Splitting Strategies

UserStratifiedHoldOut

test_ratio property writable

val_ratio property writable

__init__(test_ratio=0, val_ratio=0, seed=42)

run(datarec)

LeaveNOut

test_n property writable

validation_n property writable

__init__(test_n=0, validation_n=0, seed=42)

run(datarec)

LeaveOneOut

__init__(test=True, validation=True, seed=42)

LeaveRatioOut

__init__(test_ratio=0, val_ratio=0, seed=42)

run(datarec)

LeaveNLast

test_n property writable

validation_n property writable

__init__(test_n=0, validation_n=0, seed=42)

run(datarec)

LeaveOneLastItem

__init__(test=True, validation=True, seed=42)

LeaveRatioLast

__init__(test_ratio=0, val_ratio=0, seed=42)

run(datarec)

`Splitter`

`output(datarec, train, test, validation, step_info)` `staticmethod`

`random_sample(dataframe, seed, n_samples=1)`

`max_by_col(dataframe, discriminative_column, seed)`

`temporal_holdout(dataframe, test_ratio, val_ratio, temporal_col)`

`RandomHoldOut`

`test_ratio` `property` `writable`

`val_ratio` `property` `writable`

`init(test_ratio=0, val_ratio=0, seed=42)`

`run(datarec)`

`TemporalHoldOut`

`test_ratio` `property` `writable`

`val_ratio` `property` `writable`

`init(test_ratio=0, val_ratio=0)`

`run(datarec)`

`TemporalThresholdSplit`

`init(val_threshold, test_threshold)`

`run(datarec)`

`UserStratifiedHoldOut`

`test_ratio` `property` `writable`

`val_ratio` `property` `writable`

`init(test_ratio=0, val_ratio=0, seed=42)`

`run(datarec)`

`LeaveNOut`

`test_n` `property` `writable`

`validation_n` `property` `writable`

`init(test_n=0, validation_n=0, seed=42)`

`run(datarec)`

`LeaveOneOut`

`init(test=True, validation=True, seed=42)`

`LeaveRatioOut`

`init(test_ratio=0, val_ratio=0, seed=42)`

`run(datarec)`

`LeaveNLast`

`test_n` `property` `writable`

`validation_n` `property` `writable`

`init(test_n=0, validation_n=0, seed=42)`

`run(datarec)`

`LeaveOneLastItem`

`init(test=True, validation=True, seed=42)`

`LeaveRatioLast`

`init(test_ratio=0, val_ratio=0, seed=42)`

`run(datarec)`