Skip to content

Datasets Reference

This section provides a detailed API reference for all modules related to built-in datasets in the datarec library.

Download Utilities

Provides utility functions for downloading and decompressing dataset files.

This module contains a set of helper functions used internally by the dataset builder classes (e.g., MovieLens1M, Yelp_v1) to handle the fetching of raw data from web sources and the extraction of various archive formats like .zip, .gz, .tar, and .7z.

These functions are not typically called directly by the end-user but are fundamental to the automatic data preparation process of the library.

download_url(url, local_filepath)

Downloads a file from a URL and saves it to a local path.

Note: This is a basic downloader. For large files or more robust handling, download_file is generally preferred within this library.

Parameters:

Name Type Description Default
url str

The URL of the file to download.

required
local_filepath str

The local path where the file will be saved.

required

Raises:

Type Description
HTTPError

If the HTTP request returned an unsuccessful status code.

Source code in datarec/datasets/download.py
def download_url(url, local_filepath) -> None:
    """
    Downloads a file from a URL and saves it to a local path.

    Note: This is a basic downloader. For large files or more robust handling,
    `download_file` is generally preferred within this library.

    Args:
        url (str): The URL of the file to download.
        local_filepath (str): The local path where the file will be saved.

    Raises:
        requests.exceptions.HTTPError: If the HTTP request returned an
            unsuccessful status code.
    """
    r = requests.get(url)
    r.raise_for_status()
    with open(local_filepath, 'wb') as file:
        with tqdm(unit='byte', unit_scale=True) as progress_bar:
            for chunk in r.iter_content(chunk_size=1024):
                file.write(chunk)
                progress_bar.update(len(chunk))
                print(f"File downloaded successfully and saved as {local_filepath}")

download_file(url, local_filepath, size=None)

Downloads a file by streaming its content, with a progress bar.

This is the primary download function used for most datasets. It streams the response, making it suitable for large files. It attempts to infer the file size from response headers for the progress bar, but an expected size can also be provided.

Parameters:

Name Type Description Default
url str

The URL of the file to download.

required
local_filepath str

The local path where the file will be saved.

required
size int

The expected file size in bytes. Used for the progress bar if the 'Content-Length' header is not available. Defaults to None.

None

Returns:

Type Description
str

The local file path if the download was successful, otherwise None.

Source code in datarec/datasets/download.py
def download_file(url, local_filepath, size=None):
    """
    Downloads a file by streaming its content, with a progress bar.

    This is the primary download function used for most datasets. It streams the
    response, making it suitable for large files. It attempts to infer the file
    size from response headers for the progress bar, but an expected size can also
    be provided.

    Args:
        url (str): The URL of the file to download.
        local_filepath (str): The local path where the file will be saved.
        size (int, optional): The expected file size in bytes. Used for the
            progress bar if the 'Content-Length' header is not available.
            Defaults to None.

    Returns:
        (str): The local file path if the download was successful, otherwise None.
    """
    # Make a GET request to the URL
    response = requests.get(url, stream=True)
    # Check if the request was successful
    if response.status_code == 200:
        # try to infer the total size
        try:
            size = int(response.headers.get('Content-Length', 0))
        except:
            size = size
        # Save the response content to a file
        with open(local_filepath, 'wb') as f:
            with tqdm(unit='byte', unit_scale=True, total=size) as progress_bar:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
                    progress_bar.update(len(chunk))
            print(f"File downloaded successfully and saved at \'{local_filepath}\'")
        return local_filepath
    else:

        print(f"Failed to download the file. Response status code: {response.status_code}")
        return None

download_browser(url, local_filepath, headers=None, chunk_size=8192)

Downloads a file by mimicking a web browser request.

This function is used for sources that may block simple scripted requests. It includes a default 'User-Agent' header to appear as a standard browser, which is necessary for some datasets (e.g., Yelp).

Parameters:

Name Type Description Default
url str

The URL of the file to download.

required
local_filepath str

The local path where the file will be saved.

required
headers dict

Custom headers to use for the request. If None, a default browser User-Agent is used. Defaults to None.

None
chunk_size int

The size of chunks to download in bytes. Defaults to 8192.

8192

Returns:

Type Description
str

The local file path if the download was successful, otherwise None.

Source code in datarec/datasets/download.py
def download_browser(url, local_filepath, headers=None, chunk_size=8192):
    """
    Downloads a file by mimicking a web browser request.

    This function is used for sources that may block simple scripted requests.
    It includes a default 'User-Agent' header to appear as a standard browser,
    which is necessary for some datasets (e.g., Yelp).

    Args:
        url (str): The URL of the file to download.
        local_filepath (str): The local path where the file will be saved.
        headers (dict, optional): Custom headers to use for the request. If None,
            a default browser User-Agent is used. Defaults to None.
        chunk_size (int, optional): The size of chunks to download in bytes.
            Defaults to 8192.

    Returns:
        (str): The local file path if the download was successful, otherwise None.
    """
    # Default headers, or a custom one if provided
    if headers is None:
        headers = {'User-Agent': 'Mozilla/5.0...'}
    response = requests.get(url, stream=True, headers=headers)
    if response.status_code == 200:
        total_size = int(response.headers.get('Content-Length', 0))
        with open(local_filepath, 'wb') as f:
            with tqdm(total=total_size, unit='iB', unit_scale=True) as progress_bar:
                for chunk in response.iter_content(chunk_size=chunk_size):
                    if chunk:
                        f.write(chunk)
                        progress_bar.update(len(chunk))
        print(f"Downloaded to '{local_filepath}'")
        return local_filepath
    return None

decompress_gz(input_file, output_file)

Decompresses a .gz file.

Parameters:

Name Type Description Default
input_file str

The path to the input .gz file.

required
output_file str

The path where the decompressed file will be saved.

required

Returns:

Type Description
str

The path to the decompressed output file.

Source code in datarec/datasets/download.py
def decompress_gz(input_file, output_file):
    """
    Decompresses a .gz file.

    Args:
        input_file (str): The path to the input .gz file.
        output_file (str): The path where the decompressed file will be saved.

    Returns:
        (str): The path to the decompressed output file.
    """
    print(f'Decompress: \'{input_file}\'')
    with gzip.open(input_file, 'rb') as f_in:
        with open(output_file, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

    print(f'File decompressed: \'{output_file}\'')
    return output_file

decompress_tar_file(input_file, output_dir)

Decompresses a .tar archive.

Parameters:

Name Type Description Default
input_file str

The path to the input .tar file.

required
output_dir str

The directory where the contents will be extracted.

required

Returns:

Type Description
list

A list of the names of the extracted files and directories.

Source code in datarec/datasets/download.py
def decompress_tar_file(input_file, output_dir):
    """
    Decompresses a .tar archive.

    Args:
        input_file (str): The path to the input .tar file.
        output_dir (str): The directory where the contents will be extracted.

    Returns:
        (list): A list of the names of the extracted files and directories.
    """

    print(f'Decompress: \'{input_file}\'')
    with tarfile.open(input_file, 'r') as tar:
        tar.extractall(path=output_dir)

        print(f'File decompressed in \'{output_dir}\'')
    return os.listdir(output_dir)

decompress_zip_file(input_file, output_dir, allowZip64=False)

Decompresses a .zip archive.

Parameters:

Name Type Description Default
input_file str

The path to the input .zip file.

required
output_dir str

The directory where the contents will be extracted.

required
allowZip64 bool

Whether to allow the Zip64 extension (for archives larger than 2 GB). Defaults to False, but should be True for large files.

False

Returns:

Type Description
list

A list of the names of the extracted files and directories.

Source code in datarec/datasets/download.py
def decompress_zip_file(input_file, output_dir, allowZip64=False):
    """
    Decompresses a .zip archive.

    Args:
        input_file (str): The path to the input .zip file.
        output_dir (str): The directory where the contents will be extracted.
        allowZip64 (bool): Whether to allow the Zip64 extension (for archives
            larger than 2 GB). Defaults to False, but should be True for large files.

    Returns:
        (list): A list of the names of the extracted files and directories.
    """
    with zipfile.ZipFile(input_file, 'r', allowZip64=allowZip64) as zip_ref:
        zip_ref.extractall(output_dir)
        print(f'File decompressed in \'{output_dir}\'')
    return os.listdir(output_dir)

decompress_7z_file(input_file, output_dir)

Decompresses a .7z archive.

This function is used for datasets distributed in the 7-Zip format, such as the Alibaba-iFashion dataset.

Parameters:

Name Type Description Default
input_file str

The path to the input .7z file.

required
output_dir str

The directory where the contents will be extracted.

required

Returns:

Type Description
str

The path to the output directory.

Source code in datarec/datasets/download.py
def decompress_7z_file(input_file, output_dir):
    """
    Decompresses a .7z archive.

    This function is used for datasets distributed in the 7-Zip format, such as
    the Alibaba-iFashion dataset.

    Args:
        input_file (str): The path to the input .7z file.
        output_dir (str): The directory where the contents will be extracted.

    Returns:
        (str): The path to the output directory.
    """
    print(f"Decompressing: {input_file}")
    with py7zr.SevenZipFile(input_file, mode='r') as archive:
        archive.extractall(path=output_dir)
    print(f"File decompressed in '{output_dir}'")
    return output_dir

Alibaba-iFashion

Entry point for loading different versions of the Alibaba-iFashion dataset.

AlibabaIFashion

Entry point class to load various versions of the Alibaba-iFashion dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder.

The default version is 'latest', which currently corresponds to 'v1'.

Examples:

To load the latest version:

>>> data_loader = AlibabaIFashion()

To load a specific version:

>>> data_loader = AlibabaIFashion(version='v1')
Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion.py
class AlibabaIFashion:
    """
    Entry point class to load various versions of the Alibaba-iFashion dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder.

    The default version is 'latest', which currently corresponds to 'v1'.

    Examples:
        To load the latest version:
        >>> data_loader = AlibabaIFashion()

        To load a specific version:
        >>> data_loader = AlibabaIFashion(version='v1')
    """
    latest_version = 'v1'

    def __new__(cls, version: str = 'latest', **kwargs):
        """
        Initializes and returns the specified version of the Alibaba-iFashion dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific dataset version.

        Args:
            version (str): The version of the dataset to load. Currently, only
            'v1' and 'latest' are supported. Defaults to 'latest'.
            **kwargs: Additional keyword arguments.

        Returns:
            (AlibabaIFashion_V1): An instance of the dataset builder class, ready to be used.

        Raises:
            ValueError: If an unsupported version string is provided.
        """
        versions = {'v1': AlibabaIFashion_V1}
        if version == 'latest':
            version = cls.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError("Alibaba iFashion: Unsupported version")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the Alibaba-iFashion dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Currently, only

'latest'
**kwargs

Additional keyword arguments.

{}

Returns:

Type Description
AlibabaIFashion_V1

An instance of the dataset builder class, ready to be used.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion.py
def __new__(cls, version: str = 'latest', **kwargs):
    """
    Initializes and returns the specified version of the Alibaba-iFashion dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific dataset version.

    Args:
        version (str): The version of the dataset to load. Currently, only
        'v1' and 'latest' are supported. Defaults to 'latest'.
        **kwargs: Additional keyword arguments.

    Returns:
        (AlibabaIFashion_V1): An instance of the dataset builder class, ready to be used.

    Raises:
        ValueError: If an unsupported version string is provided.
    """
    versions = {'v1': AlibabaIFashion_V1}
    if version == 'latest':
        version = cls.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError("Alibaba iFashion: Unsupported version")

Builder class for version 'v1' of the Alibaba-iFashion dataset.

AlibabaIFashion_V1

Bases: DataRec

Builder class for the Alibaba-iFashion dataset (KDD 2019 version).

This class handles the logic for downloading, preparing, and loading the Alibaba-iFashion dataset. It is not typically instantiated directly but is called by the AlibabaIFashion entry point class.

The dataset was released for the paper "POG: Personalized Outfit Generation for Fashion Recommendation at Alibaba iFashion". It contains user-item interactions, item metadata, and outfit compositions. This loader focuses on processing the user-item interaction data from user_data.txt.

Attributes:

Name Type Description
item_data_url str

The URL for the item metadata file.

outfit_data_url str

The URL for the outfit composition file.

user_data_url str

The URL for the user-item interaction file.

CHECKSUM_ITEM str

MD5 checksum for the compressed item data archive.

CHECKSUM_USER str

MD5 checksum for the compressed user data archive.

CHECKSUM_OUTFIT str

MD5 checksum for the compressed outfit data archive.

Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
class AlibabaIFashion_V1(DataRec):
    """
    Builder class for the Alibaba-iFashion dataset (KDD 2019 version).

    This class handles the logic for downloading, preparing, and loading the
    Alibaba-iFashion dataset. It is not typically instantiated directly but is
    called by the `AlibabaIFashion` entry point class.

    The dataset was released for the paper "POG: Personalized Outfit Generation
    for Fashion Recommendation at Alibaba iFashion". It contains user-item interactions,
    item metadata, and outfit compositions. This loader focuses on processing the
    user-item interaction data from `user_data.txt`.

    Attributes:
        item_data_url (str): The URL for the item metadata file.
        outfit_data_url (str): The URL for the outfit composition file.
        user_data_url (str): The URL for the user-item interaction file.
        CHECKSUM_ITEM (str): MD5 checksum for the compressed item data archive.
        CHECKSUM_USER (str): MD5 checksum for the compressed user data archive.
        CHECKSUM_OUTFIT (str): MD5 checksum for the compressed outfit data archive.
    """
    item_data_url = 'https://drive.google.com/uc?id=17MAGl20_mf9V8j0-J6c7T3ayfZd-dIx8'
    outfit_data_url = 'https://drive.google.com/uc?id=1HFKUqBe5oMizU0lxy6sQE5Er1w9x-cC4'
    user_data_url = 'https://drive.google.com/uc?id=1G_1SV9H7fQMPPJOBmZpCnCkgifSsb9Ar'

    compressed_item_file_name = 'item_data.txt.zip'
    compressed_outfit_file_name = 'outfit_data.txt.zip'
    compressed_user_file_name = 'user_data.7z'

    data_file_name = 'alibaba_ifashion'

    uncompressed_item_file_name = 'item_data.txt'
    uncompressed_outfit_file_name = 'outfit_data.txt'
    uncompressed_user_file_name = 'user_data.txt'

    REQUIRED_FILES = [uncompressed_item_file_name, uncompressed_outfit_file_name, uncompressed_user_file_name]
    CHECKSUM_ITEM = 'f501244e784ae33defb71b3478d1125c'
    CHECKSUM_USER = '2ff9254d67fb13d04824621ca1387622'
    CHECKSUM_OUTFIT = 'f24078606235c122bd1d1c988766e83f'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """

        super().__init__(user=True, item=True, rating='implicit')

        self.dataset_name = 'alibaba_ifashion'
        self.version_name = 'v1'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
            else dataset_raw_directory(self.dataset_name)

        self.return_type = None

        # check if the required files have been already downloaded
        rq_found, rq_missing = self.required_files()
        # download and decompress the required files that are missing
        rq_found, rq_missing = self.download(found=rq_found, missing=rq_missing)
        assert len(rq_found) == len(self.REQUIRED_FILES), len(rq_missing) == 0

        data_path = None
        for p, n in rq_found:
            if n == self.uncompressed_user_file_name:
                data_path = p
                break
        assert data_path is not None, 'User data file not found'

        self.path = self.process(data_path=data_path)

    def required_files(self):
        """
        Checks for the presence of the required decompressed data files.

        Returns:
            (tuple[list, list]): A tuple where the first element is a list of
                found files and the second is a list of missing files. Each
                item in the lists is a tuple of (path, filename).
        """
        # check if the file is there
        req_files = [(os.path.join(self._raw_folder, f), f) for f in self.REQUIRED_FILES]
        found, missing = [], []
        # required file path, required file name
        for rfp, rfn in req_files:
            if os.path.isfile(rfp):
                found.append((rfp, rfn))
                print(f'Required file \'{rfn}\' found')
            else:
                missing.append((rfp, rfn))
        return found, missing

    def download_item_data(self):
        """
        Downloads, verifies, and decompresses the item data file.

        Returns:
            (str): The path to the decompressed item data file.
        """
        file_path = os.path.join(self._raw_folder, self.compressed_item_file_name)
        print(f'Downloading {self.dataset_name} item data...')
        gdown.download(self.item_data_url, file_path, quiet=False)
        print(f'{self.dataset_name} item data downloaded at \'{file_path}\'')
        verify_checksum(file_path, self.CHECKSUM_ITEM)
        print('Decompressing zip file...')
        decompress_zip_file(file_path, self._raw_folder)
        print('Deleting zip file...')
        os.remove(file_path)
        return os.path.join(self._raw_folder, self.uncompressed_item_file_name)

    def download_outfit_data(self):
        """
        Downloads, verifies, and decompresses the outfit data file.

        Returns:
            (str): The path to the decompressed outfit data file.
        """
        file_path = os.path.join(self._raw_folder, self.compressed_outfit_file_name)
        print(f'Downloading {self.dataset_name} outfit data...')
        gdown.download(self.outfit_data_url, file_path, quiet=False)
        print(f'{self.dataset_name} outfit data downloaded at \'{file_path}\'')
        verify_checksum(file_path, self.CHECKSUM_OUTFIT)
        print('Decompressing zip file...')
        decompress_zip_file(file_path, self._raw_folder)
        print('Deleting zip file...')
        os.remove(file_path)
        return os.path.join(self._raw_folder, self.uncompressed_outfit_file_name)

    def download_user_data(self):
        """
        Downloads, verifies, and decompresses the user interaction data file.

        Returns:
            (str): The path to the decompressed user data file.
        """
        file_path = os.path.join(self._raw_folder, self.compressed_user_file_name)
        print(f'Downloading {self.dataset_name} user data...')
        gdown.download(self.user_data_url, file_path, quiet=False)
        print(f'{self.dataset_name} user data downloaded at \'{file_path}\'')
        verify_checksum(file_path, self.CHECKSUM_USER)
        print('Decompressing 7z file...')
        decompress_7z_file(file_path, self._raw_folder)
        print('Deleting 7z file...')
        os.remove(file_path)
        return os.path.join(self._raw_folder, self.uncompressed_user_file_name)

    def download(self, found: list, missing: list) -> (str, str):
        """
        Downloads all missing files for the dataset.

        Iterates through the list of missing files and calls the appropriate download helper function for each one.

        Args:
            found (list): A list of file tuples that were already found locally.
            missing (list): A list of file tuples that need to be downloaded.

        Returns:
            (tuple[list, list]): The updated lists of found and missing files after
                the download and verification process.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created raw files folder at \'{}\''.format(self._raw_folder))

        downloaded = []
        for required_file in missing:
            file_path, file_name = required_file

            if file_name == 'item_data.txt':
                self.download_item_data()
                downloaded.append(required_file)
            elif file_name == 'outfit_data.txt':
                self.download_outfit_data()
                downloaded.append(required_file)
            elif file_name == 'user_data.txt':
                self.download_user_data()
                downloaded.append(required_file)
            else:
                raise warnings.warn(f'You are trying to download a not required file for {self.dataset_name}.'
                                    f' \n The file will not be downloaded.', UserWarning)

        for required_file in downloaded:
            missing.remove(required_file)
            found.append(required_file)
        return found, missing

    def process(self, data_path) -> None:
        """
        Processes the raw user interaction data and loads it into the class.

        The user interaction data is in an 'inline' format, where each line
        contains a user followed by a semicolon-separated list of their item
        interactions. This method uses `read_inline` to parse this format into
        a standard user-item pair DataFrame.

        Args:
            data_path (str): The path to the raw `user_data.txt` file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """
        from datarec.io.readers import read_inline
        self.data = read_inline(data_path, cols=['user', 'item', 'outfit'],
                                user_col='user', item_col='item',
                                col_sep=',', history_sep=';')
        return None

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """

    super().__init__(user=True, item=True, rating='implicit')

    self.dataset_name = 'alibaba_ifashion'
    self.version_name = 'v1'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
        else dataset_raw_directory(self.dataset_name)

    self.return_type = None

    # check if the required files have been already downloaded
    rq_found, rq_missing = self.required_files()
    # download and decompress the required files that are missing
    rq_found, rq_missing = self.download(found=rq_found, missing=rq_missing)
    assert len(rq_found) == len(self.REQUIRED_FILES), len(rq_missing) == 0

    data_path = None
    for p, n in rq_found:
        if n == self.uncompressed_user_file_name:
            data_path = p
            break
    assert data_path is not None, 'User data file not found'

    self.path = self.process(data_path=data_path)

required_files()

Checks for the presence of the required decompressed data files.

Returns:

Type Description
tuple[list, list]

A tuple where the first element is a list of found files and the second is a list of missing files. Each item in the lists is a tuple of (path, filename).

Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
def required_files(self):
    """
    Checks for the presence of the required decompressed data files.

    Returns:
        (tuple[list, list]): A tuple where the first element is a list of
            found files and the second is a list of missing files. Each
            item in the lists is a tuple of (path, filename).
    """
    # check if the file is there
    req_files = [(os.path.join(self._raw_folder, f), f) for f in self.REQUIRED_FILES]
    found, missing = [], []
    # required file path, required file name
    for rfp, rfn in req_files:
        if os.path.isfile(rfp):
            found.append((rfp, rfn))
            print(f'Required file \'{rfn}\' found')
        else:
            missing.append((rfp, rfn))
    return found, missing

download_item_data()

Downloads, verifies, and decompresses the item data file.

Returns:

Type Description
str

The path to the decompressed item data file.

Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
def download_item_data(self):
    """
    Downloads, verifies, and decompresses the item data file.

    Returns:
        (str): The path to the decompressed item data file.
    """
    file_path = os.path.join(self._raw_folder, self.compressed_item_file_name)
    print(f'Downloading {self.dataset_name} item data...')
    gdown.download(self.item_data_url, file_path, quiet=False)
    print(f'{self.dataset_name} item data downloaded at \'{file_path}\'')
    verify_checksum(file_path, self.CHECKSUM_ITEM)
    print('Decompressing zip file...')
    decompress_zip_file(file_path, self._raw_folder)
    print('Deleting zip file...')
    os.remove(file_path)
    return os.path.join(self._raw_folder, self.uncompressed_item_file_name)

download_outfit_data()

Downloads, verifies, and decompresses the outfit data file.

Returns:

Type Description
str

The path to the decompressed outfit data file.

Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
def download_outfit_data(self):
    """
    Downloads, verifies, and decompresses the outfit data file.

    Returns:
        (str): The path to the decompressed outfit data file.
    """
    file_path = os.path.join(self._raw_folder, self.compressed_outfit_file_name)
    print(f'Downloading {self.dataset_name} outfit data...')
    gdown.download(self.outfit_data_url, file_path, quiet=False)
    print(f'{self.dataset_name} outfit data downloaded at \'{file_path}\'')
    verify_checksum(file_path, self.CHECKSUM_OUTFIT)
    print('Decompressing zip file...')
    decompress_zip_file(file_path, self._raw_folder)
    print('Deleting zip file...')
    os.remove(file_path)
    return os.path.join(self._raw_folder, self.uncompressed_outfit_file_name)

download_user_data()

Downloads, verifies, and decompresses the user interaction data file.

Returns:

Type Description
str

The path to the decompressed user data file.

Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
def download_user_data(self):
    """
    Downloads, verifies, and decompresses the user interaction data file.

    Returns:
        (str): The path to the decompressed user data file.
    """
    file_path = os.path.join(self._raw_folder, self.compressed_user_file_name)
    print(f'Downloading {self.dataset_name} user data...')
    gdown.download(self.user_data_url, file_path, quiet=False)
    print(f'{self.dataset_name} user data downloaded at \'{file_path}\'')
    verify_checksum(file_path, self.CHECKSUM_USER)
    print('Decompressing 7z file...')
    decompress_7z_file(file_path, self._raw_folder)
    print('Deleting 7z file...')
    os.remove(file_path)
    return os.path.join(self._raw_folder, self.uncompressed_user_file_name)

download(found, missing)

Downloads all missing files for the dataset.

Iterates through the list of missing files and calls the appropriate download helper function for each one.

Parameters:

Name Type Description Default
found list

A list of file tuples that were already found locally.

required
missing list

A list of file tuples that need to be downloaded.

required

Returns:

Type Description
tuple[list, list]

The updated lists of found and missing files after the download and verification process.

Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
def download(self, found: list, missing: list) -> (str, str):
    """
    Downloads all missing files for the dataset.

    Iterates through the list of missing files and calls the appropriate download helper function for each one.

    Args:
        found (list): A list of file tuples that were already found locally.
        missing (list): A list of file tuples that need to be downloaded.

    Returns:
        (tuple[list, list]): The updated lists of found and missing files after
            the download and verification process.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created raw files folder at \'{}\''.format(self._raw_folder))

    downloaded = []
    for required_file in missing:
        file_path, file_name = required_file

        if file_name == 'item_data.txt':
            self.download_item_data()
            downloaded.append(required_file)
        elif file_name == 'outfit_data.txt':
            self.download_outfit_data()
            downloaded.append(required_file)
        elif file_name == 'user_data.txt':
            self.download_user_data()
            downloaded.append(required_file)
        else:
            raise warnings.warn(f'You are trying to download a not required file for {self.dataset_name}.'
                                f' \n The file will not be downloaded.', UserWarning)

    for required_file in downloaded:
        missing.remove(required_file)
        found.append(required_file)
    return found, missing

process(data_path)

Processes the raw user interaction data and loads it into the class.

The user interaction data is in an 'inline' format, where each line contains a user followed by a semicolon-separated list of their item interactions. This method uses read_inline to parse this format into a standard user-item pair DataFrame.

Parameters:

Name Type Description Default
data_path str

The path to the raw user_data.txt file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
def process(self, data_path) -> None:
    """
    Processes the raw user interaction data and loads it into the class.

    The user interaction data is in an 'inline' format, where each line
    contains a user followed by a semicolon-separated list of their item
    interactions. This method uses `read_inline` to parse this format into
    a standard user-item pair DataFrame.

    Args:
        data_path (str): The path to the raw `user_data.txt` file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """
    from datarec.io.readers import read_inline
    self.data = read_inline(data_path, cols=['user', 'item', 'outfit'],
                            user_col='user', item_col='item',
                            col_sep=',', history_sep=';')
    return None

Amazon Beauty

Entry point for loading different versions of the Amazon Beauty dataset.

AmazonBeauty

Entry point class to load various versions of the Amazon Beauty dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder.

The Amazon Beauty dataset contains product reviews and metadata from Amazon, specialized for the "Beauty and Personal Care" category.

The default version is 'latest', which currently corresponds to the '2023' version.

Examples:

To load the latest version:

>>> data_loader = AmazonBeauty()

To load a specific version:

>>> data_loader = AmazonBeauty(version='2023')
Source code in datarec/datasets/amazon_beauty/amz_beauty.py
class AmazonBeauty:
    """
    Entry point class to load various versions of the Amazon Beauty dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder.

    The Amazon Beauty dataset contains product reviews and metadata from Amazon,
    specialized for the "Beauty and Personal Care" category.

    The default version is 'latest', which currently corresponds to the '2023' version.

    Examples:
        To load the latest version:
        >>> data_loader = AmazonBeauty()

        To load a specific version:
        >>> data_loader = AmazonBeauty(version='2023')
    """
    latest_version = '2023'

    def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
        """
        Initializes and returns the specified version of the Amazon Beauty dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific dataset version.

        Args:
            version (str): The version of the dataset to load. Currently, only
                '2023' and 'latest' are supported. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used for this dataset).

        Returns:
            (AMZ_Beauty_2023): An instance of the dataset builder class, populated with data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """
        versions = {'2023': AMZ_Beauty_2023}
        if version == 'latest':
            version = cls.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError(f"Amazon Beauty {version}: Unsupported version \n Supported version:"
                             f"\n \t version \t name "
                             f"\n \t 2023 \t Amazon Beauty and Personal Care 2023")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the Amazon Beauty dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Currently, only '2023' and 'latest' are supported. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used for this dataset).

{}

Returns:

Type Description
AMZ_Beauty_2023

An instance of the dataset builder class, populated with data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/amazon_beauty/amz_beauty.py
def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
    """
    Initializes and returns the specified version of the Amazon Beauty dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific dataset version.

    Args:
        version (str): The version of the dataset to load. Currently, only
            '2023' and 'latest' are supported. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used for this dataset).

    Returns:
        (AMZ_Beauty_2023): An instance of the dataset builder class, populated with data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """
    versions = {'2023': AMZ_Beauty_2023}
    if version == 'latest':
        version = cls.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError(f"Amazon Beauty {version}: Unsupported version \n Supported version:"
                         f"\n \t version \t name "
                         f"\n \t 2023 \t Amazon Beauty and Personal Care 2023")

Builder class for the 2023 version of the Amazon Beauty dataset.

AMZ_Beauty_2023

Bases: DataRec

Builder class for the Amazon Beauty dataset (2023 version).

This class handles the logic for downloading, preparing, and loading the 2023 version of the Amazon Beauty dataset. It is not typically instantiated directly but is called by the AmazonBeauty entry point class.

The dataset is from the "Bridging Language and Items for Retrieval and Recommendation" paper and contains user ratings for beauty products.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/amazon_beauty/amz_beauty_2023.py
class AMZ_Beauty_2023(DataRec):
    """
    Builder class for the Amazon Beauty dataset (2023 version).

    This class handles the logic for downloading, preparing, and loading the
    2023 version of the Amazon Beauty dataset. It is not typically instantiated
    directly but is called by the `AmazonBeauty` entry point class.

    The dataset is from the "Bridging Language and Items for Retrieval and
    Recommendation" paper and contains user ratings for beauty products.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/benchmark/0core/rating_only/Beauty_and_Personal_Care.csv.gz'
    data_file_name = os.path.basename(url)
    decompressed_data_file_name = data_file_name.replace('.gz', '')
    CHECKSUM = '2e7f69fa6d738f1ee7756d8a46ad7930'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'amazon_beauty'
        self.version_name = '2023'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else dataset_raw_directory(self.dataset_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        self.process(file_path)

    def required_files(self):
        """
        Checks for the presence of the required decompressed data file.

        It first looks for the final, uncompressed file. If not found, it
        looks for the compressed archive and decompresses it.

        Returns:
            (str or None): The path to the required data file if it exists or can be
                created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

        # check if the file is there
        if os.path.exists(uncompressed_file_path):
            return uncompressed_file_path
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded .gz archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed .gz archive.

        Returns:
            (str): The path to the decompressed CSV file.
        """
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)
        return decompress_gz(path, decompressed_file_path)

    def download(self) -> (str, str):
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded .gz archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Raw files folder missing. Folder created at \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_name = os.path.basename(self.url)
        file_path = os.path.join(self._raw_folder, file_name)
        print('Downloading data file from {}'.format(self.url))
        download_file(self.url, file_path, size=559019689)

        return file_path

    def process(self, file_path):
        """
        Processes the raw data and loads it into the class.

        This method reads the decompressed file into a pandas DataFrame and
        assigns it to the `self.data` attribute.

        Args:
            file_path (str): The path to the raw data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',',
                               user_col='user_id', item_col='parent_asin',
                               rating_col='rating', timestamp_col='timestamp',
                               header=0)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/amazon_beauty/amz_beauty_2023.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'amazon_beauty'
    self.version_name = '2023'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else dataset_raw_directory(self.dataset_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    self.process(file_path)

required_files()

Checks for the presence of the required decompressed data file.

It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.

Returns:

Type Description
str or None

The path to the required data file if it exists or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/amazon_beauty/amz_beauty_2023.py
def required_files(self):
    """
    Checks for the presence of the required decompressed data file.

    It first looks for the final, uncompressed file. If not found, it
    looks for the compressed archive and decompresses it.

    Returns:
        (str or None): The path to the required data file if it exists or can be
            created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

    # check if the file is there
    if os.path.exists(uncompressed_file_path):
        return uncompressed_file_path
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded .gz archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed .gz archive.

required

Returns:

Type Description
str

The path to the decompressed CSV file.

Source code in datarec/datasets/amazon_beauty/amz_beauty_2023.py
def decompress(self, path):
    """
    Decompresses the downloaded .gz archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed .gz archive.

    Returns:
        (str): The path to the decompressed CSV file.
    """
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)
    return decompress_gz(path, decompressed_file_path)

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded .gz archive.

Source code in datarec/datasets/amazon_beauty/amz_beauty_2023.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded .gz archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Raw files folder missing. Folder created at \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_name = os.path.basename(self.url)
    file_path = os.path.join(self._raw_folder, file_name)
    print('Downloading data file from {}'.format(self.url))
    download_file(self.url, file_path, size=559019689)

    return file_path

process(file_path)

Processes the raw data and loads it into the class.

This method reads the decompressed file into a pandas DataFrame and assigns it to the self.data attribute.

Parameters:

Name Type Description Default
file_path str

The path to the raw data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/amazon_beauty/amz_beauty_2023.py
def process(self, file_path):
    """
    Processes the raw data and loads it into the class.

    This method reads the decompressed file into a pandas DataFrame and
    assigns it to the `self.data` attribute.

    Args:
        file_path (str): The path to the raw data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',',
                           user_col='user_id', item_col='parent_asin',
                           rating_col='rating', timestamp_col='timestamp',
                           header=0)
    self.data = dataset

Amazon Books

Entry point for loading different versions of the Amazon Books dataset.

AmazonBooks

Entry point class to load various versions of the Amazon Books dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder for either the 2018 or 2023 version.

The Amazon Books dataset contains product reviews and metadata from Amazon for the "Books" category.

The default version is 'latest', which currently corresponds to the '2023' version.

Examples:

To load the latest version:

>>> data_loader = AmazonBooks()

To load a specific version:

>>> data_loader = AmazonBooks(version='2018')
Source code in datarec/datasets/amazon_books/amz_books.py
class AmazonBooks:
    """
    Entry point class to load various versions of the Amazon Books dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder for either the 2018 or 2023 version.

    The Amazon Books dataset contains product reviews and metadata from Amazon
    for the "Books" category.

    The default version is 'latest', which currently corresponds to the '2023' version.

    Examples:
        To load the latest version:
        >>> data_loader = AmazonBooks()

        To load a specific version:
        >>> data_loader = AmazonBooks(version='2018')
    """
    latest_version = '2023'

    def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
        """
        Initializes and returns the specified version of the Amazon Books dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific dataset version.

        Args:
            version (str): The version of the dataset to load. Supported versions
                include '2023', '2018', and 'latest'. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used for this dataset).

        Returns:
            (DataRec): An instance of the appropriate dataset builder class (e.g.,
                `AMZ_Books_2023`), populated with data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """
        versions = {'2023': AMZ_Books_2023,
                    '2018': AMZ_Books_2018}
        if version == 'latest':
            version = cls.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError(f"Amazon Books {version}: Unsupported version \n Supported version:"
                             f"\n \t version \t name "
                             f"\n \t 2023 \t Amazon Books 2023"
                             f"\n \t 2018 \t Amazon Books 2018")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the Amazon Books dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Supported versions include '2023', '2018', and 'latest'. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used for this dataset).

{}

Returns:

Type Description
DataRec

An instance of the appropriate dataset builder class (e.g., AMZ_Books_2023), populated with data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/amazon_books/amz_books.py
def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
    """
    Initializes and returns the specified version of the Amazon Books dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific dataset version.

    Args:
        version (str): The version of the dataset to load. Supported versions
            include '2023', '2018', and 'latest'. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used for this dataset).

    Returns:
        (DataRec): An instance of the appropriate dataset builder class (e.g.,
            `AMZ_Books_2023`), populated with data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """
    versions = {'2023': AMZ_Books_2023,
                '2018': AMZ_Books_2018}
    if version == 'latest':
        version = cls.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError(f"Amazon Books {version}: Unsupported version \n Supported version:"
                         f"\n \t version \t name "
                         f"\n \t 2023 \t Amazon Books 2023"
                         f"\n \t 2018 \t Amazon Books 2018")

Builder class for the 2018 version of the Amazon Books dataset.

AMZ_Books_2018

Bases: DataRec

Builder class for the Amazon Books dataset (2018 version).

This class handles the logic for downloading, preparing, and loading the 2018 version of the Amazon Books dataset from the Amazon Reviews V2 source. It is not typically instantiated directly but is called by the AmazonBooks entry point class.

The raw data is provided as a single, uncompressed CSV file.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/amazon_books/amz_books_2018.py
class AMZ_Books_2018(DataRec):
    """
    Builder class for the Amazon Books dataset (2018 version).

    This class handles the logic for downloading, preparing, and loading the
    2018 version of the Amazon Books dataset from the Amazon Reviews V2 source.
    It is not typically instantiated directly but is called by the `AmazonBooks`
    entry point class.

    The raw data is provided as a single, uncompressed CSV file.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_v2/categoryFilesSmall/Books.csv'
    data_file_name = os.path.basename(url)
    decompressed_data_file_name = data_file_name.replace('.gz', '')
    CHECKSUM = 'c6cb0fd6e4322d3523e9afd87d5ed9dc'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'amazon_books'
        self.version_name = '2018'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        self.process(file_path)

    def required_files(self):
        """
        Checks for the presence of the required data file.

        Returns:
            (str) or None: The path to the required data file if it exists,
                otherwise returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        # uncompressed data file
        uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

        # check if the file is there
        if os.path.exists(uncompressed_file_path):
            return uncompressed_file_path
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Handles the decompression step.

        For this 2018 version, the source file is already decompressed, so this
        method simply returns the path to the file.

        Args:
            path (str): The file path of the source data file.

        Returns:
            (str): The path to the data file.
        """
        # file already decompressed
        return path

    def download(self) -> (str, str):
        """
        Downloads the raw dataset file.

        Returns:
            (str): The local file path to the downloaded file.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        download_file(self.url, file_path, size=2140933459)

        # decompress downloaded file
        return file_path

    def process(self, file_path) -> str:
        """
        Processes the raw data and loads it into the class.

        This method reads the raw file, which does not contain a header row.
        Columns are identified by their integer index. The data is then assigned
        to the `self.data` attribute.

        Args:
            file_path (str): The path to the raw data file.

        Returns:
            (str): The path to the processed data file.
        """
        verify_checksum(file_path, self.CHECKSUM)

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',',
                               user_col=0, item_col=1,
                               rating_col=2, timestamp_col=3,
                               header=None)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/amazon_books/amz_books_2018.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'amazon_books'
    self.version_name = '2018'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    self.process(file_path)

required_files()

Checks for the presence of the required data file.

Returns:

Type Description

(str) or None: The path to the required data file if it exists, otherwise returns None.

Source code in datarec/datasets/amazon_books/amz_books_2018.py
def required_files(self):
    """
    Checks for the presence of the required data file.

    Returns:
        (str) or None: The path to the required data file if it exists,
            otherwise returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    # uncompressed data file
    uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

    # check if the file is there
    if os.path.exists(uncompressed_file_path):
        return uncompressed_file_path
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Handles the decompression step.

For this 2018 version, the source file is already decompressed, so this method simply returns the path to the file.

Parameters:

Name Type Description Default
path str

The file path of the source data file.

required

Returns:

Type Description
str

The path to the data file.

Source code in datarec/datasets/amazon_books/amz_books_2018.py
def decompress(self, path):
    """
    Handles the decompression step.

    For this 2018 version, the source file is already decompressed, so this
    method simply returns the path to the file.

    Args:
        path (str): The file path of the source data file.

    Returns:
        (str): The path to the data file.
    """
    # file already decompressed
    return path

download()

Downloads the raw dataset file.

Returns:

Type Description
str

The local file path to the downloaded file.

Source code in datarec/datasets/amazon_books/amz_books_2018.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset file.

    Returns:
        (str): The local file path to the downloaded file.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    download_file(self.url, file_path, size=2140933459)

    # decompress downloaded file
    return file_path

process(file_path)

Processes the raw data and loads it into the class.

This method reads the raw file, which does not contain a header row. Columns are identified by their integer index. The data is then assigned to the self.data attribute.

Parameters:

Name Type Description Default
file_path str

The path to the raw data file.

required

Returns:

Type Description
str

The path to the processed data file.

Source code in datarec/datasets/amazon_books/amz_books_2018.py
def process(self, file_path) -> str:
    """
    Processes the raw data and loads it into the class.

    This method reads the raw file, which does not contain a header row.
    Columns are identified by their integer index. The data is then assigned
    to the `self.data` attribute.

    Args:
        file_path (str): The path to the raw data file.

    Returns:
        (str): The path to the processed data file.
    """
    verify_checksum(file_path, self.CHECKSUM)

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',',
                           user_col=0, item_col=1,
                           rating_col=2, timestamp_col=3,
                           header=None)
    self.data = dataset

Builder class for the 2023 version of the Amazon Books dataset.

AMZ_Books_2023

Bases: DataRec

Builder class for the Amazon Books dataset (2023 version).

This class handles the logic for downloading, preparing, and loading the 2023 version of the Amazon Books dataset. It is not typically instantiated directly but is called by the AmazonBooks entry point class.

The dataset is from the "Bridging Language and Items for Retrieval and Recommendation" paper and contains user ratings for books.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/amazon_books/amz_books_2023.py
class AMZ_Books_2023(DataRec):
    """
    Builder class for the Amazon Books dataset (2023 version).

    This class handles the logic for downloading, preparing, and loading the
    2023 version of the Amazon Books dataset. It is not typically instantiated
    directly but is called by the `AmazonBooks` entry point class.

    The dataset is from the "Bridging Language and Items for Retrieval and
    Recommendation" paper and contains user ratings for books.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/benchmark/0core/rating_only/Books.csv.gz'
    data_file_name = os.path.basename(url)
    decompressed_data_file_name = data_file_name.replace('.gz', '')
    CHECKSUM = 'abc9f379ac0a77860ea0792b69ad0d5d'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """

        super().__init__(None)

        self.dataset_name = 'amazon_books'
        self.version_name = '2023'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        self.process(file_path)

    def required_files(self):
        """
        Checks for the presence of the required decompressed data file.

        It first looks for the final, uncompressed file. If not found, it
        looks for the compressed archive and decompresses it.

        Returns:
            (str or None): The path to the required data file if it exists or can be
                created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        # uncompressed data file
        uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

        # check if the file is there
        if os.path.exists(uncompressed_file_path):
            return uncompressed_file_path
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed archive.

        Returns:
            (str): The path to the decompressed file.
        """
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)
        return decompress_gz(path, decompressed_file_path)

    def download(self) -> (str, str):
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        download_file(self.url, file_path)

        # decompress downloaded file
        return file_path

    def process(self, file_path):
        """
        Processes the raw data and loads it into the class.

        This method reads the decompressed file, which includes a header,
        into a pandas DataFrame and assigns it to the `self.data` attribute.

        Args:
            file_path (str): The path to the raw data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """
        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',',
                               user_col='user_id', item_col='parent_asin',
                               rating_col='rating', timestamp_col='timestamp',
                               header=0)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/amazon_books/amz_books_2023.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """

    super().__init__(None)

    self.dataset_name = 'amazon_books'
    self.version_name = '2023'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    self.process(file_path)

required_files()

Checks for the presence of the required decompressed data file.

It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.

Returns:

Type Description
str or None

The path to the required data file if it exists or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/amazon_books/amz_books_2023.py
def required_files(self):
    """
    Checks for the presence of the required decompressed data file.

    It first looks for the final, uncompressed file. If not found, it
    looks for the compressed archive and decompresses it.

    Returns:
        (str or None): The path to the required data file if it exists or can be
            created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    # uncompressed data file
    uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

    # check if the file is there
    if os.path.exists(uncompressed_file_path):
        return uncompressed_file_path
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed archive.

required

Returns:

Type Description
str

The path to the decompressed file.

Source code in datarec/datasets/amazon_books/amz_books_2023.py
def decompress(self, path):
    """
    Decompresses the downloaded archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed archive.

    Returns:
        (str): The path to the decompressed file.
    """
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)
    return decompress_gz(path, decompressed_file_path)

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded archive.

Source code in datarec/datasets/amazon_books/amz_books_2023.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    download_file(self.url, file_path)

    # decompress downloaded file
    return file_path

process(file_path)

Processes the raw data and loads it into the class.

This method reads the decompressed file, which includes a header, into a pandas DataFrame and assigns it to the self.data attribute.

Parameters:

Name Type Description Default
file_path str

The path to the raw data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/amazon_books/amz_books_2023.py
def process(self, file_path):
    """
    Processes the raw data and loads it into the class.

    This method reads the decompressed file, which includes a header,
    into a pandas DataFrame and assigns it to the `self.data` attribute.

    Args:
        file_path (str): The path to the raw data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """
    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',',
                           user_col='user_id', item_col='parent_asin',
                           rating_col='rating', timestamp_col='timestamp',
                           header=0)
    self.data = dataset

Amazon Clothing

Entry point for loading different versions of the Amazon Clothing dataset.

AmazonClothing

Entry point class to load various versions of the Amazon Clothing dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder for either the 2018 or 2023 version.

The dataset contains product reviews and metadata for the category "Clothing, Shoes and Jewelry" from Amazon.

The default version is 'latest', which currently corresponds to the '2023' version.

Examples:

To load the latest version:

>>> data_loader = AmazonClothing()

To load a specific version:

>>> data_loader = AmazonClothing(version='2018')
Source code in datarec/datasets/amazon_clothing/amz_clothing.py
class AmazonClothing:
    """
    Entry point class to load various versions of the Amazon Clothing dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder for either the 2018 or 2023 version.

    The dataset contains product reviews and metadata for the category
    "Clothing, Shoes and Jewelry" from Amazon.

    The default version is 'latest', which currently corresponds to the '2023' version.

    Examples:
        To load the latest version:
        >>> data_loader = AmazonClothing()

        To load a specific version:
        >>> data_loader = AmazonClothing(version='2018')
    """
    latest_version = '2023'

    def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
        """
        Initializes and returns the specified version of the Amazon Clothing dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific dataset version.

        Args:
            version (str): The version of the dataset to load. Supported versions
                include '2023', '2018', and 'latest'. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used for this dataset).

        Returns:
            (DataRec): An instance of the appropriate dataset builder class
                (e.g., `AmazonClothing_2023`), populated with data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """

        versions = {'2023': AmazonClothing_2023,
                    '2018': AmazonClothing_2018}
        if version == 'latest':
            version = cls.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError(f"Amazon Clothing {version}: Unsupported version \n Supported version:"
                             f"\n \t version \t name "
                             f"\n \t 2023 \t Amazon Clothing 2023"
                             f"\n \t 2018 \t Amazon Clothing 2018")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the Amazon Clothing dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Supported versions include '2023', '2018', and 'latest'. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used for this dataset).

{}

Returns:

Type Description
DataRec

An instance of the appropriate dataset builder class (e.g., AmazonClothing_2023), populated with data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/amazon_clothing/amz_clothing.py
def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
    """
    Initializes and returns the specified version of the Amazon Clothing dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific dataset version.

    Args:
        version (str): The version of the dataset to load. Supported versions
            include '2023', '2018', and 'latest'. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used for this dataset).

    Returns:
        (DataRec): An instance of the appropriate dataset builder class
            (e.g., `AmazonClothing_2023`), populated with data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """

    versions = {'2023': AmazonClothing_2023,
                '2018': AmazonClothing_2018}
    if version == 'latest':
        version = cls.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError(f"Amazon Clothing {version}: Unsupported version \n Supported version:"
                         f"\n \t version \t name "
                         f"\n \t 2023 \t Amazon Clothing 2023"
                         f"\n \t 2018 \t Amazon Clothing 2018")

Builder class for the 2018 version of the Amazon Clothing dataset.

AmazonClothing_2018

Bases: DataRec

Builder class for the Amazon Clothing dataset (2018 version).

This class handles the logic for downloading, preparing, and loading the 2018 version of the "Clothing, Shoes and Jewelry" dataset. It is not typically instantiated directly but is called by the AmazonClothing entry point class.

The raw data is provided as a single, uncompressed CSV file without a header.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/amazon_clothing/amz_clothing_2018.py
class AmazonClothing_2018(DataRec):
    """
    Builder class for the Amazon Clothing dataset (2018 version).

    This class handles the logic for downloading, preparing, and loading the
    2018 version of the "Clothing, Shoes and Jewelry" dataset. It is not typically
    instantiated directly but is called by the `AmazonClothing` entry point class.

    The raw data is provided as a single, uncompressed CSV file without a header.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_v2/categoryFilesSmall/Clothing_Shoes_and_Jewelry.csv'
    data_file_name = os.path.basename(url)
    decompressed_data_file_name = data_file_name
    CHECKSUM = '27b4184d3d4b5e443d31dc608badf927'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'amazon_clothing'
        self.version_name = '2018'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        self.process(file_path)

    def required_files(self):
        """
        Checks for the presence of the required data file.

        Returns:
            (str or None): The path to the required data file if it exists,
                otherwise returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

        # check if the file is there
        if os.path.exists(uncompressed_file_path):
            return uncompressed_file_path
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Handles the decompression step.

        For this 2018 version, the source file is already decompressed, so this
        method simply returns the path to the file.

        Args:
            path (str): The file path of the source data file.

        Returns:
            (str): The path to the data file.
        """
        # already decompressed
        return path

    def download(self) -> (str, str):
        """
        Downloads the raw dataset file.

        Returns:
            (str): The local file path to the downloaded file.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_name = os.path.basename(self.url)
        file_path = os.path.join(self._raw_folder, file_name)
        download_file(self.url, file_path, size=1395554400)

        return file_path

    def process(self, file_path):
        """
        Processes the raw data and loads it into the class.

        This method reads the raw file, which has no header. Columns are
        identified by their integer index. The data is then assigned to the
        `self.data` attribute.

        Args:
            file_path (str): The path to the raw data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """
        verify_checksum(file_path, self.CHECKSUM)

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',',
                               user_col=0, item_col=1,
                               rating_col=2, timestamp_col=3,
                               header=None)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/amazon_clothing/amz_clothing_2018.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'amazon_clothing'
    self.version_name = '2018'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    self.process(file_path)

required_files()

Checks for the presence of the required data file.

Returns:

Type Description
str or None

The path to the required data file if it exists, otherwise returns None.

Source code in datarec/datasets/amazon_clothing/amz_clothing_2018.py
def required_files(self):
    """
    Checks for the presence of the required data file.

    Returns:
        (str or None): The path to the required data file if it exists,
            otherwise returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

    # check if the file is there
    if os.path.exists(uncompressed_file_path):
        return uncompressed_file_path
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Handles the decompression step.

For this 2018 version, the source file is already decompressed, so this method simply returns the path to the file.

Parameters:

Name Type Description Default
path str

The file path of the source data file.

required

Returns:

Type Description
str

The path to the data file.

Source code in datarec/datasets/amazon_clothing/amz_clothing_2018.py
def decompress(self, path):
    """
    Handles the decompression step.

    For this 2018 version, the source file is already decompressed, so this
    method simply returns the path to the file.

    Args:
        path (str): The file path of the source data file.

    Returns:
        (str): The path to the data file.
    """
    # already decompressed
    return path

download()

Downloads the raw dataset file.

Returns:

Type Description
str

The local file path to the downloaded file.

Source code in datarec/datasets/amazon_clothing/amz_clothing_2018.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset file.

    Returns:
        (str): The local file path to the downloaded file.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_name = os.path.basename(self.url)
    file_path = os.path.join(self._raw_folder, file_name)
    download_file(self.url, file_path, size=1395554400)

    return file_path

process(file_path)

Processes the raw data and loads it into the class.

This method reads the raw file, which has no header. Columns are identified by their integer index. The data is then assigned to the self.data attribute.

Parameters:

Name Type Description Default
file_path str

The path to the raw data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/amazon_clothing/amz_clothing_2018.py
def process(self, file_path):
    """
    Processes the raw data and loads it into the class.

    This method reads the raw file, which has no header. Columns are
    identified by their integer index. The data is then assigned to the
    `self.data` attribute.

    Args:
        file_path (str): The path to the raw data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """
    verify_checksum(file_path, self.CHECKSUM)

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',',
                           user_col=0, item_col=1,
                           rating_col=2, timestamp_col=3,
                           header=None)
    self.data = dataset

Builder class for the 2023 version of the Amazon Clothing dataset.

AmazonClothing_2023

Bases: DataRec

Builder class for the Amazon Clothing dataset (2023 version).

This class handles the logic for downloading, preparing, and loading the 2023 version of the "Clothing, Shoes and Jewelry" dataset. It is not typically instantiated directly but is called by the AmazonClothing entry point class.

The dataset is from the "Bridging Language and Items for Retrieval and Recommendation" paper and is provided as a compressed CSV file.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/amazon_clothing/amz_clothing_2023.py
class AmazonClothing_2023(DataRec):
    """
    Builder class for the Amazon Clothing dataset (2023 version).

    This class handles the logic for downloading, preparing, and loading the
    2023 version of the "Clothing, Shoes and Jewelry" dataset. It is not typically
    instantiated directly but is called by the `AmazonClothing` entry point class.

    The dataset is from the "Bridging Language and Items for Retrieval and
    Recommendation" paper and is provided as a compressed CSV file.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/benchmark/0core/rating_only/Clothing_Shoes_and_Jewelry.csv.gz'
    data_file_name = os.path.basename(url)
    decompressed_data_file_name = data_file_name.replace('.gz', '')
    CHECKSUM = 'cfb9400815ce8fb6430130b7d439c203'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'amazon_clothing'
        self.version_name = '2023'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        self.process(file_path)

    def required_files(self):
        """
        Checks for the presence of the required decompressed data file.

        It first looks for the final, uncompressed file. If not found, it
        looks for the compressed archive and decompresses it.

        Returns:
            (str or None): The path to the required data file if it exists or can be
                created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

        # check if the file is there
        if os.path.exists(uncompressed_file_path):
            return uncompressed_file_path
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed archive.

        Returns:
            (str): The path to the decompressed file.
        """
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)
        return decompress_gz(path, decompressed_file_path)

    def download(self) -> (str, str):
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_name = os.path.basename(self.url)
        file_path = os.path.join(self._raw_folder, file_name)
        download_file(self.url, file_path, size=1395554400)

        return file_path

    def process(self, file_path):
        """
        Processes the raw data and loads it into the class.

        This method reads the decompressed file, which includes a header row,
        into a pandas DataFrame and assigns it to the `self.data` attribute.

        Args:
            file_path (str): The path to the raw data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',',
                               user_col='user_id', item_col='parent_asin',
                               rating_col='rating', timestamp_col='timestamp',
                               header=0)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/amazon_clothing/amz_clothing_2023.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'amazon_clothing'
    self.version_name = '2023'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    self.process(file_path)

required_files()

Checks for the presence of the required decompressed data file.

It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.

Returns:

Type Description
str or None

The path to the required data file if it exists or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/amazon_clothing/amz_clothing_2023.py
def required_files(self):
    """
    Checks for the presence of the required decompressed data file.

    It first looks for the final, uncompressed file. If not found, it
    looks for the compressed archive and decompresses it.

    Returns:
        (str or None): The path to the required data file if it exists or can be
            created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

    # check if the file is there
    if os.path.exists(uncompressed_file_path):
        return uncompressed_file_path
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed archive.

required

Returns:

Type Description
str

The path to the decompressed file.

Source code in datarec/datasets/amazon_clothing/amz_clothing_2023.py
def decompress(self, path):
    """
    Decompresses the downloaded archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed archive.

    Returns:
        (str): The path to the decompressed file.
    """
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)
    return decompress_gz(path, decompressed_file_path)

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded archive.

Source code in datarec/datasets/amazon_clothing/amz_clothing_2023.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_name = os.path.basename(self.url)
    file_path = os.path.join(self._raw_folder, file_name)
    download_file(self.url, file_path, size=1395554400)

    return file_path

process(file_path)

Processes the raw data and loads it into the class.

This method reads the decompressed file, which includes a header row, into a pandas DataFrame and assigns it to the self.data attribute.

Parameters:

Name Type Description Default
file_path str

The path to the raw data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/amazon_clothing/amz_clothing_2023.py
def process(self, file_path):
    """
    Processes the raw data and loads it into the class.

    This method reads the decompressed file, which includes a header row,
    into a pandas DataFrame and assigns it to the `self.data` attribute.

    Args:
        file_path (str): The path to the raw data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',',
                           user_col='user_id', item_col='parent_asin',
                           rating_col='rating', timestamp_col='timestamp',
                           header=0)
    self.data = dataset

Amazon Sports and Outdoors

Entry point for loading different versions of the Amazon Sports and Outdoors dataset.

AmazonSportsOutdoors

Entry point class to load various versions of the Amazon Sports and Outdoors dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder for either the 2018 or 2023 version.

The dataset contains product reviews and metadata for the category "Sports and Outdoors" from Amazon.

The default version is 'latest', which currently corresponds to the '2023' version.

Examples:

To load the latest version:

>>> data_loader = AmazonSportsOutdoors()

To load a specific version:

>>> data_loader = AmazonSportsOutdoors(version='2018')
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports.py
class AmazonSportsOutdoors:
    """
    Entry point class to load various versions of the Amazon Sports and Outdoors dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder for either the 2018 or 2023 version.

    The dataset contains product reviews and metadata for the category
    "Sports and Outdoors" from Amazon.

    The default version is 'latest', which currently corresponds to the '2023' version.

    Examples:
        To load the latest version:
        >>> data_loader = AmazonSportsOutdoors()

        To load a specific version:
        >>> data_loader = AmazonSportsOutdoors(version='2018')
    """
    latest_version = '2023'

    def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
        """
        Initializes and returns the specified version of the dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific dataset version.

        Args:
            version (str): The version of the dataset to load. Supported versions
                include '2023', '2018', and 'latest'. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used for this dataset).

        Returns:
            (DataRec): An instance of the appropriate dataset builder class
                (e.g., `AMZ_SportsOutdoors_2023`), populated with data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """
        versions = {'2023': AMZ_SportsOutdoors_2023,
                    '2018': AMZ_SportsOutdoors_2018}
        if version == 'latest':
            version = cls.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError(f"Amazon Sports and Outdoors {version}: Unsupported version \n Supported version:"
                             f"\n \t version \t name "
                             f"\n \t 2023 \t Amazon Sports and Outdoors 2023"
                             f"\n \t 2018 \t Amazon Sports and Outdoors 2018")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Supported versions include '2023', '2018', and 'latest'. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used for this dataset).

{}

Returns:

Type Description
DataRec

An instance of the appropriate dataset builder class (e.g., AMZ_SportsOutdoors_2023), populated with data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports.py
def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
    """
    Initializes and returns the specified version of the dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific dataset version.

    Args:
        version (str): The version of the dataset to load. Supported versions
            include '2023', '2018', and 'latest'. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used for this dataset).

    Returns:
        (DataRec): An instance of the appropriate dataset builder class
            (e.g., `AMZ_SportsOutdoors_2023`), populated with data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """
    versions = {'2023': AMZ_SportsOutdoors_2023,
                '2018': AMZ_SportsOutdoors_2018}
    if version == 'latest':
        version = cls.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError(f"Amazon Sports and Outdoors {version}: Unsupported version \n Supported version:"
                         f"\n \t version \t name "
                         f"\n \t 2023 \t Amazon Sports and Outdoors 2023"
                         f"\n \t 2018 \t Amazon Sports and Outdoors 2018")

Builder class for the 2018 version of the Amazon Sports and Outdoors dataset.

AMZ_SportsOutdoors_2018

Bases: DataRec

Builder class for the Amazon Sports and Outdoors dataset (2018 version).

This class handles the logic for downloading, preparing, and loading the 2018 version of the "Sports and Outdoors" dataset. It is not typically instantiated directly but is called by the AmazonSportsOutdoors entry point class.

The raw data is provided as a single, uncompressed CSV file without a header.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2018.py
class AMZ_SportsOutdoors_2018(DataRec):
    """
    Builder class for the Amazon Sports and Outdoors dataset (2018 version).

    This class handles the logic for downloading, preparing, and loading the
    2018 version of the "Sports and Outdoors" dataset. It is not typically
    instantiated directly but is called by the `AmazonSportsOutdoors` entry point class.

    The raw data is provided as a single, uncompressed CSV file without a header.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_v2/categoryFilesSmall/Sports_and_Outdoors.csv'
    data_file_name = os.path.basename(url)
    decompressed_data_file_name = data_file_name
    CHECKSUM = '1ed3d6c7a89f3c78fa260b2419753785'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'amazon_sports_and_outdoors'
        self.version_name = '2018'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        self.process(file_path)

    def required_files(self):
        """
        Checks for the presence of the required data file.

        Returns:
            (str or None): The path to the required data file if it exists,
                otherwise returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        # uncompressed data file
        uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

        # check if the file is there
        if os.path.exists(uncompressed_file_path):
            return uncompressed_file_path
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Handles the decompression step.

        For this 2018 version, the source file is already decompressed, so this
        method simply returns the path to the file.

        Args:
            path (str): The file path of the source data file.

        Returns:
            (str): The path to the data file.
        """
        # already decompressed
        return path


    def download(self) -> (str, str):
        """
        Downloads the raw dataset file.

        Returns:
            (str): The local file path to the downloaded file.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_name = os.path.basename(self.url)
        file_path = os.path.join(self._raw_folder, file_name)
        download_file(self.url, file_path)

        return file_path

    def process(self, file_path):
        """
        Processes the raw data and loads it into the class.

        This method reads the raw file, which has no header. Columns are
        identified by their integer index. The data is then assigned to the
        `self.data` attribute after checksum verification.

        Args:
            file_path (str): The path to the raw data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """
        verify_checksum(file_path, self.CHECKSUM)

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',',
                               user_col=0, item_col=1,
                               rating_col=2, timestamp_col=3,
                               header=None)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2018.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'amazon_sports_and_outdoors'
    self.version_name = '2018'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    self.process(file_path)

required_files()

Checks for the presence of the required data file.

Returns:

Type Description
str or None

The path to the required data file if it exists, otherwise returns None.

Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2018.py
def required_files(self):
    """
    Checks for the presence of the required data file.

    Returns:
        (str or None): The path to the required data file if it exists,
            otherwise returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    # uncompressed data file
    uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

    # check if the file is there
    if os.path.exists(uncompressed_file_path):
        return uncompressed_file_path
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Handles the decompression step.

For this 2018 version, the source file is already decompressed, so this method simply returns the path to the file.

Parameters:

Name Type Description Default
path str

The file path of the source data file.

required

Returns:

Type Description
str

The path to the data file.

Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2018.py
def decompress(self, path):
    """
    Handles the decompression step.

    For this 2018 version, the source file is already decompressed, so this
    method simply returns the path to the file.

    Args:
        path (str): The file path of the source data file.

    Returns:
        (str): The path to the data file.
    """
    # already decompressed
    return path

download()

Downloads the raw dataset file.

Returns:

Type Description
str

The local file path to the downloaded file.

Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2018.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset file.

    Returns:
        (str): The local file path to the downloaded file.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_name = os.path.basename(self.url)
    file_path = os.path.join(self._raw_folder, file_name)
    download_file(self.url, file_path)

    return file_path

process(file_path)

Processes the raw data and loads it into the class.

This method reads the raw file, which has no header. Columns are identified by their integer index. The data is then assigned to the self.data attribute after checksum verification.

Parameters:

Name Type Description Default
file_path str

The path to the raw data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2018.py
def process(self, file_path):
    """
    Processes the raw data and loads it into the class.

    This method reads the raw file, which has no header. Columns are
    identified by their integer index. The data is then assigned to the
    `self.data` attribute after checksum verification.

    Args:
        file_path (str): The path to the raw data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """
    verify_checksum(file_path, self.CHECKSUM)

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',',
                           user_col=0, item_col=1,
                           rating_col=2, timestamp_col=3,
                           header=None)
    self.data = dataset

Builder class for the 2023 version of the Amazon Sports and Outdoors dataset.

AMZ_SportsOutdoors_2023

Bases: DataRec

Builder class for the Amazon Sports and Outdoors dataset (2023 version).

This class handles the logic for downloading, preparing, and loading the 2023 version of the "Sports and Outdoors" dataset. It is not typically instantiated directly but is called by the AmazonSportsOutdoors entry point class.

The dataset is from the "Bridging Language and Items for Retrieval and Recommendation" paper and is provided as a compressed CSV file with a header.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2023.py
class AMZ_SportsOutdoors_2023(DataRec):
    """
    Builder class for the Amazon Sports and Outdoors dataset (2023 version).

    This class handles the logic for downloading, preparing, and loading the
    2023 version of the "Sports and Outdoors" dataset. It is not typically
    instantiated directly but is called by the `AmazonSportsOutdoors` entry point class.

    The dataset is from the "Bridging Language and Items for Retrieval and
    Recommendation" paper and is provided as a compressed CSV file with a header.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/benchmark/0core/rating_only/Sports_and_Outdoors.csv.gz'
    data_file_name = os.path.basename(url)
    decompressed_data_file_name = data_file_name.replace('.gz', '')
    CHECKSUM = '75e1dfbb3b3014fab914832b734922e6'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'amazon_sports_and_outdoors'
        self.version_name = '2023'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        self.process(file_path)

    def required_files(self):
        """
        Checks for the presence of the required decompressed data file.

        It first looks for the final, uncompressed file. If not found, it
        looks for the compressed archive and decompresses it.

        Returns:
            (str or None): The path to the required data file if it exists or can be
                created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        # uncompressed data file
        uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

        # check if the file is there
        if os.path.exists(uncompressed_file_path):
            return uncompressed_file_path
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed archive.

        Returns:
            (str): The path to the decompressed file.
        """
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)
        return decompress_gz(path, decompressed_file_path)


    def download(self) -> (str, str):
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_name = os.path.basename(self.url)
        file_path = os.path.join(self._raw_folder, file_name)
        download_file(self.url, file_path)

        return file_path

    def process(self, file_path):
        """
        Processes the raw data and loads it into the class.

        This method reads the decompressed file, which includes a header row,
        into a pandas DataFrame and assigns it to the `self.data` attribute.

        Args:
            file_path (str): The path to the raw data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',',
                               user_col='user_id', item_col='parent_asin',
                               rating_col='rating', timestamp_col='timestamp',
                               header=0)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2023.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'amazon_sports_and_outdoors'
    self.version_name = '2023'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    self.process(file_path)

required_files()

Checks for the presence of the required decompressed data file.

It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.

Returns:

Type Description
str or None

The path to the required data file if it exists or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2023.py
def required_files(self):
    """
    Checks for the presence of the required decompressed data file.

    It first looks for the final, uncompressed file. If not found, it
    looks for the compressed archive and decompresses it.

    Returns:
        (str or None): The path to the required data file if it exists or can be
            created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    # uncompressed data file
    uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

    # check if the file is there
    if os.path.exists(uncompressed_file_path):
        return uncompressed_file_path
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed archive.

required

Returns:

Type Description
str

The path to the decompressed file.

Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2023.py
def decompress(self, path):
    """
    Decompresses the downloaded archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed archive.

    Returns:
        (str): The path to the decompressed file.
    """
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)
    return decompress_gz(path, decompressed_file_path)

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded archive.

Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2023.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_name = os.path.basename(self.url)
    file_path = os.path.join(self._raw_folder, file_name)
    download_file(self.url, file_path)

    return file_path

process(file_path)

Processes the raw data and loads it into the class.

This method reads the decompressed file, which includes a header row, into a pandas DataFrame and assigns it to the self.data attribute.

Parameters:

Name Type Description Default
file_path str

The path to the raw data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2023.py
def process(self, file_path):
    """
    Processes the raw data and loads it into the class.

    This method reads the decompressed file, which includes a header row,
    into a pandas DataFrame and assigns it to the `self.data` attribute.

    Args:
        file_path (str): The path to the raw data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',',
                           user_col='user_id', item_col='parent_asin',
                           rating_col='rating', timestamp_col='timestamp',
                           header=0)
    self.data = dataset

Amazon Toys and Games

Entry point for loading different versions of the Amazon Toys and Games dataset.

AmazonToysGames

Entry point class to load various versions of the Amazon Toys and Games dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder for either the 2018 or 2023 version.

The dataset contains product reviews and metadata for the category "Toys and Games" from Amazon.

The default version is 'latest', which currently corresponds to the '2023' version.

Examples:

To load the latest version:

>>> data_loader = AmazonToysGames()

To load a specific version:

>>> data_loader = AmazonToysGames(version='2018')
Source code in datarec/datasets/amazon_toys_and_games/amz_toys.py
class AmazonToysGames:
    """
    Entry point class to load various versions of the Amazon Toys and Games dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder for either the 2018 or 2023 version.

    The dataset contains product reviews and metadata for the category
    "Toys and Games" from Amazon.

    The default version is 'latest', which currently corresponds to the '2023' version.

    Examples:
        To load the latest version:
        >>> data_loader = AmazonToysGames()

        To load a specific version:
        >>> data_loader = AmazonToysGames(version='2018')
    """
    latest_version = '2023'

    def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
        """
        Initializes and returns the specified version of the dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific dataset version.

        Args:
            version (str): The version of the dataset to load. Supported versions
                include '2023', '2018', and 'latest'. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used for this dataset).

        Returns:
            (DataRec): An instance of the appropriate dataset builder class
                (e.g., `AMZ_ToysGames_2023`), populated with data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """
        versions = {'2023': AMZ_ToysGames_2023,
                    '2018': AMZ_ToysGames_2018}
        if version == 'latest':
            version = cls.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError(f"Amazon Toys and Games {version}: Unsupported version \n Supported version:"
                             f"\n \t version \t name "
                             f"\n \t 2023 \t Amazon Toys and Games 2023"
                             f"\n \t 2018 \t Amazon Toys and Games 2018")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Supported versions include '2023', '2018', and 'latest'. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used for this dataset).

{}

Returns:

Type Description
DataRec

An instance of the appropriate dataset builder class (e.g., AMZ_ToysGames_2023), populated with data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/amazon_toys_and_games/amz_toys.py
def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
    """
    Initializes and returns the specified version of the dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific dataset version.

    Args:
        version (str): The version of the dataset to load. Supported versions
            include '2023', '2018', and 'latest'. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used for this dataset).

    Returns:
        (DataRec): An instance of the appropriate dataset builder class
            (e.g., `AMZ_ToysGames_2023`), populated with data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """
    versions = {'2023': AMZ_ToysGames_2023,
                '2018': AMZ_ToysGames_2018}
    if version == 'latest':
        version = cls.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError(f"Amazon Toys and Games {version}: Unsupported version \n Supported version:"
                         f"\n \t version \t name "
                         f"\n \t 2023 \t Amazon Toys and Games 2023"
                         f"\n \t 2018 \t Amazon Toys and Games 2018")

Builder class for the 2018 version of the Amazon Toys and Games dataset.

AMZ_ToysGames_2018

Bases: DataRec

Builder class for the Amazon Toys and Games dataset (2018 version).

This class handles the logic for downloading, preparing, and loading the 2018 version of the "Toys and Games" dataset. It is not typically instantiated directly but is called by the AmazonToysGames entry point class.

The raw data is provided as a single, uncompressed CSV file without a header.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2018.py
class AMZ_ToysGames_2018(DataRec):
    """
    Builder class for the Amazon Toys and Games dataset (2018 version).

    This class handles the logic for downloading, preparing, and loading the
    2018 version of the "Toys and Games" dataset. It is not typically
    instantiated directly but is called by the `AmazonToysGames` entry point class.

    The raw data is provided as a single, uncompressed CSV file without a header.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_v2/categoryFilesSmall/Toys_and_Games.csv'
    data_file_name = os.path.basename(url)
    decompressed_data_file_name = data_file_name
    CHECKSUM = '3e3f0c05d880403de6601f22398ccd78'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'amazon_toys_and_games'
        self.version_name = '2018'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        self.process(file_path)

    def required_files(self):
        """
        Checks for the presence of the required data file.

        Returns:
            (str or None): The path to the required data file if it exists,
                otherwise returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

        # check if the file is there
        if os.path.exists(uncompressed_file_path):
            return uncompressed_file_path
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Handles the decompression step.

        For this 2018 version, the source file is already decompressed, so this
        method simply returns the path to the file.

        Args:
            path (str): The file path of the source data file.

        Returns:
            (str): The path to the data file.
        """
        # decompress downloaded file
        return path

    def download(self) -> (str, str):
        """
        Downloads the raw dataset file.

        Returns:
            (str): The local file path to the downloaded file.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_name = os.path.basename(self.url)
        file_path = os.path.join(self._raw_folder, file_name)
        download_file(self.url, file_path, size=388191962)

        return file_path

    def process(self, file_path):
        """
        Processes the raw data and loads it into the class.

        This method reads the raw file, which has no header. Columns are
        identified by their integer index. The data is then assigned to the
        `self.data` attribute after checksum verification.

        Args:
            file_path (str): The path to the raw data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """
        verify_checksum(file_path, self.CHECKSUM)

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',',
                               user_col=0, item_col=1,
                               rating_col=2, timestamp_col=3,
                               header=None)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2018.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'amazon_toys_and_games'
    self.version_name = '2018'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    self.process(file_path)

required_files()

Checks for the presence of the required data file.

Returns:

Type Description
str or None

The path to the required data file if it exists, otherwise returns None.

Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2018.py
def required_files(self):
    """
    Checks for the presence of the required data file.

    Returns:
        (str or None): The path to the required data file if it exists,
            otherwise returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

    # check if the file is there
    if os.path.exists(uncompressed_file_path):
        return uncompressed_file_path
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Handles the decompression step.

For this 2018 version, the source file is already decompressed, so this method simply returns the path to the file.

Parameters:

Name Type Description Default
path str

The file path of the source data file.

required

Returns:

Type Description
str

The path to the data file.

Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2018.py
def decompress(self, path):
    """
    Handles the decompression step.

    For this 2018 version, the source file is already decompressed, so this
    method simply returns the path to the file.

    Args:
        path (str): The file path of the source data file.

    Returns:
        (str): The path to the data file.
    """
    # decompress downloaded file
    return path

download()

Downloads the raw dataset file.

Returns:

Type Description
str

The local file path to the downloaded file.

Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2018.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset file.

    Returns:
        (str): The local file path to the downloaded file.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_name = os.path.basename(self.url)
    file_path = os.path.join(self._raw_folder, file_name)
    download_file(self.url, file_path, size=388191962)

    return file_path

process(file_path)

Processes the raw data and loads it into the class.

This method reads the raw file, which has no header. Columns are identified by their integer index. The data is then assigned to the self.data attribute after checksum verification.

Parameters:

Name Type Description Default
file_path str

The path to the raw data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2018.py
def process(self, file_path):
    """
    Processes the raw data and loads it into the class.

    This method reads the raw file, which has no header. Columns are
    identified by their integer index. The data is then assigned to the
    `self.data` attribute after checksum verification.

    Args:
        file_path (str): The path to the raw data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """
    verify_checksum(file_path, self.CHECKSUM)

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',',
                           user_col=0, item_col=1,
                           rating_col=2, timestamp_col=3,
                           header=None)
    self.data = dataset

Builder class for the 2023 version of the Amazon Toys and Games dataset.

AMZ_ToysGames_2023

Bases: DataRec

Builder class for the Amazon Toys and Games dataset (2023 version).

This class handles the logic for downloading, preparing, and loading the 2023 version of the "Toys and Games" dataset. It is not typically instantiated directly but is called by the AmazonToysGames entry point class.

The dataset is from the "Bridging Language and Items for Retrieval and Recommendation" paper and is provided as a compressed CSV file with a header.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2023.py
class AMZ_ToysGames_2023(DataRec):
    """
    Builder class for the Amazon Toys and Games dataset (2023 version).

    This class handles the logic for downloading, preparing, and loading the
    2023 version of the "Toys and Games" dataset. It is not typically
    instantiated directly but is called by the `AmazonToysGames` entry point class.

    The dataset is from the "Bridging Language and Items for Retrieval and
    Recommendation" paper and is provided as a compressed CSV file with a header.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/benchmark/0core/rating_only/Toys_and_Games.csv.gz'
    data_file_name = os.path.basename(url)
    decompressed_data_file_name = data_file_name.replace('.gz', '')
    CHECKSUM = '542250672811854e9803d90b1f52cc14'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'amazon_toys_and_games'
        self.version_name = '2023'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        self.process(file_path)

    def required_files(self):
        """
        Checks for the presence of the required decompressed data file.

        It first looks for the final, uncompressed file. If not found, it
        looks for the compressed archive and decompresses it.

        Returns:
            (str or None): The path to the required data file if it exists or can be
                created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

        # check if the file is there
        if os.path.exists(uncompressed_file_path):
            return uncompressed_file_path
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed archive.

        Returns:
            (str): The path to the decompressed file.
        """
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)
        return decompress_gz(path, decompressed_file_path)


    def download(self) -> (str, str):
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_name = os.path.basename(self.url)
        file_path = os.path.join(self._raw_folder, file_name)
        download_file(self.url, file_path, size=388191962)

        return file_path

    def process(self, file_path):
        """
        Processes the raw data and loads it into the class.

        This method reads the decompressed file, which includes a header row,
        into a pandas DataFrame and assigns it to the `self.data` attribute.

        Args:
            file_path (str): The path to the raw data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',',
                               user_col='user_id', item_col='parent_asin',
                               rating_col='rating', timestamp_col='timestamp',
                               header=0)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2023.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'amazon_toys_and_games'
    self.version_name = '2023'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    self.process(file_path)

required_files()

Checks for the presence of the required decompressed data file.

It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.

Returns:

Type Description
str or None

The path to the required data file if it exists or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2023.py
def required_files(self):
    """
    Checks for the presence of the required decompressed data file.

    It first looks for the final, uncompressed file. If not found, it
    looks for the compressed archive and decompresses it.

    Returns:
        (str or None): The path to the required data file if it exists or can be
            created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

    # check if the file is there
    if os.path.exists(uncompressed_file_path):
        return uncompressed_file_path
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed archive.

required

Returns:

Type Description
str

The path to the decompressed file.

Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2023.py
def decompress(self, path):
    """
    Decompresses the downloaded archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed archive.

    Returns:
        (str): The path to the decompressed file.
    """
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)
    return decompress_gz(path, decompressed_file_path)

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded archive.

Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2023.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_name = os.path.basename(self.url)
    file_path = os.path.join(self._raw_folder, file_name)
    download_file(self.url, file_path, size=388191962)

    return file_path

process(file_path)

Processes the raw data and loads it into the class.

This method reads the decompressed file, which includes a header row, into a pandas DataFrame and assigns it to the self.data attribute.

Parameters:

Name Type Description Default
file_path str

The path to the raw data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2023.py
def process(self, file_path):
    """
    Processes the raw data and loads it into the class.

    This method reads the decompressed file, which includes a header row,
    into a pandas DataFrame and assigns it to the `self.data` attribute.

    Args:
        file_path (str): The path to the raw data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',',
                           user_col='user_id', item_col='parent_asin',
                           rating_col='rating', timestamp_col='timestamp',
                           header=0)
    self.data = dataset

Amazon Video Games

Entry point for loading different versions of the Amazon Video Games dataset.

AmazonVideoGames

Entry point class to load various versions of the Amazon Video Games dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder for either the 2018 or 2023 version.

The dataset contains product reviews and metadata for the "Video Games" category from Amazon.

The default version is 'latest', which currently corresponds to the '2023' version.

Examples:

To load the latest version:

>>> data_loader = AmazonVideoGames()

To load a specific version:

>>> data_loader = AmazonVideoGames(version='2018')
Source code in datarec/datasets/amazon_videogames/amz_videogames.py
class AmazonVideoGames:
    """
    Entry point class to load various versions of the Amazon Video Games dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder for either the 2018 or 2023 version.

    The dataset contains product reviews and metadata for the "Video Games"
    category from Amazon.

    The default version is 'latest', which currently corresponds to the '2023' version.

    Examples:
        To load the latest version:
        >>> data_loader = AmazonVideoGames()

        To load a specific version:
        >>> data_loader = AmazonVideoGames(version='2018')
    """
    latest_version = '2023'

    def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
        """
        Initializes and returns the specified version of the dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific dataset version.

        Args:
            version (str): The version of the dataset to load. Supported versions
                include '2023', '2018', and 'latest'. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used for this dataset).

        Returns:
            (DataRec): An instance of the appropriate dataset builder class
                (e.g., `AMZ_VideoGames_2023`), populated with data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """

        versions = {'2023': AMZ_VideoGames_2023,
                    '2018': AMZ_VideoGames_2018}
        if version == 'latest':
            version = cls.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError(f"Amazon Video Games {version}: Unsupported version \n Supported version:"
                             f"\n \t version \t name "
                             f"\n \t 2023 \t Amazon Video Games 2023"
                             f"\n \t 2018 \t Amazon Video Games 2018")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Supported versions include '2023', '2018', and 'latest'. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used for this dataset).

{}

Returns:

Type Description
DataRec

An instance of the appropriate dataset builder class (e.g., AMZ_VideoGames_2023), populated with data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/amazon_videogames/amz_videogames.py
def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
    """
    Initializes and returns the specified version of the dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific dataset version.

    Args:
        version (str): The version of the dataset to load. Supported versions
            include '2023', '2018', and 'latest'. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used for this dataset).

    Returns:
        (DataRec): An instance of the appropriate dataset builder class
            (e.g., `AMZ_VideoGames_2023`), populated with data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """

    versions = {'2023': AMZ_VideoGames_2023,
                '2018': AMZ_VideoGames_2018}
    if version == 'latest':
        version = cls.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError(f"Amazon Video Games {version}: Unsupported version \n Supported version:"
                         f"\n \t version \t name "
                         f"\n \t 2023 \t Amazon Video Games 2023"
                         f"\n \t 2018 \t Amazon Video Games 2018")

Builder class for the 2018 version of the Amazon Video Games dataset.

AMZ_VideoGames_2018

Bases: DataRec

Builder class for the Amazon Video Games dataset (2018 version).

This class handles the logic for downloading, preparing, and loading the 2018 version of the "Video Games" dataset. It is not typically instantiated directly but is called by the AmazonVideoGames entry point class.

The raw data is provided as a single, uncompressed CSV file without a header.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/amazon_videogames/amz_videogames_2018.py
class AMZ_VideoGames_2018(DataRec):
    """
    Builder class for the Amazon Video Games dataset (2018 version).

    This class handles the logic for downloading, preparing, and loading the
    2018 version of the "Video Games" dataset. It is not typically
    instantiated directly but is called by the `AmazonVideoGames` entry point class.

    The raw data is provided as a single, uncompressed CSV file without a header.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_v2/categoryFilesSmall/Video_Games.csv'
    data_file_name = os.path.basename(url)
    decompressed_data_file_name = data_file_name
    CHECKSUM = 'feecdbf6bf247e54d2a572e2be503515'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'amazon_video_games'
        self.version_name = '2018'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)
        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        self.process(file_path)

    def required_files(self):
        """
        Checks for the presence of the required data file.

        Returns:
            (str or None): The path to the required data file if it exists,
                otherwise returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

        # check if the file is there
        if os.path.exists(uncompressed_file_path):
            return uncompressed_file_path
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Handles the decompression step.

        For this 2018 version, the source file is already decompressed, so this
        method simply returns the path to the file.

        Args:
            path (str): The file path of the source data file.

        Returns:
            (str): The path to the data file.
        """
        return path

    def download(self) -> (str, str):
        """
        Downloads the raw dataset file.

        Returns:
            (str): The local file path to the downloaded file.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_name = os.path.basename(self.url)
        file_path = os.path.join(self._raw_folder, file_name)
        download_file(self.url, file_path, size=115388622)

        return file_path

    def process(self, file_path):
        """
        Processes the raw data and loads it into the class.

        This method reads the raw file, which has no header. Columns are
        identified by their integer index. The data is then assigned to the
        `self.data` attribute after checksum verification.

        Args:
            file_path (str): The path to the raw data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """
        verify_checksum(file_path, self.CHECKSUM)

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',',
                               user_col=0, item_col=1,
                               rating_col=2, timestamp_col=3,
                               header=None)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/amazon_videogames/amz_videogames_2018.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'amazon_video_games'
    self.version_name = '2018'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)
    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    self.process(file_path)

required_files()

Checks for the presence of the required data file.

Returns:

Type Description
str or None

The path to the required data file if it exists, otherwise returns None.

Source code in datarec/datasets/amazon_videogames/amz_videogames_2018.py
def required_files(self):
    """
    Checks for the presence of the required data file.

    Returns:
        (str or None): The path to the required data file if it exists,
            otherwise returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

    # check if the file is there
    if os.path.exists(uncompressed_file_path):
        return uncompressed_file_path
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Handles the decompression step.

For this 2018 version, the source file is already decompressed, so this method simply returns the path to the file.

Parameters:

Name Type Description Default
path str

The file path of the source data file.

required

Returns:

Type Description
str

The path to the data file.

Source code in datarec/datasets/amazon_videogames/amz_videogames_2018.py
def decompress(self, path):
    """
    Handles the decompression step.

    For this 2018 version, the source file is already decompressed, so this
    method simply returns the path to the file.

    Args:
        path (str): The file path of the source data file.

    Returns:
        (str): The path to the data file.
    """
    return path

download()

Downloads the raw dataset file.

Returns:

Type Description
str

The local file path to the downloaded file.

Source code in datarec/datasets/amazon_videogames/amz_videogames_2018.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset file.

    Returns:
        (str): The local file path to the downloaded file.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_name = os.path.basename(self.url)
    file_path = os.path.join(self._raw_folder, file_name)
    download_file(self.url, file_path, size=115388622)

    return file_path

process(file_path)

Processes the raw data and loads it into the class.

This method reads the raw file, which has no header. Columns are identified by their integer index. The data is then assigned to the self.data attribute after checksum verification.

Parameters:

Name Type Description Default
file_path str

The path to the raw data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/amazon_videogames/amz_videogames_2018.py
def process(self, file_path):
    """
    Processes the raw data and loads it into the class.

    This method reads the raw file, which has no header. Columns are
    identified by their integer index. The data is then assigned to the
    `self.data` attribute after checksum verification.

    Args:
        file_path (str): The path to the raw data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """
    verify_checksum(file_path, self.CHECKSUM)

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',',
                           user_col=0, item_col=1,
                           rating_col=2, timestamp_col=3,
                           header=None)
    self.data = dataset

Builder class for the 2023 version of the Amazon Video Games dataset.

AMZ_VideoGames_2023

Bases: DataRec

Builder class for the Amazon Video Games dataset (2023 version).

This class handles the logic for downloading, preparing, and loading the 2023 version of the "Video Games" dataset. It is not typically instantiated directly but is called by the AmazonVideoGames entry point class.

The dataset is from the "Bridging Language and Items for Retrieval and Recommendation" paper and is provided as a compressed CSV file with a header.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/amazon_videogames/amz_videogames_2023.py
class AMZ_VideoGames_2023(DataRec):
    """
    Builder class for the Amazon Video Games dataset (2023 version).

    This class handles the logic for downloading, preparing, and loading the
    2023 version of the "Video Games" dataset. It is not typically
    instantiated directly but is called by the `AmazonVideoGames` entry point class.

    The dataset is from the "Bridging Language and Items for Retrieval and
    Recommendation" paper and is provided as a compressed CSV file with a header.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/benchmark/0core/rating_only/Video_Games.csv.gz'
    data_file_name = os.path.basename(url)
    decompressed_data_file_name = data_file_name.replace('.gz', '')
    CHECKSUM = '60fdc3e812de871c30d65722e9a91a0a'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'amazon_video_games'
        self.version_name = '2023'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)
        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        self.process(file_path)

    def required_files(self):
        """
        Checks for the presence of the required decompressed data file.

        It first looks for the final, uncompressed file. If not found, it
        looks for the compressed archive and decompresses it.

        Returns:
            (str or None): The path to the required data file if it exists or can be
                created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

        # check if the file is there
        if os.path.exists(uncompressed_file_path):
            return uncompressed_file_path
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed archive.

        Returns:
            (str): The path to the decompressed file.
        """
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)
        return decompress_gz(path, decompressed_file_path)

    def download(self) -> (str, str):
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_name = os.path.basename(self.url)
        file_path = os.path.join(self._raw_folder, file_name)
        download_file(self.url, file_path, size=115388622)

        return file_path

    def process(self, file_path):
        """
        Processes the raw data and loads it into the class.

        This method reads the decompressed file, which includes a header row,
        into a pandas DataFrame and assigns it to the `self.data` attribute.

        Args:
            file_path (str): The path to the raw data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',',
                               user_col='user_id', item_col='parent_asin',
                               rating_col='rating', timestamp_col='timestamp',
                               header=0)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/amazon_videogames/amz_videogames_2023.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'amazon_video_games'
    self.version_name = '2023'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)
    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    self.process(file_path)

required_files()

Checks for the presence of the required decompressed data file.

It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.

Returns:

Type Description
str or None

The path to the required data file if it exists or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/amazon_videogames/amz_videogames_2023.py
def required_files(self):
    """
    Checks for the presence of the required decompressed data file.

    It first looks for the final, uncompressed file. If not found, it
    looks for the compressed archive and decompresses it.

    Returns:
        (str or None): The path to the required data file if it exists or can be
            created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    uncompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)

    # check if the file is there
    if os.path.exists(uncompressed_file_path):
        return uncompressed_file_path
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed archive.

required

Returns:

Type Description
str

The path to the decompressed file.

Source code in datarec/datasets/amazon_videogames/amz_videogames_2023.py
def decompress(self, path):
    """
    Decompresses the downloaded archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed archive.

    Returns:
        (str): The path to the decompressed file.
    """
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompressed_file_path = os.path.join(self._raw_folder, self.decompressed_data_file_name)
    return decompress_gz(path, decompressed_file_path)

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded archive.

Source code in datarec/datasets/amazon_videogames/amz_videogames_2023.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_name = os.path.basename(self.url)
    file_path = os.path.join(self._raw_folder, file_name)
    download_file(self.url, file_path, size=115388622)

    return file_path

process(file_path)

Processes the raw data and loads it into the class.

This method reads the decompressed file, which includes a header row, into a pandas DataFrame and assigns it to the self.data attribute.

Parameters:

Name Type Description Default
file_path str

The path to the raw data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/amazon_videogames/amz_videogames_2023.py
def process(self, file_path):
    """
    Processes the raw data and loads it into the class.

    This method reads the decompressed file, which includes a header row,
    into a pandas DataFrame and assigns it to the `self.data` attribute.

    Args:
        file_path (str): The path to the raw data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',',
                           user_col='user_id', item_col='parent_asin',
                           rating_col='rating', timestamp_col='timestamp',
                           header=0)
    self.data = dataset

CiaoDVD

Entry point for loading different versions of the CiaoDVD dataset.

Ciao

Entry point class to load various versions of the CiaoDVD dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder.

CiaoDVD is a dataset for DVD recommendations, also containing social trust data. This loader focuses on the movie ratings.

The default version is 'latest', which currently corresponds to 'v1'.

Examples:

To load the latest version:

>>> data_loader = Ciao()

To load a specific version:

>>> data_loader = Ciao(version='v1')
Source code in datarec/datasets/ciao/ciao.py
class Ciao:
    """
    Entry point class to load various versions of the CiaoDVD dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder.

    CiaoDVD is a dataset for DVD recommendations, also containing social trust data.
    This loader focuses on the movie ratings.

    The default version is 'latest', which currently corresponds to 'v1'.

    Examples:
        To load the latest version:
        >>> data_loader = Ciao()

        To load a specific version:
        >>> data_loader = Ciao(version='v1')
    """
    latest_version = 'v1'
    def __new__(self, version: str = 'latest', **kwargs):
        """
        Initializes and returns the specified version of the CiaoDVD dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific dataset version.

        Args:
            version (str): The version of the dataset to load. Currently, only
                'v1' and 'latest' are supported. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used for this dataset).

        Returns:
            (Ciao_V1): An instance of the dataset builder class, populated with data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """

        versions = {'v1': Ciao_V1}
        if version == 'latest':
            version = self.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError("Ciao: Unsupported version")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the CiaoDVD dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Currently, only 'v1' and 'latest' are supported. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used for this dataset).

{}

Returns:

Type Description
Ciao_V1

An instance of the dataset builder class, populated with data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/ciao/ciao.py
def __new__(self, version: str = 'latest', **kwargs):
    """
    Initializes and returns the specified version of the CiaoDVD dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific dataset version.

    Args:
        version (str): The version of the dataset to load. Currently, only
            'v1' and 'latest' are supported. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used for this dataset).

    Returns:
        (Ciao_V1): An instance of the dataset builder class, populated with data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """

    versions = {'v1': Ciao_V1}
    if version == 'latest':
        version = self.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError("Ciao: Unsupported version")

Builder class for the v1 version of the CiaoDVD dataset.

Ciao_V1

Bases: DataRec

Builder class for the CiaoDVD dataset.

This class handles the logic for downloading, preparing, and loading the CiaoDVD dataset from the LibRec repository. It is not typically instantiated directly but is called by the Ciao entry point class.

The dataset was introduced in the paper "ETAF: An Extended Trust Antecedents Framework for Trust Prediction". The archive contains multiple files; this loader specifically processes movie-ratings.txt.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

REQUIRED_FILES list

A list of files expected within the decompressed archive.

Source code in datarec/datasets/ciao/ciao_v1.py
class Ciao_V1(DataRec):
    """
    Builder class for the CiaoDVD dataset.

    This class handles the logic for downloading, preparing, and loading the
    CiaoDVD dataset from the LibRec repository. It is not typically instantiated
    directly but is called by the `Ciao` entry point class.

    The dataset was introduced in the paper "ETAF: An Extended Trust Antecedents
    Framework for Trust Prediction". The archive contains multiple files; this
    loader specifically processes `movie-ratings.txt`.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
        REQUIRED_FILES (list): A list of files expected within the decompressed archive.
    """
    url = 'https://guoguibing.github.io/librec/datasets/CiaoDVD.zip'
    data_file_name = os.path.basename(url)
    movie_file_name = 'movie-ratings.txt'
    review_file_name = 'review-ratings.txt'
    trusts_file_name = 'trusts.txt'
    REQUIRED_FILES = [movie_file_name, review_file_name, trusts_file_name]
    CHECKSUM = '43a39e068e3fc494a7f7f7581293e2c2'


    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'ciaoDVD'
        self.version_name = 'v1'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
            else dataset_raw_directory(self.dataset_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        rating_file_path = os.path.abspath(os.path.join(self._raw_folder, self.movie_file_name))
        self.process(rating_file_path)

    def required_files(self):
        """
        Checks for the presence of the required decompressed data files.

        It first looks for the final, uncompressed files. If not found, it
        looks for the compressed archive and decompresses it.

        Returns:
            (list or None): A list of paths to the required data files if they
                exist or can be created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)

        # check if the file is there
        paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(p) for p in paths]):
            return paths
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed archive.

        Returns:
            (list or None): A list of paths to the decompressed files if successful,
                otherwise None.
        """
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompress_zip_file(path, self._raw_folder)
        files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(f) for f in files]):
            return [os.path.join(self._raw_folder, f) for f in files]
        return None

    def download(self) -> (str, str):
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        download_file(self.url, file_path, size=5814757)

        return file_path

    def process(self, path) -> None:
        """
        Processes the raw `movie-ratings.txt` data and loads it into the class.

        This method reads the file, which has no header. It also parses the
        date strings in 'YYYY-MM-DD' format and converts them to Unix timestamps.

        Args:
            path (str): The path to the raw `movie-ratings.txt` file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """

        from datarec.io import read_tabular

        user_col = 0
        item_col = 1
        rating_col = 4
        timestamp_col = 5
        dataset = read_tabular(path, sep=',', user_col=user_col, item_col=item_col, rating_col=rating_col, timestamp_col=timestamp_col, header=None)
        # timestamps = pd.Series(dataset.data[timestamp_col].apply(lambda x: x.timestamp()).values,
        #                        index=dataset.data.index, dtype='float64')

        # Convert the date strings to datetime objects using the specified format
        dataset.data[timestamp_col] = pd.to_datetime(dataset.data[timestamp_col], format='%Y-%m-%d')

        # Now extract the Unix timestamps (in seconds)
        timestamps = pd.Series(dataset.data[timestamp_col].apply(lambda x: x.timestamp()).values,
                               index=dataset.data.index, dtype='float64')
        dataset.data[timestamp_col] = timestamps

        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/ciao/ciao_v1.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'ciaoDVD'
    self.version_name = 'v1'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
        else dataset_raw_directory(self.dataset_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    rating_file_path = os.path.abspath(os.path.join(self._raw_folder, self.movie_file_name))
    self.process(rating_file_path)

required_files()

Checks for the presence of the required decompressed data files.

It first looks for the final, uncompressed files. If not found, it looks for the compressed archive and decompresses it.

Returns:

Type Description
list or None

A list of paths to the required data files if they exist or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/ciao/ciao_v1.py
def required_files(self):
    """
    Checks for the presence of the required decompressed data files.

    It first looks for the final, uncompressed files. If not found, it
    looks for the compressed archive and decompresses it.

    Returns:
        (list or None): A list of paths to the required data files if they
            exist or can be created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)

    # check if the file is there
    paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(p) for p in paths]):
        return paths
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed archive.

required

Returns:

Type Description
list or None

A list of paths to the decompressed files if successful, otherwise None.

Source code in datarec/datasets/ciao/ciao_v1.py
def decompress(self, path):
    """
    Decompresses the downloaded archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed archive.

    Returns:
        (list or None): A list of paths to the decompressed files if successful,
            otherwise None.
    """
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompress_zip_file(path, self._raw_folder)
    files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(f) for f in files]):
        return [os.path.join(self._raw_folder, f) for f in files]
    return None

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded archive.

Source code in datarec/datasets/ciao/ciao_v1.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    download_file(self.url, file_path, size=5814757)

    return file_path

process(path)

Processes the raw movie-ratings.txt data and loads it into the class.

This method reads the file, which has no header. It also parses the date strings in 'YYYY-MM-DD' format and converts them to Unix timestamps.

Parameters:

Name Type Description Default
path str

The path to the raw movie-ratings.txt file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/ciao/ciao_v1.py
def process(self, path) -> None:
    """
    Processes the raw `movie-ratings.txt` data and loads it into the class.

    This method reads the file, which has no header. It also parses the
    date strings in 'YYYY-MM-DD' format and converts them to Unix timestamps.

    Args:
        path (str): The path to the raw `movie-ratings.txt` file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """

    from datarec.io import read_tabular

    user_col = 0
    item_col = 1
    rating_col = 4
    timestamp_col = 5
    dataset = read_tabular(path, sep=',', user_col=user_col, item_col=item_col, rating_col=rating_col, timestamp_col=timestamp_col, header=None)
    # timestamps = pd.Series(dataset.data[timestamp_col].apply(lambda x: x.timestamp()).values,
    #                        index=dataset.data.index, dtype='float64')

    # Convert the date strings to datetime objects using the specified format
    dataset.data[timestamp_col] = pd.to_datetime(dataset.data[timestamp_col], format='%Y-%m-%d')

    # Now extract the Unix timestamps (in seconds)
    timestamps = pd.Series(dataset.data[timestamp_col].apply(lambda x: x.timestamp()).values,
                           index=dataset.data.index, dtype='float64')
    dataset.data[timestamp_col] = timestamps

    self.data = dataset

Epinions

Entry point for loading different versions of the Epinions dataset.

Epinions

Entry point class to load various versions of the Epinions dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder.

Epinions is a who-trust-whom online social network from a general consumer review site. Members of the site can decide whether to "trust" each other.

The default version is 'latest', which currently corresponds to 'v1'.

Examples:

To load the latest version:

>>> data_loader = Epinions()

To load a specific version:

>>> data_loader = Epinions(version='v1')
Source code in datarec/datasets/epinions/epinions.py
class Epinions:
    """
    Entry point class to load various versions of the Epinions dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder.

    Epinions is a who-trust-whom online social network from a general consumer
    review site. Members of the site can decide whether to "trust" each other.

    The default version is 'latest', which currently corresponds to 'v1'.

    Examples:
        To load the latest version:
        >>> data_loader = Epinions()

        To load a specific version:
        >>> data_loader = Epinions(version='v1')
    """
    latest_version = 'v1'
    def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
        """
        Initializes and returns the specified version of the Epinions dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific dataset version.

        Args:
            version (str): The version of the dataset to load. Currently, only
                'v1' and 'latest' are supported. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used for this dataset).

        Returns:
            (Epinions_V1): An instance of the dataset builder class, populated with data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """

        versions = {'v1': Epinions_V1}
        if version == 'latest':
            version = cls.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError("Epinions: Unsupported version")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the Epinions dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Currently, only 'v1' and 'latest' are supported. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used for this dataset).

{}

Returns:

Type Description
Epinions_V1

An instance of the dataset builder class, populated with data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/epinions/epinions.py
def __new__(cls, version: str = 'latest', **kwargs) -> DataRec:
    """
    Initializes and returns the specified version of the Epinions dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific dataset version.

    Args:
        version (str): The version of the dataset to load. Currently, only
            'v1' and 'latest' are supported. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used for this dataset).

    Returns:
        (Epinions_V1): An instance of the dataset builder class, populated with data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """

    versions = {'v1': Epinions_V1}
    if version == 'latest':
        version = cls.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError("Epinions: Unsupported version")

Builder class for the v1 version of the Epinions dataset.

Epinions_V1

Bases: DataRec

Builder class for the Epinions dataset.

This class handles the logic for downloading, preparing, and loading the Epinions social network dataset from the Stanford SNAP repository. It is not typically instantiated directly but is called by the Epinions entry point class.

The dataset was introduced in the paper "Trust Management for the Semantic Web". It represents a directed graph of trust relationships.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

REQUIRED_FILES list

A list of files expected after decompression.

Source code in datarec/datasets/epinions/epinions_v1.py
class Epinions_V1(DataRec):
    """
    Builder class for the Epinions dataset.

    This class handles the logic for downloading, preparing, and loading the
    Epinions social network dataset from the Stanford SNAP repository. It is not
    typically instantiated directly but is called by the `Epinions` entry point class.

    The dataset was introduced in the paper "Trust Management for the Semantic Web".
    It represents a directed graph of trust relationships.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
        REQUIRED_FILES (list): A list of files expected after decompression.
    """
    url = 'https://snap.stanford.edu/data/soc-Epinions1.txt.gz'
    data_file_name = os.path.basename(url)
    decompressed_file_name = data_file_name.replace('.gz', '')
    REQUIRED_FILES = [decompressed_file_name]
    CHECKSUM = '8df7433d4486ba68eb25e623feacff04'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'epinions'
        self.version_name = 'v1'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
            else dataset_raw_directory(self.dataset_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        self.process(file_path[0])

    def required_files(self):
        """
        Checks for the presence of the required decompressed data file.

        It first looks for the final, uncompressed file. If not found, it
        looks for the compressed archive and decompresses it.

        Returns:
            (list or None): A list containing the path to the required data file if
                it exists or can be created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)

        # check if the file is there
        paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(p) for p in paths]):
            return paths
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed archive.

        Returns:
            (list or None): A list containing the path to the decompressed file
                if successful, otherwise None.
        """
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompressed_file = os.path.join(self._raw_folder, self.decompressed_file_name)
        decompress_gz(path, decompressed_file)
        files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(f) for f in files]):
            return [os.path.join(self._raw_folder, f) for f in files]
        return None

    def download(self) -> (str, str):
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded .gz archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        download_file(self.url, file_path)

        return file_path

    def process(self, file_path):
        """
        Processes the raw trust network data and loads it into the class.

        This method reads the file, skipping the header comment
        lines. The first column is treated as the 'user' (truster) and the
        second column as the 'item' (trustee).

        Args:
            file_path (str): The path to the raw `soc-Epinions1.txt` file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep='\t',
                               user_col=0, item_col=1,
                               header=None, skiprows=4)

        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/epinions/epinions_v1.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'epinions'
    self.version_name = 'v1'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
        else dataset_raw_directory(self.dataset_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    self.process(file_path[0])

required_files()

Checks for the presence of the required decompressed data file.

It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.

Returns:

Type Description
list or None

A list containing the path to the required data file if it exists or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/epinions/epinions_v1.py
def required_files(self):
    """
    Checks for the presence of the required decompressed data file.

    It first looks for the final, uncompressed file. If not found, it
    looks for the compressed archive and decompresses it.

    Returns:
        (list or None): A list containing the path to the required data file if
            it exists or can be created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)

    # check if the file is there
    paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(p) for p in paths]):
        return paths
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed archive.

required

Returns:

Type Description
list or None

A list containing the path to the decompressed file if successful, otherwise None.

Source code in datarec/datasets/epinions/epinions_v1.py
def decompress(self, path):
    """
    Decompresses the downloaded archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed archive.

    Returns:
        (list or None): A list containing the path to the decompressed file
            if successful, otherwise None.
    """
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompressed_file = os.path.join(self._raw_folder, self.decompressed_file_name)
    decompress_gz(path, decompressed_file)
    files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(f) for f in files]):
        return [os.path.join(self._raw_folder, f) for f in files]
    return None

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded .gz archive.

Source code in datarec/datasets/epinions/epinions_v1.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded .gz archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    download_file(self.url, file_path)

    return file_path

process(file_path)

Processes the raw trust network data and loads it into the class.

This method reads the file, skipping the header comment lines. The first column is treated as the 'user' (truster) and the second column as the 'item' (trustee).

Parameters:

Name Type Description Default
file_path str

The path to the raw soc-Epinions1.txt file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/epinions/epinions_v1.py
def process(self, file_path):
    """
    Processes the raw trust network data and loads it into the class.

    This method reads the file, skipping the header comment
    lines. The first column is treated as the 'user' (truster) and the
    second column as the 'item' (trustee).

    Args:
        file_path (str): The path to the raw `soc-Epinions1.txt` file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep='\t',
                           user_col=0, item_col=1,
                           header=None, skiprows=4)

    self.data = dataset

Gowalla

Entry point for loading different versions of the Gowalla dataset.

Gowalla

Entry point class to load various types of data from the Gowalla dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder for either the user check-ins or the social friendships graph.

Gowalla was a location-based social network. Two types of data are available: - 'checkins': User interactions with locations (suitable for recommendation). - 'friendships': The user-user social network graph.

The default version is 'latest', which currently corresponds to 'checkins'.

Examples:

To load the user check-in data (default):

>>> data_loader = Gowalla()
# or explicitly
>>> data_loader = Gowalla(version='checkins')

To load the social friendship graph:

>>> data_loader = Gowalla(version='friendships')
Source code in datarec/datasets/gowalla/gowalla.py
class Gowalla:
    """
    Entry point class to load various types of data from the Gowalla dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder for either the user check-ins or the social friendships graph.

    Gowalla was a location-based social network. Two types of data are available:
    - 'checkins': User interactions with locations (suitable for recommendation).
    - 'friendships': The user-user social network graph.

    The default version is 'latest', which currently corresponds to 'checkins'.

    Examples:
        To load the user check-in data (default):
        >>> data_loader = Gowalla()
        # or explicitly
        >>> data_loader = Gowalla(version='checkins')

        To load the social friendship graph:
        >>> data_loader = Gowalla(version='friendships')
    """
    latest_version = 'checkins'

    def __new__(self, version: str = 'latest', **kwargs):
        """
        Initializes and returns the specified version of the Gowalla dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific data type.

        Args:
            version (str): The type of data to load. Supported versions
                include 'checkins', 'friendships', and 'latest'. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used).

        Returns:
            (GowallaCheckins or GowallaFriendships): An instance of the appropriate dataset
                builder class, populated with data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """

        versions = {'friendships': GowallaFriendships,
                    'checkins': GowallaCheckins}
        if version == 'latest':
            version = self.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError("Gowalla: Unsupported version")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the Gowalla dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific data type.

Parameters:

Name Type Description Default
version str

The type of data to load. Supported versions include 'checkins', 'friendships', and 'latest'. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used).

{}

Returns:

Type Description
GowallaCheckins or GowallaFriendships

An instance of the appropriate dataset builder class, populated with data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/gowalla/gowalla.py
def __new__(self, version: str = 'latest', **kwargs):
    """
    Initializes and returns the specified version of the Gowalla dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific data type.

    Args:
        version (str): The type of data to load. Supported versions
            include 'checkins', 'friendships', and 'latest'. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used).

    Returns:
        (GowallaCheckins or GowallaFriendships): An instance of the appropriate dataset
            builder class, populated with data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """

    versions = {'friendships': GowallaFriendships,
                'checkins': GowallaCheckins}
    if version == 'latest':
        version = self.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError("Gowalla: Unsupported version")

Builder class for the Gowalla check-ins dataset.

GowallaCheckins

Bases: DataRec

Builder class for the Gowalla check-ins dataset.

This class handles the logic for downloading, preparing, and loading the user check-in data from the Stanford SNAP repository. It is not typically instantiated directly but is called by the Gowalla entry point class when version='checkins'.

The dataset contains user check-ins at various locations, representing user-item interactions suitable for recommendation tasks. It was introduced in the paper "Friendship and mobility: user movement in location-based social networks".

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/gowalla/gowalla_checkins.py
class GowallaCheckins(DataRec):
    """
    Builder class for the Gowalla check-ins dataset.

    This class handles the logic for downloading, preparing, and loading the
    user check-in data from the Stanford SNAP repository. It is not typically
    instantiated directly but is called by the `Gowalla` entry point class
    when `version='checkins'`.

    The dataset contains user check-ins at various locations, representing
    user-item interactions suitable for recommendation tasks. It was introduced
    in the paper "Friendship and mobility: user movement in location-based social networks".

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://snap.stanford.edu/data/loc-gowalla_totalCheckins.txt.gz'
    data_file_name = os.path.basename(url)
    decompressed_file_name = data_file_name.replace('.gz', '')
    REQUIRED_FILES = [decompressed_file_name]
    CHECKSUM = '8ebd5ed2dd376d8982987c49429cb9f9'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'gowalla'
        self.version_name = 'checkins'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
            else dataset_raw_directory(self.dataset_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        self.process(file_path[0])

    def required_files(self):
        """
        Checks for the presence of the required decompressed data file.

        Returns:
            (list or None): A list containing the path to the required data file if
                it exists or can be created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)

        # check if the file is there
        paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(p) for p in paths]):
            return paths
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed archive.

        Returns:
            (list or None): A list containing the path to the decompressed file
                if successful, otherwise None.
        """
        print(path)
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompressed_file = os.path.join(self._raw_folder, self.decompressed_file_name)
        decompress_gz(path, decompressed_file)
        files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(f) for f in files]):
            return [os.path.join(self._raw_folder, f) for f in files]
        return None

    def download(self) -> (str, str):
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        print('Downloading data from: \'{}\''.format(self.url))
        download_file(self.url, file_path, size=105470044)

        return file_path

    def process(self, file_path):
        """
        Processes the raw check-in data and loads it into the class.

        This method reads the tab-separated file, which has no header. It maps
        column 0 to 'user_id', column 4 to 'item_id' (location ID), and column 1
        to 'timestamp'.

        Args:
            file_path (str): The path to the raw check-in data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep='\t', user_col=0, item_col=4, timestamp_col=1, header=None)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/gowalla/gowalla_checkins.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'gowalla'
    self.version_name = 'checkins'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
        else dataset_raw_directory(self.dataset_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    self.process(file_path[0])

required_files()

Checks for the presence of the required decompressed data file.

Returns:

Type Description
list or None

A list containing the path to the required data file if it exists or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/gowalla/gowalla_checkins.py
def required_files(self):
    """
    Checks for the presence of the required decompressed data file.

    Returns:
        (list or None): A list containing the path to the required data file if
            it exists or can be created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)

    # check if the file is there
    paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(p) for p in paths]):
        return paths
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed archive.

required

Returns:

Type Description
list or None

A list containing the path to the decompressed file if successful, otherwise None.

Source code in datarec/datasets/gowalla/gowalla_checkins.py
def decompress(self, path):
    """
    Decompresses the downloaded archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed archive.

    Returns:
        (list or None): A list containing the path to the decompressed file
            if successful, otherwise None.
    """
    print(path)
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompressed_file = os.path.join(self._raw_folder, self.decompressed_file_name)
    decompress_gz(path, decompressed_file)
    files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(f) for f in files]):
        return [os.path.join(self._raw_folder, f) for f in files]
    return None

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded archive.

Source code in datarec/datasets/gowalla/gowalla_checkins.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    print('Downloading data from: \'{}\''.format(self.url))
    download_file(self.url, file_path, size=105470044)

    return file_path

process(file_path)

Processes the raw check-in data and loads it into the class.

This method reads the tab-separated file, which has no header. It maps column 0 to 'user_id', column 4 to 'item_id' (location ID), and column 1 to 'timestamp'.

Parameters:

Name Type Description Default
file_path str

The path to the raw check-in data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/gowalla/gowalla_checkins.py
def process(self, file_path):
    """
    Processes the raw check-in data and loads it into the class.

    This method reads the tab-separated file, which has no header. It maps
    column 0 to 'user_id', column 4 to 'item_id' (location ID), and column 1
    to 'timestamp'.

    Args:
        file_path (str): The path to the raw check-in data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep='\t', user_col=0, item_col=4, timestamp_col=1, header=None)
    self.data = dataset

Builder class for the Gowalla friendships (social network) dataset.

GowallaFriendships

Bases: DataRec

Builder class for the Gowalla friendships dataset.

This class handles the logic for downloading, preparing, and loading the user-user social network graph from the Stanford SNAP repository. It is not typically instantiated directly but is called by the Gowalla entry point class when version='friendships'.

The dataset contains the social friendship network of Gowalla users. Each row represents a directed edge from one user to another.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

Source code in datarec/datasets/gowalla/gowalla_friendships.py
class GowallaFriendships(DataRec):
    """
    Builder class for the Gowalla friendships dataset.

    This class handles the logic for downloading, preparing, and loading the
    user-user social network graph from the Stanford SNAP repository. It is not
    typically instantiated directly but is called by the `Gowalla` entry point class
    when `version='friendships'`.

    The dataset contains the social friendship network of Gowalla users. Each row
    represents a directed edge from one user to another.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
    """
    url = 'https://snap.stanford.edu/data/loc-gowalla_edges.txt.gz'
    data_file_name = os.path.basename(url)
    decompressed_file_name = data_file_name.replace('.gz', '')
    REQUIRED_FILES = [decompressed_file_name]
    CHECKSUM = '68bce8dc51609fe32bbd95e668aaf65e'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'gowalla'
        self.version_name = 'friendships'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
            else dataset_raw_directory(self.dataset_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)  # only one file

        self.process(file_path[0])

    def required_files(self):
        """
        Checks for the presence of the required decompressed data file.

        Returns:
            (list or None): A list containing the path to the required data file if
                it exists or can be created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)

        # check if the file is there
        paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(p) for p in paths]):
            return paths
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed archive.

        Returns:
            (list or None): A list containing the path to the decompressed file
                if successful, otherwise None.
        """
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompressed_file = os.path.join(self._raw_folder, self.decompressed_file_name)
        decompress_gz(path, decompressed_file)
        files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(f) for f in files]):
            return [os.path.join(self._raw_folder, f) for f in files]
        return None

    def download(self) -> (str, str):
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        download_file(self.url, file_path, size=6351523)

        return file_path

    def process(self, file_path):
        """
        Processes the raw friendship data and loads it into the class.

        This method reads the file, which has no header. Each row
        represents a user-user link. To fit the DataRec structure, the first
        user column is mapped to 'user_id' and the second to 'item_id'.

        Args:
            file_path (str): The path to the raw friendship data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep='\t', user_col=0, item_col=1, header=None)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/gowalla/gowalla_friendships.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'gowalla'
    self.version_name = 'friendships'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
        else dataset_raw_directory(self.dataset_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)  # only one file

    self.process(file_path[0])

required_files()

Checks for the presence of the required decompressed data file.

Returns:

Type Description
list or None

A list containing the path to the required data file if it exists or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/gowalla/gowalla_friendships.py
def required_files(self):
    """
    Checks for the presence of the required decompressed data file.

    Returns:
        (list or None): A list containing the path to the required data file if
            it exists or can be created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)

    # check if the file is there
    paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(p) for p in paths]):
        return paths
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed archive.

required

Returns:

Type Description
list or None

A list containing the path to the decompressed file if successful, otherwise None.

Source code in datarec/datasets/gowalla/gowalla_friendships.py
def decompress(self, path):
    """
    Decompresses the downloaded archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed archive.

    Returns:
        (list or None): A list containing the path to the decompressed file
            if successful, otherwise None.
    """
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompressed_file = os.path.join(self._raw_folder, self.decompressed_file_name)
    decompress_gz(path, decompressed_file)
    files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(f) for f in files]):
        return [os.path.join(self._raw_folder, f) for f in files]
    return None

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded archive.

Source code in datarec/datasets/gowalla/gowalla_friendships.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    download_file(self.url, file_path, size=6351523)

    return file_path

process(file_path)

Processes the raw friendship data and loads it into the class.

This method reads the file, which has no header. Each row represents a user-user link. To fit the DataRec structure, the first user column is mapped to 'user_id' and the second to 'item_id'.

Parameters:

Name Type Description Default
file_path str

The path to the raw friendship data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/gowalla/gowalla_friendships.py
def process(self, file_path):
    """
    Processes the raw friendship data and loads it into the class.

    This method reads the file, which has no header. Each row
    represents a user-user link. To fit the DataRec structure, the first
    user column is mapped to 'user_id' and the second to 'item_id'.

    Args:
        file_path (str): The path to the raw friendship data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep='\t', user_col=0, item_col=1, header=None)
    self.data = dataset

Last.fm

Entry point for loading different versions of the Last.fm dataset.

LastFM

Entry point class to load various versions of the Last.fm dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder.

This dataset contains social networking, tagging, and music artist listening information from the Last.fm online music system. This loader focuses on the user-artist listening data. It was released during the HetRec 2011 workshop.

The default version is 'latest', which currently corresponds to '2011'.

Examples:

To load the latest version:

>>> data_loader = LastFM()

To load a specific version:

>>> data_loader = LastFM(version='2011')
Source code in datarec/datasets/lastfm/lastfm.py
class LastFM:
    """
    Entry point class to load various versions of the Last.fm dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder.

    This dataset contains social networking, tagging, and music artist listening
    information from the Last.fm online music system. This loader focuses on the
    user-artist listening data. It was released during the HetRec 2011 workshop.

    The default version is 'latest', which currently corresponds to '2011'.

    Examples:
        To load the latest version:
        >>> data_loader = LastFM()

        To load a specific version:
        >>> data_loader = LastFM(version='2011')
    """
    VERSIONS = {'2011': LastFM2011}
    latest_version = '2011'

    def __new__(cls, version: str = 'latest', **kwargs):
        """
        Initializes and returns the specified version of the Last.fm dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific dataset version.

        Args:
            version (str): The version of the dataset to load. Currently, only
                '2011' and 'latest' are supported. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used for this dataset).

        Returns:
            (LastFM2011): An instance of the dataset builder class, populated with data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """

        if version == 'latest':
            version = cls.latest_version
        if version in cls.VERSIONS:
            return cls.VERSIONS[version]()
        else:
            raise ValueError(f"HetRec LastFM {version}: Unsupported version \n Supported version:"
                             f"\n \t version \t name "
                             f"\n \t 2011 \t LastFM (HetRec) 2011")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the Last.fm dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Currently, only '2011' and 'latest' are supported. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used for this dataset).

{}

Returns:

Type Description
LastFM2011

An instance of the dataset builder class, populated with data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/lastfm/lastfm.py
def __new__(cls, version: str = 'latest', **kwargs):
    """
    Initializes and returns the specified version of the Last.fm dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific dataset version.

    Args:
        version (str): The version of the dataset to load. Currently, only
            '2011' and 'latest' are supported. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used for this dataset).

    Returns:
        (LastFM2011): An instance of the dataset builder class, populated with data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """

    if version == 'latest':
        version = cls.latest_version
    if version in cls.VERSIONS:
        return cls.VERSIONS[version]()
    else:
        raise ValueError(f"HetRec LastFM {version}: Unsupported version \n Supported version:"
                         f"\n \t version \t name "
                         f"\n \t 2011 \t LastFM (HetRec) 2011")

Builder class for the 2011 version of the Last.fm dataset (HetRec 2011).

LastFM2011

Bases: DataRec

Builder class for the Last.fm dataset (HetRec 2011 version).

This class handles the logic for downloading, preparing, and loading the Last.fm dataset provided for the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011).

The full archive contains multiple files (user-friends, tags, etc.), but this loader specifically processes the user_artists.dat file, which contains artists listened to by each user and a corresponding listening count (weight).

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

REQUIRED_FILES list

A list of all files expected within the decompressed archive.

Source code in datarec/datasets/lastfm/lastfm_2011.py
class LastFM2011(DataRec):
    """
    Builder class for the Last.fm dataset (HetRec 2011 version).

    This class handles the logic for downloading, preparing, and loading the
    Last.fm dataset provided for the 2nd International Workshop on Information
    Heterogeneity and Fusion in Recommender Systems (HetRec 2011).

    The full archive contains multiple files (user-friends, tags, etc.), but this
    loader specifically processes the `user_artists.dat` file, which contains
    artists listened to by each user and a corresponding listening count (`weight`).

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
        REQUIRED_FILES (list): A list of all files expected within the decompressed archive.
    """

    url = 'https://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip'
    data_file_name = os.path.basename(url)
    decompressed_data_file_name = data_file_name.replace('.zip', '')

    user_artists_file_name = 'user_artists.dat'
    tags_file_name = 'tags.dat'
    artists_file_name = 'artists.dat'
    user_taggedartists_file_name = 'user_taggedartists.dat'
    user_taggedartists_timestamp_file_name = 'user_taggedartists-timestamps.dat'
    user_friends_file_name = 'user_friends.dat'

    REQUIRED_FILES = [p for p in
                      [user_friends_file_name, user_taggedartists_file_name, user_taggedartists_timestamp_file_name,
                       artists_file_name, tags_file_name, user_artists_file_name]]
    # REQUIRED_FILES = [os.path.join('ml-1m', p) for p in [movies_file_name, ratings_file_name, users_file_name]]
    CHECKSUM = '296d61afe4e8632b173fc2dd3be20ce2'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'lastfm'
        self.version_name = '2011'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        user_friends_file_path, user_taggedartists_file_path, user_taggedartists_timestamp_file_path, artists_file_path, tags_file_path, user_artists_file_path = file_path
        self.process(user_artists_file_path)

    def required_files(self):
        """
        Checks for the presence of all required decompressed data files.

        Returns:
            (list or None): A list of paths to the required data files if they
                exist or can be created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]

        if all([os.path.exists(p) for p in paths]):
            return paths
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed archive.

        Returns:
            (list or None): A list of paths to the decompressed files if successful,
                otherwise None.
        """
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompress_zip_file(path, self._raw_folder)
        files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(f) for f in files]):
            return [os.path.join(self._raw_folder, f) for f in files]
        return None

    def download(self) -> (str, str):
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_name = os.path.basename(self.url)
        file_path = os.path.join(self._raw_folder, file_name)
        download_file(self.url, file_path)

        return file_path

    def process(self, file_path):
        """
        Processes the raw `user_artists.dat` file and loads it into the class.

        This method reads the tab-separated file, which includes a header. It maps
        the 'userID', 'artistID', and 'weight' columns to the standard user, item,
        and rating columns, respectively. Note that timestamp information is not
        available in this specific file.

        Args:
            file_path (str): The path to the raw `user_artists.dat` file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """

        from datarec.io import read_tabular
        dataset = read_tabular(file_path, sep='\t',
                               user_col='userID', item_col='artistID',
                               rating_col='weight',
                               header=0)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/lastfm/lastfm_2011.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'lastfm'
    self.version_name = '2011'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    user_friends_file_path, user_taggedartists_file_path, user_taggedartists_timestamp_file_path, artists_file_path, tags_file_path, user_artists_file_path = file_path
    self.process(user_artists_file_path)

required_files()

Checks for the presence of all required decompressed data files.

Returns:

Type Description
list or None

A list of paths to the required data files if they exist or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/lastfm/lastfm_2011.py
def required_files(self):
    """
    Checks for the presence of all required decompressed data files.

    Returns:
        (list or None): A list of paths to the required data files if they
            exist or can be created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]

    if all([os.path.exists(p) for p in paths]):
        return paths
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed archive.

required

Returns:

Type Description
list or None

A list of paths to the decompressed files if successful, otherwise None.

Source code in datarec/datasets/lastfm/lastfm_2011.py
def decompress(self, path):
    """
    Decompresses the downloaded archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed archive.

    Returns:
        (list or None): A list of paths to the decompressed files if successful,
            otherwise None.
    """
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompress_zip_file(path, self._raw_folder)
    files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(f) for f in files]):
        return [os.path.join(self._raw_folder, f) for f in files]
    return None

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded archive.

Source code in datarec/datasets/lastfm/lastfm_2011.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_name = os.path.basename(self.url)
    file_path = os.path.join(self._raw_folder, file_name)
    download_file(self.url, file_path)

    return file_path

process(file_path)

Processes the raw user_artists.dat file and loads it into the class.

This method reads the tab-separated file, which includes a header. It maps the 'userID', 'artistID', and 'weight' columns to the standard user, item, and rating columns, respectively. Note that timestamp information is not available in this specific file.

Parameters:

Name Type Description Default
file_path str

The path to the raw user_artists.dat file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/lastfm/lastfm_2011.py
def process(self, file_path):
    """
    Processes the raw `user_artists.dat` file and loads it into the class.

    This method reads the tab-separated file, which includes a header. It maps
    the 'userID', 'artistID', and 'weight' columns to the standard user, item,
    and rating columns, respectively. Note that timestamp information is not
    available in this specific file.

    Args:
        file_path (str): The path to the raw `user_artists.dat` file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """

    from datarec.io import read_tabular
    dataset = read_tabular(file_path, sep='\t',
                           user_col='userID', item_col='artistID',
                           rating_col='weight',
                           header=0)
    self.data = dataset

MIND (Microsoft News Dataset)

Entry point for loading different versions of the MIND dataset.

Mind

Entry point class to load various versions of the MIND dataset.

This class provides a single, convenient interface for accessing the Microsoft News Dataset (MIND). Based on the version parameter, it selects and returns the appropriate dataset builder for either the 'small' or 'large' version.

MIND is a large-scale dataset for news recommendation research. It contains user click histories on a news website.

Note: This dataset requires manual download from the official source.

The default version is 'latest', which currently corresponds to the 'large' version.

Examples:

To load the training split of the large version (default):

>>> data_loader = Mind()

To load the validation split of the small version:

>>> data_loader = Mind(version='small', split='validation')
Source code in datarec/datasets/mind/mind.py
class Mind:
    """
    Entry point class to load various versions of the MIND dataset.

    This class provides a single, convenient interface for accessing the Microsoft
    News Dataset (MIND). Based on the `version` parameter, it selects and returns
    the appropriate dataset builder for either the 'small' or 'large' version.

    MIND is a large-scale dataset for news recommendation research. It contains user
    click histories on a news website.

    **Note:** This dataset requires manual download from the official source.

    The default version is 'latest', which currently corresponds to the 'large' version.

    Examples:
        To load the training split of the large version (default):
        >>> data_loader = Mind()

        To load the validation split of the small version:
        >>> data_loader = Mind(version='small', split='validation')
    """
    latest_version = 'large'
    versions = {'large': MindLarge,
                'small': MindSmall}


    def __new__(cls, version: str = 'latest', split: str = 'train', **kwargs):
        """
        Initializes and returns the specified version of the MIND dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the preparation and loading for a specific dataset version and split.

        Args:
            version (str): The version of the dataset to load. Supported versions
                include 'large', 'small', and 'latest'. Defaults to 'latest'.
            split (str): The data split to load. For 'large', options are 'train',
                'validation', 'test'. For 'small', options are 'train', 'validation'.
                Defaults to 'train'.
            **kwargs: Additional keyword arguments (not currently used).

        Returns:
            (MindLarge or MindSmall): An instance of the dataset builder class,
                populated with data from the specified split.

        Raises:
            ValueError: If an unsupported version string is provided.
        """

        if version == 'latest':
            version = cls.latest_version
        if version in cls.versions:
            return cls.versions[version](split=split)
        else:
            raise ValueError("Mind dataset: Unsupported version")

__new__(version='latest', split='train', **kwargs)

Initializes and returns the specified version of the MIND dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the preparation and loading for a specific dataset version and split.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Supported versions include 'large', 'small', and 'latest'. Defaults to 'latest'.

'latest'
split str

The data split to load. For 'large', options are 'train', 'validation', 'test'. For 'small', options are 'train', 'validation'. Defaults to 'train'.

'train'
**kwargs

Additional keyword arguments (not currently used).

{}

Returns:

Type Description
MindLarge or MindSmall

An instance of the dataset builder class, populated with data from the specified split.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/mind/mind.py
def __new__(cls, version: str = 'latest', split: str = 'train', **kwargs):
    """
    Initializes and returns the specified version of the MIND dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the preparation and loading for a specific dataset version and split.

    Args:
        version (str): The version of the dataset to load. Supported versions
            include 'large', 'small', and 'latest'. Defaults to 'latest'.
        split (str): The data split to load. For 'large', options are 'train',
            'validation', 'test'. For 'small', options are 'train', 'validation'.
            Defaults to 'train'.
        **kwargs: Additional keyword arguments (not currently used).

    Returns:
        (MindLarge or MindSmall): An instance of the dataset builder class,
            populated with data from the specified split.

    Raises:
        ValueError: If an unsupported version string is provided.
    """

    if version == 'latest':
        version = cls.latest_version
    if version in cls.versions:
        return cls.versions[version](split=split)
    else:
        raise ValueError("Mind dataset: Unsupported version")

Builder class for the large version of the MIND dataset.

MindLarge

Bases: DataRec

Builder class for the large version of the MIND dataset.

This class handles the logic for preparing and loading the MINDlarge dataset. It is not typically instantiated directly but is called by the Mind entry point class.

Note on usage: The MIND dataset must be downloaded manually. This class will prompt the user to download the required zip files and place them in the correct cache directory before proceeding with decompression and processing.

The dataset is pre-split into train, validation, and test sets, which can be loaded individually.

Attributes:

Name Type Description
source str

The official website for the dataset.

REQUIRED dict

A dictionary detailing the filenames and checksums for each data split (train, validation, test).

Source code in datarec/datasets/mind/mindLarge.py
class MindLarge(DataRec):
    """
    Builder class for the large version of the MIND dataset.

    This class handles the logic for preparing and loading the MINDlarge dataset.
    It is not typically instantiated directly but is called by the `Mind` entry
    point class.

    **Note on usage:** The MIND dataset must be downloaded manually. This class will
    prompt the user to download the required zip files and place them in the
    correct cache directory before proceeding with decompression and processing.

    The dataset is pre-split into train, validation, and test sets, which can be
    loaded individually.

    Attributes:
        source (str): The official website for the dataset.
        REQUIRED (dict): A dictionary detailing the filenames and checksums for
            each data split (train, validation, test).
    """
    source = 'https://msnews.github.io/#Download'

    REQUIRED = {
        'train': {
            'compressed': 'MINDlarge_train.zip',
            'decompressed': ['behaviors.tsv', 'entity_embedding.vec', 'relation_embedding.vec', 'news.tsv'],
            'interactions': 'behaviors.tsv',
            'checksum': '5be1c8f9a6809092db5fc0ac23d60f72'
        },
        'validation': {
            'compressed': 'MINDlarge_dev.zip',
            'decompressed': ['behaviors.tsv', 'entity_embedding.vec', 'relation_embedding.vec', 'news.tsv'],
            'interactions': 'behaviors.tsv',
            'checksum': '8f3dd8923172048b0e5980e7ee40841b'
        },
        'test': {
            'compressed': 'MINDlarge_test.zip',
            'decompressed': ['behaviors.tsv', 'entity_embedding.vec', 'relation_embedding.vec', 'news.tsv'],
            'interactions': 'behaviors.tsv',
            'checksum': '50406027c032898d9eddf9c8a8ecbc17'
        }
    }

    SPLITS = ('train', 'validation', 'test')

    NAME = 'MIND'
    VERSION = 'small'

    def __init__(self, folder=None, split='train'):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up paths and checks for the required files. If the
        compressed archives are missing, it provides instructions for manual
        download. It then proceeds to decompress and process the specified split.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
            split (str, optional): The data split to load, one of 'train',
                'validation', or 'test'. Defaults to 'train'.
        """
        super().__init__(None)

        self.dataset_name = self.NAME
        self.version_name = self.VERSION

        # set data folder and raw folder
        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(
            self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_directory(self.dataset_name), self.version_name, RAW_DATA_FOLDER)

        if not os.path.exists(self._data_folder):
            os.makedirs(self._data_folder)
            print('Created data folder \'{}\''.format(self._data_folder))

        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created raw folder \'{}\''.format(self._raw_folder))

        # check if the required files have been already downloaded
        found, missing = self.required_files()

        for file_type in missing:
            file_path = self.download(file_type=file_type)
            self.decompress(file_type, file_path)

        self.process(split=split)

    def required_files(self):
        """
        Checks for the presence of the required compressed and decompressed files.

        Returns:
            (tuple[list, list]): A tuple where the first element is a list of
                found splits and the second is a list of missing splits.
        """
        found, missing = [], []

        # check each required file
        for rn, rf in self.REQUIRED.items():
            # check if decompressed file is there
            comp, dec = rf['compressed'], rf['decompressed']
            decompressed_path = [os.path.join(self._raw_folder, rn, dec_name) for dec_name in dec]
            # all the decompressed files should be there, otherwise it needs to decompress the compressed file again
            if all([os.path.exists(p) for p in decompressed_path]):
                # files found!
                found.append(rn)
            else:
                # check if compressed file is there
                compressed_path = os.path.join(self._raw_folder, comp)
                if os.path.exists(compressed_path):
                    # decompress compressed file
                    self.decompress(file_type=rn, path=compressed_path)
                    # files ready!
                    found.append(rn)
                else:
                    # add missing file to missing list
                    missing.append(rn)
        return found, missing

    def decompress(self, file_type, path):
        """
        Decompresses the specified zip archive after verifying its checksum.

        Args:
            file_type (str): The split type ('train', 'validation', 'test').
            path (str): The file path of the compressed .zip archive.

        Returns:
            (list): A list of paths to the decompressed files.

        Raises:
            FileNotFoundError: If decompression fails to produce the expected files.
        """
        assert file_type in ('train', 'validation', 'test'), 'Invalid required file type'

        verify_checksum(path, self.REQUIRED[file_type]['checksum'])

        # decompress downloaded file
        output_folder = os.path.join(self._raw_folder, file_type)
        if not os.path.exists(output_folder):
            os.makedirs(output_folder)
        decompress_zip_file(path, output_folder)

        # check that all the files in the compressed file exist
        files = [os.path.join(output_folder, f) for f in self.REQUIRED[file_type]['decompressed']]
        if all([os.path.exists(f) for f in files]):
            return files
        else:
            raise FileNotFoundError(f'Error decompressing file \'{path}\'')

    def download(self, file_type) -> str:
        """
        Guides the user to manually download the dataset archive.

        This method does not download automatically. Instead, it prints instructions
        and waits for the user to place the required file in the cache directory.

        Args:
            file_type (str): The split type ('train', 'validation', 'test') to download.

        Returns:
            (str): The local file path to the user-provided archive.

        Raises:
            FileNotFoundError: If the file is not found after the user confirms download.
        """
        print(f'\'{file_type}\' MIND file not found.')

        print('Microsoft News Dataset (MIND) requires manual download from the source.\n'
              'Please download the dataset from the following link:\n'
              'https://msnews.github.io/#Download\n'
              'Then, place the downloaded files in the following directory:\n'
              f'{self._raw_folder}')

        # press continue after downloading the dataset
        input('Press Enter to continue after downloading the dataset...')

        file_path = os.path.join(self._raw_folder, self.REQUIRED[file_type]['compressed'])
        if not os.path.exists(file_path):
            raise FileNotFoundError(f'MIND \'{file_type}\' file not found.')
        return file_path

    def process_split(self, split) -> RawData:
        """
        Processes a single split from the decompressed files.

        This method reads the `behaviors.tsv` file for a given split, which
        is in an 'inline' format, and parses it.

        Args:
            split (str): The data split to process ('train', 'validation', or 'test').

        Returns:
            (RawData): A RawData object containing the user-item interactions.

        Raises:
            ValueError: If an invalid split name is provided.
        """

        if split not in self.SPLITS:
            raise ValueError(f'Invalid split type: {split}')

        # read the dataset
        file_path = os.path.join(self._raw_folder, split, self.REQUIRED[split]['interactions'])
        print('Reading file:', file_path)
        return read_inline(file_path, cols=['impression_id', 'user', 'time', 'item', 'impressions'],
                           col_sep='\t', history_sep=' ')

    def process(self, split) -> None:
        """
        Loads the processed data for the specified split into the class.

        Args:
            split (str): The data split to load ('train', 'validation', or 'test').

        Returns:
            (None): This method assigns the processed data to `self.data` directly.

        Raises:
            ValueError: If an invalid split name is provided.
        """

        if split not in self.SPLITS and split != 'merge':
            raise ValueError(f'Invalid split type: {split}')

        if split == 'merge':
            # merge all the splits
            data = None
            for s in self.SPLITS:
                print('Processing split:', s)
                if data is None:
                    data = self.process_split(split=s)
                else:
                    data = data + self.process_split(split=s)
            print('Split merged successfully')
            print('Sorting data...')
            data.data = data.data.sort_values(by=[data.user, data.item])
            print('Reindexing data...')
            data.data = data.data.reset_index(drop=True)
            print(f'{split} split processed successfully')
            self.data = data
        else:
            self.data = self.process_split(split=split)
            print(f'{split} split processed successfully')

__init__(folder=None, split='train')

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up paths and checks for the required files. If the compressed archives are missing, it provides instructions for manual download. It then proceeds to decompress and process the specified split.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
split str

The data split to load, one of 'train', 'validation', or 'test'. Defaults to 'train'.

'train'
Source code in datarec/datasets/mind/mindLarge.py
def __init__(self, folder=None, split='train'):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up paths and checks for the required files. If the
    compressed archives are missing, it provides instructions for manual
    download. It then proceeds to decompress and process the specified split.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
        split (str, optional): The data split to load, one of 'train',
            'validation', or 'test'. Defaults to 'train'.
    """
    super().__init__(None)

    self.dataset_name = self.NAME
    self.version_name = self.VERSION

    # set data folder and raw folder
    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(
        self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_directory(self.dataset_name), self.version_name, RAW_DATA_FOLDER)

    if not os.path.exists(self._data_folder):
        os.makedirs(self._data_folder)
        print('Created data folder \'{}\''.format(self._data_folder))

    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created raw folder \'{}\''.format(self._raw_folder))

    # check if the required files have been already downloaded
    found, missing = self.required_files()

    for file_type in missing:
        file_path = self.download(file_type=file_type)
        self.decompress(file_type, file_path)

    self.process(split=split)

required_files()

Checks for the presence of the required compressed and decompressed files.

Returns:

Type Description
tuple[list, list]

A tuple where the first element is a list of found splits and the second is a list of missing splits.

Source code in datarec/datasets/mind/mindLarge.py
def required_files(self):
    """
    Checks for the presence of the required compressed and decompressed files.

    Returns:
        (tuple[list, list]): A tuple where the first element is a list of
            found splits and the second is a list of missing splits.
    """
    found, missing = [], []

    # check each required file
    for rn, rf in self.REQUIRED.items():
        # check if decompressed file is there
        comp, dec = rf['compressed'], rf['decompressed']
        decompressed_path = [os.path.join(self._raw_folder, rn, dec_name) for dec_name in dec]
        # all the decompressed files should be there, otherwise it needs to decompress the compressed file again
        if all([os.path.exists(p) for p in decompressed_path]):
            # files found!
            found.append(rn)
        else:
            # check if compressed file is there
            compressed_path = os.path.join(self._raw_folder, comp)
            if os.path.exists(compressed_path):
                # decompress compressed file
                self.decompress(file_type=rn, path=compressed_path)
                # files ready!
                found.append(rn)
            else:
                # add missing file to missing list
                missing.append(rn)
    return found, missing

decompress(file_type, path)

Decompresses the specified zip archive after verifying its checksum.

Parameters:

Name Type Description Default
file_type str

The split type ('train', 'validation', 'test').

required
path str

The file path of the compressed .zip archive.

required

Returns:

Type Description
list

A list of paths to the decompressed files.

Raises:

Type Description
FileNotFoundError

If decompression fails to produce the expected files.

Source code in datarec/datasets/mind/mindLarge.py
def decompress(self, file_type, path):
    """
    Decompresses the specified zip archive after verifying its checksum.

    Args:
        file_type (str): The split type ('train', 'validation', 'test').
        path (str): The file path of the compressed .zip archive.

    Returns:
        (list): A list of paths to the decompressed files.

    Raises:
        FileNotFoundError: If decompression fails to produce the expected files.
    """
    assert file_type in ('train', 'validation', 'test'), 'Invalid required file type'

    verify_checksum(path, self.REQUIRED[file_type]['checksum'])

    # decompress downloaded file
    output_folder = os.path.join(self._raw_folder, file_type)
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    decompress_zip_file(path, output_folder)

    # check that all the files in the compressed file exist
    files = [os.path.join(output_folder, f) for f in self.REQUIRED[file_type]['decompressed']]
    if all([os.path.exists(f) for f in files]):
        return files
    else:
        raise FileNotFoundError(f'Error decompressing file \'{path}\'')

download(file_type)

Guides the user to manually download the dataset archive.

This method does not download automatically. Instead, it prints instructions and waits for the user to place the required file in the cache directory.

Parameters:

Name Type Description Default
file_type str

The split type ('train', 'validation', 'test') to download.

required

Returns:

Type Description
str

The local file path to the user-provided archive.

Raises:

Type Description
FileNotFoundError

If the file is not found after the user confirms download.

Source code in datarec/datasets/mind/mindLarge.py
def download(self, file_type) -> str:
    """
    Guides the user to manually download the dataset archive.

    This method does not download automatically. Instead, it prints instructions
    and waits for the user to place the required file in the cache directory.

    Args:
        file_type (str): The split type ('train', 'validation', 'test') to download.

    Returns:
        (str): The local file path to the user-provided archive.

    Raises:
        FileNotFoundError: If the file is not found after the user confirms download.
    """
    print(f'\'{file_type}\' MIND file not found.')

    print('Microsoft News Dataset (MIND) requires manual download from the source.\n'
          'Please download the dataset from the following link:\n'
          'https://msnews.github.io/#Download\n'
          'Then, place the downloaded files in the following directory:\n'
          f'{self._raw_folder}')

    # press continue after downloading the dataset
    input('Press Enter to continue after downloading the dataset...')

    file_path = os.path.join(self._raw_folder, self.REQUIRED[file_type]['compressed'])
    if not os.path.exists(file_path):
        raise FileNotFoundError(f'MIND \'{file_type}\' file not found.')
    return file_path

process_split(split)

Processes a single split from the decompressed files.

This method reads the behaviors.tsv file for a given split, which is in an 'inline' format, and parses it.

Parameters:

Name Type Description Default
split str

The data split to process ('train', 'validation', or 'test').

required

Returns:

Type Description
RawData

A RawData object containing the user-item interactions.

Raises:

Type Description
ValueError

If an invalid split name is provided.

Source code in datarec/datasets/mind/mindLarge.py
def process_split(self, split) -> RawData:
    """
    Processes a single split from the decompressed files.

    This method reads the `behaviors.tsv` file for a given split, which
    is in an 'inline' format, and parses it.

    Args:
        split (str): The data split to process ('train', 'validation', or 'test').

    Returns:
        (RawData): A RawData object containing the user-item interactions.

    Raises:
        ValueError: If an invalid split name is provided.
    """

    if split not in self.SPLITS:
        raise ValueError(f'Invalid split type: {split}')

    # read the dataset
    file_path = os.path.join(self._raw_folder, split, self.REQUIRED[split]['interactions'])
    print('Reading file:', file_path)
    return read_inline(file_path, cols=['impression_id', 'user', 'time', 'item', 'impressions'],
                       col_sep='\t', history_sep=' ')

process(split)

Loads the processed data for the specified split into the class.

Parameters:

Name Type Description Default
split str

The data split to load ('train', 'validation', or 'test').

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Raises:

Type Description
ValueError

If an invalid split name is provided.

Source code in datarec/datasets/mind/mindLarge.py
def process(self, split) -> None:
    """
    Loads the processed data for the specified split into the class.

    Args:
        split (str): The data split to load ('train', 'validation', or 'test').

    Returns:
        (None): This method assigns the processed data to `self.data` directly.

    Raises:
        ValueError: If an invalid split name is provided.
    """

    if split not in self.SPLITS and split != 'merge':
        raise ValueError(f'Invalid split type: {split}')

    if split == 'merge':
        # merge all the splits
        data = None
        for s in self.SPLITS:
            print('Processing split:', s)
            if data is None:
                data = self.process_split(split=s)
            else:
                data = data + self.process_split(split=s)
        print('Split merged successfully')
        print('Sorting data...')
        data.data = data.data.sort_values(by=[data.user, data.item])
        print('Reindexing data...')
        data.data = data.data.reset_index(drop=True)
        print(f'{split} split processed successfully')
        self.data = data
    else:
        self.data = self.process_split(split=split)
        print(f'{split} split processed successfully')

Builder class for the small version of the MIND dataset.

MindSmall

Bases: MindLarge

Builder class for the small version of the MIND dataset.

This class handles the logic for preparing and loading the MINDsmall dataset. It inherits most of its functionality from the MindLarge class but overrides the required file configurations for the smaller version.

MINDsmall is a smaller version of the MIND dataset, suitable for rapid prototyping. It contains only train and validation splits.

Note on usage: Like the large version, this dataset requires manual download.

Attributes:

Name Type Description
REQUIRED dict

A dictionary detailing the filenames and checksums for each data split (train, validation).

SPLITS tuple

The available splits for this version.

Source code in datarec/datasets/mind/mindSmall.py
class MindSmall(MindLarge):
    """
    Builder class for the small version of the MIND dataset.

    This class handles the logic for preparing and loading the MINDsmall dataset.
    It inherits most of its functionality from the `MindLarge` class but overrides
    the required file configurations for the smaller version.

    MINDsmall is a smaller version of the MIND dataset, suitable for rapid
    prototyping. It contains only `train` and `validation` splits.

    **Note on usage:** Like the large version, this dataset requires manual download.

    Attributes:
        REQUIRED (dict): A dictionary detailing the filenames and checksums for
            each data split (train, validation).
        SPLITS (tuple): The available splits for this version.
    """

    REQUIRED = {
        'train': {
            'compressed': 'MINDsmall_train.zip',
            'decompressed': ['behaviors.tsv', 'entity_embedding.vec', 'relation_embedding.vec', 'news.tsv'],
            'interactions': 'behaviors.tsv',
            'checksum': '8ab752c7d11564622d93132be05dcf6b'
        },
        'validation': {
            'compressed': 'MINDsmall_dev.zip',
            'decompressed': ['behaviors.tsv', 'entity_embedding.vec', 'relation_embedding.vec', 'news.tsv'],
            'interactions': 'behaviors.tsv',
            'checksum': 'e3bac5485be8fc7a9934e85e3b78615f'
        }
    }

    SPLITS = ('train', 'validation')

    VERSION = 'small'

    def __init__(self, folder=None, split='train'):
        """
        Initializes the builder for the MINDsmall dataset.

        This constructor calls the parent `MindLarge` constructor but will use the
        overridden `REQUIRED`, `SPLITS`, and `VERSION` attributes specific to
        the small version of the dataset.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
            split (str, optional): The data split to load, one of 'train' or
                'validation'. Defaults to 'train'.
        """

        super().__init__(folder=folder, split=split)

__init__(folder=None, split='train')

Initializes the builder for the MINDsmall dataset.

This constructor calls the parent MindLarge constructor but will use the overridden REQUIRED, SPLITS, and VERSION attributes specific to the small version of the dataset.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
split str

The data split to load, one of 'train' or 'validation'. Defaults to 'train'.

'train'
Source code in datarec/datasets/mind/mindSmall.py
def __init__(self, folder=None, split='train'):
    """
    Initializes the builder for the MINDsmall dataset.

    This constructor calls the parent `MindLarge` constructor but will use the
    overridden `REQUIRED`, `SPLITS`, and `VERSION` attributes specific to
    the small version of the dataset.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
        split (str, optional): The data split to load, one of 'train' or
            'validation'. Defaults to 'train'.
    """

    super().__init__(folder=folder, split=split)

MovieLens

Entry point for loading different versions of the MovieLens dataset.

MovieLens

Entry point class to load various versions of the MovieLens dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder.

The MovieLens datasets are a collection of movie ratings data collected by the GroupLens Research project at the University of Minnesota.

The default version is 'latest', which currently corresponds to the '1m' version.

Examples:

To load the latest version (1M):

>>> ml_1m = MovieLens().prepare_and_load()

To load a specific version (e.g., 100k):

>>> ml_100k = MovieLens(version='100k').prepare_and_load()
Source code in datarec/datasets/movielens/movielens.py
class MovieLens:
    """Entry point class to load various versions of the MovieLens dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder.

    The MovieLens datasets are a collection of movie ratings data collected by the
    GroupLens Research project at the University of Minnesota.

    The default version is 'latest', which currently corresponds to the '1m' version.

    Examples:
        To load the latest version (1M):
        >>> ml_1m = MovieLens().prepare_and_load()

        To load a specific version (e.g., 100k):
        >>> ml_100k = MovieLens(version='100k').prepare_and_load()
    """
    latest_version = '1m'

    def __new__(cls, version: str = 'latest', **kwargs) -> BaseDataRecBuilder:
        """
        Initializes and returns the specified version of the MovieLens dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific dataset version.

        Note: The returned object is a builder. You must call `.prepare_and_load()`
        on it to get a populated `DataRec` object.

        Args:
            version (str): The version of the dataset to load. Supported versions
                include '1m', '20m', '100k', and 'latest'. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used).

        Returns:
            (BaseDataRecBuilder): An instance of the appropriate dataset builder class
                (e.g., `MovieLens1M`), ready to prepare and load data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """
        versions = {'1m': MovieLens1M,
                    '20m': MovieLens20M,
                    '100k': MovieLens100k}
        if version == 'latest':
            version = cls.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError(f"MovieLens {version}: Unsupported version \n Supported version:"
                             f"\n \t version \t name "
                             f"\n \t 1m \t Movielens 1 Million "
                             f"\n \t 20m \t Movielens 20 Million")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the MovieLens dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.

Note: The returned object is a builder. You must call .prepare_and_load() on it to get a populated DataRec object.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Supported versions include '1m', '20m', '100k', and 'latest'. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used).

{}

Returns:

Type Description
BaseDataRecBuilder

An instance of the appropriate dataset builder class (e.g., MovieLens1M), ready to prepare and load data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/movielens/movielens.py
def __new__(cls, version: str = 'latest', **kwargs) -> BaseDataRecBuilder:
    """
    Initializes and returns the specified version of the MovieLens dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific dataset version.

    Note: The returned object is a builder. You must call `.prepare_and_load()`
    on it to get a populated `DataRec` object.

    Args:
        version (str): The version of the dataset to load. Supported versions
            include '1m', '20m', '100k', and 'latest'. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used).

    Returns:
        (BaseDataRecBuilder): An instance of the appropriate dataset builder class
            (e.g., `MovieLens1M`), ready to prepare and load data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """
    versions = {'1m': MovieLens1M,
                '20m': MovieLens20M,
                '100k': MovieLens100k}
    if version == 'latest':
        version = cls.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError(f"MovieLens {version}: Unsupported version \n Supported version:"
                         f"\n \t version \t name "
                         f"\n \t 1m \t Movielens 1 Million "
                         f"\n \t 20m \t Movielens 20 Million")

Builder class for the MovieLens 100k dataset.

MovieLens100k

Bases: BaseDataRecBuilder

Builder class for the MovieLens 100k dataset.

This dataset contains 100,000 ratings. It is not typically instantiated directly but is called by the MovieLens entry point class.

The raw data is provided in a tab-separated file (u.data).

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

REQUIRED_FILES list

A list of file paths expected after decompression.

Source code in datarec/datasets/movielens/movielens100k.py
class MovieLens100k(BaseDataRecBuilder):
    """
    Builder class for the MovieLens 100k dataset.

    This dataset contains 100,000 ratings. It is not typically instantiated
    directly but is called by the `MovieLens` entry point class.

    The raw data is provided in a tab-separated file (`u.data`).

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
        REQUIRED_FILES (list): A list of file paths expected after decompression.
    """
    url = 'https://files.grouplens.org/datasets/movielens/ml-100k.zip'
    data_file_name = os.path.basename(url)
    ratings_file_name = 'u.data'
    REQUIRED_FILES = [os.path.join('ml-100k', p) for p in [ratings_file_name]]
    CHECKSUM = "0e33842e24a9c977be4e0107933c0723"

    def __init__(self, folder=None):
        """
        Initializes the builder.

        This constructor sets up the necessary paths for caching the dataset.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        self.dataset_name = 'MovieLens'
        self.version_name = '100k'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(
            os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

    def prepare(self):
        """
        Ensures all required raw files are downloaded and decompressed.

        This method checks for the existence of the required files. If they are
        not found, it triggers the download and decompression process.
        """
        if self.required_files() is not None:
            # All required files are already available
            return

        file_path = self.download()
        verify_checksum(file_path, self.CHECKSUM)
        self.decompress(file_path)

    def load(self):
        """
        Loads the prepared `u.data` file into a DataRec object.

        Returns:
            (DataRec): A DataRec object containing the user-item interactions.
        """
        from datarec.io import read_tabular

        ratings_file_path = self.required_files()[0]
        dataset = read_tabular(ratings_file_path, sep='\t', user_col=0, item_col=1, rating_col=2, timestamp_col=3,
                               header=None)
        return DataRec(rawdata=dataset,
                       dataset_name=self.dataset_name,
                       version_name=self.version_name)

    def required_files(self):
        """
        Check whether the required dataset files exist.

        Returns:
            (list[str]): Paths to required files if they exist, or None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)

        # check if the file is there
        paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(p) for p in paths]):
            return paths
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompress the downloaded zip file and verify required files.

        Args:
            path (str): Path to the zip file.

        Returns:
            (list[str]): Paths to the extracted files if successful, or None.
        """
        decompress_zip_file(path, self._raw_folder)
        files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(f) for f in files]):
            return [os.path.join(self._raw_folder, f) for f in files]
        return None

    def download(self) -> str:
        """
        Download the raw dataset zip file to the raw folder.

        Returns:
            (str): Path to the downloaded zip file.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        download_file(self.url, file_path)

        return file_path

__init__(folder=None)

Initializes the builder.

This constructor sets up the necessary paths for caching the dataset.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/movielens/movielens100k.py
def __init__(self, folder=None):
    """
    Initializes the builder.

    This constructor sets up the necessary paths for caching the dataset.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    self.dataset_name = 'MovieLens'
    self.version_name = '100k'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(
        os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

prepare()

Ensures all required raw files are downloaded and decompressed.

This method checks for the existence of the required files. If they are not found, it triggers the download and decompression process.

Source code in datarec/datasets/movielens/movielens100k.py
def prepare(self):
    """
    Ensures all required raw files are downloaded and decompressed.

    This method checks for the existence of the required files. If they are
    not found, it triggers the download and decompression process.
    """
    if self.required_files() is not None:
        # All required files are already available
        return

    file_path = self.download()
    verify_checksum(file_path, self.CHECKSUM)
    self.decompress(file_path)

load()

Loads the prepared u.data file into a DataRec object.

Returns:

Type Description
DataRec

A DataRec object containing the user-item interactions.

Source code in datarec/datasets/movielens/movielens100k.py
def load(self):
    """
    Loads the prepared `u.data` file into a DataRec object.

    Returns:
        (DataRec): A DataRec object containing the user-item interactions.
    """
    from datarec.io import read_tabular

    ratings_file_path = self.required_files()[0]
    dataset = read_tabular(ratings_file_path, sep='\t', user_col=0, item_col=1, rating_col=2, timestamp_col=3,
                           header=None)
    return DataRec(rawdata=dataset,
                   dataset_name=self.dataset_name,
                   version_name=self.version_name)

required_files()

Check whether the required dataset files exist.

Returns:

Type Description
list[str]

Paths to required files if they exist, or None.

Source code in datarec/datasets/movielens/movielens100k.py
def required_files(self):
    """
    Check whether the required dataset files exist.

    Returns:
        (list[str]): Paths to required files if they exist, or None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)

    # check if the file is there
    paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(p) for p in paths]):
        return paths
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompress the downloaded zip file and verify required files.

Parameters:

Name Type Description Default
path str

Path to the zip file.

required

Returns:

Type Description
list[str]

Paths to the extracted files if successful, or None.

Source code in datarec/datasets/movielens/movielens100k.py
def decompress(self, path):
    """
    Decompress the downloaded zip file and verify required files.

    Args:
        path (str): Path to the zip file.

    Returns:
        (list[str]): Paths to the extracted files if successful, or None.
    """
    decompress_zip_file(path, self._raw_folder)
    files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(f) for f in files]):
        return [os.path.join(self._raw_folder, f) for f in files]
    return None

download()

Download the raw dataset zip file to the raw folder.

Returns:

Type Description
str

Path to the downloaded zip file.

Source code in datarec/datasets/movielens/movielens100k.py
def download(self) -> str:
    """
    Download the raw dataset zip file to the raw folder.

    Returns:
        (str): Path to the downloaded zip file.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    download_file(self.url, file_path)

    return file_path

MovieLens1M

Bases: BaseDataRecBuilder

Builder class for the MovieLens 1M dataset.

This dataset contains 1 million ratings. It is not typically instantiated directly but is called by the MovieLens entry point class.

The raw ratings data is provided in ratings.dat with a :: separator.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

REQUIRED_FILES list

A list of file paths expected after decompression.

Source code in datarec/datasets/movielens/movielens1m.py
class MovieLens1M(BaseDataRecBuilder):
    """
    Builder class for the MovieLens 1M dataset.

    This dataset contains 1 million ratings. It is not typically instantiated directly but is
    called by the `MovieLens` entry point class.

    The raw ratings data is provided in `ratings.dat` with a `::` separator.

   Attributes:
       url (str): The URL from which the raw dataset is downloaded.
       CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
       REQUIRED_FILES (list): A list of file paths expected after decompression.
   """
    url = 'https://files.grouplens.org/datasets/movielens/ml-1m.zip'
    data_file_name = os.path.basename(url)
    movies_file_name = 'movies.dat'
    ratings_file_name = 'ratings.dat'
    users_file_name = 'users.dat'
    REQUIRED_FILES = [os.path.join('ml-1m', p) for p in [movies_file_name, ratings_file_name, users_file_name]]
    CHECKSUM = "c4d9eecfca2ab87c1945afe126590906"

    def __init__(self, folder=None):
        """
        Initializes the builder.

        This constructor sets up the necessary paths for caching the dataset.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        self.dataset_name = 'MovieLens'
        self.version_name = '1m'

        self._data_folder = folder if folder else dataset_directory(self.dataset_name)
        self._raw_folder = (
            os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER))
            if folder
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)
        )

    def download(self) -> str:
        """
        Downloads the raw dataset archive file.

        Returns:
            (str): The local file path to the downloaded zip file.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print(f"Created folder '{self._raw_folder}'")

        file_path = os.path.join(self._raw_folder, self.data_file_name)
        if not os.path.exists(file_path):
            download_file(self.url, file_path)
        return file_path

    def prepare(self):
        """
        Ensures all required raw files are downloaded and decompressed.

        This method checks for the existence of the required files. If they are
        not found, it triggers the download and decompression process.
        """
        raw_paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all(os.path.exists(p) for p in raw_paths):
            return

        archive_path = os.path.join(self._raw_folder, self.data_file_name)
        if not os.path.exists(archive_path):
            archive_path = self.download()

        verify_checksum(archive_path, self.CHECKSUM)
        decompress_zip_file(archive_path, self._raw_folder)

    def load(self) -> DataRec:
        """
        Loads the prepared `ratings.dat` file into a DataRec object.

        Returns:
            (DataRec): A DataRec object containing the user-item interactions.
        """
        ratings_path = os.path.join(self._raw_folder, 'ml-1m', self.ratings_file_name)
        dataset = read_tabular(ratings_path, sep='::', user_col=0, item_col=1, rating_col=2, timestamp_col=3, header=None)

        dr = DataRec(dataset_name=self.dataset_name, version_name=self.version_name)
        dr.data = dataset
        return dr

__init__(folder=None)

Initializes the builder.

This constructor sets up the necessary paths for caching the dataset.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/movielens/movielens1m.py
def __init__(self, folder=None):
    """
    Initializes the builder.

    This constructor sets up the necessary paths for caching the dataset.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    self.dataset_name = 'MovieLens'
    self.version_name = '1m'

    self._data_folder = folder if folder else dataset_directory(self.dataset_name)
    self._raw_folder = (
        os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER))
        if folder
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)
    )

download()

Downloads the raw dataset archive file.

Returns:

Type Description
str

The local file path to the downloaded zip file.

Source code in datarec/datasets/movielens/movielens1m.py
def download(self) -> str:
    """
    Downloads the raw dataset archive file.

    Returns:
        (str): The local file path to the downloaded zip file.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print(f"Created folder '{self._raw_folder}'")

    file_path = os.path.join(self._raw_folder, self.data_file_name)
    if not os.path.exists(file_path):
        download_file(self.url, file_path)
    return file_path

prepare()

Ensures all required raw files are downloaded and decompressed.

This method checks for the existence of the required files. If they are not found, it triggers the download and decompression process.

Source code in datarec/datasets/movielens/movielens1m.py
def prepare(self):
    """
    Ensures all required raw files are downloaded and decompressed.

    This method checks for the existence of the required files. If they are
    not found, it triggers the download and decompression process.
    """
    raw_paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all(os.path.exists(p) for p in raw_paths):
        return

    archive_path = os.path.join(self._raw_folder, self.data_file_name)
    if not os.path.exists(archive_path):
        archive_path = self.download()

    verify_checksum(archive_path, self.CHECKSUM)
    decompress_zip_file(archive_path, self._raw_folder)

load()

Loads the prepared ratings.dat file into a DataRec object.

Returns:

Type Description
DataRec

A DataRec object containing the user-item interactions.

Source code in datarec/datasets/movielens/movielens1m.py
def load(self) -> DataRec:
    """
    Loads the prepared `ratings.dat` file into a DataRec object.

    Returns:
        (DataRec): A DataRec object containing the user-item interactions.
    """
    ratings_path = os.path.join(self._raw_folder, 'ml-1m', self.ratings_file_name)
    dataset = read_tabular(ratings_path, sep='::', user_col=0, item_col=1, rating_col=2, timestamp_col=3, header=None)

    dr = DataRec(dataset_name=self.dataset_name, version_name=self.version_name)
    dr.data = dataset
    return dr

Builder class for the MovieLens 20M dataset.

MovieLens20M

Bases: DataRec

Builder class for the MovieLens 20M dataset.

This dataset contains 20 million ratings. It is not typically instantiated directly but is called by the MovieLens entry point class.

This loader specifically processes the ratings.csv file from the archive.

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the downloaded file.

REQUIRED_FILES list

A list of all files expected after decompression.

Source code in datarec/datasets/movielens/movielens20m.py
class MovieLens20M(DataRec):
    """
    Builder class for the MovieLens 20M dataset.

    This dataset contains 20 million ratings. It is not typically instantiated directly
    but is called by the `MovieLens` entry point class.

    This loader specifically processes the `ratings.csv` file from the archive.

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the downloaded file.
        REQUIRED_FILES (list): A list of all files expected after decompression.
    """
    url = 'https://files.grouplens.org/datasets/movielens/ml-20m.zip'
    data_file_name = os.path.basename(url)
    genome_scores_file_name = 'genome-scores.csv'
    genome_tags_file_name = 'genome-tags.csv'
    links_file_name = 'links.csv'
    movies_file_name = 'movies.csv'
    ratings_file_name = 'ratings.csv'
    tags_file_name = 'tags.csv'
    REQUIRED_FILES = [os.path.join('ml-20m', p) for p in [genome_scores_file_name,
                                                          genome_tags_file_name,
                                                          links_file_name,
                                                          movies_file_name,
                                                          ratings_file_name,
                                                          tags_file_name]]
    CHECKSUM = "cd245b17a1ae2cc31bb14903e1204af3"

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'MovieLens'
        self.version_name = '20m'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
            else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'files: {file_path}')

        # the order of tha paths is the same of REQUIRED_PATHS
        genome_scores_file_path, genome_tags_file_path, links_file_path, movies_file_path, ratings_file_path,\
            tags_file_path = file_path

        self.process(ratings_file_path)


    def required_files(self):
        """
        Checks for the presence of all required decompressed data files.

        Returns:
            (list or None): A list of paths to the required data files if they
                exist or can be created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)

        # check if the file is there
        paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(p) for p in paths]):
            return paths
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed archive.

        Returns:
            (list or None): A list of paths to the decompressed files if successful,
                otherwise None.
        """
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompress_zip_file(path, self._raw_folder)
        files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(f) for f in files]):
            return [os.path.join(self._raw_folder, f) for f in files]
        return None

    def download(self) -> str:
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        # download dataset file (compressed file)
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        download_file(self.url, file_path)

        return file_path

    def process(self, file_path):
        """
        Processes the raw  file and loads it into the class.

        This method reads the file, which includes a header row, and maps
        the columns to the standard user, item, rating, and timestamp fields.

        Args:
            file_path (str): The path to the raw  file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',',
                               user_col='userId', item_col='movieId', rating_col='rating', timestamp_col='timestamp',
                               header=0)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/movielens/movielens20m.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'MovieLens'
    self.version_name = '20m'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, self.version_name, RAW_DATA_FOLDER)) if folder \
        else os.path.join(dataset_raw_directory(self.dataset_name), self.version_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'files: {file_path}')

    # the order of tha paths is the same of REQUIRED_PATHS
    genome_scores_file_path, genome_tags_file_path, links_file_path, movies_file_path, ratings_file_path,\
        tags_file_path = file_path

    self.process(ratings_file_path)

required_files()

Checks for the presence of all required decompressed data files.

Returns:

Type Description
list or None

A list of paths to the required data files if they exist or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/movielens/movielens20m.py
def required_files(self):
    """
    Checks for the presence of all required decompressed data files.

    Returns:
        (list or None): A list of paths to the required data files if they
            exist or can be created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)

    # check if the file is there
    paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(p) for p in paths]):
        return paths
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed archive.

required

Returns:

Type Description
list or None

A list of paths to the decompressed files if successful, otherwise None.

Source code in datarec/datasets/movielens/movielens20m.py
def decompress(self, path):
    """
    Decompresses the downloaded archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed archive.

    Returns:
        (list or None): A list of paths to the decompressed files if successful,
            otherwise None.
    """
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompress_zip_file(path, self._raw_folder)
    files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(f) for f in files]):
        return [os.path.join(self._raw_folder, f) for f in files]
    return None

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded archive.

Source code in datarec/datasets/movielens/movielens20m.py
def download(self) -> str:
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    # download dataset file (compressed file)
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    download_file(self.url, file_path)

    return file_path

process(file_path)

Processes the raw file and loads it into the class.

This method reads the file, which includes a header row, and maps the columns to the standard user, item, rating, and timestamp fields.

Parameters:

Name Type Description Default
file_path str

The path to the raw file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/movielens/movielens20m.py
def process(self, file_path):
    """
    Processes the raw  file and loads it into the class.

    This method reads the file, which includes a header row, and maps
    the columns to the standard user, item, rating, and timestamp fields.

    Args:
        file_path (str): The path to the raw  file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',',
                           user_col='userId', item_col='movieId', rating_col='rating', timestamp_col='timestamp',
                           header=0)
    self.data = dataset

Tmall

Entry point for loading different versions of the Tmall dataset.

Tmall

Entry point class to load various versions of the Tmall dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder.

This dataset was released for the IJCAI-16 Contest and contains user interactions from the Tmall.com platform for a nearby store recommendation task.

Note: This dataset requires manual download from the official source.

The default version is 'latest', which currently corresponds to 'v1'.

Examples:

To load the latest version:

>>> data_loader = Tmall()
Source code in datarec/datasets/tmall/tmall.py
class Tmall:
    """
    Entry point class to load various versions of the Tmall dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder.

    This dataset was released for the IJCAI-16 Contest and contains user
    interactions from the Tmall.com platform for a nearby store recommendation task.

    **Note:** This dataset requires manual download from the official source.

    The default version is 'latest', which currently corresponds to 'v1'.

    Examples:
        To load the latest version:
        >>> data_loader = Tmall()
    """
    latest_version = 'v1'

    def __new__(self, version: str = 'latest', **kwargs):
        """
        Initializes and returns the specified version of the Tmall dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the preparation and loading for a specific dataset version.

        Args:
            version (str): The version of the dataset to load. Currently, only
                'v1' and 'latest' are supported. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used for this dataset).

        Returns:
            (Tmall_v1): An instance of the dataset builder class, populated with data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """

        versions = {'v1': Tmall_v1}
        if version == 'latest':
            version = self.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError("Tmall dataset: Unsupported version")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the Tmall dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the preparation and loading for a specific dataset version.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Currently, only 'v1' and 'latest' are supported. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used for this dataset).

{}

Returns:

Type Description
Tmall_v1

An instance of the dataset builder class, populated with data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/tmall/tmall.py
def __new__(self, version: str = 'latest', **kwargs):
    """
    Initializes and returns the specified version of the Tmall dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the preparation and loading for a specific dataset version.

    Args:
        version (str): The version of the dataset to load. Currently, only
            'v1' and 'latest' are supported. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used for this dataset).

    Returns:
        (Tmall_v1): An instance of the dataset builder class, populated with data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """

    versions = {'v1': Tmall_v1}
    if version == 'latest':
        version = self.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError("Tmall dataset: Unsupported version")

Builder class for the v1 version of the Tmall IJCAI-16 dataset.

Tmall_v1

Bases: DataRec

Builder class for the Tmall dataset (IJCAI-16 Contest version).

This class handles the logic for preparing and loading the Tmall dataset. It is not typically instantiated directly but is called by the Tmall entry point class.

Note on usage: The Tmall dataset must be downloaded manually after registering on the Tianchi Aliyun website. This class will prompt the user with instructions to download the required zip file and place it in the correct cache directory before proceeding.

This loader processes the ijcai2016_taobao.csv file from the archive.

Attributes:

Name Type Description
website_url str

The official website where the dataset can be downloaded.

CHECKSUM str

The MD5 checksum to verify the integrity of the user-provided file.

Source code in datarec/datasets/tmall/tmall_v1.py
class Tmall_v1(DataRec):
    """
    Builder class for the Tmall dataset (IJCAI-16 Contest version).

    This class handles the logic for preparing and loading the Tmall dataset. It is
    not typically instantiated directly but is called by the `Tmall` entry point class.

    **Note on usage:** The Tmall dataset must be downloaded manually after
    registering on the Tianchi Aliyun website. This class will prompt the user
    with instructions to download the required zip file and place it in the
    correct cache directory before proceeding.

    This loader processes the `ijcai2016_taobao.csv` file from the archive.

    Attributes:
        website_url (str): The official website where the dataset can be downloaded.
        CHECKSUM (str): The MD5 checksum to verify the integrity of the user-provided file.
    """
    website_url = 'https://tianchi.aliyun.com/dataset/dataDetail?dataId=53'
    data_file_name = 'IJCAI16_data.zip'
    uncompressed_user_item_file_name = 'ijcai2016_taobao.csv'
    REQUIRED_FILES = [uncompressed_user_item_file_name]
    CHECKSUM = 'c4f4f0b8860984723652d2e91bcddc01'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up paths and checks for the required files. If the
        compressed archive is missing, it provides instructions for manual
        download. It then proceeds to decompress and process the data.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'tmall'
        self.version_name = 'v1'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
            else dataset_raw_directory(self.dataset_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path)

        print(f'found {file_path}')
        self.process(file_path[0])

    def required_files(self):
        """
        Checks for the presence of the required decompressed data file.

        Returns:
            (list or None): A list containing the path to the required data file if
                it exists or can be created by decompression. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.data_file_name)

        # check if the file is there
        paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(p) for p in paths]):
            return paths
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the downloaded archive after verifying its checksum.

        Args:
            path (str): The file path of the compressed archive.

        Returns:
            (list or None): A list of paths to the decompressed files if successful,
                otherwise None.
        """
        verify_checksum(path, self.CHECKSUM)

        # decompress downloaded file
        decompress_zip_file(input_file=path, output_dir=self._raw_folder)
        files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(f) for f in files]):
            return [os.path.join(self._raw_folder, f) for f in files]
        return None

    def download(self) -> (str, str):
        """
        Guides the user to manually download the dataset archive.

        This method does not download automatically. Instead, it prints instructions
        for the user to visit the Tianchi Aliyun website, register, download the
        `IJCAI16_data.zip` file, and place it in the correct cache directory.

        Returns:
            (str): The local file path to the user-provided archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        print(f'\nThis version of Tmall dataset requires the user to manually download it.\n'
              f'Please, go to {self.website_url} on your browser, register, and click on the download button.\n'
              f'Then, move or copy \'{self.data_file_name}\' in the following directory:\n'
              f'\'{self._raw_folder}\'\n'
              f'Please, do not change the original file name and try again.')
        file_path = os.path.join(self._raw_folder, self.data_file_name)
        return file_path

    def process(self, file_path):
        """
        Processes the raw file and loads it.

        This method reads the CSV file, which has a header, and maps the specific
        column names (`use_ID`, `ite_ID`, `act_ID`, `time`) to the standard
        user, item, rating, and timestamp fields.

        Args:
            file_path (str): The path to the raw data file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """

        from datarec.io import read_tabular

        dataset = read_tabular(file_path, sep=',', user_col='use_ID', item_col='ite_ID', rating_col='act_ID', timestamp_col='time', header=0)
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up paths and checks for the required files. If the compressed archive is missing, it provides instructions for manual download. It then proceeds to decompress and process the data.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/tmall/tmall_v1.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up paths and checks for the required files. If the
    compressed archive is missing, it provides instructions for manual
    download. It then proceeds to decompress and process the data.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'tmall'
    self.version_name = 'v1'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
        else dataset_raw_directory(self.dataset_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path)

    print(f'found {file_path}')
    self.process(file_path[0])

required_files()

Checks for the presence of the required decompressed data file.

Returns:

Type Description
list or None

A list containing the path to the required data file if it exists or can be created by decompression. Otherwise, returns None.

Source code in datarec/datasets/tmall/tmall_v1.py
def required_files(self):
    """
    Checks for the presence of the required decompressed data file.

    Returns:
        (list or None): A list containing the path to the required data file if
            it exists or can be created by decompression. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.data_file_name)

    # check if the file is there
    paths = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(p) for p in paths]):
        return paths
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the downloaded archive after verifying its checksum.

Parameters:

Name Type Description Default
path str

The file path of the compressed archive.

required

Returns:

Type Description
list or None

A list of paths to the decompressed files if successful, otherwise None.

Source code in datarec/datasets/tmall/tmall_v1.py
def decompress(self, path):
    """
    Decompresses the downloaded archive after verifying its checksum.

    Args:
        path (str): The file path of the compressed archive.

    Returns:
        (list or None): A list of paths to the decompressed files if successful,
            otherwise None.
    """
    verify_checksum(path, self.CHECKSUM)

    # decompress downloaded file
    decompress_zip_file(input_file=path, output_dir=self._raw_folder)
    files = [os.path.join(self._raw_folder, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(f) for f in files]):
        return [os.path.join(self._raw_folder, f) for f in files]
    return None

download()

Guides the user to manually download the dataset archive.

This method does not download automatically. Instead, it prints instructions for the user to visit the Tianchi Aliyun website, register, download the IJCAI16_data.zip file, and place it in the correct cache directory.

Returns:

Type Description
str

The local file path to the user-provided archive.

Source code in datarec/datasets/tmall/tmall_v1.py
def download(self) -> (str, str):
    """
    Guides the user to manually download the dataset archive.

    This method does not download automatically. Instead, it prints instructions
    for the user to visit the Tianchi Aliyun website, register, download the
    `IJCAI16_data.zip` file, and place it in the correct cache directory.

    Returns:
        (str): The local file path to the user-provided archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    print(f'\nThis version of Tmall dataset requires the user to manually download it.\n'
          f'Please, go to {self.website_url} on your browser, register, and click on the download button.\n'
          f'Then, move or copy \'{self.data_file_name}\' in the following directory:\n'
          f'\'{self._raw_folder}\'\n'
          f'Please, do not change the original file name and try again.')
    file_path = os.path.join(self._raw_folder, self.data_file_name)
    return file_path

process(file_path)

Processes the raw file and loads it.

This method reads the CSV file, which has a header, and maps the specific column names (use_ID, ite_ID, act_ID, time) to the standard user, item, rating, and timestamp fields.

Parameters:

Name Type Description Default
file_path str

The path to the raw data file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/tmall/tmall_v1.py
def process(self, file_path):
    """
    Processes the raw file and loads it.

    This method reads the CSV file, which has a header, and maps the specific
    column names (`use_ID`, `ite_ID`, `act_ID`, `time`) to the standard
    user, item, rating, and timestamp fields.

    Args:
        file_path (str): The path to the raw data file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """

    from datarec.io import read_tabular

    dataset = read_tabular(file_path, sep=',', user_col='use_ID', item_col='ite_ID', rating_col='act_ID', timestamp_col='time', header=0)
    self.data = dataset

Yelp

Entry point for loading different versions of the Yelp dataset.

Yelp

Entry point class to load various versions of the Yelp Dataset.

This class provides a single, convenient interface for accessing the dataset. Based on the version parameter, it selects and returns the appropriate dataset builder.

The default version is 'latest', which currently corresponds to 'v1'.

Examples:

To load the latest version:

>>> data_loader = Yelp()

To load a specific version:

>>> data_loader = Yelp(version='v1')
Source code in datarec/datasets/yelp/yelp.py
class Yelp:
    """
    Entry point class to load various versions of the Yelp Dataset.

    This class provides a single, convenient interface for accessing the dataset.
    Based on the `version` parameter, it selects and returns the appropriate
    dataset builder.

    The default version is 'latest', which currently corresponds to 'v1'.

    Examples:
        To load the latest version:
        >>> data_loader = Yelp()

        To load a specific version:
        >>> data_loader = Yelp(version='v1')
    """
    latest_version = 'v1'

    def __new__(self, version: str = 'latest', **kwargs):
        """
        Initializes and returns the specified version of the Yelp dataset builder.

        This method acts as a dispatcher, instantiating the correct builder class
        that handles the downloading, caching, and loading for a specific dataset version.

        Args:
            version (str): The version of the dataset to load. Currently, only
                'v1' and 'latest' are supported. Defaults to 'latest'.
            **kwargs: Additional keyword arguments (not currently used for this dataset).

        Returns:
            (Yelp_v1): An instance of the dataset builder class, populated with data.

        Raises:
            ValueError: If an unsupported version string is provided.
        """

        versions = {'v1': Yelp_v1}
        if version == 'latest':
            version = self.latest_version
        if version in versions:
            return versions[version]()
        else:
            raise ValueError("Yelp dataset: Unsupported version")

__new__(version='latest', **kwargs)

Initializes and returns the specified version of the Yelp dataset builder.

This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.

Parameters:

Name Type Description Default
version str

The version of the dataset to load. Currently, only 'v1' and 'latest' are supported. Defaults to 'latest'.

'latest'
**kwargs

Additional keyword arguments (not currently used for this dataset).

{}

Returns:

Type Description
Yelp_v1

An instance of the dataset builder class, populated with data.

Raises:

Type Description
ValueError

If an unsupported version string is provided.

Source code in datarec/datasets/yelp/yelp.py
def __new__(self, version: str = 'latest', **kwargs):
    """
    Initializes and returns the specified version of the Yelp dataset builder.

    This method acts as a dispatcher, instantiating the correct builder class
    that handles the downloading, caching, and loading for a specific dataset version.

    Args:
        version (str): The version of the dataset to load. Currently, only
            'v1' and 'latest' are supported. Defaults to 'latest'.
        **kwargs: Additional keyword arguments (not currently used for this dataset).

    Returns:
        (Yelp_v1): An instance of the dataset builder class, populated with data.

    Raises:
        ValueError: If an unsupported version string is provided.
    """

    versions = {'v1': Yelp_v1}
    if version == 'latest':
        version = self.latest_version
    if version in versions:
        return versions[version]()
    else:
        raise ValueError("Yelp dataset: Unsupported version")

Builder class for the v1 version of the Yelp Dataset.

Yelp_v1

Bases: DataRec

Builder class for the Yelp Dataset.

This class handles the logic for downloading, preparing, and loading the Yelp dataset. It is not typically instantiated directly but is called by the Yelp entry point class.

The download and preparation process is multi-step: 1. A .zip archive is downloaded. 2. The zip is extracted, revealing a .tar archive. 3. The tar is extracted, revealing several .json files. This loader specifically processes yelp_academic_dataset_review.json to extract user-business interactions (ratings).

Attributes:

Name Type Description
url str

The URL from which the raw dataset is downloaded.

CHECKSUM_ZIP str

MD5 checksum for the initial downloaded .zip file.

CHECKSUM_TAR str

MD5 checksum for the intermediate .tar file.

Source code in datarec/datasets/yelp/yelp_v1.py
class Yelp_v1(DataRec):
    """
    Builder class for the Yelp Dataset.

    This class handles the logic for downloading, preparing, and loading the
    Yelp dataset. It is not typically instantiated directly but is called by
    the `Yelp` entry point class.

    The download and preparation process is multi-step:
    1. A `.zip` archive is downloaded.
    2. The zip is extracted, revealing a `.tar` archive.
    3. The tar is extracted, revealing several `.json` files.
    This loader specifically processes `yelp_academic_dataset_review.json` to extract
    user-business interactions (ratings).

    Attributes:
        url (str): The URL from which the raw dataset is downloaded.
        CHECKSUM_ZIP (str): MD5 checksum for the initial downloaded .zip file.
        CHECKSUM_TAR (str): MD5 checksum for the intermediate .tar file.
    """
    website_url = 'https://www.yelp.com/dataset'
    url = 'https://business.yelp.com/external-assets/files/Yelp-JSON.zip'
    data_file_name = 'Yelp-JSON.zip'
    data_tar_file_name = 'yelp_dataset.tar'
    subdirectory_name = 'Yelp JSON'  # once extracted the zip file
    uncompressed_business_file_name = 'yelp_academic_dataset_business.json'
    uncompressed_checkin_file_name = 'yelp_academic_dataset_checkin.json'
    uncompressed_review_file_name = 'yelp_academic_dataset_review.json'
    uncompressed_tip_file_name = 'yelp_academic_dataset_tip.json'
    uncompressed_user_file_name = 'yelp_academic_dataset_user.json'
    REQUIRED_FILES = [uncompressed_business_file_name,
                      uncompressed_checkin_file_name,
                      uncompressed_review_file_name,
                      uncompressed_tip_file_name,
                      uncompressed_user_file_name]
    CHECKSUM_ZIP = 'b0c36fe2d00a52d8de44fa3b2513c9d2'
    CHECKSUM_TAR = '0bc8cc1481ccbbd140d2aba2909a928a'

    def __init__(self, folder=None):
        """
        Initializes the builder and orchestrates the data preparation workflow.

        This constructor sets up the necessary paths and automatically triggers
        the download, verification, and processing steps if the data is not
        already present in the specified cache directory.

        Args:
            folder (str, optional): A custom directory to store the dataset files.
                If None, a default user cache directory is used. Defaults to None.
        """
        super().__init__(None)

        self.dataset_name = 'yelp'
        self.version_name = 'v1'

        self._data_folder = folder if folder \
            else dataset_directory(self.dataset_name)
        self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
            else dataset_raw_directory(self.dataset_name)

        self.return_type = None

        # check if the required files have been already downloaded
        file_path = self.required_files()

        if file_path is None:
            file_path = self.download()
            file_path = self.decompress(file_path) ## all files

        business_file_path, checkin_file_path, review_file_path, tip_file_path, user_file_path = file_path

        print(f'found {file_path}')
        self.process(review_file_path)

    def required_files(self):
        """
        Checks for the presence of the final required decompressed JSON files.

        Returns:
            (list or None): A list of paths to the required data files if they
                exist. Otherwise, returns None.
        """
        # compressed data file
        file_path = os.path.join(self._raw_folder, self.subdirectory_name)

        # check if the file is there
        paths = [os.path.join(self._raw_folder, self.subdirectory_name, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(p) for p in paths]):
            return paths
        # check if the compressed file is there
        elif os.path.exists(file_path):
            return self.decompress(file_path)
        else:
            return None

    def decompress(self, path):
        """
        Decompresses the dataset via a two-step process (zip then tar).

        Args:
            path (str): The file path of the initial downloaded .zip archive.

        Returns:
            (list or None): A list of paths to the final decompressed files if
                successful, otherwise None.
        """
        verify_checksum(path, self.CHECKSUM_ZIP)
        decompress_zip_file(path, self._raw_folder)


        tar_file_path = os.path.join(self._raw_folder, self.subdirectory_name, self.data_tar_file_name)
        verify_checksum(tar_file_path, self.CHECKSUM_TAR)

        decompress_tar_file(tar_file_path, os.path.join(self._raw_folder, self.subdirectory_name))
        files = [os.path.join(self._raw_folder, self.subdirectory_name, f) for f in self.REQUIRED_FILES]
        if all([os.path.exists(f) for f in files]):
            return [os.path.join(self._raw_folder, f) for f in files]
        return None

    def download(self) -> (str, str):
        """
        Downloads the raw dataset compressed archive.

        Returns:
            (str): The local file path to the downloaded .zip archive.
        """
        if not os.path.exists(self._raw_folder):
            os.makedirs(self._raw_folder)
            print('Created folder \'{}\''.format(self._raw_folder))

        file_name = os.path.basename(self.url)
        file_path = os.path.join(self._raw_folder, file_name)
        if not os.path.exists(file_path):
            download_browser(self.url, file_path)
        return file_path

    def process(self, path):
        """
        Processes the raw file and loads it into the class.

        This method reads the JSON file line by line. It extracts the user ID,
        business ID (as the item), star rating, and date. The date strings are
        then converted to Unix timestamps.

        Args:
            path (str): The path to the raw file.

        Returns:
            (None): This method assigns the processed data to `self.data` directly.
        """
        from datarec.io import read_json

        user_field = 'user_id'
        item_field = 'business_id'
        rating_field = 'stars'
        date_field = 'date'  # format: YYYY-MM-DD , e.g.: 2016-03-09
        dataset = read_json(path, user_field=user_field, item_field=item_field, rating_field=rating_field, timestamp_field=date_field)
        timestamps = pd.Series(dataset.data[date_field].apply(lambda x: x.timestamp()).values,
                               index=dataset.data.index, dtype='float64')
        dataset.data.loc[:, date_field] = timestamps
        self.data = dataset

__init__(folder=None)

Initializes the builder and orchestrates the data preparation workflow.

This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.

Parameters:

Name Type Description Default
folder str

A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None.

None
Source code in datarec/datasets/yelp/yelp_v1.py
def __init__(self, folder=None):
    """
    Initializes the builder and orchestrates the data preparation workflow.

    This constructor sets up the necessary paths and automatically triggers
    the download, verification, and processing steps if the data is not
    already present in the specified cache directory.

    Args:
        folder (str, optional): A custom directory to store the dataset files.
            If None, a default user cache directory is used. Defaults to None.
    """
    super().__init__(None)

    self.dataset_name = 'yelp'
    self.version_name = 'v1'

    self._data_folder = folder if folder \
        else dataset_directory(self.dataset_name)
    self._raw_folder = os.path.abspath(os.path.join(self._data_folder, RAW_DATA_FOLDER)) if folder \
        else dataset_raw_directory(self.dataset_name)

    self.return_type = None

    # check if the required files have been already downloaded
    file_path = self.required_files()

    if file_path is None:
        file_path = self.download()
        file_path = self.decompress(file_path) ## all files

    business_file_path, checkin_file_path, review_file_path, tip_file_path, user_file_path = file_path

    print(f'found {file_path}')
    self.process(review_file_path)

required_files()

Checks for the presence of the final required decompressed JSON files.

Returns:

Type Description
list or None

A list of paths to the required data files if they exist. Otherwise, returns None.

Source code in datarec/datasets/yelp/yelp_v1.py
def required_files(self):
    """
    Checks for the presence of the final required decompressed JSON files.

    Returns:
        (list or None): A list of paths to the required data files if they
            exist. Otherwise, returns None.
    """
    # compressed data file
    file_path = os.path.join(self._raw_folder, self.subdirectory_name)

    # check if the file is there
    paths = [os.path.join(self._raw_folder, self.subdirectory_name, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(p) for p in paths]):
        return paths
    # check if the compressed file is there
    elif os.path.exists(file_path):
        return self.decompress(file_path)
    else:
        return None

decompress(path)

Decompresses the dataset via a two-step process (zip then tar).

Parameters:

Name Type Description Default
path str

The file path of the initial downloaded .zip archive.

required

Returns:

Type Description
list or None

A list of paths to the final decompressed files if successful, otherwise None.

Source code in datarec/datasets/yelp/yelp_v1.py
def decompress(self, path):
    """
    Decompresses the dataset via a two-step process (zip then tar).

    Args:
        path (str): The file path of the initial downloaded .zip archive.

    Returns:
        (list or None): A list of paths to the final decompressed files if
            successful, otherwise None.
    """
    verify_checksum(path, self.CHECKSUM_ZIP)
    decompress_zip_file(path, self._raw_folder)


    tar_file_path = os.path.join(self._raw_folder, self.subdirectory_name, self.data_tar_file_name)
    verify_checksum(tar_file_path, self.CHECKSUM_TAR)

    decompress_tar_file(tar_file_path, os.path.join(self._raw_folder, self.subdirectory_name))
    files = [os.path.join(self._raw_folder, self.subdirectory_name, f) for f in self.REQUIRED_FILES]
    if all([os.path.exists(f) for f in files]):
        return [os.path.join(self._raw_folder, f) for f in files]
    return None

download()

Downloads the raw dataset compressed archive.

Returns:

Type Description
str

The local file path to the downloaded .zip archive.

Source code in datarec/datasets/yelp/yelp_v1.py
def download(self) -> (str, str):
    """
    Downloads the raw dataset compressed archive.

    Returns:
        (str): The local file path to the downloaded .zip archive.
    """
    if not os.path.exists(self._raw_folder):
        os.makedirs(self._raw_folder)
        print('Created folder \'{}\''.format(self._raw_folder))

    file_name = os.path.basename(self.url)
    file_path = os.path.join(self._raw_folder, file_name)
    if not os.path.exists(file_path):
        download_browser(self.url, file_path)
    return file_path

process(path)

Processes the raw file and loads it into the class.

This method reads the JSON file line by line. It extracts the user ID, business ID (as the item), star rating, and date. The date strings are then converted to Unix timestamps.

Parameters:

Name Type Description Default
path str

The path to the raw file.

required

Returns:

Type Description
None

This method assigns the processed data to self.data directly.

Source code in datarec/datasets/yelp/yelp_v1.py
def process(self, path):
    """
    Processes the raw file and loads it into the class.

    This method reads the JSON file line by line. It extracts the user ID,
    business ID (as the item), star rating, and date. The date strings are
    then converted to Unix timestamps.

    Args:
        path (str): The path to the raw file.

    Returns:
        (None): This method assigns the processed data to `self.data` directly.
    """
    from datarec.io import read_json

    user_field = 'user_id'
    item_field = 'business_id'
    rating_field = 'stars'
    date_field = 'date'  # format: YYYY-MM-DD , e.g.: 2016-03-09
    dataset = read_json(path, user_field=user_field, item_field=item_field, rating_field=rating_field, timestamp_field=date_field)
    timestamps = pd.Series(dataset.data[date_field].apply(lambda x: x.timestamp()).values,
                           index=dataset.data.index, dtype='float64')
    dataset.data.loc[:, date_field] = timestamps
    self.data = dataset