Datasets Reference
This section provides a detailed API reference for all modules related to built-in datasets in the datarec
library.
Download Utilities
Provides utility functions for downloading and decompressing dataset files.
This module contains a set of helper functions used internally by the dataset
builder classes (e.g., MovieLens1M
, Yelp_v1
) to handle the fetching of
raw data from web sources and the extraction of various archive formats
like .zip, .gz, .tar, and .7z.
These functions are not typically called directly by the end-user but are fundamental to the automatic data preparation process of the library.
download_url(url, local_filepath)
Downloads a file from a URL and saves it to a local path.
Note: This is a basic downloader. For large files or more robust handling,
download_file
is generally preferred within this library.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the file to download. |
required |
local_filepath
|
str
|
The local path where the file will be saved. |
required |
Raises:
Type | Description |
---|---|
HTTPError
|
If the HTTP request returned an unsuccessful status code. |
Source code in datarec/datasets/download.py
download_file(url, local_filepath, size=None)
Downloads a file by streaming its content, with a progress bar.
This is the primary download function used for most datasets. It streams the response, making it suitable for large files. It attempts to infer the file size from response headers for the progress bar, but an expected size can also be provided.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the file to download. |
required |
local_filepath
|
str
|
The local path where the file will be saved. |
required |
size
|
int
|
The expected file size in bytes. Used for the progress bar if the 'Content-Length' header is not available. Defaults to None. |
None
|
Returns:
Type | Description |
---|---|
str
|
The local file path if the download was successful, otherwise None. |
Source code in datarec/datasets/download.py
download_browser(url, local_filepath, headers=None, chunk_size=8192)
Downloads a file by mimicking a web browser request.
This function is used for sources that may block simple scripted requests. It includes a default 'User-Agent' header to appear as a standard browser, which is necessary for some datasets (e.g., Yelp).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the file to download. |
required |
local_filepath
|
str
|
The local path where the file will be saved. |
required |
headers
|
dict
|
Custom headers to use for the request. If None, a default browser User-Agent is used. Defaults to None. |
None
|
chunk_size
|
int
|
The size of chunks to download in bytes. Defaults to 8192. |
8192
|
Returns:
Type | Description |
---|---|
str
|
The local file path if the download was successful, otherwise None. |
Source code in datarec/datasets/download.py
decompress_gz(input_file, output_file)
Decompresses a .gz file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_file
|
str
|
The path to the input .gz file. |
required |
output_file
|
str
|
The path where the decompressed file will be saved. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the decompressed output file. |
Source code in datarec/datasets/download.py
decompress_tar_file(input_file, output_dir)
Decompresses a .tar archive.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_file
|
str
|
The path to the input .tar file. |
required |
output_dir
|
str
|
The directory where the contents will be extracted. |
required |
Returns:
Type | Description |
---|---|
list
|
A list of the names of the extracted files and directories. |
Source code in datarec/datasets/download.py
decompress_zip_file(input_file, output_dir, allowZip64=False)
Decompresses a .zip archive.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_file
|
str
|
The path to the input .zip file. |
required |
output_dir
|
str
|
The directory where the contents will be extracted. |
required |
allowZip64
|
bool
|
Whether to allow the Zip64 extension (for archives larger than 2 GB). Defaults to False, but should be True for large files. |
False
|
Returns:
Type | Description |
---|---|
list
|
A list of the names of the extracted files and directories. |
Source code in datarec/datasets/download.py
decompress_7z_file(input_file, output_dir)
Decompresses a .7z archive.
This function is used for datasets distributed in the 7-Zip format, such as the Alibaba-iFashion dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_file
|
str
|
The path to the input .7z file. |
required |
output_dir
|
str
|
The directory where the contents will be extracted. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the output directory. |
Source code in datarec/datasets/download.py
Alibaba-iFashion
Entry point for loading different versions of the Alibaba-iFashion dataset.
AlibabaIFashion
Entry point class to load various versions of the Alibaba-iFashion dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder.
The default version is 'latest', which currently corresponds to 'v1'.
Examples:
To load the latest version:
To load a specific version:
Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the Alibaba-iFashion dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Currently, only |
'latest'
|
**kwargs
|
Additional keyword arguments. |
{}
|
Returns:
Type | Description |
---|---|
AlibabaIFashion_V1
|
An instance of the dataset builder class, ready to be used. |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion.py
Builder class for version 'v1' of the Alibaba-iFashion dataset.
AlibabaIFashion_V1
Bases: DataRec
Builder class for the Alibaba-iFashion dataset (KDD 2019 version).
This class handles the logic for downloading, preparing, and loading the
Alibaba-iFashion dataset. It is not typically instantiated directly but is
called by the AlibabaIFashion
entry point class.
The dataset was released for the paper "POG: Personalized Outfit Generation
for Fashion Recommendation at Alibaba iFashion". It contains user-item interactions,
item metadata, and outfit compositions. This loader focuses on processing the
user-item interaction data from user_data.txt
.
Attributes:
Name | Type | Description |
---|---|---|
item_data_url |
str
|
The URL for the item metadata file. |
outfit_data_url |
str
|
The URL for the outfit composition file. |
user_data_url |
str
|
The URL for the user-item interaction file. |
CHECKSUM_ITEM |
str
|
MD5 checksum for the compressed item data archive. |
CHECKSUM_USER |
str
|
MD5 checksum for the compressed user data archive. |
CHECKSUM_OUTFIT |
str
|
MD5 checksum for the compressed outfit data archive. |
Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
required_files()
Checks for the presence of the required decompressed data files.
Returns:
Type | Description |
---|---|
tuple[list, list]
|
A tuple where the first element is a list of found files and the second is a list of missing files. Each item in the lists is a tuple of (path, filename). |
Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
download_item_data()
Downloads, verifies, and decompresses the item data file.
Returns:
Type | Description |
---|---|
str
|
The path to the decompressed item data file. |
Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
download_outfit_data()
Downloads, verifies, and decompresses the outfit data file.
Returns:
Type | Description |
---|---|
str
|
The path to the decompressed outfit data file. |
Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
download_user_data()
Downloads, verifies, and decompresses the user interaction data file.
Returns:
Type | Description |
---|---|
str
|
The path to the decompressed user data file. |
Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
download(found, missing)
Downloads all missing files for the dataset.
Iterates through the list of missing files and calls the appropriate download helper function for each one.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
found
|
list
|
A list of file tuples that were already found locally. |
required |
missing
|
list
|
A list of file tuples that need to be downloaded. |
required |
Returns:
Type | Description |
---|---|
tuple[list, list]
|
The updated lists of found and missing files after the download and verification process. |
Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
process(data_path)
Processes the raw user interaction data and loads it into the class.
The user interaction data is in an 'inline' format, where each line
contains a user followed by a semicolon-separated list of their item
interactions. This method uses read_inline
to parse this format into
a standard user-item pair DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_path
|
str
|
The path to the raw |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/alibaba_ifashion/alibaba_ifashion_v1.py
Amazon Beauty
Entry point for loading different versions of the Amazon Beauty dataset.
AmazonBeauty
Entry point class to load various versions of the Amazon Beauty dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder.
The Amazon Beauty dataset contains product reviews and metadata from Amazon, specialized for the "Beauty and Personal Care" category.
The default version is 'latest', which currently corresponds to the '2023' version.
Examples:
To load the latest version:
To load a specific version:
Source code in datarec/datasets/amazon_beauty/amz_beauty.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the Amazon Beauty dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Currently, only '2023' and 'latest' are supported. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used for this dataset). |
{}
|
Returns:
Type | Description |
---|---|
AMZ_Beauty_2023
|
An instance of the dataset builder class, populated with data. |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/amazon_beauty/amz_beauty.py
Builder class for the 2023 version of the Amazon Beauty dataset.
AMZ_Beauty_2023
Bases: DataRec
Builder class for the Amazon Beauty dataset (2023 version).
This class handles the logic for downloading, preparing, and loading the
2023 version of the Amazon Beauty dataset. It is not typically instantiated
directly but is called by the AmazonBeauty
entry point class.
The dataset is from the "Bridging Language and Items for Retrieval and Recommendation" paper and contains user ratings for beauty products.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/amazon_beauty/amz_beauty_2023.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/amazon_beauty/amz_beauty_2023.py
required_files()
Checks for the presence of the required decompressed data file.
It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.
Returns:
Type | Description |
---|---|
str or None
|
The path to the required data file if it exists or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/amazon_beauty/amz_beauty_2023.py
decompress(path)
Decompresses the downloaded .gz archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed .gz archive. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the decompressed CSV file. |
Source code in datarec/datasets/amazon_beauty/amz_beauty_2023.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded .gz archive. |
Source code in datarec/datasets/amazon_beauty/amz_beauty_2023.py
process(file_path)
Processes the raw data and loads it into the class.
This method reads the decompressed file into a pandas DataFrame and
assigns it to the self.data
attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/amazon_beauty/amz_beauty_2023.py
Amazon Books
Entry point for loading different versions of the Amazon Books dataset.
AmazonBooks
Entry point class to load various versions of the Amazon Books dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder for either the 2018 or 2023 version.
The Amazon Books dataset contains product reviews and metadata from Amazon for the "Books" category.
The default version is 'latest', which currently corresponds to the '2023' version.
Examples:
To load the latest version:
To load a specific version:
Source code in datarec/datasets/amazon_books/amz_books.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the Amazon Books dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Supported versions include '2023', '2018', and 'latest'. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used for this dataset). |
{}
|
Returns:
Type | Description |
---|---|
DataRec
|
An instance of the appropriate dataset builder class (e.g.,
|
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/amazon_books/amz_books.py
Builder class for the 2018 version of the Amazon Books dataset.
AMZ_Books_2018
Bases: DataRec
Builder class for the Amazon Books dataset (2018 version).
This class handles the logic for downloading, preparing, and loading the
2018 version of the Amazon Books dataset from the Amazon Reviews V2 source.
It is not typically instantiated directly but is called by the AmazonBooks
entry point class.
The raw data is provided as a single, uncompressed CSV file.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/amazon_books/amz_books_2018.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/amazon_books/amz_books_2018.py
required_files()
Checks for the presence of the required data file.
Returns:
Type | Description |
---|---|
(str) or None: The path to the required data file if it exists, otherwise returns None. |
Source code in datarec/datasets/amazon_books/amz_books_2018.py
decompress(path)
Handles the decompression step.
For this 2018 version, the source file is already decompressed, so this method simply returns the path to the file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the source data file. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the data file. |
Source code in datarec/datasets/amazon_books/amz_books_2018.py
download()
Downloads the raw dataset file.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded file. |
Source code in datarec/datasets/amazon_books/amz_books_2018.py
process(file_path)
Processes the raw data and loads it into the class.
This method reads the raw file, which does not contain a header row.
Columns are identified by their integer index. The data is then assigned
to the self.data
attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw data file. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the processed data file. |
Source code in datarec/datasets/amazon_books/amz_books_2018.py
Builder class for the 2023 version of the Amazon Books dataset.
AMZ_Books_2023
Bases: DataRec
Builder class for the Amazon Books dataset (2023 version).
This class handles the logic for downloading, preparing, and loading the
2023 version of the Amazon Books dataset. It is not typically instantiated
directly but is called by the AmazonBooks
entry point class.
The dataset is from the "Bridging Language and Items for Retrieval and Recommendation" paper and contains user ratings for books.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/amazon_books/amz_books_2023.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/amazon_books/amz_books_2023.py
required_files()
Checks for the presence of the required decompressed data file.
It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.
Returns:
Type | Description |
---|---|
str or None
|
The path to the required data file if it exists or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/amazon_books/amz_books_2023.py
decompress(path)
Decompresses the downloaded archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed archive. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the decompressed file. |
Source code in datarec/datasets/amazon_books/amz_books_2023.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded archive. |
Source code in datarec/datasets/amazon_books/amz_books_2023.py
process(file_path)
Processes the raw data and loads it into the class.
This method reads the decompressed file, which includes a header,
into a pandas DataFrame and assigns it to the self.data
attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/amazon_books/amz_books_2023.py
Amazon Clothing
Entry point for loading different versions of the Amazon Clothing dataset.
AmazonClothing
Entry point class to load various versions of the Amazon Clothing dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder for either the 2018 or 2023 version.
The dataset contains product reviews and metadata for the category "Clothing, Shoes and Jewelry" from Amazon.
The default version is 'latest', which currently corresponds to the '2023' version.
Examples:
To load the latest version:
To load a specific version:
Source code in datarec/datasets/amazon_clothing/amz_clothing.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the Amazon Clothing dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Supported versions include '2023', '2018', and 'latest'. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used for this dataset). |
{}
|
Returns:
Type | Description |
---|---|
DataRec
|
An instance of the appropriate dataset builder class
(e.g., |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/amazon_clothing/amz_clothing.py
Builder class for the 2018 version of the Amazon Clothing dataset.
AmazonClothing_2018
Bases: DataRec
Builder class for the Amazon Clothing dataset (2018 version).
This class handles the logic for downloading, preparing, and loading the
2018 version of the "Clothing, Shoes and Jewelry" dataset. It is not typically
instantiated directly but is called by the AmazonClothing
entry point class.
The raw data is provided as a single, uncompressed CSV file without a header.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/amazon_clothing/amz_clothing_2018.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/amazon_clothing/amz_clothing_2018.py
required_files()
Checks for the presence of the required data file.
Returns:
Type | Description |
---|---|
str or None
|
The path to the required data file if it exists, otherwise returns None. |
Source code in datarec/datasets/amazon_clothing/amz_clothing_2018.py
decompress(path)
Handles the decompression step.
For this 2018 version, the source file is already decompressed, so this method simply returns the path to the file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the source data file. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the data file. |
Source code in datarec/datasets/amazon_clothing/amz_clothing_2018.py
download()
Downloads the raw dataset file.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded file. |
Source code in datarec/datasets/amazon_clothing/amz_clothing_2018.py
process(file_path)
Processes the raw data and loads it into the class.
This method reads the raw file, which has no header. Columns are
identified by their integer index. The data is then assigned to the
self.data
attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/amazon_clothing/amz_clothing_2018.py
Builder class for the 2023 version of the Amazon Clothing dataset.
AmazonClothing_2023
Bases: DataRec
Builder class for the Amazon Clothing dataset (2023 version).
This class handles the logic for downloading, preparing, and loading the
2023 version of the "Clothing, Shoes and Jewelry" dataset. It is not typically
instantiated directly but is called by the AmazonClothing
entry point class.
The dataset is from the "Bridging Language and Items for Retrieval and Recommendation" paper and is provided as a compressed CSV file.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/amazon_clothing/amz_clothing_2023.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/amazon_clothing/amz_clothing_2023.py
required_files()
Checks for the presence of the required decompressed data file.
It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.
Returns:
Type | Description |
---|---|
str or None
|
The path to the required data file if it exists or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/amazon_clothing/amz_clothing_2023.py
decompress(path)
Decompresses the downloaded archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed archive. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the decompressed file. |
Source code in datarec/datasets/amazon_clothing/amz_clothing_2023.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded archive. |
Source code in datarec/datasets/amazon_clothing/amz_clothing_2023.py
process(file_path)
Processes the raw data and loads it into the class.
This method reads the decompressed file, which includes a header row,
into a pandas DataFrame and assigns it to the self.data
attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/amazon_clothing/amz_clothing_2023.py
Amazon Sports and Outdoors
Entry point for loading different versions of the Amazon Sports and Outdoors dataset.
AmazonSportsOutdoors
Entry point class to load various versions of the Amazon Sports and Outdoors dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder for either the 2018 or 2023 version.
The dataset contains product reviews and metadata for the category "Sports and Outdoors" from Amazon.
The default version is 'latest', which currently corresponds to the '2023' version.
Examples:
To load the latest version:
To load a specific version:
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Supported versions include '2023', '2018', and 'latest'. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used for this dataset). |
{}
|
Returns:
Type | Description |
---|---|
DataRec
|
An instance of the appropriate dataset builder class
(e.g., |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports.py
Builder class for the 2018 version of the Amazon Sports and Outdoors dataset.
AMZ_SportsOutdoors_2018
Bases: DataRec
Builder class for the Amazon Sports and Outdoors dataset (2018 version).
This class handles the logic for downloading, preparing, and loading the
2018 version of the "Sports and Outdoors" dataset. It is not typically
instantiated directly but is called by the AmazonSportsOutdoors
entry point class.
The raw data is provided as a single, uncompressed CSV file without a header.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2018.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2018.py
required_files()
Checks for the presence of the required data file.
Returns:
Type | Description |
---|---|
str or None
|
The path to the required data file if it exists, otherwise returns None. |
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2018.py
decompress(path)
Handles the decompression step.
For this 2018 version, the source file is already decompressed, so this method simply returns the path to the file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the source data file. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the data file. |
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2018.py
download()
Downloads the raw dataset file.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded file. |
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2018.py
process(file_path)
Processes the raw data and loads it into the class.
This method reads the raw file, which has no header. Columns are
identified by their integer index. The data is then assigned to the
self.data
attribute after checksum verification.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2018.py
Builder class for the 2023 version of the Amazon Sports and Outdoors dataset.
AMZ_SportsOutdoors_2023
Bases: DataRec
Builder class for the Amazon Sports and Outdoors dataset (2023 version).
This class handles the logic for downloading, preparing, and loading the
2023 version of the "Sports and Outdoors" dataset. It is not typically
instantiated directly but is called by the AmazonSportsOutdoors
entry point class.
The dataset is from the "Bridging Language and Items for Retrieval and Recommendation" paper and is provided as a compressed CSV file with a header.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2023.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2023.py
required_files()
Checks for the presence of the required decompressed data file.
It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.
Returns:
Type | Description |
---|---|
str or None
|
The path to the required data file if it exists or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2023.py
decompress(path)
Decompresses the downloaded archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed archive. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the decompressed file. |
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2023.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded archive. |
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2023.py
process(file_path)
Processes the raw data and loads it into the class.
This method reads the decompressed file, which includes a header row,
into a pandas DataFrame and assigns it to the self.data
attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/amazon_sports_and_outdoors/amz_sports_2023.py
Amazon Toys and Games
Entry point for loading different versions of the Amazon Toys and Games dataset.
AmazonToysGames
Entry point class to load various versions of the Amazon Toys and Games dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder for either the 2018 or 2023 version.
The dataset contains product reviews and metadata for the category "Toys and Games" from Amazon.
The default version is 'latest', which currently corresponds to the '2023' version.
Examples:
To load the latest version:
To load a specific version:
Source code in datarec/datasets/amazon_toys_and_games/amz_toys.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Supported versions include '2023', '2018', and 'latest'. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used for this dataset). |
{}
|
Returns:
Type | Description |
---|---|
DataRec
|
An instance of the appropriate dataset builder class
(e.g., |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/amazon_toys_and_games/amz_toys.py
Builder class for the 2018 version of the Amazon Toys and Games dataset.
AMZ_ToysGames_2018
Bases: DataRec
Builder class for the Amazon Toys and Games dataset (2018 version).
This class handles the logic for downloading, preparing, and loading the
2018 version of the "Toys and Games" dataset. It is not typically
instantiated directly but is called by the AmazonToysGames
entry point class.
The raw data is provided as a single, uncompressed CSV file without a header.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2018.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2018.py
required_files()
Checks for the presence of the required data file.
Returns:
Type | Description |
---|---|
str or None
|
The path to the required data file if it exists, otherwise returns None. |
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2018.py
decompress(path)
Handles the decompression step.
For this 2018 version, the source file is already decompressed, so this method simply returns the path to the file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the source data file. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the data file. |
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2018.py
download()
Downloads the raw dataset file.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded file. |
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2018.py
process(file_path)
Processes the raw data and loads it into the class.
This method reads the raw file, which has no header. Columns are
identified by their integer index. The data is then assigned to the
self.data
attribute after checksum verification.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2018.py
Builder class for the 2023 version of the Amazon Toys and Games dataset.
AMZ_ToysGames_2023
Bases: DataRec
Builder class for the Amazon Toys and Games dataset (2023 version).
This class handles the logic for downloading, preparing, and loading the
2023 version of the "Toys and Games" dataset. It is not typically
instantiated directly but is called by the AmazonToysGames
entry point class.
The dataset is from the "Bridging Language and Items for Retrieval and Recommendation" paper and is provided as a compressed CSV file with a header.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2023.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2023.py
required_files()
Checks for the presence of the required decompressed data file.
It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.
Returns:
Type | Description |
---|---|
str or None
|
The path to the required data file if it exists or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2023.py
decompress(path)
Decompresses the downloaded archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed archive. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the decompressed file. |
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2023.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded archive. |
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2023.py
process(file_path)
Processes the raw data and loads it into the class.
This method reads the decompressed file, which includes a header row,
into a pandas DataFrame and assigns it to the self.data
attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/amazon_toys_and_games/amz_toys_2023.py
Amazon Video Games
Entry point for loading different versions of the Amazon Video Games dataset.
AmazonVideoGames
Entry point class to load various versions of the Amazon Video Games dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder for either the 2018 or 2023 version.
The dataset contains product reviews and metadata for the "Video Games" category from Amazon.
The default version is 'latest', which currently corresponds to the '2023' version.
Examples:
To load the latest version:
To load a specific version:
Source code in datarec/datasets/amazon_videogames/amz_videogames.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Supported versions include '2023', '2018', and 'latest'. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used for this dataset). |
{}
|
Returns:
Type | Description |
---|---|
DataRec
|
An instance of the appropriate dataset builder class
(e.g., |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/amazon_videogames/amz_videogames.py
Builder class for the 2018 version of the Amazon Video Games dataset.
AMZ_VideoGames_2018
Bases: DataRec
Builder class for the Amazon Video Games dataset (2018 version).
This class handles the logic for downloading, preparing, and loading the
2018 version of the "Video Games" dataset. It is not typically
instantiated directly but is called by the AmazonVideoGames
entry point class.
The raw data is provided as a single, uncompressed CSV file without a header.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/amazon_videogames/amz_videogames_2018.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/amazon_videogames/amz_videogames_2018.py
required_files()
Checks for the presence of the required data file.
Returns:
Type | Description |
---|---|
str or None
|
The path to the required data file if it exists, otherwise returns None. |
Source code in datarec/datasets/amazon_videogames/amz_videogames_2018.py
decompress(path)
Handles the decompression step.
For this 2018 version, the source file is already decompressed, so this method simply returns the path to the file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the source data file. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the data file. |
Source code in datarec/datasets/amazon_videogames/amz_videogames_2018.py
download()
Downloads the raw dataset file.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded file. |
Source code in datarec/datasets/amazon_videogames/amz_videogames_2018.py
process(file_path)
Processes the raw data and loads it into the class.
This method reads the raw file, which has no header. Columns are
identified by their integer index. The data is then assigned to the
self.data
attribute after checksum verification.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/amazon_videogames/amz_videogames_2018.py
Builder class for the 2023 version of the Amazon Video Games dataset.
AMZ_VideoGames_2023
Bases: DataRec
Builder class for the Amazon Video Games dataset (2023 version).
This class handles the logic for downloading, preparing, and loading the
2023 version of the "Video Games" dataset. It is not typically
instantiated directly but is called by the AmazonVideoGames
entry point class.
The dataset is from the "Bridging Language and Items for Retrieval and Recommendation" paper and is provided as a compressed CSV file with a header.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/amazon_videogames/amz_videogames_2023.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/amazon_videogames/amz_videogames_2023.py
required_files()
Checks for the presence of the required decompressed data file.
It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.
Returns:
Type | Description |
---|---|
str or None
|
The path to the required data file if it exists or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/amazon_videogames/amz_videogames_2023.py
decompress(path)
Decompresses the downloaded archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed archive. |
required |
Returns:
Type | Description |
---|---|
str
|
The path to the decompressed file. |
Source code in datarec/datasets/amazon_videogames/amz_videogames_2023.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded archive. |
Source code in datarec/datasets/amazon_videogames/amz_videogames_2023.py
process(file_path)
Processes the raw data and loads it into the class.
This method reads the decompressed file, which includes a header row,
into a pandas DataFrame and assigns it to the self.data
attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/amazon_videogames/amz_videogames_2023.py
CiaoDVD
Entry point for loading different versions of the CiaoDVD dataset.
Ciao
Entry point class to load various versions of the CiaoDVD dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder.
CiaoDVD is a dataset for DVD recommendations, also containing social trust data. This loader focuses on the movie ratings.
The default version is 'latest', which currently corresponds to 'v1'.
Examples:
To load the latest version:
To load a specific version:
Source code in datarec/datasets/ciao/ciao.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the CiaoDVD dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Currently, only 'v1' and 'latest' are supported. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used for this dataset). |
{}
|
Returns:
Type | Description |
---|---|
Ciao_V1
|
An instance of the dataset builder class, populated with data. |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/ciao/ciao.py
Builder class for the v1 version of the CiaoDVD dataset.
Ciao_V1
Bases: DataRec
Builder class for the CiaoDVD dataset.
This class handles the logic for downloading, preparing, and loading the
CiaoDVD dataset from the LibRec repository. It is not typically instantiated
directly but is called by the Ciao
entry point class.
The dataset was introduced in the paper "ETAF: An Extended Trust Antecedents
Framework for Trust Prediction". The archive contains multiple files; this
loader specifically processes movie-ratings.txt
.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
REQUIRED_FILES |
list
|
A list of files expected within the decompressed archive. |
Source code in datarec/datasets/ciao/ciao_v1.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/ciao/ciao_v1.py
required_files()
Checks for the presence of the required decompressed data files.
It first looks for the final, uncompressed files. If not found, it looks for the compressed archive and decompresses it.
Returns:
Type | Description |
---|---|
list or None
|
A list of paths to the required data files if they exist or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/ciao/ciao_v1.py
decompress(path)
Decompresses the downloaded archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed archive. |
required |
Returns:
Type | Description |
---|---|
list or None
|
A list of paths to the decompressed files if successful, otherwise None. |
Source code in datarec/datasets/ciao/ciao_v1.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded archive. |
Source code in datarec/datasets/ciao/ciao_v1.py
process(path)
Processes the raw movie-ratings.txt
data and loads it into the class.
This method reads the file, which has no header. It also parses the date strings in 'YYYY-MM-DD' format and converts them to Unix timestamps.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The path to the raw |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/ciao/ciao_v1.py
Epinions
Entry point for loading different versions of the Epinions dataset.
Epinions
Entry point class to load various versions of the Epinions dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder.
Epinions is a who-trust-whom online social network from a general consumer review site. Members of the site can decide whether to "trust" each other.
The default version is 'latest', which currently corresponds to 'v1'.
Examples:
To load the latest version:
To load a specific version:
Source code in datarec/datasets/epinions/epinions.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the Epinions dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Currently, only 'v1' and 'latest' are supported. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used for this dataset). |
{}
|
Returns:
Type | Description |
---|---|
Epinions_V1
|
An instance of the dataset builder class, populated with data. |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/epinions/epinions.py
Builder class for the v1 version of the Epinions dataset.
Epinions_V1
Bases: DataRec
Builder class for the Epinions dataset.
This class handles the logic for downloading, preparing, and loading the
Epinions social network dataset from the Stanford SNAP repository. It is not
typically instantiated directly but is called by the Epinions
entry point class.
The dataset was introduced in the paper "Trust Management for the Semantic Web". It represents a directed graph of trust relationships.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
REQUIRED_FILES |
list
|
A list of files expected after decompression. |
Source code in datarec/datasets/epinions/epinions_v1.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/epinions/epinions_v1.py
required_files()
Checks for the presence of the required decompressed data file.
It first looks for the final, uncompressed file. If not found, it looks for the compressed archive and decompresses it.
Returns:
Type | Description |
---|---|
list or None
|
A list containing the path to the required data file if it exists or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/epinions/epinions_v1.py
decompress(path)
Decompresses the downloaded archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed archive. |
required |
Returns:
Type | Description |
---|---|
list or None
|
A list containing the path to the decompressed file if successful, otherwise None. |
Source code in datarec/datasets/epinions/epinions_v1.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded .gz archive. |
Source code in datarec/datasets/epinions/epinions_v1.py
process(file_path)
Processes the raw trust network data and loads it into the class.
This method reads the file, skipping the header comment lines. The first column is treated as the 'user' (truster) and the second column as the 'item' (trustee).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/epinions/epinions_v1.py
Gowalla
Entry point for loading different versions of the Gowalla dataset.
Gowalla
Entry point class to load various types of data from the Gowalla dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder for either the user check-ins or the social friendships graph.
Gowalla was a location-based social network. Two types of data are available: - 'checkins': User interactions with locations (suitable for recommendation). - 'friendships': The user-user social network graph.
The default version is 'latest', which currently corresponds to 'checkins'.
Examples:
To load the user check-in data (default):
To load the social friendship graph:
Source code in datarec/datasets/gowalla/gowalla.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the Gowalla dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific data type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The type of data to load. Supported versions include 'checkins', 'friendships', and 'latest'. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used). |
{}
|
Returns:
Type | Description |
---|---|
GowallaCheckins or GowallaFriendships
|
An instance of the appropriate dataset builder class, populated with data. |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/gowalla/gowalla.py
Builder class for the Gowalla check-ins dataset.
GowallaCheckins
Bases: DataRec
Builder class for the Gowalla check-ins dataset.
This class handles the logic for downloading, preparing, and loading the
user check-in data from the Stanford SNAP repository. It is not typically
instantiated directly but is called by the Gowalla
entry point class
when version='checkins'
.
The dataset contains user check-ins at various locations, representing user-item interactions suitable for recommendation tasks. It was introduced in the paper "Friendship and mobility: user movement in location-based social networks".
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/gowalla/gowalla_checkins.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/gowalla/gowalla_checkins.py
required_files()
Checks for the presence of the required decompressed data file.
Returns:
Type | Description |
---|---|
list or None
|
A list containing the path to the required data file if it exists or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/gowalla/gowalla_checkins.py
decompress(path)
Decompresses the downloaded archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed archive. |
required |
Returns:
Type | Description |
---|---|
list or None
|
A list containing the path to the decompressed file if successful, otherwise None. |
Source code in datarec/datasets/gowalla/gowalla_checkins.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded archive. |
Source code in datarec/datasets/gowalla/gowalla_checkins.py
process(file_path)
Processes the raw check-in data and loads it into the class.
This method reads the tab-separated file, which has no header. It maps column 0 to 'user_id', column 4 to 'item_id' (location ID), and column 1 to 'timestamp'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw check-in data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/gowalla/gowalla_checkins.py
Builder class for the Gowalla friendships (social network) dataset.
GowallaFriendships
Bases: DataRec
Builder class for the Gowalla friendships dataset.
This class handles the logic for downloading, preparing, and loading the
user-user social network graph from the Stanford SNAP repository. It is not
typically instantiated directly but is called by the Gowalla
entry point class
when version='friendships'
.
The dataset contains the social friendship network of Gowalla users. Each row represents a directed edge from one user to another.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
Source code in datarec/datasets/gowalla/gowalla_friendships.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/gowalla/gowalla_friendships.py
required_files()
Checks for the presence of the required decompressed data file.
Returns:
Type | Description |
---|---|
list or None
|
A list containing the path to the required data file if it exists or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/gowalla/gowalla_friendships.py
decompress(path)
Decompresses the downloaded archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed archive. |
required |
Returns:
Type | Description |
---|---|
list or None
|
A list containing the path to the decompressed file if successful, otherwise None. |
Source code in datarec/datasets/gowalla/gowalla_friendships.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded archive. |
Source code in datarec/datasets/gowalla/gowalla_friendships.py
process(file_path)
Processes the raw friendship data and loads it into the class.
This method reads the file, which has no header. Each row represents a user-user link. To fit the DataRec structure, the first user column is mapped to 'user_id' and the second to 'item_id'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw friendship data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/gowalla/gowalla_friendships.py
Last.fm
Entry point for loading different versions of the Last.fm dataset.
LastFM
Entry point class to load various versions of the Last.fm dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder.
This dataset contains social networking, tagging, and music artist listening information from the Last.fm online music system. This loader focuses on the user-artist listening data. It was released during the HetRec 2011 workshop.
The default version is 'latest', which currently corresponds to '2011'.
Examples:
To load the latest version:
To load a specific version:
Source code in datarec/datasets/lastfm/lastfm.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the Last.fm dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Currently, only '2011' and 'latest' are supported. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used for this dataset). |
{}
|
Returns:
Type | Description |
---|---|
LastFM2011
|
An instance of the dataset builder class, populated with data. |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/lastfm/lastfm.py
Builder class for the 2011 version of the Last.fm dataset (HetRec 2011).
LastFM2011
Bases: DataRec
Builder class for the Last.fm dataset (HetRec 2011 version).
This class handles the logic for downloading, preparing, and loading the Last.fm dataset provided for the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011).
The full archive contains multiple files (user-friends, tags, etc.), but this
loader specifically processes the user_artists.dat
file, which contains
artists listened to by each user and a corresponding listening count (weight
).
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
REQUIRED_FILES |
list
|
A list of all files expected within the decompressed archive. |
Source code in datarec/datasets/lastfm/lastfm_2011.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/lastfm/lastfm_2011.py
required_files()
Checks for the presence of all required decompressed data files.
Returns:
Type | Description |
---|---|
list or None
|
A list of paths to the required data files if they exist or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/lastfm/lastfm_2011.py
decompress(path)
Decompresses the downloaded archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed archive. |
required |
Returns:
Type | Description |
---|---|
list or None
|
A list of paths to the decompressed files if successful, otherwise None. |
Source code in datarec/datasets/lastfm/lastfm_2011.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded archive. |
Source code in datarec/datasets/lastfm/lastfm_2011.py
process(file_path)
Processes the raw user_artists.dat
file and loads it into the class.
This method reads the tab-separated file, which includes a header. It maps the 'userID', 'artistID', and 'weight' columns to the standard user, item, and rating columns, respectively. Note that timestamp information is not available in this specific file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/lastfm/lastfm_2011.py
MIND (Microsoft News Dataset)
Entry point for loading different versions of the MIND dataset.
Mind
Entry point class to load various versions of the MIND dataset.
This class provides a single, convenient interface for accessing the Microsoft
News Dataset (MIND). Based on the version
parameter, it selects and returns
the appropriate dataset builder for either the 'small' or 'large' version.
MIND is a large-scale dataset for news recommendation research. It contains user click histories on a news website.
Note: This dataset requires manual download from the official source.
The default version is 'latest', which currently corresponds to the 'large' version.
Examples:
To load the training split of the large version (default):
To load the validation split of the small version:
Source code in datarec/datasets/mind/mind.py
__new__(version='latest', split='train', **kwargs)
Initializes and returns the specified version of the MIND dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the preparation and loading for a specific dataset version and split.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Supported versions include 'large', 'small', and 'latest'. Defaults to 'latest'. |
'latest'
|
split
|
str
|
The data split to load. For 'large', options are 'train', 'validation', 'test'. For 'small', options are 'train', 'validation'. Defaults to 'train'. |
'train'
|
**kwargs
|
Additional keyword arguments (not currently used). |
{}
|
Returns:
Type | Description |
---|---|
MindLarge or MindSmall
|
An instance of the dataset builder class, populated with data from the specified split. |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/mind/mind.py
Builder class for the large version of the MIND dataset.
MindLarge
Bases: DataRec
Builder class for the large version of the MIND dataset.
This class handles the logic for preparing and loading the MINDlarge dataset.
It is not typically instantiated directly but is called by the Mind
entry
point class.
Note on usage: The MIND dataset must be downloaded manually. This class will prompt the user to download the required zip files and place them in the correct cache directory before proceeding with decompression and processing.
The dataset is pre-split into train, validation, and test sets, which can be loaded individually.
Attributes:
Name | Type | Description |
---|---|---|
source |
str
|
The official website for the dataset. |
REQUIRED |
dict
|
A dictionary detailing the filenames and checksums for each data split (train, validation, test). |
Source code in datarec/datasets/mind/mindLarge.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 |
|
__init__(folder=None, split='train')
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up paths and checks for the required files. If the compressed archives are missing, it provides instructions for manual download. It then proceeds to decompress and process the specified split.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
split
|
str
|
The data split to load, one of 'train', 'validation', or 'test'. Defaults to 'train'. |
'train'
|
Source code in datarec/datasets/mind/mindLarge.py
required_files()
Checks for the presence of the required compressed and decompressed files.
Returns:
Type | Description |
---|---|
tuple[list, list]
|
A tuple where the first element is a list of found splits and the second is a list of missing splits. |
Source code in datarec/datasets/mind/mindLarge.py
decompress(file_type, path)
Decompresses the specified zip archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_type
|
str
|
The split type ('train', 'validation', 'test'). |
required |
path
|
str
|
The file path of the compressed .zip archive. |
required |
Returns:
Type | Description |
---|---|
list
|
A list of paths to the decompressed files. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If decompression fails to produce the expected files. |
Source code in datarec/datasets/mind/mindLarge.py
download(file_type)
Guides the user to manually download the dataset archive.
This method does not download automatically. Instead, it prints instructions and waits for the user to place the required file in the cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_type
|
str
|
The split type ('train', 'validation', 'test') to download. |
required |
Returns:
Type | Description |
---|---|
str
|
The local file path to the user-provided archive. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If the file is not found after the user confirms download. |
Source code in datarec/datasets/mind/mindLarge.py
process_split(split)
Processes a single split from the decompressed files.
This method reads the behaviors.tsv
file for a given split, which
is in an 'inline' format, and parses it.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
split
|
str
|
The data split to process ('train', 'validation', or 'test'). |
required |
Returns:
Type | Description |
---|---|
RawData
|
A RawData object containing the user-item interactions. |
Raises:
Type | Description |
---|---|
ValueError
|
If an invalid split name is provided. |
Source code in datarec/datasets/mind/mindLarge.py
process(split)
Loads the processed data for the specified split into the class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
split
|
str
|
The data split to load ('train', 'validation', or 'test'). |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Raises:
Type | Description |
---|---|
ValueError
|
If an invalid split name is provided. |
Source code in datarec/datasets/mind/mindLarge.py
Builder class for the small version of the MIND dataset.
MindSmall
Bases: MindLarge
Builder class for the small version of the MIND dataset.
This class handles the logic for preparing and loading the MINDsmall dataset.
It inherits most of its functionality from the MindLarge
class but overrides
the required file configurations for the smaller version.
MINDsmall is a smaller version of the MIND dataset, suitable for rapid
prototyping. It contains only train
and validation
splits.
Note on usage: Like the large version, this dataset requires manual download.
Attributes:
Name | Type | Description |
---|---|---|
REQUIRED |
dict
|
A dictionary detailing the filenames and checksums for each data split (train, validation). |
SPLITS |
tuple
|
The available splits for this version. |
Source code in datarec/datasets/mind/mindSmall.py
__init__(folder=None, split='train')
Initializes the builder for the MINDsmall dataset.
This constructor calls the parent MindLarge
constructor but will use the
overridden REQUIRED
, SPLITS
, and VERSION
attributes specific to
the small version of the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
split
|
str
|
The data split to load, one of 'train' or 'validation'. Defaults to 'train'. |
'train'
|
Source code in datarec/datasets/mind/mindSmall.py
MovieLens
Entry point for loading different versions of the MovieLens dataset.
MovieLens
Entry point class to load various versions of the MovieLens dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder.
The MovieLens datasets are a collection of movie ratings data collected by the GroupLens Research project at the University of Minnesota.
The default version is 'latest', which currently corresponds to the '1m' version.
Examples:
To load the latest version (1M):
To load a specific version (e.g., 100k):
Source code in datarec/datasets/movielens/movielens.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the MovieLens dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.
Note: The returned object is a builder. You must call .prepare_and_load()
on it to get a populated DataRec
object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Supported versions include '1m', '20m', '100k', and 'latest'. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used). |
{}
|
Returns:
Type | Description |
---|---|
BaseDataRecBuilder
|
An instance of the appropriate dataset builder class
(e.g., |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/movielens/movielens.py
Builder class for the MovieLens 100k dataset.
MovieLens100k
Bases: BaseDataRecBuilder
Builder class for the MovieLens 100k dataset.
This dataset contains 100,000 ratings. It is not typically instantiated
directly but is called by the MovieLens
entry point class.
The raw data is provided in a tab-separated file (u.data
).
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
REQUIRED_FILES |
list
|
A list of file paths expected after decompression. |
Source code in datarec/datasets/movielens/movielens100k.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
|
__init__(folder=None)
Initializes the builder.
This constructor sets up the necessary paths for caching the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/movielens/movielens100k.py
prepare()
Ensures all required raw files are downloaded and decompressed.
This method checks for the existence of the required files. If they are not found, it triggers the download and decompression process.
Source code in datarec/datasets/movielens/movielens100k.py
load()
Loads the prepared u.data
file into a DataRec object.
Returns:
Type | Description |
---|---|
DataRec
|
A DataRec object containing the user-item interactions. |
Source code in datarec/datasets/movielens/movielens100k.py
required_files()
Check whether the required dataset files exist.
Returns:
Type | Description |
---|---|
list[str]
|
Paths to required files if they exist, or None. |
Source code in datarec/datasets/movielens/movielens100k.py
decompress(path)
Decompress the downloaded zip file and verify required files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the zip file. |
required |
Returns:
Type | Description |
---|---|
list[str]
|
Paths to the extracted files if successful, or None. |
Source code in datarec/datasets/movielens/movielens100k.py
download()
Download the raw dataset zip file to the raw folder.
Returns:
Type | Description |
---|---|
str
|
Path to the downloaded zip file. |
Source code in datarec/datasets/movielens/movielens100k.py
MovieLens1M
Bases: BaseDataRecBuilder
Builder class for the MovieLens 1M dataset.
This dataset contains 1 million ratings. It is not typically instantiated directly but is
called by the MovieLens
entry point class.
The raw ratings data is provided in ratings.dat
with a ::
separator.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
REQUIRED_FILES |
list
|
A list of file paths expected after decompression. |
Source code in datarec/datasets/movielens/movielens1m.py
__init__(folder=None)
Initializes the builder.
This constructor sets up the necessary paths for caching the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/movielens/movielens1m.py
download()
Downloads the raw dataset archive file.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded zip file. |
Source code in datarec/datasets/movielens/movielens1m.py
prepare()
Ensures all required raw files are downloaded and decompressed.
This method checks for the existence of the required files. If they are not found, it triggers the download and decompression process.
Source code in datarec/datasets/movielens/movielens1m.py
load()
Loads the prepared ratings.dat
file into a DataRec object.
Returns:
Type | Description |
---|---|
DataRec
|
A DataRec object containing the user-item interactions. |
Source code in datarec/datasets/movielens/movielens1m.py
Builder class for the MovieLens 20M dataset.
MovieLens20M
Bases: DataRec
Builder class for the MovieLens 20M dataset.
This dataset contains 20 million ratings. It is not typically instantiated directly
but is called by the MovieLens
entry point class.
This loader specifically processes the ratings.csv
file from the archive.
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the downloaded file. |
REQUIRED_FILES |
list
|
A list of all files expected after decompression. |
Source code in datarec/datasets/movielens/movielens20m.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/movielens/movielens20m.py
required_files()
Checks for the presence of all required decompressed data files.
Returns:
Type | Description |
---|---|
list or None
|
A list of paths to the required data files if they exist or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/movielens/movielens20m.py
decompress(path)
Decompresses the downloaded archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed archive. |
required |
Returns:
Type | Description |
---|---|
list or None
|
A list of paths to the decompressed files if successful, otherwise None. |
Source code in datarec/datasets/movielens/movielens20m.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded archive. |
Source code in datarec/datasets/movielens/movielens20m.py
process(file_path)
Processes the raw file and loads it into the class.
This method reads the file, which includes a header row, and maps the columns to the standard user, item, rating, and timestamp fields.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/movielens/movielens20m.py
Tmall
Entry point for loading different versions of the Tmall dataset.
Tmall
Entry point class to load various versions of the Tmall dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder.
This dataset was released for the IJCAI-16 Contest and contains user interactions from the Tmall.com platform for a nearby store recommendation task.
Note: This dataset requires manual download from the official source.
The default version is 'latest', which currently corresponds to 'v1'.
Examples:
To load the latest version:
Source code in datarec/datasets/tmall/tmall.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the Tmall dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the preparation and loading for a specific dataset version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Currently, only 'v1' and 'latest' are supported. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used for this dataset). |
{}
|
Returns:
Type | Description |
---|---|
Tmall_v1
|
An instance of the dataset builder class, populated with data. |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/tmall/tmall.py
Builder class for the v1 version of the Tmall IJCAI-16 dataset.
Tmall_v1
Bases: DataRec
Builder class for the Tmall dataset (IJCAI-16 Contest version).
This class handles the logic for preparing and loading the Tmall dataset. It is
not typically instantiated directly but is called by the Tmall
entry point class.
Note on usage: The Tmall dataset must be downloaded manually after registering on the Tianchi Aliyun website. This class will prompt the user with instructions to download the required zip file and place it in the correct cache directory before proceeding.
This loader processes the ijcai2016_taobao.csv
file from the archive.
Attributes:
Name | Type | Description |
---|---|---|
website_url |
str
|
The official website where the dataset can be downloaded. |
CHECKSUM |
str
|
The MD5 checksum to verify the integrity of the user-provided file. |
Source code in datarec/datasets/tmall/tmall_v1.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up paths and checks for the required files. If the compressed archive is missing, it provides instructions for manual download. It then proceeds to decompress and process the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/tmall/tmall_v1.py
required_files()
Checks for the presence of the required decompressed data file.
Returns:
Type | Description |
---|---|
list or None
|
A list containing the path to the required data file if it exists or can be created by decompression. Otherwise, returns None. |
Source code in datarec/datasets/tmall/tmall_v1.py
decompress(path)
Decompresses the downloaded archive after verifying its checksum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the compressed archive. |
required |
Returns:
Type | Description |
---|---|
list or None
|
A list of paths to the decompressed files if successful, otherwise None. |
Source code in datarec/datasets/tmall/tmall_v1.py
download()
Guides the user to manually download the dataset archive.
This method does not download automatically. Instead, it prints instructions
for the user to visit the Tianchi Aliyun website, register, download the
IJCAI16_data.zip
file, and place it in the correct cache directory.
Returns:
Type | Description |
---|---|
str
|
The local file path to the user-provided archive. |
Source code in datarec/datasets/tmall/tmall_v1.py
process(file_path)
Processes the raw file and loads it.
This method reads the CSV file, which has a header, and maps the specific
column names (use_ID
, ite_ID
, act_ID
, time
) to the standard
user, item, rating, and timestamp fields.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
The path to the raw data file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |
Source code in datarec/datasets/tmall/tmall_v1.py
Yelp
Entry point for loading different versions of the Yelp dataset.
Yelp
Entry point class to load various versions of the Yelp Dataset.
This class provides a single, convenient interface for accessing the dataset.
Based on the version
parameter, it selects and returns the appropriate
dataset builder.
The default version is 'latest', which currently corresponds to 'v1'.
Examples:
To load the latest version:
To load a specific version:
Source code in datarec/datasets/yelp/yelp.py
__new__(version='latest', **kwargs)
Initializes and returns the specified version of the Yelp dataset builder.
This method acts as a dispatcher, instantiating the correct builder class that handles the downloading, caching, and loading for a specific dataset version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str
|
The version of the dataset to load. Currently, only 'v1' and 'latest' are supported. Defaults to 'latest'. |
'latest'
|
**kwargs
|
Additional keyword arguments (not currently used for this dataset). |
{}
|
Returns:
Type | Description |
---|---|
Yelp_v1
|
An instance of the dataset builder class, populated with data. |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported version string is provided. |
Source code in datarec/datasets/yelp/yelp.py
Builder class for the v1 version of the Yelp Dataset.
Yelp_v1
Bases: DataRec
Builder class for the Yelp Dataset.
This class handles the logic for downloading, preparing, and loading the
Yelp dataset. It is not typically instantiated directly but is called by
the Yelp
entry point class.
The download and preparation process is multi-step:
1. A .zip
archive is downloaded.
2. The zip is extracted, revealing a .tar
archive.
3. The tar is extracted, revealing several .json
files.
This loader specifically processes yelp_academic_dataset_review.json
to extract
user-business interactions (ratings).
Attributes:
Name | Type | Description |
---|---|---|
url |
str
|
The URL from which the raw dataset is downloaded. |
CHECKSUM_ZIP |
str
|
MD5 checksum for the initial downloaded .zip file. |
CHECKSUM_TAR |
str
|
MD5 checksum for the intermediate .tar file. |
Source code in datarec/datasets/yelp/yelp_v1.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
|
__init__(folder=None)
Initializes the builder and orchestrates the data preparation workflow.
This constructor sets up the necessary paths and automatically triggers the download, verification, and processing steps if the data is not already present in the specified cache directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder
|
str
|
A custom directory to store the dataset files. If None, a default user cache directory is used. Defaults to None. |
None
|
Source code in datarec/datasets/yelp/yelp_v1.py
required_files()
Checks for the presence of the final required decompressed JSON files.
Returns:
Type | Description |
---|---|
list or None
|
A list of paths to the required data files if they exist. Otherwise, returns None. |
Source code in datarec/datasets/yelp/yelp_v1.py
decompress(path)
Decompresses the dataset via a two-step process (zip then tar).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The file path of the initial downloaded .zip archive. |
required |
Returns:
Type | Description |
---|---|
list or None
|
A list of paths to the final decompressed files if successful, otherwise None. |
Source code in datarec/datasets/yelp/yelp_v1.py
download()
Downloads the raw dataset compressed archive.
Returns:
Type | Description |
---|---|
str
|
The local file path to the downloaded .zip archive. |
Source code in datarec/datasets/yelp/yelp_v1.py
process(path)
Processes the raw file and loads it into the class.
This method reads the JSON file line by line. It extracts the user ID, business ID (as the item), star rating, and date. The date strings are then converted to Unix timestamps.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The path to the raw file. |
required |
Returns:
Type | Description |
---|---|
None
|
This method assigns the processed data to |