module muben.dataset
This module includes base classes for dataset creation and batch processing.
function pack_instances
pack_instances(**kwargs) → list[dict]
Converts lists of attributes into a list of data instances.
Each data instance is represented as a dictionary with attribute names as keys and the corresponding data point values as values.
Args:
**kwargs
: Variable length keyword arguments, where each key is an attribute name and its value is a list of data points.
Returns:
List[Dict]
: A list of dictionaries, each representing a data instance.
function unpack_instances
unpack_instances(instance_list: list[dict], attr_names: list[str] = None)
Converts a list of dictionaries (data instances) back into lists of attribute values.
This function is essentially the inverse of pack_instances
.
Args:
instance_list
(List[Dict]): A list of data instances, where each instance is a dictionary with attribute names as keys.attr_names
([List[str]], optional): A list of attribute names to extract. If not provided, all attributes found in the first instance are used.
Returns:
List[List]
: A list of lists, where each sublist contains all values for a particular attribute across all instances.
class Batch
Represents a batch of data instances, where each instance is initialized with attributes provided as keyword arguments.
Each attribute name acts as a key to its corresponding value, allowing for flexible data handling within a batched context.
Attributes:
size
(int): The size of the batch. Defaults to 0._tensor_members
(dict): A dictionary to keep track of tensor attributes for device transfer operations.
function __init__
__init__(**kwargs)
Initializes a Batch object with dynamic attributes based on the provided keyword arguments.
Args:
**kwargs
: Arbitrary keyword arguments representing attributes of data instances within the batch. A special keyword 'batch_size' can be used to explicitly set the batch size.
function to
to(device)
Moves all tensor attributes to the specified device (cpu, cuda).
Args:
device
: The target device to move the tensor attributes to.
Returns:
self
: The batch instance with its tensor attributes moved to the specified device.
class Dataset
Custom Dataset class to handle data storage, manipulation, and preprocessing operations.
Attributes:
_smiles
(Union[list[str], None]): Chemical structures represented as strings._lbs
(Union[np.ndarray, None]): data labels._masks
(Union[np.ndarray, None]): Data masks._ori_ids
(Union[np.ndarray, None]): Original IDs of the datapoints, specifically used for randomly split datasets.data_instances
: Packed instances of data.
property data_instances
Returns the current data instances, considering whether the full dataset or a selection is being used.
property lbs
Returns the label data, considering whether standardized labels are being used.
property smiles
Returns the chemical structures represented as strings.
function add_sample_by_ids
add_sample_by_ids(ids: list[int] = None)
Appends a subset of data instances to the selected data instances.
Args:
ids
(list[int], optional): Indices of the selected instances.
Raises:
ValueError
: Ifids
is not specified.
Returns:
self
(Dataset): The dataset with the added data instances.
function create_features
create_features(config)
Creates data features. This method should be implemented by subclasses to generate data features according to different descriptors or fingerprints.
Raises:
NotImplementedError
: This method should be implemented by subclasses.
function downsample_by
downsample_by(file_path: str = None, ids: list[int] = None)
Downsamples the dataset to a subset with the specified indices.
Args:
file_path
(str, optional): Path to the file containing the indices of the selected instances.ids
(list[int], optional): Indices of the selected instances.
Raises:
ValueError
: If neitherids
norfile_path
is specified.
Returns:
self
(Dataset): The downsampled dataset.
function get_instances
get_instances()
Gets the instances of the dataset. This method should be implemented by subclasses to pack data, labels, and masks into data instances.
Raises:
NotImplementedError
: This method should be implemented by subclasses.
function load
load(file_path: str)
Loads the entire dataset from disk.
Args:
file_path
(str): Path to the saved file.
Returns: self (Dataset)
function prepare
prepare(config, partition, **kwargs)
Prepares the dataset for training and testing.
Args:
config
: Configuration parameters.partition
(str): The dataset partition; should be one of 'train', 'valid', 'test'.
Raises:
ValueError
: Ifpartition
is not one of 'train', 'valid', 'test'.
Returns:
self
(Dataset): The prepared dataset.
function read_csv
read_csv(data_dir: str, partition: str)
Reads data from CSV files.
Args:
data_dir
(str): The directory where data files are stored.partition
(str): The dataset partition ('train', 'valid', 'test').
Raises:
FileNotFoundError
: If the specified file does not exist.
Returns: self (Dataset)
function save
save(file_path: str)
Saves the entire dataset for future use.
Args:
file_path
(str): Path to the save file.
Returns: self (Dataset)
function set_standardized_lbs
set_standardized_lbs(lbs)
Sets standardized labels and updates the instance list accordingly.
Args:
lbs
: The standardized label data.
Returns:
self
(Dataset): The dataset with the standardized labels set.
function toggle_standardized_lbs
toggle_standardized_lbs(use_standardized_lbs: bool = None)
Toggle between using standardized and unstandardized labels.
Args:
use_standardized_lbs
(bool, optional): Whether to use standardized labels. Defaults to None.
Returns:
self
(Dataset): The dataset with the standardized labels toggled.
function update_lbs
update_lbs(lbs)
Updates the dataset labels and the instance list accordingly.
Args:
lbs
: The new labels.
Returns:
self
(Dataset): The dataset with the updated labels.