module muben.dataset

This module includes base classes for dataset creation and batch processing.

function pack_instances

pack_instances(**kwargs) → list[dict]

Converts lists of attributes into a list of data instances.

Each data instance is represented as a dictionary with attribute names as keys and the corresponding data point values as values.


  • **kwargs: Variable length keyword arguments, where each key is an attribute name and its value is a list of data points.


  • List[Dict]: A list of dictionaries, each representing a data instance.

function unpack_instances

unpack_instances(instance_list: list[dict], attr_names: list[str] = None)

Converts a list of dictionaries (data instances) back into lists of attribute values.

This function is essentially the inverse of pack_instances.


  • instance_list (List[Dict]): A list of data instances, where each instance is a dictionary with attribute names as keys.
  • attr_names ([List[str]], optional): A list of attribute names to extract. If not provided, all attributes found in the first instance are used.


  • List[List]: A list of lists, where each sublist contains all values for a particular attribute across all instances.

class Batch

Represents a batch of data instances, where each instance is initialized with attributes provided as keyword arguments.

Each attribute name acts as a key to its corresponding value, allowing for flexible data handling within a batched context.


  • size (int): The size of the batch. Defaults to 0.
  • _tensor_members (dict): A dictionary to keep track of tensor attributes for device transfer operations.

function __init__


Initializes a Batch object with dynamic attributes based on the provided keyword arguments.


  • **kwargs: Arbitrary keyword arguments representing attributes of data instances within the batch. A special keyword 'batch_size' can be used to explicitly set the batch size.

function to


Moves all tensor attributes to the specified device (cpu, cuda).


  • device: The target device to move the tensor attributes to.


  • self: The batch instance with its tensor attributes moved to the specified device.

class Dataset

Custom Dataset class to handle data storage, manipulation, and preprocessing operations.


  • _smiles (Union[list[str], None]): Chemical structures represented as strings.
  • _lbs (Union[np.ndarray, None]): data labels.
  • _masks (Union[np.ndarray, None]): Data masks.
  • _ori_ids (Union[np.ndarray, None]): Original IDs of the datapoints, specifically used for randomly split datasets.
  • data_instances: Packed instances of data.

property data_instances

Returns the current data instances, considering whether the full dataset or a selection is being used.

property lbs

Returns the label data, considering whether standardized labels are being used.

property smiles

Returns the chemical structures represented as strings.

function add_sample_by_ids

add_sample_by_ids(ids: list[int] = None)

Appends a subset of data instances to the selected data instances.


  • ids (list[int], optional): Indices of the selected instances.


  • ValueError: If ids is not specified.


  • self (Dataset): The dataset with the added data instances.

function create_features


Creates data features. This method should be implemented by subclasses to generate data features according to different descriptors or fingerprints.


  • NotImplementedError: This method should be implemented by subclasses.

function downsample_by

downsample_by(file_path: str = None, ids: list[int] = None)

Downsamples the dataset to a subset with the specified indices.


  • file_path (str, optional): Path to the file containing the indices of the selected instances.
  • ids (list[int], optional): Indices of the selected instances.


  • ValueError: If neither ids nor file_path is specified.


  • self (Dataset): The downsampled dataset.

function get_instances


Gets the instances of the dataset. This method should be implemented by subclasses to pack data, labels, and masks into data instances.


  • NotImplementedError: This method should be implemented by subclasses.

function load

load(file_path: str)

Loads the entire dataset from disk.


  • file_path (str): Path to the saved file.

Returns: self (Dataset)

function prepare

prepare(config, partition, **kwargs)

Prepares the dataset for training and testing.


  • config: Configuration parameters.
  • partition (str): The dataset partition; should be one of 'train', 'valid', 'test'.


  • ValueError: If partition is not one of 'train', 'valid', 'test'.


  • self (Dataset): The prepared dataset.

function read_csv

read_csv(data_dir: str, partition: str)

Reads data from CSV files.


  • data_dir (str): The directory where data files are stored.
  • partition (str): The dataset partition ('train', 'valid', 'test').


  • FileNotFoundError: If the specified file does not exist.

Returns: self (Dataset)

function save

save(file_path: str)

Saves the entire dataset for future use.


  • file_path (str): Path to the save file.

Returns: self (Dataset)

function set_standardized_lbs


Sets standardized labels and updates the instance list accordingly.


  • lbs: The standardized label data.


  • self (Dataset): The dataset with the standardized labels set.

function toggle_standardized_lbs

toggle_standardized_lbs(use_standardized_lbs: bool = None)

Toggle between using standardized and unstandardized labels.


  • use_standardized_lbs (bool, optional): Whether to use standardized labels. Defaults to None.


  • self (Dataset): The dataset with the standardized labels toggled.

function update_lbs


Updates the dataset labels and the instance list accordingly.


  • lbs: The new labels.


  • self (Dataset): The dataset with the updated labels.