module muben.dataset

This module includes base classes for dataset creation and batch processing.


function pack_instances

pack_instances(**kwargs) → list[dict]

Converts lists of attributes into a list of data instances.

Each data instance is represented as a dictionary with attribute names as keys and the corresponding data point values as values.

Args:

  • **kwargs: Variable length keyword arguments, where each key is an attribute name and its value is a list of data points.

Returns:

  • List[Dict]: A list of dictionaries, each representing a data instance.

function unpack_instances

unpack_instances(instance_list: list[dict], attr_names: list[str] = None)

Converts a list of dictionaries (data instances) back into lists of attribute values.

This function is essentially the inverse of pack_instances.

Args:

  • instance_list (List[Dict]): A list of data instances, where each instance is a dictionary with attribute names as keys.
  • attr_names ([List[str]], optional): A list of attribute names to extract. If not provided, all attributes found in the first instance are used.

Returns:

  • List[List]: A list of lists, where each sublist contains all values for a particular attribute across all instances.

class Batch

Represents a batch of data instances, where each instance is initialized with attributes provided as keyword arguments.

Each attribute name acts as a key to its corresponding value, allowing for flexible data handling within a batched context.

Attributes:

  • size (int): The size of the batch. Defaults to 0.
  • _tensor_members (dict): A dictionary to keep track of tensor attributes for device transfer operations.

function __init__

__init__(**kwargs)

Initializes a Batch object with dynamic attributes based on the provided keyword arguments.

Args:

  • **kwargs: Arbitrary keyword arguments representing attributes of data instances within the batch. A special keyword 'batch_size' can be used to explicitly set the batch size.

function to

to(device)

Moves all tensor attributes to the specified device (cpu, cuda).

Args:

  • device: The target device to move the tensor attributes to.

Returns:

  • self: The batch instance with its tensor attributes moved to the specified device.

class Dataset

Custom Dataset class to handle data storage, manipulation, and preprocessing operations.

Attributes:

  • _smiles (Union[list[str], None]): Chemical structures represented as strings.
  • _lbs (Union[np.ndarray, None]): data labels.
  • _masks (Union[np.ndarray, None]): Data masks.
  • _ori_ids (Union[np.ndarray, None]): Original IDs of the datapoints, specifically used for randomly split datasets.
  • data_instances: Packed instances of data.

property data_instances

Returns the current data instances, considering whether the full dataset or a selection is being used.


property lbs

Returns the label data, considering whether standardized labels are being used.


property smiles

Returns the chemical structures represented as strings.


function add_sample_by_ids

add_sample_by_ids(ids: list[int] = None)

Appends a subset of data instances to the selected data instances.

Args:

  • ids (list[int], optional): Indices of the selected instances.

Raises:

  • ValueError: If ids is not specified.

Returns:

  • self (Dataset): The dataset with the added data instances.

function create_features

create_features(config)

Creates data features. This method should be implemented by subclasses to generate data features according to different descriptors or fingerprints.

Raises:

  • NotImplementedError: This method should be implemented by subclasses.

function downsample_by

downsample_by(file_path: str = None, ids: list[int] = None)

Downsamples the dataset to a subset with the specified indices.

Args:

  • file_path (str, optional): Path to the file containing the indices of the selected instances.
  • ids (list[int], optional): Indices of the selected instances.

Raises:

  • ValueError: If neither ids nor file_path is specified.

Returns:

  • self (Dataset): The downsampled dataset.

function get_instances

get_instances()

Gets the instances of the dataset. This method should be implemented by subclasses to pack data, labels, and masks into data instances.

Raises:

  • NotImplementedError: This method should be implemented by subclasses.

function load

load(file_path: str)

Loads the entire dataset from disk.

Args:

  • file_path (str): Path to the saved file.

Returns: self (Dataset)


function prepare

prepare(config, partition, **kwargs)

Prepares the dataset for training and testing.

Args:

  • config: Configuration parameters.
  • partition (str): The dataset partition; should be one of 'train', 'valid', 'test'.

Raises:

  • ValueError: If partition is not one of 'train', 'valid', 'test'.

Returns:

  • self (Dataset): The prepared dataset.

function read_csv

read_csv(data_dir: str, partition: str)

Reads data from CSV files.

Args:

  • data_dir (str): The directory where data files are stored.
  • partition (str): The dataset partition ('train', 'valid', 'test').

Raises:

  • FileNotFoundError: If the specified file does not exist.

Returns: self (Dataset)


function save

save(file_path: str)

Saves the entire dataset for future use.

Args:

  • file_path (str): Path to the save file.

Returns: self (Dataset)


function set_standardized_lbs

set_standardized_lbs(lbs)

Sets standardized labels and updates the instance list accordingly.

Args:

  • lbs: The standardized label data.

Returns:

  • self (Dataset): The dataset with the standardized labels set.

function toggle_standardized_lbs

toggle_standardized_lbs(use_standardized_lbs: bool = None)

Toggle between using standardized and unstandardized labels.

Args:

  • use_standardized_lbs (bool, optional): Whether to use standardized labels. Defaults to None.

Returns:

  • self (Dataset): The dataset with the standardized labels toggled.

function update_lbs

update_lbs(lbs)

Updates the dataset labels and the instance list accordingly.

Args:

  • lbs: The new labels.

Returns:

  • self (Dataset): The dataset with the updated labels.