`module` `muben.dataset`

This module includes base classes for dataset creation and batch processing.

`function` `pack_instances`

pack_instances(**kwargs) → list[dict]

Converts lists of attributes into a list of data instances.

Each data instance is represented as a dictionary with attribute names as keys and the corresponding data point values as values.

Args:

**kwargs: Variable length keyword arguments, where each key is an attribute name and its value is a list of data points.

Returns:

List[Dict]: A list of dictionaries, each representing a data instance.

`function` `unpack_instances`

unpack_instances(instance_list: list[dict], attr_names: list[str] = None)

Converts a list of dictionaries (data instances) back into lists of attribute values.

This function is essentially the inverse of pack_instances.

Args:

instance_list (List[Dict]): A list of data instances, where each instance is a dictionary with attribute names as keys.
attr_names ([List[str]], optional): A list of attribute names to extract. If not provided, all attributes found in the first instance are used.

Returns:

List[List]: A list of lists, where each sublist contains all values for a particular attribute across all instances.

`class` `Batch`

Represents a batch of data instances, where each instance is initialized with attributes provided as keyword arguments.

Each attribute name acts as a key to its corresponding value, allowing for flexible data handling within a batched context.

Attributes:

size (int): The size of the batch. Defaults to 0.
_tensor_members (dict): A dictionary to keep track of tensor attributes for device transfer operations.

`function` `init`

__init__(**kwargs)

Initializes a Batch object with dynamic attributes based on the provided keyword arguments.

Args:

**kwargs: Arbitrary keyword arguments representing attributes of data instances within the batch. A special keyword 'batch_size' can be used to explicitly set the batch size.

`function` `to`

to(device)

Moves all tensor attributes to the specified device (cpu, cuda).

Args:

device: The target device to move the tensor attributes to.

Returns:

self: The batch instance with its tensor attributes moved to the specified device.

`class` `Dataset`

Custom Dataset class to handle data storage, manipulation, and preprocessing operations.

Attributes:

_smiles (Union[list[str], None]): Chemical structures represented as strings.
_lbs (Union[np.ndarray, None]): data labels.
_masks (Union[np.ndarray, None]): Data masks.
_ori_ids (Union[np.ndarray, None]): Original IDs of the datapoints, specifically used for randomly split datasets.
data_instances: Packed instances of data.

`property` data_instances

Returns the current data instances, considering whether the full dataset or a selection is being used.

`property` lbs

Returns the label data, considering whether standardized labels are being used.

`property` smiles

Returns the chemical structures represented as strings.

`function` `add_sample_by_ids`

add_sample_by_ids(ids: list[int] = None)

Appends a subset of data instances to the selected data instances.

Args:

ids (list[int], optional): Indices of the selected instances.

Raises:

ValueError: If ids is not specified.

Returns:

self (Dataset): The dataset with the added data instances.

`function` `create_features`

create_features(config)

Creates data features. This method should be implemented by subclasses to generate data features according to different descriptors or fingerprints.

Raises:

NotImplementedError: This method should be implemented by subclasses.

`function` `downsample_by`

downsample_by(file_path: str = None, ids: list[int] = None)

Downsamples the dataset to a subset with the specified indices.

Args:

file_path (str, optional): Path to the file containing the indices of the selected instances.
ids (list[int], optional): Indices of the selected instances.

Raises:

ValueError: If neither ids nor file_path is specified.

Returns:

self (Dataset): The downsampled dataset.

`function` `get_instances`

get_instances()

Gets the instances of the dataset. This method should be implemented by subclasses to pack data, labels, and masks into data instances.

Raises:

NotImplementedError: This method should be implemented by subclasses.

`function` `load`

load(file_path: str)

Loads the entire dataset from disk.

Args:

file_path (str): Path to the saved file.

Returns: self (Dataset)

`function` `prepare`

prepare(config, partition, **kwargs)

Prepares the dataset for training and testing.

Args:

config: Configuration parameters.
partition (str): The dataset partition; should be one of 'train', 'valid', 'test'.

Raises:

ValueError: If partition is not one of 'train', 'valid', 'test'.

Returns:

self (Dataset): The prepared dataset.

`function` `read_csv`

read_csv(data_dir: str, partition: str)

Reads data from CSV files.

Args:

data_dir (str): The directory where data files are stored.
partition (str): The dataset partition ('train', 'valid', 'test').

Raises:

FileNotFoundError: If the specified file does not exist.

Returns: self (Dataset)

`function` `save`

save(file_path: str)

Saves the entire dataset for future use.

Args:

file_path (str): Path to the save file.

Returns: self (Dataset)

`function` `set_standardized_lbs`

set_standardized_lbs(lbs)

Sets standardized labels and updates the instance list accordingly.

Args:

lbs: The standardized label data.

Returns:

self (Dataset): The dataset with the standardized labels set.

`function` `toggle_standardized_lbs`

toggle_standardized_lbs(use_standardized_lbs: bool = None)

Toggle between using standardized and unstandardized labels.

Args:

use_standardized_lbs (bool, optional): Whether to use standardized labels. Defaults to None.

Returns:

self (Dataset): The dataset with the standardized labels toggled.

`function` `update_lbs`

update_lbs(lbs)

Updates the dataset labels and the instance list accordingly.

Args:

lbs: The new labels.

Returns:

self (Dataset): The dataset with the updated labels.

module muben.dataset

function pack_instances

function unpack_instances

class Batch

function __init__

function to

class Dataset

property data_instances

property lbs

property smiles

function add_sample_by_ids

function create_features

function downsample_by

function get_instances

function load

function prepare

function read_csv

function save

function set_standardized_lbs

function toggle_standardized_lbs

function update_lbs

`module` `muben.dataset`

`function` `pack_instances`

`function` `unpack_instances`

`class` `Batch`

`function` `init`

`function` `to`

`class` `Dataset`

`property` data_instances

`property` lbs

`property` smiles

`function` `add_sample_by_ids`

`function` `create_features`

`function` `downsample_by`

`function` `get_instances`

`function` `load`

`function` `prepare`

`function` `read_csv`

`function` `save`

`function` `set_standardized_lbs`

`function` `toggle_standardized_lbs`

`function` `update_lbs`