Customization
We have shown in the previous sections how to reproduce our results through a Python (Jupyter) script or command line interface. In this section, we focus on extending our benchmark to more datasets and backbone models.
Note
We have also tried to modulate the UQ methods. However, it seems that most of their implementations are deeply entangled with the model architecture and training process. Therefore, we only provide a limited selection of UQ methods in MUBen and leave its extension to the community.
Customize dataset
Prepare the dataset
Training the backbone/UQ methods with a customized dataset is quite straightforward.
If you want to test the UQ methods on your own dataset, you can organize your data as pandas.DataFrame
with three keys: ["smiles", "labels", "masks"]
.
Their types are shown below.
{
"smiles": list of `str`,
"labels": list of list of int/float,
"masks": list of list of int/float (with values within {0,1})
}
Here, mask=1
indicates the existence informative label at the position and mask=0
indicates the missing label.
Note
You should store your molecules using smiles even if you choose other descriptors such as 2D and 3D graphs. The graphs or RDKit features could be constructed during data pre-processing within the training process.
The training, validation, and test partitions should be stored as train.csv
, valid.csv
, and test.csv
files respectively in the same folder.
The .csv
files should be accompanied by a meta.json
file within the same directory.
It stores some constant dataset properties, e.g., task_type
(classification or regression), n_tasks
, or classes
([0,1]
for all our classification datasets).
For the customized dataset, one required property is the eval_metric
for validation and testing (e.g., roc-auc, rmse, etc.).
You can check the prepared datasets included in our program for reference.
You are recommended to put the dataset files in the ./data/file/<dataset name>
directory, but you can of course choose your favorite location and specify the --data_folder
argument.
Train the model
To conduct training and evaluation on the customized dataset, we only need to modify the dataset_name
argument (muben.args
) to the name of the customized dataset.
This can be achieved through both CLI (--dataset_name <the name of your dataset>
) or within the Python script (config.dataset_name="<the name of your dataset>"
).
Note
Notice that dataset_name
only contains the name of the specific dataset folder instead of the entire path to a specific file.
The full path to the training partition, for example, is constructed from <dataset_folder>/<dataset_name>/train.csv
.
Customize backbone model
It is also easy to define a customized backbone model and integrate it into the training & evaluation pipeline, as long as it follows the standard input & output format. In the following example, we manually construct a DNN model that uses RDKit features as input.
Define the model
The following code defines a conventional DNN model with customizable input/output/hidden dimensionalities and dropout probabilities.
For the output layer, we use the OutputLayer
class defined in muben.layers
module to realize easy initialization and integration of Bayes-by-Backprop (BBP).
import torch.nn as nn
from muben.layers import OutputLayer
class DNN(nn.Module):
def __init__(self,
d_feature: int,
n_lbs: int,
n_tasks: int,
hidden_dims: list[int],
p_dropout: float = 0.1,
uncertainty_method: str = "none",
**kwargs):
super().__init__()
# d_feature = config.d_feature
# n_lbs = config.n_lbs
# n_tasks = config.n_tasks
# n_hidden_layers = config.n_dnn_hidden_layers
# d_hidden = config.d_dnn_hidden
# p_dropout = config.dropout
# uncertainty_method = config.uncertainty_method
if hidden_dims is None:
hidden_dims = [d_hidden] * (n_hidden_layers + 1)
else:
n_hidden_layers = len(hidden_dims)
self.input_layer = nn.Sequential(
nn.Linear(d_feature, hidden_dims[0]),
nn.ReLU(),
nn.Dropout(p_dropout),
)
hidden_layers = [
nn.Sequential(
nn.Linear(hidden_dims[i], hidden_dims[i + 1]),
nn.ReLU(),
nn.Dropout(p_dropout),
)
for i in range(n_hidden_layers)
]
self.hidden_layers = nn.Sequential(*hidden_layers)
self.output_layer = OutputLayer(
hidden_dims[-1], n_lbs * n_tasks, uncertainty_method, **kwargs
)
self.initialize()
def initialize(self):
def init_weights(m):
if isinstance(m, nn.Linear):
nn.init.xavier_uniform_(m.weight)
m.bias.data.fill_(0.01)
self.apply(init_weights)
self.output_layer.initialize()
return self
def forward(self, batch):
features = batch.features
x = self.input_layer(features)
x = self.hidden_layers(x)
logits = self.output_layer(x)
return logits
Initialize trainer with customized model
Once the model is defined, we can pass it as an argument to the Trainer
class to set it as the backbone mode.
Notice that when the model is referred to but not initialized together with Trainer
.
For example, we can use the same code in the simple example for Trainer
initialization.
from muben.args import Config
from muben.utils.selectors import dataset_selector, model_selector
from muben.train import Trainer
descriptor_type = "RDKit" # as mentioned above, we use RDKit features here
config_class = configure_selector(descriptor_type)
dataset_class = dataset_selector(descriptor_type)
config = Config() # We'll can the default configuration for customized backbone models
# io configurations
config.feature_type = "rdkit"
config.data_folder = "./data/files/"
config.dataset_name = "bbbp"
config.result_folder = "./output-demo/"
config.uncertainty_method = "none" # here "none" refers to "Deterministic"
# training configurations
config.retrain_model = True
config.n_epochs = 50
config.lr = 0.0001
# we'll leave model configurations
config.__post_init__()
config.get_meta().validate()
# Load and process the training, validation and test datasets
dataset_class, collator_class = dataset_selector(descriptor_type)
training_dataset = dataset_class().prepare(config=config, partition="train")
valid_dataset = dataset_class().prepare(config=config, partition="valid")
test_dataset = dataset_class().prepare(config=config, partition="test")
# Inintialized the trainer with the configuration and datasets
trainer = Trainer(
config=config,
model_class=DNN, # passed the customized model class to the Trainer
training_dataset=training_dataset,
valid_dataset=valid_dataset,
test_dataset=test_dataset,
collate_fn=collator_class(config),
)
With the above code, we have initialized the trainer with DNN
as well as RDKit datasets.
However, the model is not initialized.
To initialize the model, we can use trainer.initialize
.
trainer.initialize(
d_feature = 200 # the feature dimensionality has to be the same as the output of your feature generator
n_lbs = 1 # bbbp dataset has 2 label types (0, 1), but we use only 1 classification head for binary classification
n_tasks = 1 # bbbp dataset has 1 tasks
hidden_dims = [512, 512, 512] # 3 hidden layer, each with dimensionality 512
uncertainty_method = "MCDropout" # set the uncertainty method as MC Dropout
)
The keyword arguments passed to trainer.initialize
should be the same as what you have defined in DNN.__init__
as we use **kwargs
to pass these arguments.
If the keywords are different, the model may not be initialized properly.
Once the trainer has been initialized, we can start the training and evaluation process as we demonstrated before.
trainer.run()