Frequently Asked Questions¶
This document comprises a few tips and tricks to illustrate common concepts with trident
. Be sure to read the walkthrough first.
Experiments¶
How do I add my own variables?¶
experiment
configurations (e.g., ./configs/nli.yaml
) have the dedicated run
key that is best suited to store your variables, as all variables in run
are logged automatically on, for instance, wandb
.
Note
You must then link the corresponding configuration (like batch sizes in dataloader
) for your user variables.
# @package _global_
defaults:
- default
- /dataspecs@datamodule.train:
- mnli_train
- /dataspecs@datamodule.val:
- xnli_val_test
- override /module: text_classification
run:
task: nli
# must be linked to your training dataloader!
my_train_batch_size: 32
How do I save cleanly store artifacts of runs (e.g., checkpoints)?¶
The hydra runtime directory controls where output of your run (like checkpoints or logs) are stored. You can modify the runtime directory of hydra from the commandline.
Note
You can only set hydra.run.dir
on the commandline, such that hydra is aware before start-up of where to set the runtime directory to!
Consider the following NLI experiment
configuration.
# @package _global_
# defaults:
# - ...
run:
task: nli
module:
model:
pretrained_model_name_or_path: "xlm-roberta-base"
We can then either directly on the commandline or wrapped in a bash script set the runtime directory for hydra.
#!/bin/bash
#SBATCH --gres=gpu:1
source $HOME/.bashrc
conda activate tx
python -m trident.run \
experiment=nli \
'hydra.run.dir="logs/${run.task}/${module.model.pretrained_model_name_or_path}/"'
Note
hydra variables are best enclosed in single quotation marks. The configuration the becomes accessible with resolution in strings embedded in double quotation marks.
In practice, keep in mind that you have to link against the runtime directory in hydra! For instance, a callback for storing checkpoints in trident may look as follows.
model_checkpoint_on_epoch:
_target_: lightning.pytorch.callbacks.ModelCheckpoint
monitor: null
every_n_epochs: 1
verbose: false
save_top_k: -1
filename: "epoch={epoch}"
save_last: false
dirpath: "${hydra:runtime.output_dir}/checkpoints/"
save_weights_only: true
auto_insert_metric_name: false
Where dirpath
is linked against the runtime directory of hydra.
Module and Models¶
How to use your own model?¶
Using your own model typically follows on of two patterns.
The existing model already defines a training step.¶
HuggingFace models merge the Pytorch Lighting forward
and training_step
function into a single forward
function that also accepts labels
as a kwargs
. The trident.TridentModule
seamlessly passes the
batch through to self.model(**batch)
in forward
.
In these cases, the below pattern suffices
module:
model:
_target_: transformers.AutoModelForSequenceClassification.from_pretrained
num_labels: ???
pretrained_model_name_or_path: ???
which is hierarchically top-to-bottom constructed from interleaving your
experiment
configuration
defaults:
# ...
- override /module: text_classification.yaml
# ...
sourced from {text, token}_classification
defaults:
- trident
- /evaluation: text_classification
model:
_target_: transformers.AutoModelForSequenceClassification.from_pretrained
num_labels: ???
pretrained_model_name_or_path: ???
which inherits trident
(i.e. optimizer, scheduler) defaults.
# _target_ is hydra-lingo to point to the object (class, function) to instantiate
_target_: trident.TridentModule
# _recursive_: true would mean all keyword arguments are /already/ instantiated
# when passed to `TridentDataModule`
_recursive_: false
defaults:
# interleaved with setup so instantiated later (recursive false)
- /optimizer: ${optimizer} # see config/optimizer/adamw.yaml for default
- /scheduler: ${scheduler} # see config/scheduler/linear_warm_up.yaml for default
evaluation: ???
model: ???
The existing model does not define a training step but a forward step.¶
In this scenario, the user implements a trident.TridentModule
class MyModule(TridentModule):
def __init__(
self,
my_variable: Any,
*args,
**kwargs,
):
super().__init__(*args, **kwargs)
self.my_variable = my_variable
# INFO: this is not stricty required and shows default implementation
def forward(self, batch: dict) -> dict:
# #####################
# override IF AND ONLY IF custom glue between model and module required
# #####################
return self.model(**batch)
# the default training_step implementation inherited in MyModule(TridentModule)
def training_step(self, batch: dict, batch_idx: int) -> torch.Tensor:
# #####################
# custom logic here -- don't forget to add logging!
# #####################
outputs = self(batch) # calls above forward(self, batch)
self.log("train/loss", outputs["loss"])
return outputs
and links the module intermittently in his own module
configuration
module:
# src.projects is an exemplary path in trident_xtreme folder
_target_: src.projects.my_project.my_module.MyModule
defaults:
- trident
- /evaluation: ??? # required, task-dependent
model:
_target_: my_package.my_existing_model
model_kwarg_1: ???
model_kwarg_2: ???
The architecture does not exist yet.¶
Two variants are most common:
Write a
lightning.pytorch.LightningModule
for the barebones architecture (i.e. definingforward
pass, model setup) and a separateTridentModule
embedding the former to enclose training logic (training_step
)Write a stand-alone
TridentModule
that implements bothforward
andtraining_step
The idiomatic approach is (1) as it reflects a more common research-oriented workflow.
How to opt out of default model instantiation?¶
You can opt out of automatic model instantiation by passing
initialize_model=False
to the super.__init__()
method.
Beware that you the have to instantiate the self.model
yourself!
Furthermore, you may need to override TridentModule.forward
, for instance, if the model is not defined in self.model
any longer.
class MyModule(TridentModule):
def __init__(self, *args, **kwargs):
super().__init__(*args, initialize_model=False, **kwargs)
self.model = hydra.utils.instantiate(self.hparams.model)
How to load a checkpoint for a TridentModule?¶
The run.ckpt_path
in the experiment configuration can point to a LightningModule checkpoint of your TridentModule
. The run.ckpt_path
is then passed to trainer.fit
of the Lightning Trainer.
#...
run:
seed: 42
ckpt_path: $PATH_TO_YOUR_CKPT
Note
Absolute paths to checkpoints are generally recommnended, though ./logs/.../your_ckpt.pt
**should work.
Multi-GPU training¶
Multi-GPU training with trident incurs some stepping stones that should be carefully handled. We will first discuss validation, it is comparatively straightforward to training.
Validation¶
Lightning recommends to disable trainer.use_distributed_sampler for research (see note in LightningModule.validation_loop). Consequently trident disables the flag by default.
Nevertheless, setting the flag may be recommended for training dataloaders. The example in trainer.use_distributed_sampler demonstrates how:
# in your LightningModule or LightningDataModule
def train_dataloader(self):
dataset = ...
# default used by the Trainer
sampler = torch.utils.data.DistributedSampler(dataset, shuffle=True)
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
return dataloader
Training¶
You should use trainer.strategy="ddp"
or, better yet, DeepSpeed.
Since we set trainer.use_distributed_sampler
to False
, we need to ensure that each process per GPU runs on a different subset of the data.
For conventional a Dataset
(i.e., not a IterableDataset`), you can use the preprocessing
key in TridentDataspec.preprocessing as follows:
preprocessing:
# ... other preprocessing here
# THIS MUST BE AT BOTTOM OF YOUR PREPROCESSING
apply:
wrap_sampler:
_target_: torch.utils.data.DistributedSampler
shuffle: True
For IterableDataset
, you need to ensure that datasets appropriately splits the data over the processes. Typically, an IterableDataset
comprises many files (i.e., shards), which can be evenly split over the GPUs as follows.
preprocessing:
apply:
split_dataset_by_node:
_target_: datasets.distributed.split_dataset_by_node
rank:
_target_: builtins.int
_args_:
- _target_: os.environ.get
key: NODE_RANK
world_size:
_target_: builtins.int
_args_:
- _target_: os.environ.get
key: WORLD_SIZE
DataModule¶
How to train and evaluate on multiple datasets?¶
The below example illustrates training and evaluating NLI jointly on English and a ${lang}
of AmericasNLI.
defaults:
- /dataspec@datamodule.train:
- mnli_train
- amnli_train
During training batch
turns from dict[str, torch.Tensor]
with, for instance, a structure common for HuggingFace
batch = {
"input_ids": torch.LongTensor,
"attention_mask": torch.LongTensor,
"labels": torch.LongTensor,
}
to dict[str, dict[str, torch.Tensor]]
, embedding the original batch now by dataset.
batch = {
"source": {
"input_ids": torch.LongTensor,
"attention_mask": torch.LongTensor,
"labels": torch.LongTensor,
}
"target": {
"input_ids": torch.LongTensor,
"attention_mask": torch.LongTensor,
"labels": torch.LongTensor,
}
}
Important: this is not applicable for evaluation, as dict[str, DataLoader]
up- or downsample to the largest or smallest dataset in the dictionary. During evaluation, the DataLoader
for multiple validation or test datasets consequently are of list[DataLoader]
in order of declaration in the yaml
configuration.
How do I subsample a dataset?¶
# typically declared in your ./configs/datamodule/$YOUR_DATAMODULE.yaml
train:
my_dataset:
preprocessing:
method:
shuffle:
seed: ${run.seed}
select:
indices:
_target_: builtins.range
_args_:
- 0
# must be set by user
- ${run.num_shots}
How do I only run testing?¶
Bypassing training is implemented with the corresponding Lightning Trainer flag. You can write the following in your experiment.yaml
trainer:
limit_train_batches: 0.0
or pass
python -m trident.run ... trainer.limit_train_batches=0.0
to the CLI.
hydra¶
How do I set a default for variable in yaml?¶
The yaml configuration of hydra bases on OmegaConf. OmegaConf has support for built-in and custom resolvers, which, among other things, let’s you define a default for your variable that can otherwise not be resolved.
# absolute_path_to_node is a link to a node like e.g., `with_default`
with_default: "${oc.select:absolute_path_to_node,default_value}"