.. _walkthrough: ####################### |project| in 20 minutes ####################### The walkthrough first introduces common concepts of hydra_ and then walks through an exemplary text-classification pipeline for sequence-pair classification (NLI). The `example NLI project `_ is embedded in the repository. hydra primer ============ .. include:: hydra.rst Project Structure ================= An exemplary structure for a user project is shown below: - `configs `_ holds the entire hydra_ yaml configuration - `src `_ comprises required code, typically for processing and evaluation, as referred to in the config .. code-block:: bash # yaml configuration your-project ├── configs │   ├── config.yaml # inherits all `default.yaml` │   ├── experiment # typical entry point, 2nd-level `config.yaml` for your experiment │   │   ├── default.yaml │   │   └── nli.yaml │   ├── module │   │   ├── optimizer # torch.optim │   │   │   ├── adam.yaml │   │   │   └── adamw.yaml │   │   ├── scheduler # learning-rate scheduler │   │   │   └── linear_warm_up.yaml │   │   ├── default.yaml │   │   └── text_classification.yaml │   ├── datamodule │   │   ├── default.yaml │   │   └── mnli_train.yal │   ├── dataspec # defines [dataset, preprocessing, dataloader, evaluation] │   │   ├── dataloader │   │   │   └── default.yaml │   │   ├── evaluation │   │   │   └── text_classification.yaml │   │   │ # inherits dataloader/default.yaml │   │   ├── default.yaml │   │   │ # task-specific dataspecs │   │   │ # inherits default.yaml and evaluation/text_classification.yaml │   │   ├── text_classification.yaml │   │   │ # dataset-group specific dataspecs │   │   │ # inherits text_classification.yaml │   │   ├── nli.yaml │   │   │ # dataset-specific dataspecs │   ├── dataspecs # defines groups of dataspec │   │   ├── mnli_train.yaml │   │   ├── xnli_val_test.yaml │   │   ├── amnli_val_test.yaml │   │   └── indicxnli_val_test.yaml │   ├── hydra │   │   └── default.yaml │   ├── logger │   │   ├── csv.yaml │   │   └── wandb.yaml │   ├── callbacks # defines callbacks like lightning.pytorch.ModelCheckpoint │   │   └── default.yaml │   └── trainer # defines lightning.pytorch.Trainer │   ├── debug.yaml │   └── default.yaml └── src # typical code folder structure └── tasks └── text_classification ├── evaluation.py └── processing.py Components ========== TridentModule ------------- :class:`~trident.core.module.TridentModule` extends the LightningModule_. The configuration defines all required components for a :class:`~trident.core.module.TridentModule`: 1. ``model``: ``_target_`` to your model constructor for which ``TridentModule.model`` will be initialized 2. ``optimizer``: the optimizer for all :class:`~trident.core.module.TridentModule` parameters 3. ``scheduler``: the learning-rate scheduler for the ``optimizer`` The ``default.yaml`` by default sets up AdamW optimizer and linear learning rate scheduler. .. code-block:: yaml # _target_ is hydra-lingo to point to the object (class, function) to instantiate _target_: trident.TridentModule # _recursive_: true would mean all keyword arguments are /already/ instantiated # when passed to `TridentModule` _recursive_: false defaults: # interleaved with setup so instantiated later (recursive false) - optimizer: adamw.yaml # see config/module/optimizer/adamw.yaml for default - scheduler: linear_warm_up # see config/module/scheduler/linear_warm_up.yaml for default # required to be defined by user model: ??? A common pattern is that users create a ``configs/module/task.yaml`` that predefines shared ``model`` and ``evaluation`` logic for a particular task. .. code-block:: yaml defaults: - default - evaluation: text_classification model: _target_: transformers.AutoModelForSequenceClassification.from_pretrained num_labels: ??? pretrained_model_name_or_path: ??? - The ``model`` constructor points to ``transformers.AutoModelForSequenceClassification.from_pretrained``. The actual model and number of labels will be defined in either the experiment configuration or in the CLI (cf. ``???``). TridentDataspec --------------- .. include:: dataspec_intro.rst dataset ^^^^^^^ .. include:: dataspec_dataset.rst preprocessing ^^^^^^^^^^^^^ .. include:: dataspec_preprocessing.rst dataloader ^^^^^^^^^^ .. include:: dataspec_dataloader.rst .. _evaluation: evaluation ^^^^^^^^^^ .. include:: dataspec_evaluation.rst TridentDataModule ----------------- .. include:: datamodule_intro.rst Config Composition ^^^^^^^^^^^^^^^^^^ .. note:: Hierarchical config composition heavily relies on `default lists `_ . The below file tree is a common structure for a hierarchical :class:`~trident.core.datamodule.TridentDatamodule` configuration in our NLI example. We will hierarchically 1. Compose a general ``dataspec`` 2. Compose a tast-specific text classification ``dataspec`` 3. Compose a NLI ``dataspec`` 4. Compose a train, val, or test split via ``dataspecs`` 5. Compose a datamodule .. code-block:: bash configs ├── config.yaml ├── datamodule │   └── default.yaml ├── dataspec │  ├── dataloader │ │   └── default.yaml │ ├── evaluation │ │   └── text_classification.yaml │ ├── default.yaml │ ├── nli.yaml │ └── text_classification.yaml └── dataspecs ├── mnli_train.yaml ├── xnli_val_test.yaml └── amnli_val_test.yaml Default """"""" The general ``dataspec`` simply defines the default (``./configs/dataspec/default.yaml``) configuration. .. code-block:: yaml defaults: - dataset: null # pull in the default dataloader - dataloader: default dataset: # _target_: datasets.load.load_dataset Text Classification """"""""""""""""""" .. code-block:: yaml defaults: - default - evaluation: text_classification # see TridentDataspec evaluation # task specific preprocessing preprocessing: ... # see TridentDataspec preprocessing .. seealso:: :ref:`TridentDataspec.preprocessing ` NLI """ The ``configs/dataspec/nli.yaml`` simply extends the task-specific ``text_classification.yaml`` by specifying columns for the tokenizer in preprocessing. .. code-block:: yaml defaults: - text_classification preprocessing: map: function: # column_names denotes input to the tokenizer during preprocessing column_names: text: premise text_pair: hypothesis .. seealso:: :ref:`TridentDataspec.preprocessing `, :ref:`TridentDataspec.preprocessing ` Dataspecs """"""""" We can now compose ``dataspecs`` which group :class:`~trident.core.dataspec.TridentDataspec`` for entire datasets. The ``configs/dataspecs/xnli_val_test.yaml`` levers ``hydra`` `package directives `_ to put the ``nli`` configuration into the corresponding dataspec keys. .. code-block:: yaml defaults: # package `nli` of configs/dataspec into @{...} - /dataspec@validation_xnli_en: nli - /dataspec@validation_xnli_es: nli # ... can extend this to the entire XNLI benchmark for val and test splits validation_xnli_en: dataset: path: xnli name: en split: validation validation_xnli_es: dataset: path: xnli name: es split: validation # ... can extend this to the entire XNLI benchmark for val and test splits NLI Datamodules """"""""""""""" Datamodule Configurations ~~~~~~~~~~~~~~~~~~~~~~~~~ We can now use `package directives `_ to include the configuration from the ``configs/dataspecs/xnli_val_test.yaml`` file into the ``val`` and ``test`` keys of the :class:`~trident.core.datamodule.TridentDatamodule`. .. warning:: When using packaging, make sure to provide a list of ``dataspecs`` configurations to allow for the merging of multiple ``datamodule`` configurations in the ``experiment`` configuration. **Imporant**: - A single :class:`~trident.core.dataspec.TridentDataspec`` in ``train`` of the :class:`~trident.core.datamodule.TridentDatamodule` will return a ``batch`` of ``dict[str, Any]`` at runtime - Multiple :class:`~trident.core.dataspec.TridentDataspec`` in ``train`` of the :class:`~trident.core.datamodule.TridentDatamodule` will return a ``batch`` of ``dict[str, dict[str, Any]]`` for multi-dataset training at runtime **Example Configuration** We now `package `_ the ``config/dataspec/xnli_val_test.yaml`` into a list configuration in ``datamodule.val`` of our experiment. We can thereby easily in- and exclude various datasets for training, validation, or testing. .. code-block:: yaml # variant A: training on a single dataset defaults: - /dataspecs@datamodule.train: mnli_train - /dataspecs@datamodule.val: - xnli_val_test - amnli_val_test - indicxnli_val_test - /dataspecs@datamodule.test: - xnli_val_test - amnli_val_test - indicxnli_val_test # variant B: training on multiple datasets defaults: - /dataspecs@datamodule.train: - mnli_train - xnli_train # ... Experiment ---------- The experiment configurations also segments into a general ``default.yaml`` and a task-specific ``nli.yaml``. The ``run`` key is, next to ``module``, ``datamodule``, and ``trainer`` a special key reserved for user configuration. The configuration of this key also gets saved in your ``logger`` (e.g., ``wandb``). .. code-block:: yaml defaults: - override /trainer: default - override /callbacks: default - override /logger: wandb # `run` namespace should hold your individual configuration run: seed: 42 task: ??? trainer: max_epochs: 10 devices: 1 precision: "16-mixed" deterministic: true inference_mode: false # log vars infers first training dataset # for logging batch size _log_vars: # needed because hydra cannot index list in interpolation train_datasets: ${oc.dict.keys:datamodule.train} train_dataset: ${_log_vars.train_datasets[0]} train_batch_size: ${datamodule.train.${_log_vars.train_dataset}.dataloader.batch_size} logger: wandb: name: "model=${module.model.pretrained_model_name_or_path}_epochs=${trainer.max_epochs}_bs=${_log_vars.train_batch_size}_lr=${module.optimizer.lr}_scheduler=${module.scheduler.num_warmup_steps}_seed=${run.seed}" tags: - "${module.model.pretrained_model_name_or_path}" - "bs=${_log_vars.train_batch_size}" - "lr=${module.optimizer.lr}" - "scheduler=${module.scheduler.num_warmup_steps}" project: ${run.task} .. code-block:: yaml # @package _global_ # The above line is important! It sets the namespace of the config defaults: - default # We can now combine `dataspecs` for training, validation, and testing - /dataspecs@datamodule.train: - mnli_train - /dataspecs@datamodule.val: - xnli_val_test - indicxnli_val_test - amnli_val_test - override /module: text_classification run: task: nli module: model: pretrained_model_name_or_path: "xlm-roberta-base" num_labels: 3 Commandline Interface ===================== hydra_ allows to simply set configuration items on the commandline. See more information .. code-block:: bash # change the learning rate python -m trident.run experiment=nli module.optimizer.lr=0.0001 # set a different optimizer python -m trident.run experiment=nli module.optimizer=adam # no lr scheduler python -m trident.run experiment=nli module.scheduler=null .. warning:: The commandline interface only supports absolute paths. For instance, overriding defaults at runtime from the CLI is not possible.