Jobs

Primary jobs

Depending on the workflow mode selected, different primary jobs will be executed. The jobs are what Autosubmit submits to the remote platforms. They are a combination of templates (bash scripts) and the configuration selected by the user.

The templates are located in /workflow/templates.

local_setup: performs basic checks as well as compressing the workflow project in order to be sent through the network.
synchronize: syncs the workflow project with the remote platform.
remote_setup:
- loads the necessary environment and then compiles the different models/applications.
- Performs checks in the remote platform.
- Installs the running applications and the GSV interface.
ini: prepares any necessary initial data for the climate model runs. Runs in the login node of the HPC.
sim: runs one chunk of climate simulation. Runs in the HPC.
dqc: performs basic checks on the data produced by the simulation. Runs in the HPC. It has two modes: BASIC and FULL.
dn: notifies when the wanted data is already produced by the model. Runs in the login node of the HPC.
opa: creates the statistics required by the data consumers (Apps). Runs in the HPC.
applications: creates usable output using the applications from the different use cases. Runs in the HPC.

Platform in which each one of the primary jobs run, as well as relation of jobs with each one of the workflow modes.

Login nodes for LUMI, interactive partition for MareNostrum5

Additional jobs

Since v5.2.0, the applications are configured as additional jobs, that is, that they are enabled or disabled by setting them True or False in conf/main.yml. They are mandatory in APPS and END-TO-END mode.

The following additional jobs are optional:

transfer: transfers the data produced in the simulation to the Data Bridge. Runs in the HPC or the client machine in MN5.
backup: copies the rundir and certain restarts to another partition. Runs in the HPC.
check_mem: monitors the memory consumption of the SIM jobs. Runs in the login node of the HPC.
wipe: contains two jobs:
- wipe-check: checks which data has already been transferred to the HPC-FDB. Runs in the HPC or the client machine in MN5.
- wipe: wipes already transferred data from the HPC-FDB. Runs in the HPC.
clean: compresses the rundir and the logs from the HPC. Purges the data of the FDB (deletes repeated entries). Runs in the HPC.
clean_restarts: deletes certain restart directories on the remote platform. Runs in the HPC.
- KEEP_EVERY: this variable determines which restart directories to keep. The 1st is always saved then each restart directory at the frequency indicated with this variable.
scaling: performs a scaling test of the model. Runs in the HPC.
aqua: use the variable AQUA.EXPERIMENT_NAME to clearly name the AQUA plots, files, and catalog branch. contains 3 jobs:
- LRA_GENERATOR: generates the LRA (Low resolution archive) files
- AQUA_ANALYSIS: Performs the analysis of the AQUA files
- AQUA_PUSH: Pushes AQUA plots to LUMI-O
- AQUA_PUSH and the second one performs the analysis of the AQUA files
sync_lra: contains 3 jobs:
- SYNC_LRA: syncs the LRA files to the common path.
- UPDATE_CATALOG: updates the catalog urlpath with the new location of the LRA.
- PUSH_UPDATED_CATALOG: pushes the updated catalog to the reposotitory.
postprocessing for application: It is actually a job per application (POSTPROCESS_${APPNAME}), and allows for postprocessing scripts outside of the core streaming. It runs at the end of the chunk in app or end-to-end mode.(to be used in app or end-to-end modes) to run postprocessing scripts outside the core streaming.

Performance Jobs

The workflow includes dedicated performance monitoring and analysis jobs to help diagnose and quantify resource usage during simulations.

MONITOR_RESOURCES

Function: Continuously collects per-node and per-step resource metrics while the SIM job is running.
Data collected:
- Per-node pidstat samples at configurable intervals (default: 10 seconds).
- SLURM sstat per-step and aggregated snapshots at configurable intervals (default: 60 seconds).
- Threads-per-core (TPC) detection from the first allocated node to normalise CPU metrics to physsical nodes instead of logical nodes.
Output: Compressed JSON files under $HPCROOTDIR/performance/monitor/ (e.g. metadata/start.json.gz, pidstat/nodes/{node}/{ts}.json.gz, sstat/steps/{step}_{ts}.json.gz).
Dependencies: Runs while SIM has STATUS: RUNNING.

PERFORMANCE_METRICS

Function: After monitoring completes, computes end-to-end performance KPIs (CMIPS-style) from SLURM accounting and monitoring data.
Metrics computed:
- SYPD (Simulated Years Per Day) and QSYPD (including queue time).
- Core-hours: Total physical core-hours consumed (normalized by threads-per-core).
- CHSY (Core-Hours per Simulated Year): Core-hour efficiency metric.
- Energy: Energy consumption in Joules, Joules per simulated year, and carbon footprint (gCO2).
- Memory Bloat: Ratio of actual memory usage (MaxRSS) to theoretical restart size.
- Storage: Total FDB output data size (bytes) and data intensity (bytes per core-hour).
- Data Output Cost: Time and resource fractions dedicated to I/O operations (from model timing files).
- Grid Points: Total number of grid points for the simulation.
Output: A compressed JSON summary in $HPCROOTDIR/performance/performance_metrics/{jobname}/CMIPS.json.gz containing metadata, raw SLURM info and computed performance metrics.
Dependencies: Depends on MONITOR_RESOURCES with STATUS: COMPLETED.

Selectable configuration

We are now running the workflow with the new version of the CUSTOM_CONFIG and the minimal configuration new features of Autosubmit. This new configuration scheme allows for a distributed, hierarchical parametrization of the workflow, thereby providing a more customizable, modular, and user-friendly workflow. The structure, domain and use of this new configuration scheme will likely evolve as it adapts to the needs of other work packages.

In the file main.yml the user will decide the parameters of the simulation. Depending on what the user selects, one set or another of configurations will be loaded.

The following parameters will be used to load the configuration files:

RUN.WORKFLOW
MODEL.NAME
MODEL.SIMULATION
MODEL.GRID_ATM
CONFIGURATION.ADDITIONAL_JOBS.*
APP

The user can overwrite any parameter defininig it in the main.yml file. This will have priority over the default configuration files loaded previously.

Note

For a comprehensive list of the allowed values, see Configuration keys.

In the minimal.yml the basic information of the experiment is defined. It is the last file loaded in the configuration process. For more information: Autosubmit documentation on minimal experiments.