File and Directory Structure

This section provides an overview of the file and directory structure of the Climate DT Workflow project. Below is a tree view of the main files and directories, followed by a detailed explanation of their purpose.

Project Structure

.
├── catalog/                 # Catalog of datasets or configurations
├── CHANGELOG.md             # Log of changes made to the project
├── conf/                    # Configuration files for the workflow
│   ├── additional_jobs/     # Definitions for additional jobs
│   │   ├── aqua-True.yml    # AQUA job configuration
│   │   ├── backup-True.yml  # Backup job configuration
│   │   └── ...              # Other additional job configurations
│   ├── applications/        # Application-specific configurations
│   │   ├── container_versions.yml  # Container version mappings
│   │   ├── energy_indicators/  # Directory for energy indicators application configuration
│   │   ├── request/         # Directory to set data request details for all apps
│   │   └── ...              # Other application configurations
│   ├── bootstrap/           # Internal bootstrap configurations
│   │   └── include.yml      # Entry point for loading workflow configuration
│   ├── data_gov/            # Data governance-related configurations
│   │   ├── production.yml   # Production data governance rules
│   │   └── ...              # Other data governance configurations
│   ├── defaults/            # Default configuration files
│   │   ├── defaults_model.yml  # Default model configuration
│   │   └── ...              # Other default configurations
│   ├── model/               # Model-specific configurations
│   │   ├── icon/            # ICON model configurations
│   │   ├── ifs-fesom/       # IFS-FESOM model configurations
│   │   └── ifs-nemo/        # IFS-NEMO model configurations
│   ├── simulation/          # Simulation-specific configurations
│   │   ├── control-ifs-nemo.yml  # Control simulation for IFS-NEMO
│   │   └── ...              # Other simulation configurations
│   └── platforms.yml        # Platform-specific configurations
├── data-portfolio/          # Submodule: Data portfolio for model output
├── docs/                    # Documentation source files
│   ├── source/              # Sphinx documentation source
│   │   ├── developers_guide/  # Developer guide documentation
│   │   ├── schemas/         # Schema documentation
│   │   └── users_guide/     # User guide documentation
├── dvc-cache-de340/         # Submodule: DVC cache for managing IFS inputdata
├── ifs-nemo/                # Submodule: IFS-NEMO model sources
├── lib/                     # Shared libraries and utilities
│   ├── common/              # Common utility scripts
│   │   ├── checkers.sh      # Script for validation checks
│   │   └── util.sh          # General utility functions
│   ├── LUMI/                # LUMI-specific configurations
│   │   └── config.sh        # Configuration script for LUMI
│   └── ...                  # Other platform-specific configurations
├── mains/                   # Main configuration examples
│   ├── main_example_app.yml  # Example configuration for applications workflow
│   └── ...                  # Other main configuration examples
├── Makefile                 # Build automation file
├── nemo/                    # Submodule: NEMO model sources
├── pyproject.toml           # Python project configuration
├── pytest.ini               # Pytest configuration
├── README.md                # Project overview and getting started guide
├── runscripts/              # Additional scripts for APPS, ICON, FDB
│   ├── dn/                  # Data Notifier scripts
│   │   └── run_dn.py        # Script to run the Data Notifier
│   ├── icon                 # ICON model runscripts
│   │   ├── control          # Control runscripts for ICON
│   │   ├── historical       # Historical runscripts for ICON
│   │   └── ...              # Other ICON runscripts
│   ├── hydroland/           # Hydroland application scripts
│   │   ├── run_hydroland.sh  # Script to run Hydroland
│   │   └── ...              # Other Hydroland scripts
│   └── ...                  # Other application-specific scripts
├── setup.py                 # Python package setup script
├── templates/               # Main workflow jobs bash templates
│   ├── aqua/                # AQUA-specific templates
│   │   ├── aqua_analysis.sh  # AQUA analysis script template
│   │   └── ...              # Other AQUA templates
│   └── ...                  # Other templates
├── tests/                   # Unit and integration tests
│   ├── bats_tests/          # BATS tests for job shell scripts
│   ├── schemas/             # JSON schema validation tests
│   │   ├── run.schema.json  # Schema for the RUN section
│   │   └── ...              # Other schema files
│   └── workflow_mock/       # Mock workflow tests
└── utils/                   # Utility scripts and helper functions
    ├── logger.py            # Logging utilities
    ├── update_changelog.sh  # Script to update the changelog
    └── ...                  # Other utility scripts

Detailed Descriptions

Configuration System (`conf/`)

The conf/ directory contains all configuration files used by the workflow system to define behavior across different environments, models, and applications.

Model Configurations (`conf/model/`)

This directory contains settings specific to each climate model:

icon/: Configuration files for the ICON (ICOsahedral Nonhydrostatic) atmospheric model, including:
- Model resolution settings depending on the processing unit
- Wallclock and timestepping configurations
- Physical parameterization options
- Default request resolutions
ifs-fesom/: Configuration for the IFS-FESOM coupled model:
- IFS model parameters depending on the processing unit
- Wallclock, IO tasks, and additional setup configuration
- ICMCL configurations
- Default request resolutions
ifs-nemo/: Configuration for the IFS-NEMO coupled model:
- IFS model parameters depending on the processing unit
- Wallclock, IO tasks, and additional setup configuration
- Default request resolutions

Application Configurations (`conf/applications/`)

Contains settings specific to downstream applications:

container_versions.yml: Maps application names to container versions.
default_gsv_request.yml: Configuration for the GSV, including:
- Grid resolutions
- Area to process
- Method (e.g., nn)
opa/opa.yml: Configuration for the opa, including:
- Apps output directory
- Retries
- Platform specific settings
energy_indicators/: Directory for energy indicators application configurable physical parameters.
energy_offshore/: Directory for energy offshore application configurable physical parameters.
hydroland/: Directory for hydroland application configurable physical parameters.
hydromet/: Directory for hydromet application configurable physical parameters.
wildfires_fwi/: Directory for wildfires FWI application configurable physical parameters.
wildfires_wise/: Directory for wildfires WISE application configurable physical parameters.
obsall/: Directory for OBSALL application configurable physical parameters.
data/: Directory to set details of the data workflow.
request/: Directory to set data request details for all apps. A file per app.
- Contains detailed specifications for GSV and OPA requests
- Defines parameters like activity, resolution, generation, and realization
- Includes hardcoded settings for datasets, grids, and methods

Workflow Job Management (`conf/additional_jobs/`)

Defines auxiliary workflows that can be attached to the main simulation workflow:

aqua-True.yml: Configuration for the AQUA jobs, including:
- Adds AQUA to the jobs definitions
- Additional parameters for the analysis
dqc-True.yml: Configuration for DQC jobs, including:
- Adds the DQC to the jobs definitions
- Additional DQC parameters (e.g., profiles)
backup-True.yml: Configuration for backup jobs.
cleanup-True.yml: Configuration for cleanup jobs.
transfer-True.yml: Configuration for transfer jobs.
memory_checker-True.yml: Configuration for memory checking job.
transfer-True.yml: Configuration for transferring FDB data between systems.
wipe-True.yml: Configuration for cleaning up transferred data.
size_checker-True.yml: Configuration for checking the size of generated data.
scaling-True.yml: Configuration for performance scaling job.
energy_indicators-True.yml: Configuration for energy indicators jobs.
energy_offshore-True.yml: Configuration for jobs of Energy Offshore.
hydroland-True.yml: Configuration for jobs of Hydroland.
hydromet-True.yml: Configuration for jobs of Hydromet.
wildfires_fwi-True.yml: Configuration for jobs of Wildfires FWI.
wildfires_wise-True.yml: Configuration for jobs of Wildfires WISE.
obsall-True.yml: Configuration for jobs of OBSALL.
data-True.yml: Configuration for data retrieval workflow.
postproc_hydroland-True.yml: Configuration for post-processing jobs of Hydroland.
postproc_energy_indicators-True.yml: Configuration for post-processing jobs of energy_indicators.

Bootstrap System (`conf/bootstrap/`)

include.yml: Bootstrap entry point that:
- Establishes configuration loading order
- Sets up environment-specific overrides
- Initializes the workflow context

Data Governance (`conf/run_types/`)

Defines data management policies and rules for each workflow mode:

production.yml: Production environment data governance rules for:
- FDB keys (EXPVER: 0001, CLASS: d1, FDB Path, etc.)
- Application specifics
research.yml: Research environment data governance rules for:
- GSV Requests (expid, d1, FDB Path, etc.)
- Application specifics

Additional files for other types of runs:

pre-production.yml
test.yml
operational-read

Default Settings (`conf/defaults/`)

Base configurations that can be overridden by more specific files:

defaults_model.yml: Default parameters for all models, including:
- Model paths
- Resquest and AQUA configurations
- I/O settings

Additional files for other modes of the workflow:

defaults_end-to-end.yml
defaults_simless.yml

Simulation Configurations (`conf/simulation/`)

Defines standard simulation types and scenarios:

control-ifs-nemo.yml: Configuration for IFS-NEMO control simulations, including:
- Chunking information, calendar, etc.
- IFS-NEMO Multi-IO Plans, RAPS conf, etc.
- GSV Request definitions
control-r2b9-icon-1990.yml: Configuration for ICON 5km control simulations, including:
- Chunking information, calendar, etc.
- Simulation timestepping definitions
- GSV Request definitions
ifs-fesom-control-tco79.yml: Configuration for IFS-FESOM tco79 control simulations, including:
- Chunking information, calendar, etc.
- RAPS, Portoflio, DQC settings
- GSV Request definitions

Platform Configurations (`conf/platforms.yml`)

platforms.yml: Settings for different computing platforms:
- Defines platform-specific parameters (e.g., LUMI, MN5)
- Includes queue names
- Wallclock limits
- SBATCH options
- Components paths and version

General Configuration (`conf/general.yml`)

general.yml: Contains global configuration settings for the workflow:
- Defines paths for containers, scratch directories, and libraries
- Sets up default directories for HPC projects and FDB (Field Database)
- Includes general tools and ensemble versioning information

Job Templates (`conf/jobs_<mode>.yml`)

jobs_model.yml: Defines job configurations for model workflows:
- Specifies dependencies, wallclock limits, and platform settings.
jobs_simless.yml: Defines job configurations for simulation-less workflows:
- Specifies dependencies, wallclock limits, and platform settings, does not include SIM.
jobs_end-to-end.yml: Defines basic job configurations for end-to-end workflows.
- Specifies dependencies, wallclock limits, and platform settings, does include up to DN.
jobs_apps.yml: Defines job configurations for application workflows.
- Specifies dependencies, wallclock limits, and platform settings, does include up to DN

GSV Configuration (`conf/gsv.yml`)

gsv.yml: Configuration file for the GSV:
- Maps paths for grid definitions, weights, and test files
- Defines GSV version
- Defines model grid definitions paths

Libraries and Utilities (`lib/`)

Collection of shared scripts and libraries used across the workflow:

Common Utilities (`lib/common/`)

checkers.sh: Contains validation functions for:
- Verifying configuration file integrity
- Checking environment setup for required dependencies
- Validating input data for workflows
util.sh: General utility functions for:
- Logging and error handling
- File system operations, such as creating directories or managing files
- String manipulation and other helper functions

Platform-Specific Configurations (`lib/LUMI/` and others)

LUMI/config.sh: Configuration settings specific to the LUMI supercomputing platform:
- Defines queue configurations and module loading for LUMI
- Sets up environment variables and I/O path mappings for workflows
- Includes functions for loading Singularity containers and managing dependencies
MARENOSTRUM5/config.sh: Configuration for MareNostrum5:
- Defines HPC-specific settings, such as module loading and queue names
- Manages paths for input/output data and temporary directories

Runtime Components

Runscripts (`runscripts/`)

Contains scripts for ICON, applications, and other components:

icon/
- control/: ICON model execution scripts for control simulations:
  - Manages namelists, input data, and output paths
  - Handles processing of restart files and log files
  - Configures process mappings
  - Handles job submission ICON control runs
- historical/: ICON model execution scripts for historical simulations:
  - Similar to control scripts but tailored for historical data processing
  - Manages specific configurations for historical runs
  - Handles job submission for ICON historical runs
- test/: ICON scripts for testing
dn/run_dn.py: Script for the Data Notifier, which:
- Monitors for new data availability in the system
- Triggers downstream workflows based on data readiness
- Sends notifications to ensure workflow synchronization
hydroland/run_hydroland.sh: Script for the Hydroland application, which:
- Processes meteorological inputs and runs hydrological models
- Generates outputs such as river discharge and soil moisture
- Manages restart files and log files to optimize memory usage
opa/run_opa.py: Scripts for the OPA (One Pass algorithm)
wildfires_fwi/: Scripts for running the Wildfires FWI application
wildfires_wise/: Scripts for running the Wildfires WISE application
ensembles/: Scripts for ensemble simulations:
- perturb_nemo_restart.py: Perturbs NEMO restart files for ensemble simulations
- perturb_var.py: Perturbs specific variables for ensemble runs
energy_onshore/run_energy_onshore.py: Scripts for the Energy Onshore application:
- Processes data and runs simulations for onshore energy applications
energy_offshore/run_energy_offshore.py: Scripts for the Energy Offshore application:
- Processes data and runs simulations for offshore energy applications
FDB/: Scripts for managing the Field Database (FDB):
- count_expected_messages.py: Counts expected messages in the FDB
- yaml_to_mars.py: Converts YAML configurations to MARS requests
- update_fdb_info.py: Updates FDB metadata and configurations

Templates (`templates/`)

The templates/ directory contains the bash scripts that define the behavior of the workflow jobs and components. Additionally it includes configuration files for the FDB and AQUA.

aqua/: Templates for AQUA-related workflows:
- aqua_analysis.sh: Template for AQUA analysis jobs, which:
  - Performs quality analysis on model outputs, such as consistency checks
  - Generates diagnostic metrics and visualizations
  - Supports containerized execution for portability
- aqua_push.sh: Template for pushing AQUA outputs to external storage or databases
- lra_generator.sh: Template for generating LRA (Low Resolution Archive) configurations for AQUA
sim_ifs-nemo.sh: Template for IFS-NEMO simulation jobs, which:
- Configures model-specific parameters, such as grid resolution and timestepping
- Manages input/output paths and restart files
- Executes the simulation in chunks to optimize resource usage
sim_icon.sh: Template for ICON simulation jobs, which:
- Configures ICON-specific parameters, such as grid identifiers and refinement levels
- Manages timestepping and coupling configurations
- Handles restart files and output paths
sim_nemo.sh: Template for standalone NEMO simulation jobs, which:
- Configures ocean model parameters, such as grid resolution and timestepping
- Manages input data and restart files
- Executes the simulation and handles output generation
application.sh: General template for running application-specific workflows, including:
- Energy Onshore: Processes data and runs simulations for onshore energy applications
- Energy Offshore: Processes data and runs simulations for offshore energy applications
- Hydroland: Runs hydrological models and processes meteorological inputs
- Wildfires FWI: Calculates fire weather indices for wildfire risk assessment
- Wildfires WISE: Simulates wildfire spread and behavior
dqc.sh: Template for running Data Quality Checker (DQC) jobs, which:
- Validates data compliance with predefined standards
- Checks spatial completeness, consistency, and physical plausibility
dn.sh: Template for the Data Notifier (DN) service, which:
- Monitors for new data availability
- Triggers downstream workflows based on data readiness
- Sends notifications to ensure workflow synchronization
transfer.sh: Template for transferring data between systems, which:
- Manages data movement to and from HPC environments
- Ensures data integrity during transfers
- Supports integration with the Field Database (FDB)
remote_setup.sh: Template for setting up remote environments, which:
- Loads HPC-specific configurations and modules
- Prepares directories and dependencies for job execution
- Handles compilation and installation of models
local_setup.sh: Template for setting up local environments, which:
- Prepares the local directory structure for workflows
- Validates configuration files and dependencies
- Compresses and transfers project files to remote systems
ini.sh: Template for initializing simulations, which:
- Configures initial conditions for models
- Prepares namelists and other input files
- Handles dependencies for simulation startup
synchronize.sh: Template for synchronizing data across systems, which:
- Ensures consistency between local and remote environments
- Manages file transfers and updates
wipe.sh: Template for cleaning up data that has already been transferred to the bridge.
- Removes intermediate files generated during workflows
- Frees up storage space on HPC systems

Testing Framework (`tests/`)

Comprehensive test suite for validating workflow components:

bats_tests/: BATS (Bash Automated Testing System) tests for shell scripts that:
- Verify the functionality of job templates and utility scripts
- Test error handling and edge cases in shell scripts
- Ensure compatibility with different HPC environments
schemas/: JSON schema validation tests:
- run.schema.json: Schema definition for validating the RUN section of the workflow configuration
- Other schemas for validating model, application, and platform configurations
workflow_mock/: Mock tests for workflow execution that:
- Simulate workflow scenarios without requiring full execution
- Test job dependencies and sequencing logic
- Validate the correctness of workflow configurations and outputs

Documentation (`docs/`)

Comprehensive documentation for the project:

source/developers_guide/: Documentation for developers, including:
- Details on the architecture and structure of the workflow
- Guidelines for contributing to the project
- API references for internal tools and libraries
source/users_guide/: Documentation for end users, including:
- Tutorials for setting up and running experiments
- Examples of workflow configurations for different use cases
- Troubleshooting common issues
source/schemas/: Documentation of JSON schemas, including:
- Schema definitions for workflow configuration files
- Validation rules for ensuring configuration correctness
- Example configurations for reference

Submodules

Data Management

catalog/: Catalog of available datasets and configurations for AQUA:
- Contains metadata and configuration files for datasets
- Defines paths and parameters for accessing and processing data
data-portfolio/: Submodule containing the data portfolio for model output:
- Manages metadata and configurations for model-generated data
- Ensures consistency and traceability of data across workflows
dvc-cache-de340/: DVC (Data Version Control) cache for managing IFS input data:
- Stores versioned input data for reproducibility
- Tracks changes to input datasets over time

Model Source Code

The project integrates multiple model codebases as Git submodules:

ifs-nemo/: Source code for the IFS-NEMO coupled Earth system model:
- Combines the IFS atmospheric model with the NEMO ocean model
- Includes configuration files, scripts, and source code for running coupled simulations
nemo/: Standalone NEMO ocean model source code:
- Includes configuration files, scripts, and source code for running ocean-only simulations
- Supports various resolutions and configurations for ocean modeling

File and Directory Structure

Project Structure

Detailed Descriptions

Configuration System (conf/)

Model Configurations (conf/model/)

Application Configurations (conf/applications/)

Workflow Job Management (conf/additional_jobs/)

Bootstrap System (conf/bootstrap/)

Data Governance (conf/run_types/)

Default Settings (conf/defaults/)

Simulation Configurations (conf/simulation/)

Platform Configurations (conf/platforms.yml)

General Configuration (conf/general.yml)

Job Templates (conf/jobs_<mode>.yml)

GSV Configuration (conf/gsv.yml)

Libraries and Utilities (lib/)

Common Utilities (lib/common/)

Platform-Specific Configurations (lib/LUMI/ and others)