General topics -------------- What is an *experiment*? ************************ An experiment is a directory that contains all the information about a workflow. It contains the configuration files, the scripts, the logs, and the results of the workflow. The command ``autosubmit expid`` creates an experiment, and assignes it a unique *expid*, i.e. experiment identifier. An experiment contains multiple jobs, each job is a task that is executed in the workflow. The jobs are defined in the configuration files and are executed in the order defined by the workflow. We can have different types of experiments; with different shapes. They all have calendar information (in the EXPERIMENT) section. For example, an experiment can be a simple "model" worfklow with only ``SIM`` (simulation) jobs, or a more complex workflow with post-processing tasks like ``DQC``, ``TRANSFER``, and ``AQUA`` jobs. An experiment can also be an "apps" workflow, where the jobs are Data Notifiers, OPAs and applications, consuming data and producing results. An experiment can also be a "end-to-end" workflow, where the jobs are a combination of different types of tasks, like ``SIM``, ``DQC``, ``TRANSFER``, ``AQUA``, ``DNs``, ``OPAs``, and applications, directly reading from the model that is producing the data. Finally, an experiment can also be a "simless" workflow, where the jobs are only ``DQC``, ``TRANSFER``, ``AQUA`` without any ``SIM`` jobs. This is used to transfer or check the data "offline". Where are my logs? And my scripts? ********************************** Your experiment in the Virtual Machine contains different directories. One of them is ``tmp``, that has also two subdirectories. There, you can find three types of files: - ``.cmd``: scripts generated by Autosubmit, applying the configuration to the templates. The placeholders are substituted. Those are the scripts submitted. Don't edit them, it won't work! They are re-generated each time that autosubmit runs. - ``.err``: standard error logs of each job. - ``.out``: standard output logs of each job. In ``ASLOGS`` you can see the output and logs of each autosubmit command that was performed to an experiment. My workflow failed! Now what? ***************************** It might happen that a job of the workflow fails. The first thing that you should check is the job's log. They will be located here: ``$expid/tmp/LOG_$expid``. Every job has two log files: a ``.err`` and a ``.out``. Check the both of them, and once you find and fix the error, perform ``autosubmit create $expid`` and ``autosubmit run $expid`` in order to re-run the **whole** workflow. If the job failed **but you don't want to rerun the whole workflow, just restart the last job that failed,** you can perform, for example: .. code-block:: bash autosubmit setstatus -fs FAILED -t READY -s .. note:: You can select a status in the Graph view of the Autosubmit GUI. It provides your command above, ready to paste into the terminal. And then: .. code-block:: bash autosubmit run You **don't** have to create again. If you don't understand why your job failed, don't hesitate to contact us! My Workflow is slow. What can I do? *********************************** Try the debug queue and get rid of the long lumi queues! 1. Modify ``$expid/conf/minimal.yml`` to .. code-block:: yaml CONFIG: AUTOSUBMIT_VERSION: "4.0.89" TOTALJOBS: 2 # <---- MAXWAITINGJOBS: 2 # <---- 2. Modify ``$expid/proj/workflow/conf/platforms.yml`` to .. code-block:: yaml PARTITION: debug # <---- (line 98) 3. Run ``create $expid`` 4. Run ``run $expid`` Which is the best wrapper configuration? **************************************** There's no universal answer but unless there is an issue preventing doing so, like the model frequently crashing and Autosubmit unable to keep the allocations open and apply re-trials within the wrappers, we should go for wrappers as long as possible. For example, the waiting time is probably not going to be larger with 72h than with 24h (since if the machine was full I doubt the 24hours job fill in the gaps with backfilling), so once you wait, try to run as much as possible before the next queuing time. Can I receive email notifications when jobs fail? ************************************************* Yes! Simply add the following block to your main.yml and substitute with your email address. You can even add multiple email addresses (separated by a space) if you would like multiple people to receive the notifications. .. code-block:: yaml MAIL: # Enable mail notifications for remote_failures # Default:True NOTIFY_ON_REMOTE_FAIL: True # Enable mail notifications # Default: False NOTIFICATIONS: True # Mail address where notifications will be received TO: damien.mcclain@bsc.es aina.gayayavila@bsc.es