First Steps

A launcher is a script that is going to be executed in Katea HPC, and it is going to launch the model with the desired parameters. It includes several steps that must by fulfilled, like allocating resources, loading models ... If you know about resources, modules and environments, you can skip to create a launcher. Otherwise, here you have a reduced explanation of these concepts.

Resources

In order to launch a job, you need to specify the resources you are going to use. The resources are going to be specified in the launcher, and they are going to be used by the HPC to allocate the necessary resources for the job. There are several commands available to work with resources, such as - --mem: specifies the upper memory to be used per node. in MB, or GB. You can put 0GB to use the maximun amount available, although it is recommended to set a fixed value.

Example: --mem=16GB - --time: specifies the maximum time the job is going to be running.
Example: --time=12:00:00 - --nodes: specifies the number of nodes to be used.
Example: --nodes=1 - --cpus-per-task: specifies the number of cpus to be used per node.
Example: --cpus-per-task=4 - --gres: specifies the number of gpus to be used per node.
Example: --gres=gpu:1

For a more detailed explanation, you can read the Requesting Resources.

Modules

Modules are general software packages that are going to be used to load general software, such as CUDA, torch ... There are some commands available to work with models, such as:

module list: shows all the loaded modules
module avail: shows all the modules the user is able to load
module purge: removes all the loaded modules
module load module_name: loads the necessary environment variables for the selected modulefile (PATH, MANPATH, LD_LIBRARY_PATH...)
module unload module_name: removes all environment changes made by module load command

For more information, you can read Software Environment

Environment

For dependencies that are not available as modules, you need to create and load an environment. For venv environments, you can use the following commands:

python -m .venv [env_path]

Once the environment is created, you can load it with the following command:

source [env_path]/bin/activate

and install the dependencies with pip:

pip install -r requirements.txt

You can find more detailed explanations online.

if an environment is already created, and dependencies installed, you only need to load it using

source [env_path]/bin/activate

Create a launcher

Once everything is set up, you can create a launcher and store it in the project's folder. A launcher is a script, as a *.sh file, that is going to be executed in the HPC, and it is going to launch the model with the desired parameters. In the launcher, you are going to specify

the name of the job
the modules you are going to load
the environment you are going to load
the resources you are going to use
the command to launch the model
the partition to lauch the script

Here you have an example of a launcher:

#!/bin/bash
#SBATCH --job-name=example_job

# ask for resources
#SBATCH --partition=gpu
#SBATCH --mem=1GB
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1

# Load module for PyTorch
module load torch/2.3.1

# Load environment
source /global/projects/[NAVISION_ID]/[ENV_PATH]/bin/activate

# Launch the model
python main.py

For a more detailed explanation, you can read the Running Jobs.

Running a Job

The HPC uses SLURM, a management and job scheduling system for Linux clusters. In order to run a process or a job in it, you need to submit the launcher created in the previous step and queue it. You can do it with the following command:

sbatch [launcher.sh]

This is going to submit and queue the job to the HPC. You can check the status of the queue using:

squeue

If you need to cancel the job, you can do it with the following command:

scancel [job_id]

There are more commands available, such as srun, sinfo ... You can find more information in Commands and SLURM.

Monitoring Jobs

Once the job is queued, you can monitor it with the following command:

squeue -u [tecnalia_username]

which is going to show the JobID, resource used (cpu/gpu), name of the job, user, status (R: running, PD: pending), time and used nodes. You can also just call squeue to see all the jobs in the queue.

You can also see the status of the resources with the following command:

sinfo

This is going to show the partition, the time limit, available nodes, the state of each one, and a node list. For a more detailed explanation, you can read the Monitoring jobs.

Resources​

Modules​

Environment​

Create a launcher​