First Steps
A launcher is a script that is going to be executed in Katea HPC, and it is going to launch the model with the desired parameters. It includes several steps that must by fulfilled, like allocating resources, loading models ... If you know about resources, modules and environments, you can skip to create a launcher. Otherwise, here you have a reduced explanation of these concepts.
Resources
In order to launch a job, you need to specify the resources you are going to use. The resources are going to be specified in the launcher, and they are going to be used by the HPC to allocate the necessary resources for the job. There are several commands available to work with resources, such as - --mem: specifies the upper memory to be used per node. in MB, or GB. You can put 0GB to use the maximun amount available, although it is recommended to set a fixed value.
- Example: --mem=16GB - --time: specifies the maximum time the job is going to be running.
- Example: --time=12:00:00 - --nodes: specifies the number of nodes to be used.
- Example: --nodes=1 - --cpus-per-task: specifies the number of cpus to be used per node.
- Example: --cpus-per-task=4 - --gres: specifies the number of gpus to be used per node.
- Example: --gres=gpu:1
For a more detailed explanation, you can read the Requesting Resources.
Modules
Modules are general software packages that are going to be used to load general software, such as CUDA, torch ... There are some commands available to work with models, such as:
- module list: shows all the loaded modules
- module avail: shows all the modules the user is able to load
- module purge: removes all the loaded modules
- module load module_name: loads the necessary environment variables for the selected modulefile (PATH, MANPATH, LD_LIBRARY_PATH...)
- module unload module_name: removes all environment changes made by module load command
For more information, you can read Software Environment
Environment
For dependencies that are not available as modules, you need to create and load an environment. For venv environments, you can use the following commands:
python -m .venv [env_path]
Once the environment is created, you can load it with the following command:
source [env_path]/bin/activate
and install the dependencies with pip:
pip install -r requirements.txt
You can find more detailed explanations online.
if an environment is already created, and dependencies installed, you only need to load it using
source [env_path]/bin/activate
Create a launcher
Once everything is set up, you can create a launcher and store it in the project's folder. A launcher is a script, as a *.sh file, that is going to be executed in the HPC, and it is going to launch the model with the desired parameters. In the launcher, you are going to specify
- the name of the job
- the modules you are going to load
- the environment you are going to load
- the resources you are going to use
- the command to launch the model
- the partition to lauch the script
Here you have an example of a launcher:
#!/bin/bash
#SBATCH --job-name=example_job
# ask for resources
#SBATCH --partition=gpu
#SBATCH --mem=1GB
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
# Load module for PyTorch
module load torch/2.3.1
# Load environment
source /global/projects/[NAVISION_ID]/[ENV_PATH]/bin/activate
# Launch the model
python main.py
For a more detailed explanation, you can read the Running Jobs.
Running a Job
The HPC uses SLURM, a management and job scheduling system for Linux clusters. In order to run a process or a job in it, you need to submit the launcher created in the previous step and queue it. You can do it with the following command:
sbatch [launcher.sh]
This is going to submit and queue the job to the HPC. You can check the status of the queue using:
squeue
If you need to cancel the job, you can do it with the following command:
scancel [job_id]
There are more commands available, such as srun, sinfo ... You can find more information in Commands and SLURM.
Monitoring Jobs
Once the job is queued, you can monitor it with the following command:
squeue -u [tecnalia_username]
which is going to show the JobID, resource used (cpu/gpu), name of the job, user, status (R: running, PD: pending), time and used nodes. You can also just call squeue to see all the jobs in the queue.
You can also see the status of the resources with the following command:
sinfo
This is going to show the partition, the time limit, available nodes, the state of each one, and a node list. For a more detailed explanation, you can read the Monitoring jobs.