Running Jobs
The commands section explains in more detail each of the possible commands that can be used in a brief and schematic way. Here are some of the ways of working with SLURM.
Submit to queues
A job is the execution unit for Slurm. A job is defined by a text file containing a set of directives describing the job's requirements and the commands to execute.
The method for submitting jobs is to use the Slurm sbatch directives directly.
For more information: SLURM Docs
man sbatch
man srun
man salloc
Launch queues
To submit jobs to queues, Slurm's 'sbatch' directives must be used, for example:
Submit a job:
[ user@hpc ~]$ sbatch {job_script} 
Show all submitted jobs:
[ user@hpc ~]$ squeue
Cancel the execution of a job:
[ user@hpc ~]$ scancel {job_id}
Load Modules
The python (pip) packages are installed within an anaconda environment.
To enable them, just load the modulename stack:
[ user@hpc ~]$ module load modulename
Loading is integrated into the shell:
- In interactive use, the upload must be carried out every time a new session is started
- In batch use, the command can be placed after the lines #SBATCH, so that when launching the job the loading is carried out as one more step
SLURM
Types of work
Interactivity
- Interactive use
- Batch process
Parallelism
- Sequential / serial
- Parallel
It is recommended to start with the simplest (sequential) and complicate things based on successful tests.
Commands
| command | function | 
|---|---|
| srun | execute step (interactive) | 
| sbatch | job submission | 
| squeue | consult queue | 
| scancel | cancel job | 
Interactive sessions
By default, if I exit the HPC platform or the connection is dropped, the interactive session is interrupted (just like a simple ssh). It is recommended to use a terminal multiplexer such as screen or tmux
In each case, I adjust the resources I reserve (CPUs/cores, GPUs...)
[ user@hpc ~]$ srun --cpus-per-task=2 --pty bash
[ user@hpc ~]$ srun --cpus-per-task=4 --gres=gpu:2 --pty -p gpu bash
Examples of jobs (BATCH)
One step after another
Each step carries a unique task (-n 1).
#!/bin/bash
#SBATCH --time=01:00:00
srun -n 1 command_step_1
srun -n 1 command_step_2
GPU
At the sbatch level, I reserve resources for all work (e.g. 2 GPUs). Then, for each specific step I request resources (for example, 1 GPU and 4 CPUs)
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --ntasks=2
# if the app does not need multiple CPUs, I leave it at 1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:2
#SBATCH --mem-per-cpu=8G
srun --gres=gpu:1 -n 1 --cpus-per-task=4 PROGRAM_GPU PARAMETERS
Another example, loading a module and launching an script
#!/bin/bash
#SBATCH --job-name=<your_job_name>
# If the app does not need multiple CPUs, I leave it at 1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --output=output.log
#SBATCH --error=error.log
# Load module for PyTorch
module load torch/2.3.1
# Add your desired commands here
srun --partition=gpu --gres=gpu:1 -n 1 example.sh
And the srun command will execute a file example.sh that contains, for example:
#!/bin/bash
sleep 2
nvidia-smi
python -c "import torch; print(torch.tensor([42.0]).to('cuda'))"
To be able to execute it in the HPC Slurm queue, you would only have to write in the terminal:
[ user@hpc ~]$ sbatch my-sbatch-code.sh