Skip to main content

Running Jobs

The commands section explains in more detail each of the possible commands that can be used in a brief and schematic way. Here are some of the ways of working with SLURM.

Submit to queues

A job is the execution unit for Slurm. A job is defined by a text file containing a set of directives describing the job's requirements and the commands to execute.

The method for submitting jobs is to use the Slurm sbatch directives directly.

INFO

For more information: SLURM Docs

man sbatch
man srun
man salloc

Launch queues

To submit jobs to queues, Slurm's 'sbatch' directives must be used, for example:

Submit a job:

[ user@hpc ~]$ sbatch {job_script} 

Show all submitted jobs:

[ user@hpc ~]$ squeue

Cancel the execution of a job:

[ user@hpc ~]$ scancel {job_id}

Load Modules

The python (pip) packages are installed within an anaconda environment.

To enable them, just load the modulename stack:

[ user@hpc ~]$ module load modulename

Loading is integrated into the shell:

  • In interactive use, the upload must be carried out every time a new session is started
  • In batch use, the command can be placed after the lines #SBATCH, so that when launching the job the loading is carried out as one more step

SLURM

Types of work

Interactivity

  • Interactive use
  • Batch process

Parallelism

  • Sequential / serial
  • Parallel

It is recommended to start with the simplest (sequential) and complicate things based on successful tests.

Commands

commandfunction
srunexecute step (interactive)
sbatchjob submission
squeueconsult queue
scancelcancel job

Interactive sessions

By default, if I exit the HPC platform or the connection is dropped, the interactive session is interrupted (just like a simple ssh). It is recommended to use a terminal multiplexer such as screen or tmux

In each case, I adjust the resources I reserve (CPUs/cores, GPUs...)

Example 1: I reserve 2 CPUs
[ user@hpc ~]$ srun --cpus-per-task=2 --pty bash
Example 2: I request 2 GPU and 4 CPU (cores)
[ user@hpc ~]$ srun --cpus-per-task=4 --gres=gpu:2 --pty -p gpu bash

Examples of jobs (BATCH)

One step after another

Each step carries a unique task (-n 1).

BASIC: SEQUENTIAL WORK (WITHOUT GPU)
#!/bin/bash
#SBATCH --time=01:00:00
srun -n 1 command_step_1
srun -n 1 command_step_2

GPU

At the sbatch level, I reserve resources for all work (e.g. 2 GPUs). Then, for each specific step I request resources (for example, 1 GPU and 4 CPUs)

GPU Example
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --ntasks=2
# if the app does not need multiple CPUs, I leave it at 1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:2
#SBATCH --mem-per-cpu=8G
srun --gres=gpu:1 -n 1 --cpus-per-task=4 PROGRAM_GPU PARAMETERS

Another example, loading a module and launching an script

my-sbatch-code.sh
#!/bin/bash
#SBATCH --job-name=<your_job_name>
# If the app does not need multiple CPUs, I leave it at 1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --output=output.log
#SBATCH --error=error.log

# Load module for PyTorch
module load torch/2.3.1

# Add your desired commands here
srun --partition=gpu --gres=gpu:1 -n 1 example.sh

And the srun command will execute a file example.sh that contains, for example:

example.sh
#!/bin/bash
sleep 2
nvidia-smi
python -c "import torch; print(torch.tensor([42.0]).to('cuda'))"

To be able to execute it in the HPC Slurm queue, you would only have to write in the terminal:

[ user@hpc ~]$ sbatch my-sbatch-code.sh