Monitoring jobs
There are several tools available on the cluster to help you monitor jobs on the cluster. We will discuss some of them here. You can find more information on some convenient SLURM commands here.
squeue
The most basic way to check the status of the batch system are the programs squeue and sinfo. These are not graphical programs, but we will mention them here for comparison. We can check which jobs are active with squeue
[ user@hpc ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES QOS NODELIST(REASON)
1389 cpu fmMle_no userA PD 0:00 32 normal (Resources)
1381 cpu fmMle_no userB R 15:52 1 normal NODE01
Notice that the first job is in state PD (pending), and is waiting for 32 nodes to become available. The second job is in state R (running), and is executing on node n7.
sacct
We can retrieve statistics for a completed job (no longer in the queue) using the sacct command.
[ user@hpc ~]$ sacct -j 151111 --format=JobID,JobName,Partition,QOS,Elapsed,Start,NodeList,State,ExitCode
JobID JobName Partition QOS Elapsed Start NodeList State ExitCode
------------ ---------- ---------- ---------- ---------- ------------------- --------------- ---------- --------
151111 fq32_ch14 batch normal 00:05:56 2024-03-10T14:14:45 NODE09 COMPLETED 0:0
151111.batch batch 00:05:56 2024-03-10T14:14:45 NODE09 COMPLETED 0:0
151111.0 run_scrip+ 00:05:55 2024-03-10T14:14:46 NODE09 COMPLETED 0:0
Suspend/Resume all jobs
There are cases in which a user may desire to suspend all of their jobs currently runnning (including job arrays) This can be done with the command:
squeue -ho %A -t R | xargs -n 1 scontrol suspend
To resume the suspended jobs simply use the command:
squeue -o "%.18A %.18t" -u <username> | awk '{if ($2 =="S"){print $1}}' | xargs -n 1 scontrol resume
sinfo
We can see what’s going on with the batch system from the perspective of the queues, using sinfo.
[ user@hpc ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu* up 21-00:00:0 1 drain NODE01
cpu* up 21-00:00:0 7 mix NODE[02-06,09-10]
cpu* up 21-00:00:0 3 idle NODE[07-08,11]
gpu up 21-00:00:0 1 comp TCPGPUSRVGAF09
gpu up 21-00:00:0 8 mix TCPGPUSRVGAF[02-08,10]
jisap up 21-00:00:0 1 idle TCPGPUSRVGAF01
We can see that the two nodes (n1, n2) in the develop queue are idle. The other queues share nodes n3 – n84, and currently n3 is in use for a running job. By combining this with the Linux watch command, we can make a simple display that refreshes periodically. Try
[ user@hpc ~]$ watch sinfo
and you will get the following display
Every 2.0s: sinfo nodeges02.tri.lan: Mon Nov 11 12:23:48 2024
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu* up 21-00:00:0 1 drain NODE01
cpu* up 21-00:00:0 7 mix NODE[02-06,09-10]
cpu* up 21-00:00:0 3 idle NODE[07-08,11]
gpu up 21-00:00:0 1 comp TCPGPUSRVGAF09
gpu up 21-00:00:0 8 mix TCPGPUSRVGAF[02-08,10]
jisap up 21-00:00:0 1 idle TCPGPUSRVGAF01
Use ctrl-c to exit back to the prompt.
You can also customize the output of the squeue and sinfo commands. Many fields are available that aren’t shown in the default output format. For example we can add a SHARED field, which tells if a job allows its nodes to be shared, and a TIME_LEFT field which says how much time is left before the job’s walltime limit is reached.
squeue --format '%.7i %.4P %.12j %.12u %.8T %.5C %.5m %.11l %.11M %.11L %.6D %R'
JOBID PART NAME USER STATE CPUS MIN_M TIME_LIMIT TIME TIME_LEFT NODES NODELIST(REASON)
8817 cpu launch_exper john.doe RUNNING 32 4G 7-00:00:00 2-18:19:12 4-05:40:48 1 NODE09
8856 cpu run_experime mary.doe RUNNING 25 8G 7-00:00:00 23:41:52 6-00:18:08 1 NODE09
8858 cpu run_experime mary.doe RUNNING 25 8G 7-00:00:00 23:41:13 6-00:18:47 1 NODE02
8859 cpu run_experime mary.doe RUNNING 25 8G 7-00:00:00 23:40:57 6-00:19:03 1 NODE03
8860 cpu run_experime mary.doe RUNNING 25 8G 7-00:00:00 23:40:37 6-00:19:23 1 NODE04
8861 cpu run_experime mary.doe RUNNING 25 8G 7-00:00:00 23:40:15 6-00:19:45 1 NODE05
We’ve specified some fields to obtain this output. For all available fields and other output options, see the squeue and sinfo man pages.
scontrol
SLURM maintains more information about the system than is available through squeue and sinfo. The scontrol command allows you to see this. First, let’s see how to get very detailed information about all jobs currently in the batch system (this includes running, recently completed, pending, etc).
[ user@hpc ~]$ scontrol show jobs
JobId=8817 JobName=launch_experiment.sh
UserId=john.doe(94152869) GroupId=domain users MCS_label=N/A
Priority=4294895151 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=2-18:31:37 TimeLimit=7-00:00:00 TimeMin=N/A
SubmitTime=2024-11-08T18:05:21 EligibleTime=2024-11-08T18:05:21
AccrueTime=2024-11-08T18:05:21
StartTime=2024-11-08T18:05:21 EndTime=2024-11-15T18:05:21 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-11-08T18:05:21 Scheduler=Main
Partition=cpu AllocNode:Sid=nodeges01:2198320
ReqNodeList=(null) ExcNodeList=(null)
NodeList=NODE09
BatchHost=NODE09
NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
TRES=cpu=32,mem=128G,node=1,billing=32
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=32 MinMemoryCPU=4G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/global/projects/whereever/launch_experiment.sh
WorkDir=/global/projects/whereever
StdErr=/global/projects/whereever/slurm-8817.out
StdIn=/dev/null
StdOut=/global/projects/whereever/slurm-8817.out
Power=
JobId=8870 JobName=dinov2:train
UserId=name.surname GroupId=domain users MCS_label=N/A
Priority=4294895098 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=04:05:22 TimeLimit=3-00:00:00 TimeMin=N/A
SubmitTime=2024-11-11T08:31:35 EligibleTime=2024-11-11T08:31:35
AccrueTime=2024-11-11T08:31:35
StartTime=2024-11-11T08:31:36 EndTime=2024-11-14T08:31:36 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-11-11T08:31:36 Scheduler=Backfill
Partition=gpu AllocNode:Sid=nodeges01:3271764
ReqNodeList=(null) ExcNodeList=(null)
NodeList=TCPGPUSRVGAF[05-07]
BatchHost=TCPGPUSRVGAF05
NumNodes=3 NumCPUs=30 NumTasks=3 CPUs/Task=10 ReqB:S:C:T=0:0:*:*
TRES=cpu=30,mem=1542000M,node=3,billing=30,gres/gpu=3
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=10 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/global/projects/whatever/outputs/train/.submission_file_018c992215bf4340ab8f4eccca46d324.sh
WorkDir=/global/projects/whatever
StdErr=/global/projects/whatever/outputs/train/8870_0_log.err
StdIn=/dev/null
StdOut=/global/projects/whatever/outputs/train/8870_0_log.out
Power=
TresPerNode=gres:gpu:1
...
From this output, we can see for example that the job 8870 was submitted at 2024-11-07T15:21:13, has 20 tasks (NumCPUs) running on nodes TCPGPUSRVGAF[05-07], and its working directory is /global/projects/whatever. One thing that’s missing is how many processes are running on each node. Fortunately, we can get this by specifying the “–detail” option.
[ user@hpc ~]$ scontrol show --detail JobId=8870
JobId=8870 JobName=dinov2:train
UserId=name.surname GroupId=domain users MCS_label=N/A
Priority=4294895098 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=04:25:49 TimeLimit=3-00:00:00 TimeMin=N/A
SubmitTime=2024-11-11T08:31:35 EligibleTime=2024-11-11T08:31:35
AccrueTime=2024-11-11T08:31:35
StartTime=2024-11-11T08:31:36 EndTime=2024-11-14T08:31:36 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-11-11T08:31:36 Scheduler=Backfill
Partition=gpu AllocNode:Sid=nodeges01:3271764
ReqNodeList=(null) ExcNodeList=(null)
NodeList=TCPGPUSRVGAF[05-07]
BatchHost=TCPGPUSRVGAF05
NumNodes=3 NumCPUs=30 NumTasks=3 CPUs/Task=10 ReqB:S:C:T=0:0:*:*
TRES=cpu=30,mem=1542000M,node=3,billing=30,gres/gpu=3
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
JOB_GRES=gpu:hopper:3
Nodes=TCPGPUSRVGAF[05-07] CPU_IDs=0-9 Mem=514000 GRES=gpu:hopper:1(IDX:0)
MinCPUsNode=10 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/global/projects/nowhere/outputs/train/.submission_file_018c992215bf4340ab8f4eccca46d324.sh
WorkDir=/global/projects/nowhere
StdErr=/global/projects/nowhere/outputs/train/8870_0_log.err
StdIn=/dev/null
StdOut=/global/projects/nowhere/outputs/train/8870_0_log.out
Power=
TresPerNode=gres:gpu:1
There is a lot of output, but notice the following lines:
- Nodes=TCPGPUSRVGAF[05-07] CPU_IDs=0-9 Mem=514000 GRES=gpu:hopper:1(IDX:0)
This tells us that ten processes are being used on nodes TCPGPUSRVGAF[05-07] (running on CPU cores 0, 1, 2, …, 9).
scontrol is a very versatile command, and we can also use it to get detailed information about the available nodes and queues (called “partitions” in SLURM).
[ user@hpc ~]$ scontrol show partitions
PartitionName=cpu
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=02:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=21-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=NODE[01-11]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=704 TotalNodes=11 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=704,mem=5407000M,node=11,billing=704
PartitionName=gpu
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=02:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=21-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=TCPGPUSRVGAF[02-10]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=352 TotalNodes=9 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=352,mem=4626000M,node=9,billing=352,gres/gpu=15
...
[ user@hpc ~]$ scontrol show nodes | head -n 18
NodeName=NODE01 Arch=x86_64 CoresPerSocket=32
CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=NODE01 NodeHostName=NODE01 Version=22.05.8
OS=Linux 5.14.0-362.13.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 24 01:57:57 EST 2023
RealMemory=1547000 AllocMem=0 FreeMem=1538696 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=cpu
BootTime=2024-10-14T08:10:56 SlurmdStartTime=2024-10-14T08:11:16
LastBusyTime=2024-11-09T06:46:38
CfgTRES=cpu=64,mem=1547000M,billing=64
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=newreason [root@2024-11-11T12:02:59]
...
See the man page for scontrol (“man scontrol”) for more details about the command, especially to help understand how to interpret the many fields which are reported. Also note that some of the features of the scontrol command, such as modifying job information, can only be accessed by system administrators.