Monitoring jobs

There are several tools available on the cluster to help you monitor jobs on the cluster. We will discuss some of them here. You can find more information on some convenient SLURM commands here.

squeue

The most basic way to check the status of the batch system are the programs squeue and sinfo. These are not graphical programs, but we will mention them here for comparison. We can check which jobs are active with squeue

[ user@hpc ~]$ squeue
  JOBID PARTITION     NAME     USER   ST       TIME  NODES QOS    NODELIST(REASON)
   1389  cpu        fmMle_no   userA  PD       0:00     32 normal (Resources)
   1381  cpu        fmMle_no   userB   R      15:52      1 normal NODE01   

Notice that the first job is in state PD (pending), and is waiting for 32 nodes to become available. The second job is in state R (running), and is executing on node n7.

sacct

We can retrieve statistics for a completed job (no longer in the queue) using the sacct command.

[ user@hpc ~]$ sacct -j 151111 --format=JobID,JobName,Partition,QOS,Elapsed,Start,NodeList,State,ExitCode
       JobID    JobName  Partition        QOS    Elapsed               Start        NodeList      State ExitCode 
------------ ---------- ---------- ---------- ---------- ------------------- --------------- ---------- -------- 
151111        fq32_ch14      batch     normal   00:05:56 2024-03-10T14:14:45  NODE09  COMPLETED      0:0 
151111.batch      batch                         00:05:56 2024-03-10T14:14:45  NODE09  COMPLETED      0:0 
151111.0     run_scrip+                         00:05:55 2024-03-10T14:14:46  NODE09  COMPLETED      0:0

Suspend/Resume all jobs

There are cases in which a user may desire to suspend all of their jobs currently runnning (including job arrays) This can be done with the command:

squeue -ho %A -t R | xargs -n 1 scontrol suspend

To resume the suspended jobs simply use the command:

squeue -o "%.18A %.18t" -u <username> | awk '{if ($2 =="S"){print $1}}' | xargs -n 1 scontrol resume

sinfo

We can see what’s going on with the batch system from the perspective of the queues, using sinfo.

[ user@hpc ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu*         up 21-00:00:0      1  drain NODE01
cpu*         up 21-00:00:0      7    mix NODE[02-06,09-10]
cpu*         up 21-00:00:0      3   idle NODE[07-08,11]
gpu          up 21-00:00:0      1   comp TCPGPUSRVGAF09
gpu          up 21-00:00:0      8    mix TCPGPUSRVGAF[02-08,10]
jisap        up 21-00:00:0      1   idle TCPGPUSRVGAF01

We can see that the two nodes (n1, n2) in the develop queue are idle. The other queues share nodes n3 – n84, and currently n3 is in use for a running job. By combining this with the Linux watch command, we can make a simple display that refreshes periodically. Try

[ user@hpc ~]$ watch sinfo

and you will get the following display

Every 2.0s: sinfo                                                                                                                                    nodeges02.tri.lan: Mon Nov 11 12:23:48 2024

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu*         up 21-00:00:0      1  drain NODE01
cpu*         up 21-00:00:0      7    mix NODE[02-06,09-10]
cpu*         up 21-00:00:0      3   idle NODE[07-08,11]
gpu          up 21-00:00:0      1   comp TCPGPUSRVGAF09
gpu          up 21-00:00:0      8    mix TCPGPUSRVGAF[02-08,10]
jisap        up 21-00:00:0      1   idle TCPGPUSRVGAF01

Use ctrl-c to exit back to the prompt.

You can also customize the output of the squeue and sinfo commands. Many fields are available that aren’t shown in the default output format. For example we can add a SHARED field, which tells if a job allows its nodes to be shared, and a TIME_LEFT field which says how much time is left before the job’s walltime limit is reached.

squeue --format '%.7i %.4P %.12j %.12u %.8T %.5C %.5m %.11l %.11M %.11L %.6D %R'

  JOBID PART         NAME     USER    STATE  CPUS MIN_M  TIME_LIMIT        TIME   TIME_LEFT  NODES NODELIST(REASON)
cpu launch_exper john.doe  RUNNING    32    4G  7-00:00:00  2-18:19:12  4-05:40:48      1 NODE09
cpu run_experime mary.doe  RUNNING    25    8G  7-00:00:00    23:41:52  6-00:18:08      1 NODE09
cpu run_experime mary.doe  RUNNING    25    8G  7-00:00:00    23:41:13  6-00:18:47      1 NODE02
cpu run_experime mary.doe  RUNNING    25    8G  7-00:00:00    23:40:57  6-00:19:03      1 NODE03
cpu run_experime mary.doe  RUNNING    25    8G  7-00:00:00    23:40:37  6-00:19:23      1 NODE04
cpu run_experime mary.doe  RUNNING    25    8G  7-00:00:00    23:40:15  6-00:19:45      1 NODE05

We’ve specified some fields to obtain this output. For all available fields and other output options, see the squeue and sinfo man pages.

scontrol

SLURM maintains more information about the system than is available through squeue and sinfo. The scontrol command allows you to see this. First, let’s see how to get very detailed information about all jobs currently in the batch system (this includes running, recently completed, pending, etc).

[ user@hpc ~]$ scontrol show jobs

JobId=8817 JobName=launch_experiment.sh
   UserId=john.doe(94152869) GroupId=domain users MCS_label=N/A
   Priority=4294895151 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=2-18:31:37 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2024-11-08T18:05:21 EligibleTime=2024-11-08T18:05:21
   AccrueTime=2024-11-08T18:05:21
   StartTime=2024-11-08T18:05:21 EndTime=2024-11-15T18:05:21 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-11-08T18:05:21 Scheduler=Main
   Partition=cpu AllocNode:Sid=nodeges01:2198320
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=NODE09
   BatchHost=NODE09
   NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,mem=128G,node=1,billing=32
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/global/projects/whereever/launch_experiment.sh
   WorkDir=/global/projects/whereever
   StdErr=/global/projects/whereever/slurm-8817.out
   StdIn=/dev/null
   StdOut=/global/projects/whereever/slurm-8817.out
   Power=

JobId=8870 JobName=dinov2:train
   UserId=name.surname GroupId=domain users MCS_label=N/A
   Priority=4294895098 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=04:05:22 TimeLimit=3-00:00:00 TimeMin=N/A
   SubmitTime=2024-11-11T08:31:35 EligibleTime=2024-11-11T08:31:35
   AccrueTime=2024-11-11T08:31:35
   StartTime=2024-11-11T08:31:36 EndTime=2024-11-14T08:31:36 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-11-11T08:31:36 Scheduler=Backfill
   Partition=gpu AllocNode:Sid=nodeges01:3271764
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=TCPGPUSRVGAF[05-07]
   BatchHost=TCPGPUSRVGAF05
   NumNodes=3 NumCPUs=30 NumTasks=3 CPUs/Task=10 ReqB:S:C:T=0:0:*:*
   TRES=cpu=30,mem=1542000M,node=3,billing=30,gres/gpu=3
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=10 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/global/projects/whatever/outputs/train/.submission_file_018c992215bf4340ab8f4eccca46d324.sh
   WorkDir=/global/projects/whatever
   StdErr=/global/projects/whatever/outputs/train/8870_0_log.err
   StdIn=/dev/null
   StdOut=/global/projects/whatever/outputs/train/8870_0_log.out
   Power=
   TresPerNode=gres:gpu:1

...

From this output, we can see for example that the job 8870 was submitted at 2024-11-07T15:21:13, has 20 tasks (NumCPUs) running on nodes TCPGPUSRVGAF[05-07], and its working directory is /global/projects/whatever. One thing that’s missing is how many processes are running on each node. Fortunately, we can get this by specifying the “–detail” option.

[ user@hpc ~]$ scontrol show --detail JobId=8870
JobId=8870 JobName=dinov2:train
   UserId=name.surname GroupId=domain users MCS_label=N/A
   Priority=4294895098 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=04:25:49 TimeLimit=3-00:00:00 TimeMin=N/A
   SubmitTime=2024-11-11T08:31:35 EligibleTime=2024-11-11T08:31:35
   AccrueTime=2024-11-11T08:31:35
   StartTime=2024-11-11T08:31:36 EndTime=2024-11-14T08:31:36 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-11-11T08:31:36 Scheduler=Backfill
   Partition=gpu AllocNode:Sid=nodeges01:3271764
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=TCPGPUSRVGAF[05-07]
   BatchHost=TCPGPUSRVGAF05
   NumNodes=3 NumCPUs=30 NumTasks=3 CPUs/Task=10 ReqB:S:C:T=0:0:*:*
   TRES=cpu=30,mem=1542000M,node=3,billing=30,gres/gpu=3
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   JOB_GRES=gpu:hopper:3
     Nodes=TCPGPUSRVGAF[05-07] CPU_IDs=0-9 Mem=514000 GRES=gpu:hopper:1(IDX:0)
   MinCPUsNode=10 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/global/projects/nowhere/outputs/train/.submission_file_018c992215bf4340ab8f4eccca46d324.sh
   WorkDir=/global/projects/nowhere
   StdErr=/global/projects/nowhere/outputs/train/8870_0_log.err
   StdIn=/dev/null
   StdOut=/global/projects/nowhere/outputs/train/8870_0_log.out
   Power=
   TresPerNode=gres:gpu:1

There is a lot of output, but notice the following lines:

Nodes=TCPGPUSRVGAF[05-07] CPU_IDs=0-9 Mem=514000 GRES=gpu:hopper:1(IDX:0)

This tells us that ten processes are being used on nodes TCPGPUSRVGAF[05-07] (running on CPU cores 0, 1, 2, …, 9).

scontrol is a very versatile command, and we can also use it to get detailed information about the available nodes and queues (called “partitions” in SLURM).

[ user@hpc ~]$ scontrol show partitions
PartitionName=cpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=02:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=21-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=NODE[01-11]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=704 TotalNodes=11 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=704,mem=5407000M,node=11,billing=704

PartitionName=gpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=02:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=21-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=TCPGPUSRVGAF[02-10]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=352 TotalNodes=9 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=352,mem=4626000M,node=9,billing=352,gres/gpu=15

...

[ user@hpc ~]$ scontrol show nodes | head -n 18

NodeName=NODE01 Arch=x86_64 CoresPerSocket=32
   CPUAlloc=0 CPUEfctv=64 CPUTot=64 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=NODE01 NodeHostName=NODE01 Version=22.05.8
   OS=Linux 5.14.0-362.13.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 24 01:57:57 EST 2023
   RealMemory=1547000 AllocMem=0 FreeMem=1538696 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=cpu
   BootTime=2024-10-14T08:10:56 SlurmdStartTime=2024-10-14T08:11:16
   LastBusyTime=2024-11-09T06:46:38
   CfgTRES=cpu=64,mem=1547000M,billing=64
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=newreason [root@2024-11-11T12:02:59]
...

See the man page for scontrol (“man scontrol”) for more details about the command, especially to help understand how to interpret the many fields which are reported. Also note that some of the features of the scontrol command, such as modifying job information, can only be accessed by system administrators.

squeue​

sacct​

Suspend/Resume all jobs​

sinfo​

scontrol​

squeue

sacct

Suspend/Resume all jobs

sinfo

scontrol