Monitoring CPU and Memory on HPC Clusters¶

Optimizing CPU and memory usage in your jobs enables you and others using the clusters to utilize resources more efficiently, leading to faster results. This guide demonstrates how to monitor CPU and memory usage in various scenarios, including future, running, and completed jobs.

General Note¶

To ensure optimal allocation of resources, it is important to consult the cluster's hardware page (especially the partitions and hardware section) before beginning any work. This will provide information about the available resources and queues in the cluster.

Monitoring Future Jobs¶

By using /usr/bin/time, you can gather statistics about the resources used by a program. For instance:

[username@c051 ~]$ /usr/bin/time -v stress-ng --cpu 8 --timeout 10s
stress-ng: info:  [62127] dispatching hogs: 8 cpu
stress-ng: info:  [62127] successful run completed in 10.06s
    Command being timed: "stress-ng --cpu 8 --timeout 10s"
    User time (seconds): 79.92
    System time (seconds): 0.04
    Percent of CPU this job got: 792%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.08
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 4780
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 32
    Minor (reclaiming a frame) page faults: 11061
    Voluntary context switches: 76
    Involuntary context switches: 4212
    Swaps: 0
    File system inputs: 5400
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

To determine the RAM used by your job (and future jobs), check the "Maximum resident set size" in the output.

Monitoring Running Jobs¶

To monitor a job that's already running, you can check its usage in real-time. First, use squeue to identify the compute node your job is running on:

[username@login1 ~]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            785510  standard job_clus username  R 8-15:45:40      5 c[040-044]

Then, SSH into the appropriate node:

[username@login1 ~]$ ssh c051
[username@c051 ~]$

Once connected to the compute node, use ps or top to monitor your job's resource usage.

Using `ps`¶

The ps command provides instantaneous usage information each time you run it. For example:

[username@c051 ~]$ ps -u$USER -o %cpu,rss,args

Using `top`¶

top displays live usage statistics interactively. To filter processes by your user, press U, enter your username, and press Enter. Look for the "RES" value to check memory usage. To exit, press Q.

Monitoring Multi-node Jobs with ClusterShell¶

For multi-node jobs, use clush from ClusterShell. Refer to our ClusterShell setup guide for setup and usage instructions.

Monitoring Completed Jobs¶

Slurm records statistics for every job, including memory and CPU usage.

Using sacct¶

The sacct command offers more flexibility and advanced job queries. We recommend setting an environment variable to customize the output:

[username@login1 ~]$ export SACCT_FORMAT="JobID%20,JobName,User,Partition,NodeList,Elapsed,State,ExitCode,MaxRSS,AllocTRES%32"
[username@login1 ~]$ sacct -j <job_id>

Examine the "MaxRSS" value to see your job's memory usage.

(screenshot)

By monitoring your jobs' resource usage on HPC clusters, you can make better decisions about resource allocation and optimization, leading to more efficient use of the clusters.