Job Submission: Difference between revisions

Revision as of 16:23, 13 November 2024

NU HPC clusters use SLURM workload manager to schedule, distribute, and execute user jobs. SLURM (the name comes from Simple Linux Utility for Resource Management) is free and open-source software used by many, if not most, large HPC facilities throughout the world. Thus, if you happen to have run research calculations at some HPC facility elsewhere, it should be rather easy for you to migrate your jobs and start using NU HPC clusters. On a login node, users arrange their data, write batch scripts, and submit their jobs to the execution queue. The submitted jobs are then put in a pending state until the requested system resources are available and allocated. SLURM will schedule each job to run in the queue according to a predetermined site policy designated to balance competing user needs and to maximize efficient use of the cluster resources.

Absolutely all computational tasks on Shabyt (apart from compiling and very short test runs that use one or just few cores) are supposed to be executed using SLURM workload manager that distributes them across the system in an optimal way. It is extremely important that users do not abuse the management/login node (ln01 in Shabyt) where they log in and do not run long heavy calculations on it interactively or in the background. The function of the management node is to let users compile binaries, transfer data, manage files, prepare input for calculations, and submit jobs. The management node is not a workhorse for heavy calculations.

For a comprehensive guide on SLURM you can refer to its website. A short printable cheat sheet of some useful SLURM commands and parameters is available in this summary. Below we will dive into explaining how jobs should be submitted for execution in NU HPC systems and provide some basic examples.

Partitions, QoS, and job limits

Each job’s position and progress in the queue is determined through the fairshare algorithm, which depends on a number of factors (e.g. size of the job, time requirement, job queuing time, partition, QoS, etc). For the information on the available partitions, Quality of Service (QoS), maximum job durations, and maximum number of simultaneously running jobs and CPU cores used, please refer to the corresponding sections on page Policies. Please note that the limits are subject to change.

It is important to keep in mind that your jobs cannot request the duration that exceed the time limit set by our policies. Jobs that request execution times exceeding the limit may still be sent to the queue, but they will stay queued (i.e. will be in a pending state) forever until you change the time requested. Likewise, requesting more RAM for your job than what is physically available will result in your submitted job stay in a pending state forever. One cannot simultaneously use more CPU cores than what is allowed for a single user that belongs to a specific QoS category. For example, if the total CPU core limit is 256 cores for a user and this user currently has four 64-core jobs running, then any new submitted jobs will be placed in a queue in stay in a pending state. They will not start running even if there are resources available to execute them. They will be pending until one or all of the four running jobs finish.

Job submission and monitoring

Jobs can be submitted for execution using a “batch” file. The batch file is essentially a Unix shell script (typically a bash script, but using other shells, e.g. tcsh, is possible as well) that in addition to the actual user commands contains a preamble (or header) written in a special format. This header, all lines of which begin with the keyword #SBATCH, contains batch directives - the information about the resources requested for the job, user information, job name, etc. While bash and other Linux shells treat these lines beginning with #SBATCH as comments, these are not comments for SLURM. SLURM reads them when you submit a job for execution, interprets them, and acts accordingly. Note that if you change the format of those lines just slightly, e.g. if instead they begin with # SBATCH or ##SBATCH then SLURM no longer reads them and assumes they are comments. This is convenient for making SLURM omit some lines without actually deleting them in your script.

Most common and useful SLURM commands

A list of some of the most useful SLURM commands is provided below. Please be aware that many SLURM commands accept flags/arguments that further extend their functionality. Those can be explored by invoking help using the --help flag (e.g. sbatch --help) in the terminal window.

List of useful SLURM Commands
Command	Description
`sbatch <script_file_name>`	Submit a job using the specified script `<script_file_name>` (e.g. `myscript.slurm`).
`sbatch --test-only <script_file_name>`	Test the SLURM script for errors without submitting a job.
`scancel <job_id>`	Cancel a running or queued job by specifying its Job ID (a four or five digit number).
`squeue`	View the status of all user jobs in the queue, including their state (running, pending, etc.).
`squeue -u <user_name>`	View the status of jobs by user `<user_name>` (e.g. john.smith).
`squeue -l`	A longer output format for the `squeue` command.
`squeue -o "%.6i %.6P %.12j %.19u %.2t %.11M %.8Q %.5D %.4C %.11R"`	This is an example of even more detailed, user-defined, output format of `squeue`. To avoid typing such long command all the time in terminal you can define an alias in your shell profile file `.bashrc`, by adding a line such as `alias mysqueu='( squeue -o "%.6i %.6P %.12j %.19u %.2t %.11M %.8Q %.5D %.4C %.11R" )'`
`sinfo`	Display information about the cluster, including nodes, partitions, and their status.
`scontrol show job <job_id>`	Show detailed information about a specific job.
`squeue --start`	Estimate the start time for queued jobs.

Sample SLURM batch scripts

Running a simple serial (single-threaded) job

In this example we will prepare a submit a simple job that executes a serial Python program. For that we will first create directory called test_serial inside our home directory with a full path /shared/home/<your.name>/test_serial. It is always advisable to run each job in a separate directory to avoid confusion and mix up files.

Now let us enter directory test_serial and create a short Python script called my_python_script.py. When executed, this script prints a "Hello, world!" message once a second for one minute before it completes.

import time

nsec=60

for i in range(1, nsec+1):
    print(f"{i}: Hello, world!")
    time.sleep(1)

Now let us create a SLURM script (we will use bash for this) called batch_serial.slurm with the following content.

#!/bin/bash

# ---------------------------------------------------
# Directives section that requests specific resources
# ---------------------------------------------------

#SBATCH --job-name=MySerialJob           # The name of your job
#SBATCH --time=0-2:00:00                 # Time requested: 0 days + 2 hours + 0 minutes + 0 seconds
#SBATCH --mem=1G                         # Memory requested: 1 Gigabyte
#SBATCH --ntasks=1                       # Request room for a single thread only 
#SBATCH --partition=CPU                  # Specify the partition name
#SBATCH --output=stdout%j.out            # Specify the file name for standard screen output
#SBATCH --error=stderr%j.out             # Specify the file name for error messages
#SBATCH --mail-type=END,FAIL             # Specify when automatic email notification should be sent
#SBATCH --mail-user=your.name@nu.edu.kz  # Specify your email address where notifications are sent

# ---------------------------------------------------
# Your code section
# ---------------------------------------------------

# Load a module with a specific Python version. Technically, this is not required for our example
# because the operating system default Python is already available in all compute nodes of Shabyt 
# without loading anything. However, we will do it here for illustrative purposes. 
  
module load Python/3.11.5-GCCcore-13.2.0

# Below we will execute our Python program. By default, SLURM will execute the commands
# that appear in this batch script in the same directory where this script was submitted 
# from with sbatch. However, if you need to you, in this script you can go to some specific 
# directory, or use a full path to your Python program. Moreover, if you wish, you could 
# define your own Unix shell variables, copy or move files, have loops or conditional statements, 
# execute shell commands, etc. -- i.e. do anything that one can do in a Unix shell script in 
# order to pre-process and post-process your files and data. In this example we simply execute 
# our Python program using Python 3 in the current directory. 

python3 ./my_python_script.py

We can now submit our job from the terminal by typing sbatch batch_serial.slurm. If everything goes well and the task is accepted by SLURM, you should see a confirmation message, e.g. Submitted batch job XXXXX , where XXXXX is the ID assigned to your job. You can check the queue to see whether your job is executed immediately or is placed in a queue (pending status). For this you can type, e.g. squeue --me in the terminal. If your job started execution it should finish within just one minute, because this is how we wrote our Python program. After the job execution is complete, it will no longer appear in the list of jobs that you see with the squeue command. In the end you will find two files in directory test. These are stderrXXXXX.out and stdoutXXXXX.out. The first one should normally be empty, unless there were errors or exceptions during the execution. The second one contains the standard screen output, i.e. all sixty "Hello, world!" messages are printed there.

Note that the maximum time we requested for our job in the above SLURM script was 2 hours. It the batch script finishes sooner than in 2 hours after it begins execution then the job is terminated automatically. SLURM will then go ahead and use the freed resources for some other job in the queue. If the execution of the batch script is not finished within the time that is requested then the job is forced to terminate. If you perform long calculations and do not save intermediate data it produces then you must set the max time long enough so that it suffices to complete the task. On the other hand, setting the max time to be much longer than you actually need might sometimes increase the wait time of your job in the queue. Thus users must use reasonable judgement when they set max time for their jobs. It also applies to other parameters/resources, such as the amount of memory. Requesting too little memory may cause your job's premature death, while requesting too much (e.g. request 100 GB while your program actually needs only 100 kB) may increase the wait time in the queue. Requesting too much memory may also prevent SLURM from starting jobs submitted by other users due to the reduced amount of memory available for them. Again, users need to use reasonable judgement when they request resources (time, memory, CPU cores, etc) for their jobs.

In the above example we set that the default output goes to the file called stdoutXXXXX.out (%j in line #SBATCH --output=stdout%j.out will be substituted by the unique Job ID assigned to your job by SLURM - a four or five digit number). In case if different parts of your batch script code need to output to different files you can always use redirection, e.g. you could have something like this in your batch script:

python3 ./mypythonscript1.py > output.txt
python3 ./mypythonscript2.py > otheroutput.txt

The above batch script example contains a directive #SBATCH --mail-type=END,FAIL. It tells SLURM that the user should be sent an automatic email in the event if the job ends normally or fails during the execution. This is handy if you do not want to sit by a computer and constantly check the status of your calculations. If you also would like to be notified about the job start (could be useful if your job waits in a queue for a long time before SLURM executes it) then you can use the directive #SBATCH --mail-type=BEGIN,END,FAIL instead. The email address where the messages are sent is given in the next line of the batch script, #SBATCH --mail-user=your.name@nu.edu.kz (do not forget to replace it with your actual email).

Distributed Memory Parallelism (MPI) Job

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to allow for execution of programs using CPUs on multiple nodes where CPUs across nodes communicate over the network. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. Intel MPI and OpenMPI are available in Shabyt system and SLURM jobs may make use of either MPI implementations.

Requesting for multiple nodes and /or loading any MPI modules may not necessarily make your code faster, your code must be MPI aware to use MPI. Even though running a non-MPI code with mpirun might possibly succeed, you will most likely have every core assigned to your job running the exact computation, duplicating each others work, and wasting resources.

The version of the MPI commands you run must match the version of the MPI library used in compiling your code, or your job is likely to fail. And the version of the MPI daemons started on all the nodes for your job must also match. For example, an MPI program compiled with Intel MPI compilers should be executed using Intel MPI runtime instead of Open MPI runtime.

#!/bin/bash
#SBATCH --job-name=Test_MPI
#SBATCH --nodes=2
#SBATCH --ntasks=256
#SBATCH --ntasks-per-node=128
#SBATCH --time=0-0:30:00
#SBATCH --mem=32G
#SBATCH --partition=CPU

pwd; hostname; date
NP=${SLURM_NTASKS}
module load iimpi/2022b
mpirun -np ${NP} ./my_mpi_program <options>

GPU Job

#!/bin/bash
#SBATCH --job-name=gputest
#SBATCH --output=gpu.test.out
#SBATCH --error=gpu.test.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@nu.edu.kz
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=8
#SBATCH --distribution=cyclic:cyclic
#SBATCH --mem-per-cpu=7000mb
#SBATCH --partition=NVIDIA
#SBATCH --gpus=a100:4
#SBATCH --time=00:30:00

module purge
module load cuda/11.4.1 intel/2023b OpenMPI/4.0.5-GCC-9.3.0

SLURM Job Options

A SLURM script includes a list of SLURM job options at the top of the file, where each line starts with #SBATCH followed by option name to value pairs to tell the job scheduler the resources that a job requests.

Long Option	Short Option	Default value	Description
`--job-name`	`-J`	file name of job script	User defined name to identify a job
`--time`	`-t`	48:00:00	Specify a limit on the maximum execution time (walltime) for the job (D-HH:MM:SS) . For example, -t 1- is one day, -t 6:00:00 is 6 hours
`--nodes`	`-N`		Total number of node(s)
`--ntasks`	`-n`	1	Number of tasks (MPI workers)
`--ntasks-per-node`			Number of tasks per node
`--cpus-per-task`	`-c`	1	Number of CPUs required per task
`--mem`			Amount of memory allocated per node. Different units can be specified using the suffix [K\|M\|G\|T]
`--mem-per-cpu`			Amount of memory allocated per cpu per code (For multicore jobs). Different units can be specified using the suffix [K\|M\|G\|T]
`--constraint`	`-C`		Nodes with requested features. Multiple constraints may be specified with AND, OR, Matching OR. For example, `--constraint="CPU_MNF:AMD"`, `--constraint="CPU_MNF:INTEL&CPU_GEN:CLX"`
`--exclude`	`-x`		Explicitly exclude certain nodes from the resources granted to the job. For example, `--exclude=cn[1-3]`