Job Submission: Difference between revisions

Revision as of 15:29, 21 September 2024

Shabyt uses SLURM workload manager to schedule, distribute, and execute user jobs. SLURM (the name comes from Simple Linux Utility for Resource Management) is free and open-source software used by many, if not most, large HPC facilities throughout the world. Thus, if you happen to have run research calculations at some HPC facility elsewhere, it should be rather easy for you to migrate your jobs and start using NU HPC clusters. On a login node, users arrange their data ad files, write batch scripts, and submit their jobs to the execution queue. The submitted jobs then is put in a pending state until the requested system resources are available and allocated. SLURM will schedule each job to run in the queue according to a predetermined site policy designated to balance competing user needs and to maximize efficient use of the cluster resources.

Absolutely all computations on Shabyt (apart from compiling and quick test runs that use one or few cores) are supposed to be executed via the workload manager software that distributes them across the system in an optimal way. It is extremely important that users do not abuse the management/login node (ln01) where they log in and do not run long heavy calculations on it interactively or in the background. The function of the management node is to let users compile binaries, copy data, prepare input files, and submit jobs. The management node is NOT a workhorse for heavy calculations.

A cheat sheet for SLURM job scheduler is available at https://slurm.schedmd.com/pdfs/summary.pdf.

Partitions, QoS, and job limits

Each job’s position and progress in the queue is determined through the fairshare algorithm, which depends on a number of factors (e.g. size of the job, time requirement, job queuing time, partition, QoS, etc). For the information on the available partitions, Quality of Service (QoS), maximum job durations, and maximum number of simultaneously running jobs and CPU cores used, please refer to the corresponding sections on page Policies. Please note that the limits are subject to change.

It is important to keep in mind that your jobs cannot request the duration that exceed the time limit. Jobs that request execution times exceeding the limit may still be sent to the queue, but they will stay queued (i.e. will be in a pending state) forever until you change the time requested. Likewise, requesting more RAM for your job than what is physically available will result your submitted job stay in a pending state. One cannot simultaneously use more CPU cores than what is allowed for a single user that belongs to a specific QoS category. For example, if the total CPU core limit is 256 cores for a user and this user currently has four 64-core jobs running, then any new submitted jobs will be placed in a queue in stay in a pending state. They will not start running even if there are resources available to execute them. They will be pending until one or all of the four running jobs finish.

Job submission

Jobs can be submitted to the cluster using a “batch” file. The top half of the file consists of #SBATCH options which communicate needs or parameters of the job – these lines are not comments, but essential options for the job.

After the #SBATCH options, the submit file should contain the commands needed to run your job, including loading any needed software modules.

Running Serial / Single Threaded Jobs

First we are going to create basic python script called myscript.py

a = 10

for i in range(a);
  print('Hello World')

Serial or single CPU core jobs are those jobs that can only make use of one CPU on a node.

## Shebang
#!/bin/bash

## Resource request
#SBATCH --job-name=Test_Serial
#SBATCH --ntasks=1
#SBATCH --output=stdout%j.out
#SBATCH --error=stderr%j.out

## Bash command
python3 ./myscript.py
homedatectl

Distributed Memory Parallelism (MPI) Job

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to allow for execution of programs using CPUs on multiple nodes where CPUs across nodes communicate over the network. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. Intel MPI and OpenMPI are available in Shabyt system and SLURM jobs may make use of either MPI implementations.

Requesting for multiple nodes and /or loading any MPI modules may not necessarily make your code faster, your code must be MPI aware to use MPI. Even though running a non-MPI code with mpirun might possibly succeed, you will most likely have every core assigned to your job running the exact computation, duplicating each others work, and wasting resources.

The version of the MPI commands you run must match the version of the MPI library used in compiling your code, or your job is likely to fail. And the version of the MPI daemons started on all the nodes for your job must also match. For example, an MPI program compiled with Intel MPI compilers should be executed using Intel MPI runtime instead of Open MPI runtime.

#!/bin/bash
#SBATCH --job-name=Test_MPI
#SBATCH --nodes=2
#SBATCH --ntasks=256
#SBATCH --ntasks-per-node=128
#SBATCH --time=0-0:30:00
#SBATCH --mem=32G
#SBATCH --partition=CPU

pwd; hostname; date
NP=${SLURM_NTASKS}
module load iimpi/2022b
mpirun -np ${NP} ./my_mpi_program <options>

GPU Job

#!/bin/bash
#SBATCH --job-name=gputest
#SBATCH --output=gpu.test.out
#SBATCH --error=gpu.test.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@nu.edu.kz
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=8
#SBATCH --distribution=cyclic:cyclic
#SBATCH --mem-per-cpu=7000mb
#SBATCH --partition=NVIDIA
#SBATCH --gpus=a100:4
#SBATCH --time=00:30:00

module purge
module load cuda/11.4.1 intel/2023b OpenMPI/4.0.5-GCC-9.3.0

SLURM Job Options

A SLURM script includes a list of SLURM job options at the top of the file, where each line starts with #SBATCH followed by option name to value pairs to tell the job scheduler the resources that a job requests.

Long Option	Short Option	Default value	Description
`--job-name`	`-J`	file name of job script	User defined name to identify a job
`--time`	`-t`	48:00:00	Specify a limit on the maximum execution time (walltime) for the job (D-HH:MM:SS) . For example, -t 1- is one day, -t 6:00:00 is 6 hours
`--nodes`	`-N`		Total number of node(s)
`--ntasks`	`-n`	1	Number of tasks (MPI workers)
`--ntasks-per-node`			Number of tasks per node
`--cpus-per-task`	`-c`	1	Number of CPUs required per task
`--mem`			Amount of memory allocated per node. Different units can be specified using the suffix [K\|M\|G\|T]
`--mem-per-cpu`			Amount of memory allocated per cpu per code (For multicore jobs). Different units can be specified using the suffix [K\|M\|G\|T]
`--constraint`	`-C`		Nodes with requested features. Multiple constraints may be specified with AND, OR, Matching OR. For example, `--constraint="CPU_MNF:AMD"`, `--constraint="CPU_MNF:INTEL&CPU_GEN:CLX"`
`--exclude`	`-x`		Explicitly exclude certain nodes from the resources granted to the job. For example, `--exclude=cn[1-3]`

@@ Line 1: / Line 1: @@
 Shabyt uses SLURM workload manager to schedule, distribute, and execute user jobs. SLURM (the name comes from Simple Linux Utility for Resource Management) is free and open-source software used by many, if not most, large HPC facilities throughout the world. Thus, if you happen to have run research calculations at some HPC facility elsewhere, it should be rather easy for you to migrate your jobs and start using NU HPC clusters. On a login node, users arrange their data ad files, write batch scripts, and submit their jobs to the execution queue. The submitted jobs then is put in a pending state until the requested system resources are available and allocated. SLURM will schedule each job to run in the queue according to a predetermined site policy designated to balance competing user needs and to maximize efficient use of the cluster resources.
-Each job’s position and progress in the queue is determined through the fairshare algorithm, which depends on a number of factors (e.g. size of the job, time requirement, job queuing time, partition, QoS, etc). The HPC system is set up to support large computation jobs. Maximum number of CPUs and processing time limits are summarized in section Policies. Please note that the limits are subject to change.
+Absolutely ''all'' ''computations'' on Shabyt (apart from compiling and quick test runs that use one or few cores) are supposed to be executed via the workload manager software that distributes them across the system in an optimal way. It is extremely important that users do not abuse the management/login node (ln01) where they log in and do not run long heavy calculations on it interactively or in the background. The function of the management node is to let users compile binaries, copy data, prepare input files, and submit jobs. The management node is NOT a workhorse for heavy calculations.
-All computations on Shabyt (apart from quick test runs) are supposed to be executed via the workload manager software that distributes them across the system in an optimal way. It is extremely important that users do not abuse the management node (ln01) where they log in and do not run long heavy calculations on it interactively or in the background. The function of the management node is to let users compile binaries, copy data, prepare input files, and submit jobs. The management node is NOT a workhorse for heavy calculations.
 A cheat sheet for SLURM job scheduler is available at https://slurm.schedmd.com/pdfs/summary.pdf.
 == Partitions, QoS, and job limits ==
-For the information on the available partitions, Quality of Service (QoS), maximum job durations, and maximum number of simultaneously running jobs and CPU cores used, please refer to the corresponding sections on page [[Policies]].
+Each job’s position and progress in the queue is determined through the fairshare algorithm, which depends on a number of factors (e.g. size of the job, time requirement, job queuing time, partition, QoS, etc). For the information on the available partitions, Quality of Service (QoS), maximum job durations, and maximum number of simultaneously running jobs and CPU cores used, please refer to the corresponding sections on page [[Policies]]. Please note that the limits are subject to change.
 It is important to keep in mind that your jobs cannot request the duration that exceed the time limit. Jobs that request execution times exceeding the limit may still be sent to the queue, but they will stay queued (i.e. will be in a pending state) forever until you change the time requested. Likewise, requesting more RAM for your job than what is physically available will result your submitted job stay in a pending state. One cannot simultaneously use more CPU cores than what is allowed for a single user that belongs to a specific QoS category. For example, if the total CPU core limit is 256 cores for a user and this user currently has four 64-core jobs running, then any new submitted jobs will be placed in a queue in stay in a pending state. They will not start running even if there are resources available to execute them. They will be pending until one or all of the four running jobs finish.