Job Submission: Difference between revisions
(Created page with "Shabyt uses SLURM workload manager to schedule, distribute, and execute user jobs. SLURM (the name comes from Simple Linux Utility for Resource Management) is free and open-source software used by many, if not most, large HPC facilities throughout the world. Thus, it should be easy for NU users to migrate their jobs from other facilities if they were using computational resources elsewhere. On a login node, user writes a batch script and submit it to the queue manager to...") |
|||
Line 141: | Line 141: | ||
A SLURM script includes a list of SLURM job directives at the top of the file, where each line starts with <code>#SBATCH</code> followed by option name to value pairs to tell the job scheduler the resources that a job requests. | A SLURM script includes a list of SLURM job directives at the top of the file, where each line starts with <code>#SBATCH</code> followed by option name to value pairs to tell the job scheduler the resources that a job requests. | ||
{| class="wikitable" | {| class="wikitable" | ||
!Long Option | !Long Option | ||
!Short Option | !Short Option | ||
!Default value | !Default value |
Revision as of 15:58, 13 March 2024
Shabyt uses SLURM workload manager to schedule, distribute, and execute user jobs. SLURM (the name comes from Simple Linux Utility for Resource Management) is free and open-source software used by many, if not most, large HPC facilities throughout the world. Thus, it should be easy for NU users to migrate their jobs from other facilities if they were using computational resources elsewhere. On a login node, user writes a batch script and submit it to the queue manager to schedule for execution in the compute nodes. The submitted job then queue up until the requested system resources is allocated. The queue manager will schedule a job to run on the queue according to a predetermined site policy designated to balance competing user needs and to maximize efficient use of cluster resources.
Each job’s position in the queue is determined through the fairshare algorithm, which depends on a number of factors (e.g. size of job, time requirement, job queuing time etc). The HPC system is set up to support large computation jobs. Maximum CPUs and processing time limits are summarized in the tables below. Please note that the limits are subject to change without notice.
Cheat sheet for SLURM job scheduler is available at https://slurm.schedmd.com/pdfs/summary.pdf.
- Partition and QoS
- Job submission
- Job Management
- Removing or Holding jobs
Partition & Qos
Partitions
Currently, there are two available partitions on Shabyt:
1. CPU: This partition includes 20 nodes equipped with CPUs only.
2. NVIDIA: This partition consists of 4 GPU nodes. All jobs requiring GPU computations must be queued to this partition. While it is possible to run jobs that need CPUs only in this partition, users are discouraged from doing so to ensure efficient utilization of the system.
Partition |
|
Number of Nodes | Cores Per Node | RAM(GB) per node | |
---|---|---|---|---|---|
CPU | 14 days | 20 | 128 | 256 | |
GPU | 2 days | 4 | 128 | 256 |
Quality of Service (QoS)
Each QoS is assigned a set of limits to be applied to the job, dictating the limit in the resources and partitions that a job is entitled to request. The table below shows the available QoS in Shabyt and their allowed partitions / resources limits.
QoS | Supported Partition | Max Jobs Per User | Max CPU |
---|---|---|---|
* hpcnc | CPU, GPU | 40 | 2560 |
nu | CPU, GPU | 12 | 512 |
* Require special approval
Job Submission
Jobs can be submitted to the cluster using a “batch” file. The top half of the file consists of #SBATCH
options which communicate needs or parameters of the job – these lines are not comments, but essential options for the job. The values for #SBATCH
options should reflect the size of nodes and run time limits described here.
After the #SBATCH
options, the submit file should contain the commands needed to run your job, including loading any needed software modules.
An example submit file is given below. It requests 1 nodes of 64 cores and 4GB of memory each (so 64 cores and 256 GB of memory total), on the shared
partition. It also specifies a run time limit of 4.5 hours.
Serial Job
#!/bin/bash
#SBATCH --job-name=Test_Serial
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=3-0:00:00
#SBATCH --mem=5G
#SBATCH --partition=CPU
#SBATCH --output=stdout%j.out
#SBATCH --error=stderr%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=my.email@nu.edu.kz
#SBATCH --get-user-env
#SBATCH --no-requeue
pwd; hostname; date
cp myfile1.dat myfile2.dat
./my_program myfile2.dat
SMP Job
#!/bin/bash
#SBATCH --job-name=Test_SMP
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=3-0:00:00
#SBATCH --mem=20G
#SBATCH --partition=CPU
#SBATCH --output=stdout%j.out
#SBATCH --error=stderr%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=my.email@nu.edu.kz
#SBATCH --get-user-env
#SBATCH --no-requeue
pwd; hostname; date
export OMP_NUM_THREADS=8
./my_smp_program myinput.inp > myoutput.out
Distributed Memory Parallelism (MPI) Job
#!/bin/bash
#SBATCH --job-name=Test_MPI
#SBATCH --nodes=2
#SBATCH --ntasks=256
#SBATCH --ntasks-per-node=128
#SBATCH --time=3-0:00:00
#SBATCH --mem=250G
#SBATCH --partition=CPU
#SBATCH --exclusive
#SBATCH --output=stdout%j.out
#SBATCH --error=stderr%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=my.email@nu.edu.kz
#SBATCH --get-user-env
#SBATCH --no-requeue
pwd; hostname; date
NP=${SLURM_NTASKS}
module load gcc/9.5.0
module load openmpi/gcc9/4.1.5
mpirun -np ${NP} ./my_mpi_program myinput.inp > myoutput.out
All computations on Shabyt (apart from quick test runs) are supposed to be executed via the workload manager software that distributes them across the system in an optimal way. It is extremely important that users do not abuse the management node (mgmt01) where they log in and do not run long heavy calculations on it interactively or in the background. The function of the management node is to let users compile binaries, copy data, prepare input files, and submit jobs. The management node is NOT a workhorse for heavy calculations.
SLURM Job Directives
A SLURM script includes a list of SLURM job directives at the top of the file, where each line starts with #SBATCH
followed by option name to value pairs to tell the job scheduler the resources that a job requests.
Long Option | Short Option | Default value | Description |
---|---|---|---|
--job-name
|
-J
|
file name of job script | User defined name to identify a job |
--time
|
-t
|
48:00:00 | Specify a limit on the maximum execution time (walltime) for the job (D-HH:MM:SS) .
For example, -t 1- is one day, -t 6:00:00 is 6 hours |
--nodes
|
-N
|
Total number of node(s) | |
--ntasks
|
-n
|
1 | Number of tasks (MPI workers) |
--ntasks-per-node
|
Number of tasks per node | ||
--cpus-per-task
|
-c
|
1 | Number of CPUs required per task |
--mem
|
Amount of memory allocated per node. Different units can be specified using the suffix [K|M|G|T] | ||
--mem-per-cpu
|
Amount of memory allocated per cpu per code (For multicore jobs). Different units can be specified using the suffix [K|M|G|T] | ||
--constraint
|
-C
|
Nodes with requested features. Multiple constraints may be specified with AND, OR, Matching OR. For example, --constraint="CPU_MNF:AMD" , --constraint="CPU_MNF:INTEL&CPU_GEN:CLX"
| |
--exclude
|
-x
|
Explicitly exclude certain nodes from the resources granted to the job. For example, --exclude=SPG-2-[1-3] , --exclude=SPG-2-1,SPG-2-2,SPG-2-3
|