Job Submission 1

From NU HPC Wiki
Revision as of 23:21, 30 August 2023 by Admin (talk | contribs) (Created page with "=== SLURM Job Scheduler === All computations on Shabyt (apart from quick test runs) are supposed to be executed via the workload manager software that distributes them across the system in an optimal way. It is extremely important that users do not abuse the management node (mgmt01) where they log in and do not run long heavy calculations on it interactively or in the background. The function of the management node is to let users compile binaries, copy data, prepare in...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

SLURM Job Scheduler

All computations on Shabyt (apart from quick test runs) are supposed to be executed via the workload manager software that distributes them across the system in an optimal way. It is extremely important that users do not abuse the management node (mgmt01) where they log in and do not run long heavy calculations on it interactively or in the background. The function of the management node is to let users compile binaries, copy data, prepare input files, and submit jobs. The management node is NOT a workhorse for heavy calculations.

Shabyt uses SLURM workload manager to schedule, distribute, and execute user jobs. SLURM (the name comes from Simple Linux Utility for Resource Management) is free and open-source software used by many, if not most, large HPC facilities throughout the world. Thus, it should be easy for NU users to migrate their jobs from other facilities if they were using computational resources elsewhere.

A complete guide and manual for SLURM is available on the website of [11]SchedMD. Shabyt users are strongly encouraged to read these documents to educate themselves about the software.

In most instances, user interaction with SLURM amounts to executing just a few basic terminal commands. These are summarized below in a table:

Command Description
sbatch <scriptfile> Submits a job (defined by script file <scriptfile>) to the queue
squeue Show the status of all currently running and queued jobs
squeue –job <JobID> Shows the status of a specific job (integer <JobID> is the ID of the job)
scancel <JobID> Delete a specific job. Note that each user can delete only his/her own jobs
scancel -u <username> Deletes all jobs of the user
sinfo View information about nodes and queues
scontrol show job <JobID> View details of a specific job

Even though it is technically possible to use SLURM without scripts, users are advised to write a script file for each job they will be executing. While the exact content of such scripts and their complexity may vary depending on the nature of each user's work, we provide a few basic examples that illustrate typical scenarios.

Partitions

Currently, there are two available partitions on Shabyt:

1. CPU: This partition includes 20 nodes equipped with CPUs only. 2. NVIDIA: This partition consists of 4 GPU nodes. All jobs requiring GPU computations must be queued to this partition. While it is possible to run jobs that need CPUs only in this partition, users are discouraged from doing so to ensure efficient utilization of the system.

Examples of Batch Scripts

Below is an example of a batch script for a serial single-thread job. It requests 1 core and 5 GB of RAM for the job and executes a program called my_program. The standard screen output will be written into the file called stdoutXXXXX.out, where XXXXX is the ID of the job.

  1. !/bin/bash
  2. SBATCH --job-name=Test_Serial
  3. SBATCH --nodes=1
  4. SBATCH --ntasks=1
  5. SBATCH --time=3-0:00:00
  6. SBATCH --mem=5G
  7. SBATCH --partition=CPU
  8. SBATCH --output=stdout%j.out
  9. SBATCH --error=stderr%j.out
  10. SBATCH --mail-type=END,FAIL
  11. SBATCH --mail-user=my.email@nu.edu.kz
  12. SBATCH --get-user-env
  13. SBATCH --no-requeue

pwd; hostname; date cp myfile1.dat myfile2.dat ./my_program myfile2.dat