Job Submission: Difference between revisions

From NU HPC Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 334: Line 334:
</syntaxhighlight>
</syntaxhighlight>


To compile the above program we will use <code>nvcc</code> compiler that is part of CUDA version 12.1.1 framework available as a module in Shabyt. This can be done as follows:
To compile the above program we will use <code>nvcc</code> compiler that is part of CUDA version 12.1.1 toolkit available as a module in Shabyt cluster. This can be done as follows:


<syntaxhighlight lang="bash" line="1">
<syntaxhighlight lang="bash" line="1">

Revision as of 07:49, 18 November 2024

NU HPC clusters use SLURM workload manager to schedule, distribute, and execute user jobs. SLURM (the name comes from Simple Linux Utility for Resource Management) is free and open-source software used by many, if not most, large HPC facilities throughout the world. Thus, if you happen to have run research calculations at some HPC facility elsewhere, it should be rather easy for you to migrate your jobs and start using NU HPC clusters. On a login node, users arrange their data, write batch scripts, and submit their jobs to the execution queue. The submitted jobs are then put in a pending state until the requested system resources are available and allocated. SLURM will schedule each job to run in the queue according to a predetermined site policy designated to balance competing user needs and to maximize efficient use of the cluster resources.

Absolutely all computational tasks on Shabyt (apart from compiling and very short test runs that use one or just few cores) are supposed to be executed using SLURM workload manager that distributes them across the system in an optimal way. It is extremely important that users do not abuse the management/login node (ln01 in Shabyt) where they log in and do not run long heavy calculations on it interactively or in the background. The function of the management node is to let users compile binaries, transfer data, manage files, prepare input for calculations, and submit jobs. The management node is not a workhorse for heavy calculations.

For a comprehensive guide on SLURM you can refer to its website. A short printable cheat sheet of some useful SLURM commands and parameters is available in this summary. Below we will dive into explaining how jobs should be submitted for execution in NU HPC systems and provide some basic examples.

Partitions, QoS, and job limits

Each job’s position and progress in the queue is determined through the fairshare algorithm, which depends on a number of factors (e.g. size of the job, time requirement, job queuing time, partition, QoS, etc). For the information on the available partitions, Quality of Service (QoS), maximum job durations, and maximum number of simultaneously running jobs and CPU cores used, please refer to the corresponding sections on page Policies. Please note that the limits are subject to change.

It is important to keep in mind that your jobs cannot request the duration that exceed the time limit set by our policies. Jobs that request execution times exceeding the limit may still be sent to the queue, but they will stay queued (i.e. will be in a pending state) forever until you change the time requested. Likewise, requesting more RAM for your job than what is physically available will result in your submitted job stay in a pending state forever. One cannot simultaneously use more CPU cores than what is allowed for a single user that belongs to a specific QoS category. For example, if the total CPU core limit is 256 cores for a user and this user currently has four 64-core jobs running, then any new submitted jobs will be placed in a queue in stay in a pending state. They will not start running even if there are resources available to execute them. They will be pending until one or all of the four running jobs finish.

Job submission and monitoring

Jobs can be submitted for execution using a “batch” file. The batch file is essentially a Unix shell script (typically a bash script, but using other shells, e.g. tcsh, is possible as well) that in addition to the actual user commands contains a preamble (or header) written in a special format. This header, all lines of which begin with the keyword #SBATCH, contains batch directives - the information about the resources requested for the job, user information, job name, etc. While bash and other Linux shells treat these lines beginning with #SBATCH as comments, these are not comments for SLURM. SLURM reads them when you submit a job for execution, interprets them, and acts accordingly. Note that if you change the format of those lines just slightly, e.g. if instead they begin with # SBATCH or ##SBATCH then SLURM no longer reads them and assumes they are comments. This is convenient for making SLURM omit some lines without actually deleting them in your script.

Most common and useful SLURM commands

A list of some of the most useful SLURM commands is provided below. Please be aware that many SLURM commands accept flags/arguments that further extend their functionality. Those can be explored by invoking help using the --help flag (e.g. sbatch --help) in the terminal window.

List of useful SLURM Commands
Command Description
sbatch <script_file_name> Submit a job using the specified script <script_file_name> (e.g. myscript.slurm).
sbatch --test-only <script_file_name> Test the SLURM script for errors without submitting a job.
scancel <job_id> Cancel a running or queued job by specifying its Job ID (a four or five digit number).
squeue View the status of all user jobs in the queue, including their state (running, pending, etc.).
squeue -u <user_name> View the status of jobs by user <user_name> (e.g. john.smith).
squeue -l A longer output format for the squeue command.
squeue -o "%.6i %.6P %.12j %.19u %.2t %.11M %.8Q %.5D %.4C %.11R" This is an example of even more detailed, user-defined, output format of squeue. To avoid typing such long command all the time in terminal you can define an alias in your shell profile file .bashrc, by adding a line such as alias mysqueu='( squeue -o "%.6i %.6P %.12j %.19u %.2t %.11M %.8Q %.5D %.4C %.11R" )'
sinfo Display information about the cluster, including nodes, partitions, and their status.
scontrol show job <job_id> Show detailed information about a specific job.
squeue --start Estimate the start time for queued jobs.

Sample SLURM batch scripts

Running a simple serial (single-threaded) job

In this example we will prepare a submit a simple job that executes a serial Python program. For that we will first create directory called test_serial inside our home directory with a full path /shared/home/<your.name>/test_serial. It is always advisable to run each job in a separate directory to avoid confusion and mix up files.

Now let us enter directory test_serial and create a short Python script called my_python_script.py. When executed, this script prints a "Hello, world!" message once a second for one minute before it completes.

import time

nsec=60

for i in range(1, nsec+1):
    print(f"{i}: Hello, world!")
    time.sleep(1)

Now let us create a SLURM script (we will use bash for this) called batch_serial.slurm with the following content.

#!/bin/bash

# ---------------------------------------------------
# Directives section that requests specific resources
# ---------------------------------------------------

#SBATCH --job-name=MySerialJob           # Your job's name
#SBATCH --time=0-2:00:00                 # Time requested: 0 days + 2 hours + 0 minutes + 0 seconds
#SBATCH --mem=1G                         # Memory requested: 1 Gigabyte
#SBATCH --ntasks=1                       # Run a single task
#SBATCH --partition=CPU                  # Specify partition name
#SBATCH --output=stdout%j.out            # Specify file name for standard screen output
#SBATCH --error=stderr%j.out             # Specify file name for error messages
#SBATCH --mail-type=END,FAIL             # Specify when automatic email notification should be sent
#SBATCH --mail-user=your.name@nu.edu.kz  # Specify your email address where notifications are sent

# ---------------------------------------------------
# Your code section
# ---------------------------------------------------

# Load a module with a specific Python version. Technically, this is not required for our example
# because the operating system default Python is already available in all compute nodes of Shabyt 
# without loading anything. However, we will do it here for illustrative purposes. 
  
module load Python/3.11.5-GCCcore-13.2.0

# Below we will execute our Python program. By default, SLURM will execute the commands
# that appear in this batch script in the same directory where this script was submitted 
# from with sbatch. However, if you need to you, in this script you can go to some specific 
# directory, or use a full path to your Python program. Moreover, if you wish, you could 
# define your own Unix shell variables, copy or move files, have loops or conditional statements, 
# execute shell commands, etc. -- i.e. do anything that one can do in a Unix shell script in 
# order to pre-process and post-process your files and data. In this example we simply execute 
# our Python program using Python 3 in the current directory. 

python3 ./my_python_script.py

We can now submit our job from the terminal by typing sbatch batch_serial.slurm. If everything goes well and the task is accepted by SLURM, you should see a confirmation message, e.g. Submitted batch job XXXXX , where XXXXX is the ID assigned to your job. You can check the queue to see whether your job is executed immediately or is placed in a queue (pending status). For this you can type, e.g. squeue --me in the terminal. If your job started execution it should finish within just one minute, because this is how we wrote our Python program. After the job execution is complete, it will no longer appear in the list of jobs that you see with the squeue command. In the end you will find two files in directory test. These are stderrXXXXX.out and stdoutXXXXX.out. The first one should normally be empty, unless there were errors or exceptions during the execution. The second one contains the standard screen output, i.e. all sixty "Hello, world!" messages are printed there.

Note that the maximum time we requested for our job in the above SLURM script was 2 hours. It the batch script finishes sooner than in 2 hours after it begins execution then the job is terminated automatically. SLURM will then go ahead and use the freed resources for some other job in the queue. If the execution of the batch script is not finished within the time that is requested then the job is forced to terminate. If you perform long calculations and do not save intermediate data it produces then you must set the max time long enough so that it suffices to complete the task. On the other hand, setting the max time to be much longer than you actually need might sometimes increase the wait time of your job in the queue. Thus users must use reasonable judgement when they set max time for their jobs. It also applies to other parameters/resources, such as the amount of memory. Requesting too little memory may cause your job's premature death, while requesting too much (e.g. request 100 GB while your program actually needs only 100 kB) may increase the wait time in the queue. Requesting too much memory may also prevent SLURM from starting jobs submitted by other users due to the reduced amount of memory available for them. Again, users need to use reasonable judgement when they request resources (time, memory, CPU cores, etc) for their jobs.

In the above example we set that the default output goes to the file called stdoutXXXXX.out (%j in line #SBATCH --output=stdout%j.out will be substituted by the unique Job ID assigned to your job by SLURM - a four or five digit number). In case if different parts of your batch script code need to output to different files you can always use redirection, e.g. you could have something like this in your batch script:

python3 ./my_python_script_1.py > output_file.txt
python3 ./my_python_script_2.py > other_output_file.txt

The above batch script example contains a directive #SBATCH --mail-type=END,FAIL. It tells SLURM that the user should be sent an automatic email in the event if the job ends normally or fails during the execution. This is handy if you do not want to sit by a computer and constantly check the status of your calculations. If you also would like to be notified about the job start (could be useful if your job waits in a queue for a long time before SLURM executes it) then you can use the directive #SBATCH --mail-type=BEGIN,END,FAIL instead. The email address where the messages are sent is given in the next line of the batch script, #SBATCH --mail-user=your.name@nu.edu.kz (do not forget to replace it with your actual email).

Running an SMP parallel job

SMP (Symmetric Multiprocessing) parallelism is a form of parallel computing where multiple processors (or CPU cores) share a single, common memory space and work together to perform tasks. For example, in Shabyt cluster, each node (i.e. server) has two 32-core CPUs. Each core, in turn, can simultaneously run up to two threads. So overall, one can run an application on such a node using up to 2 x 32 x 2 = 128 parallel threads. Note that SMP parallelism does not extend to multiple nodes. Different nodes do not share memory. Therefore 128 parallel threads is a hard limit for SMP jobs executed in Shabyt cluster.

OpenMP is widely used in scientific and engineering applications to facilitate SMP parallelism in multi-core CPUs by dividing tasks among threads. OpenMP (Open Multi-Processing) is an API that supports multi-platform shared-memory parallel programming in C, C++, and Fortran. It allows developers to write parallel code easily by adding compiler directives (pragmas) that tell the compiler which parts of the code should run in parallel.

In the following example we will create a C program that will be compiled and then executed using an appropriate batch script. Let us create a directory called test_smp. Inside that directory create a file called my_smp_program.c containing the following C code:

#include <stdio.h>
#include <omp.h>

/* This program uses multiple threads to compute the sum of squares of the 
numbers ranging from 1 to n and. The result is printed on the screen. The 
program also also prints the number of parallel threads that were executed. */

int main() {
    int n = 100000000;    // n - Number of elements
    long long sum = 0;    // sum - Result, we use long long type as the result may be very large

    // Parallel region with OpenMP to print the number of threads
    #pragma omp parallel
    {
        #pragma omp single  // Ensure only one thread prints the message
        {
            int num_threads = omp_get_num_threads();
            printf("Number of threads used: %d\n", num_threads);
        }
    }

    // Parallel region to calculate the sum of squares
    #pragma omp parallel for reduction(+:sum)
    for (int i = 1; i <= n; i++) {
        sum += i * i;
    }

    printf("The sum of squares from 1 to %d is: %lld\n", n, sum);
    return 0;
}

First, we need to compile this program and generate an executable binary file. This can be done using any modern C compiler that supports OpenMP (such as gcc). While it would be perfectly ok to use the system default gcc compiler for this, for an illustrative purpose let us invoke a newer version of the GCC toolchain, GCC 13.2, which is available as a module on NU HPC clusters. The following terminal commands load the corresponding GCC module and then build a binary file:

module load GCC/13.2.0
gcc -O3 -fopenmp my_smp_program.c -o my_smp_program

If everything goes well then an executable file called my_smp_program should appear in the same directory after the compilation. Let us then create a SLURM batch script called batch_smp.slurm that has the following content:

#!/bin/bash
#SBATCH --job-name=MySMPJob              # Your job's name
#SBATCH --time=0-0:30:00                 # Time requested: 0 days + 0 hours + 30 minutes + 0 seconds
#SBATCH --mem=1G                         # Memory requested: 1 Gigabyte
#SBATCH --nodes=1                        # Run on a single node
#SBATCH --ntasks=1                       # Run a single task
#SBATCH --cpus-per-task=128              # Request all 128 threads on the node
#SBATCH --partition=CPU                  # Specify partition name
#SBATCH --output=stdout%j.out            # Specify file name for standard screen output
#SBATCH --error=stderr%j.out             # Specify file name for error messages
#SBATCH --mail-type=END,FAIL             # Specify when automatic email notification should be sent
#SBATCH --mail-user=your.name@nu.edu.kz  # Specify your email address where notifications are sent

# Load module GCC 13.2
module load GCC/13.2.0

# Set the number of threads for OpenMP to match the cpus-per-task directive
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Run the program
./my_smp_program

Distributed Memory Parallelism (MPI) Job

Message Passing Interface (MPI) is a standardized and portable communication protocol designed for parallel computing. In the MPI paradigm multiple parallel processes can run on different hosts and communicate over the network by passing messages containing data. Each process has its own memory that is not accessible to other processes. MPI is designed to work across various platforms and architectures, making it widely used in high-performance computing. One of the most popular implementations is MPI is Open MPI (should not be confused with OpenMP).

In the following example we will use a simple Fortran program that has makes calls of MPI library functions and execute it in parallel on multiple nodes (servers).

Let us create a directory called test_mpi. Inside that directory create a file called my_mpi_program.f90 containing the following Fortran code:

program my_mpi_program
    use mpi
    implicit none

    integer :: rank, num_procs, ierr, N, local_start, local_end
    integer :: local_sum, global_sum, i

    ! Predefined value of N (can be adjusted as needed)
    N = 5555

    ! Initialize MPI
    call MPI_Init(ierr)
    call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
    call MPI_Comm_size(MPI_COMM_WORLD, num_procs, ierr)

    ! Print the number of parallel MPI processes being executed
    if (rank == 0) then
        print *, "Running with", num_procs, "parallel MPI processes"
    end if

    ! Divide the range among processes
    local_start = rank * (N / num_procs) + 1
    local_end = min((rank + 1) * (N / num_procs), N)

    if (rank == num_procs - 1) then
        local_end = N  ! Last process takes any leftover elements
    end if

    ! Each process computes its local sum
    local_sum = 0
    do i = local_start, local_end
        local_sum = local_sum + i
    end do

    ! Reduce the local sums to a global sum in process 0
    call MPI_Reduce(local_sum, global_sum, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, ierr)

    ! Process 0 prints the result
    if (rank == 0) then
        print *, "The sum of integers from 1 to", N, "is", global_sum
    end if

    call MPI_Finalize(ierr)
end program my_mpi_program

This simple code computes the sum of numbers from 1 to 5555 and prints the result on the screen. It also prints the number of parallel processes that are being used for it. To compile this program we will use toolchain called foss (the name comes from Fully Open Source Software). Among many other things, the foss/2023b toolchain (available as a module on NU HPC systems) includes gcc and gfortran compilers version 13.2 and Open MPI library version 4.1.6. We will use the standard mpif90 wrapper to compile the code as follows:

module load foss/2023b
mpif90 -O3 my_mpi_program.f90 -o my_mpi_program

If everything goes well, the second command should generate a binary file called my_mpi_program. We can then create the following SLURM batch script to execute it in parallel using mpirun.

#!/bin/bash
#SBATCH --job-name=MyMPIJob              # Your job's name
#SBATCH --time=0-0:30:00                 # Time requested: 0 days + 0 hours + 30 minutes + 0 seconds
#SBATCH --mem-per-cpu=1G                 # Specify amount of memory per each MPI process
#SBATCH --nodes=3                        # Number of nodes requested
#SBATCH --ntasks=192                     # Total number of MPI processes (3 nodes × 64 cores per node)
#SBATCH --exclusive                      # Requested nodes can be used by your job only 
#SBATCH --partition=CPU                  # Specify partition name
#SBATCH --output=stdout%j.out            # Specify file name for standard screen output
#SBATCH --error=stderr%j.out             # Specify file name for error messages
#SBATCH --mail-type=END,FAIL             # Specify when automatic email notification should be sent
#SBATCH --mail-user=your.name@nu.edu.kz  # Specify your email address where notifications are sent

# Load module foss/2023b that was used to build my_mpi_program binary
module load foss/2023b

# Set the number of parallel processes to match the ntasks directive
export NPROCS=$SLURM_NTASKS

# Run the program
mpirun -np $NPROCS ./my_mpi_program

The above script requests that each node hosts 64 processes, so that the total number of parallel MPI processes across all three nodes is 192.

GPU Job

Below is an example of a job that will use an Nvidia V100 GPU that is available in partition called NVIDIA in Shabyt cluster. It will run a CUDA program to compute a matrix-matrix multiplication. First, let us create a directory called test_gpu. In that directory, create a file called my_cuda_program.cu with the following content:

#include <cuda_runtime.h>
#include <iostream>

// CUDA kernel for matrix multiplication
__global__ void matMulKernel(float *A, float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0;
        for (int i = 0; i < N; i++) {
            sum += A[row * N + i] * B[i * N + col];
        }
        C[row * N + col] = sum;
    }
}

void matMul(float *A, float *B, float *C, int N) {
    float *d_A, *d_B, *d_C;
    size_t size = N * N * sizeof(float);

    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);

    dim3 blockDim(16, 16);
    dim3 gridDim((N + blockDim.x - 1) / blockDim.x, (N + blockDim.y - 1) / blockDim.y);

    matMulKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

    cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);

    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}

int main() {
    const int N = 512;
    float *A = new float[N * N];
    float *B = new float[N * N];
    float *C = new float[N * N];

    for (int i = 0; i < N * N; i++) {
        A[i] = 1.0f;
        B[i] = 2.0f;
    }

    matMul(A, B, C, N);

    std::cout << "Result C[0]: " << C[0] << std::endl;

    delete[] A;
    delete[] B;
    delete[] C;
    return 0;
}

To compile the above program we will use nvcc compiler that is part of CUDA version 12.1.1 toolkit available as a module in Shabyt cluster. This can be done as follows:

module load CUDA/12.1.1
nvcc -O3 my_cuda_program.cu -o my_cuda_program

If everything goes well, you should have a binary file called my_cuda_program generated in the current directory. Then let us create a SLURM batch script batch_gpu.slurm with the following content:

#!/bin/bash
#SBATCH --job-name=MyGPUJob              # Your job's name
#SBATCH --ntasks=1                       # Run a single task      
#SBATCH --gres=gpu:1                     # Allocate one GPU for the job
#SBATCH --cpus-per-task=4                # Allocate four CPU threads
#SBATCH --mem=16G                        # Memory requested: 16 Gigabytes
#SBATCH --time=0-00:10:00                # Time requested: 0 days + 0 hours + 10 minutes + 0 seconds
#SBATCH --partition=NVIDIA               # Specify partition name 
#SBATCH --output=stdout%j.out            # Specify file name for standard screen output
#SBATCH --error=stderr%j.out             # Specify file name for error messages
#SBATCH --mail-type=END,FAIL             # Specify when automatic email notification should be sent
#SBATCH --mail-user=your.name@nu.edu.kz  # Specify your email address where notifications are sent

# Load CUDA module that was used to build binary file
module load CUDA/12.1.1

# Run the program
srun ./my_cuda_program

After submitting this SLURM batch script using the terminal command sbatch batch_gpu.slurm it should go to the queue. When the job starts it should complete momentarily (the calculation is rather simple and quick) and print the result in file stdoutXXXXX.out.