Quick Start: Difference between revisions

Revision as of 21:10, 28 June 2024

This Quick Start Tutorial is meant to provide a very short introduction for those who are new to High Performance Computing, or simply wish to take a refresher of the basics. It covers some concepts that are general to HPC, explains its basic philosophy, and should let you decide whether and how you can deploy it in your research.

Overview

What is HPC?

HPC stands for High Performance Computing and is synonymous with the more colloquial term Supercomputer. In turn, a Supercomputer is a somewhat loosely defined umbrella term that means a computer that is capable to perform computations and other information processing tasks much more quickly than a typical computing device we use in everyday life (e.g. a laptop or mobile phone). Typically supercomputers are assembled as clusters, or collections of powerful computer servers interconnected with fast network connections. Each server in a cluster is often referred to as a Compute Node. Each of the servers or nodes is essentially a workstation though typically much more capable. For example, a standard laptop these days might have a CPU with 4-8 cores and 8-16 GB of RAM. Compare this with a standard compute node on Shabyt cluster, which has a whopping 64 CPU cores and 256 GB of RAM. In addition, some of the compute nodes on Shabyt feature powerful GPU accelerators (Nvidia V100), which in certain tasks may perform number crunching at speeds that exceed that of CPUs by a factor of 5x-10x or even more.

HPC cluster is a shared resource

Another main difference between a supercomputer and a personal laptop or desktop is that the supercomputer is a Shared Resource. This means there may be tens or even hundreds of users who simultaneously access the supercomputer. Each of them can connect to the HPC cluster from their own personal computer and run (or schedule) jobs on one or more of the cluster's compute nodes. You can probably guess that this shared resource model requires some form of coordination. Otherwise, a chaotic execution of computational tasks may lead to serious inefficiency and logistical disasters. This is why pretty much all supercomputers use Job Schedulers - software that controls execution of tasks and makes sure the system is not overcommitted at any given time. Job Schedulers may also handle different users and tasks according to predefined priority policy thereby preventing unintended or unfair share of precious computer resources.

Role of the job scheduler

A job scheduler (also known as workload manager) is software used to manage execution of user jobs. On all our HPC facilities at NU, we deployed a scheduler called SLURM - a free and open-source job scheduler for Linux and Unix-like systems, used in many, if not most, supercomputers and computer clusters found in universities, research institutions, and commercial companies across the world. Users can invoke SLURM by writing a Batch Script that requests certain amount of compute resources (e.g., CPUs, RAM, GPUs, compute time) and includes instructions for running your code. Users submit their scripts to the job scheduler, which then goes and finds available resources on the supercomputer for each user's job. When the resources needed for each specific job become available, it initiates the commands included in the batch script, and outputs the results to a text file (which is sort of an equivalent to the screen output).

Benefits of HPC - scaling up and automation

Supercomputers provide opportunities for data storage and parallel processing that far surpass what is capable in a standard workstation. These systems provide researchers with the ability to scale up or scale out their work.

Increasing the data throughput of a single job is known as scaling up. This may mean moving from a 500 GB database on a workstation to a 5 TB database on the HPC, or raising the resolution of your simulation by a factor of 10 or 100.

Other types of analyses may benefit from an increased number of jobs, such performing parameter sweeps, running Monte Carlo simulations, or performing molecular dynamics simulations. Local machines are limited by the number of cores accessible to them, decreasing the number of simultaneous computations as compared to an HPC. An increase in the number of CPUs used during analysis is known as scaling out your work.

Automation is another feature of HPC systems that allows users to schedule jobs ahead of time, and for those jobs to be run without supervision. Managing a workstation or keeping an SSH terminal active while scripts are running can lead to major complications when running extended analyses. Batch scripts allow a prewritten set of instructions to be executed when the scheduler determines that sufficient resources will be available. This allows for jobs with extended completion times to be run for up to 10 days (a limit imposed by the scheduler). Real-time output is saved to a text file, allowing you to check the progress of the job. Checkpointing is recommended for jobs that require longer than 10 days.

Common misconceptions

aaa

@@ Line 1: / Line 1: @@
 This Quick Start Tutorial is meant to provide a very short introduction for those who are new to High Performance Computing, or simply wish to take a refresher of the basics. It covers some concepts that are general to HPC, explains its basic philosophy, and should let you decide whether and how you can deploy it in your research.
+=== Overview ===
 ==== What is HPC? ====
@@ Line 18: / Line 20: @@
 '''Automation''' is another feature of HPC systems that allows users to schedule jobs ahead of time, and for those jobs to be run without supervision. Managing a workstation or keeping an SSH terminal active while scripts are running can lead to major complications when running extended analyses. Batch scripts allow a prewritten set of instructions to be executed when the scheduler determines that sufficient resources will be available. This allows for jobs with extended completion times to be run for up to 10 days (a limit imposed by the scheduler). Real-time output is saved to a text file, allowing you to check the progress of the job. '''Checkpointing''' is recommended for jobs that require longer than 10 days.
+=== Common misconceptions ===
+aaa