Quick Start
This Quick Start Tutorial is meant to provide a very short introduction for those who are new to High Performance Computing, or simply wish to take a refresher of the basics. It covers some concepts that are general to HPC, explains its basic philosophy, and should let you decide whether and how you can deploy it in your research.
What is HPC?
HPC stands for High Performance Computing and is synonymous with the more colloquial term Supercomputer. In turn, a Supercomputer is a somewhat loosely defined umbrella term that means a computer that is capable to perform computations and other information processing tasks much more quickly than a typical computing device we use in everyday life (e.g. a laptop or mobile phone). Typically supercomputers are assembled as clusters, or collections of powerful computer servers interconnected with fast network connections. Each server in a cluster is often referred to as a Compute Node. Each of the servers or nodes is essentially a workstation though typically much more capable. For example, a standard laptop these days might have a CPU with 4-8 cores and 8-16 GB of RAM. Compare this with a standard compute node on Shabyt cluster, which has a whopping 64 CPU cores and 256 GB of RAM. In addition, some of the compute nodes on Shabyt feature powerful GPU accelerators (Nvidia V100), which in certain tasks may perform number crunching at speeds that exceed that of CPUs by a factor of 5x-10x or even more.
Another main difference between a supercomputer and a personal laptop or desktop is that the supercomputer is a Shared Resource. This means there may be tens or even hundreds of users who simultaneously access the supercomputer. Each of them can connect to the HPC cluster from their own personal computer and run (or schedule) jobs on one or more of the cluster's compute nodes. You can probably guess that this shared resource model requires some form of coordination. Otherwise, a chaotic execution of computational tasks may lead to serious inefficiency and logistical disasters. This is why pretty much all supercomputers use Job Schedulers - software that controls execution of tasks and makes sure the system is not overcommitted at any given time. Job Schedulers may also handle different users and tasks according to predefined priority policy thereby preventing unintended or unfair share of precious computer resources.
Role of the job scheduler
A job scheduler (also known as workload manager) is software used to manage execution of user jobs. On all our HPC facilities at NU, we deployed a scheduler called SLURM - a free and open-source job scheduler for Linux and Unix-like systems, used in many, if not most, supercomputers and computer clusters found in universities, research institutions, and commercial companies across the world. Users can invoke SLURM by writing a Batch Script that requests certain amount of compute resources (e.g., CPUs, RAM, GPUs, compute time) and includes instructions for running your code. Users submit their scripts to the job scheduler, which then goes and finds available resources on the supercomputer for each user's job. When the resources needed for each specific job become available, it initiates the commands included in the batch script, and outputs the results to a text file (which is sort of an equivalent to the screen output).