Quick Start

From NU HPC Wiki
Revision as of 03:15, 29 June 2024 by Admin (talk | contribs)
Jump to navigation Jump to search

This Quick Start Tutorial is meant to provide a very short introduction for those who are new to High Performance Computing, or simply wish to take a refresher of the basics. It covers some concepts that are general to HPC, explains its basic philosophy, and should let you decide whether and how you can deploy it in your research.

Overview

What is HPC?

HPC stands for High Performance Computing and is synonymous with the more colloquial term Supercomputer. In turn, a Supercomputer is a somewhat loosely defined umbrella term that means a computer that is capable to perform computations and other information processing tasks much more quickly than a typical computing device we use in everyday life (e.g. a laptop or mobile phone). Typically supercomputers are assembled as clusters, or collections of powerful computer servers interconnected with fast network connections. Each server in a cluster is often referred to as a Compute Node. Each of the servers or nodes is essentially a workstation though typically much more capable. For example, a standard laptop these days might have a CPU with 4-8 cores and 8-16 GB of RAM. Compare this with a standard compute node on Shabyt cluster, which has a whopping 64 CPU cores and 256 GB of RAM. In addition, some of the compute nodes on Shabyt feature powerful GPU accelerators (Nvidia V100), which in certain tasks may perform number crunching at speeds that exceed that of CPUs by a factor of 5x-10x or even more.

HPC cluster is a shared resource

Another main difference between a supercomputer and a personal laptop or desktop is that the supercomputer is a Shared Resource. This means there may be tens or even hundreds of users who simultaneously access the supercomputer. Each of them can connect to the HPC cluster from their own personal computer and run (or schedule) jobs on one or more of the cluster's compute nodes. You can probably guess that this shared resource model requires some form of coordination. Otherwise, a chaotic execution of computational tasks may lead to serious inefficiency and logistical disasters. This is why pretty much all supercomputers use Job Schedulers - software that controls execution of tasks and makes sure the system is not overcommitted at any given time. Job Schedulers may also handle different users and tasks according to predefined priority policy thereby preventing unintended or unfair share of precious computer resources.

Role of the job scheduler

A job scheduler (also known as workload manager) is software used to manage execution of user jobs. On all our HPC facilities at NU, we deployed a scheduler called SLURM - a free and open-source job scheduler for Linux and Unix-like systems, used in many, if not most, supercomputers and computer clusters found in universities, research institutions, and commercial companies across the world. Users can invoke SLURM by writing a Batch Script that requests certain amount of compute resources (e.g., CPUs, RAM, GPUs, compute time) and includes instructions for running your code. Users submit their scripts to the job scheduler, which then goes and finds available resources on the supercomputer for each user's job. When the resources needed for each specific job become available, it initiates the commands included in the batch script, and outputs the results to a text file (which is sort of an equivalent to the screen output).

Benefits of HPC - scaling up and automation

Supercomputers provide opportunities for parallel processing and data storage that greatly surpass what is capable in a standard laptop or desktop computer. This gives the ability to scale up simulations (e.g. use higher resolution or increase the size/complexity of the model). Other types of analyses may benefit not from increased complexity of the models but from the mere fact that one can execute more jobs at the same time. Common laptop/desktop machines are limited by the relatively small number of CPU cores accessible to them, decreasing the number of simultaneous computations as compared to an HPC.

Another benefit of using HPC is automation. Automation is a feature of HPC systems that allows users to schedule jobs ahead of time. These jobs are then run without supervision. Managing a workstation or keeping an SSH terminal active while scripts are running can lead to many inconveniences and complications when running extended analyses. Contrary to that, batch scripts allow a prewritten set of instructions to be executed when the scheduler determines that sufficient resources are available. This allows for jobs with extended completion times to be run for many days (the actual time limit is imposed by the policy set by the administrator). Meanwhile, the real-time output is saved to a file, allowing the user to check the progress of the job. Lastly, the user can set Checkpointing if the job requires execution longer than 10 days.

Common misconceptions

If I move my code/software from a desktop computer to HPC cluster, it will automatically run faster

You might naively assume that if you just move your code from a laptop/desktop computer to a supercomputer, it will automatically run faster. That is not always the case. In fact, you might be surprised to learn that sometimes it can run even slower. It is particularly true for serial jobs. This is because the power of a supercomputer comes not from the clockspeed of a single CPU core (which are typically not very high) but from the sheer volume of resources available - many more CPU cores in each node, multiple nodes that can be used, larger amount of memory, availability of GPUs, etc. Most often performance boosts come from rebuilding or optimizing your code to take advantage of the parallelism, i.e. additional CPU cores available on HPC. Parallelization enables jobs to divide-and-conquer independent tasks when multiple threads or parallel processes are executed. However, on the HPC, parallelization must almost always be explicitly coded or configured and called from your job. It is not automatic. This process is highly software-dependent, so you will want to research the proper method for running your program of choice in parallel.

If I allocate more CPU cores to my job, my software will use them and the performance will scale up accordingly

Running a job with a large number of CPU cores when the software has not been configured to use them is a waste of your allocation, your time, and precious HPC resources. Software must be designed to use multiple CPU cores as part of its execution. You will need to ensure your software has the capability to make use of multiple CPU cores. The job scheduler only allocates the resources you requested, but it is your responsibility to ensure that the code itself can use them as intended and take advantage of parallelism.

If I run my job on a node that has GPU(s), it will automatically use them and run faster

The power of GPU computing, just like the power of using multiple CPU cores, comes from parallelism. GPUs typically have many thousands of specialized cores. In order to take advantage of those the code must have the capability to use them (e.g. through a software stack for some specific GPU architecture, such as Nvidia CUDA). It is not automatic.

All nodes on a supercomputer are the same

NU HPC facilities are equipped with different types of compute nodes. For example, the login node is available to all users by default upon login. It is designed for managing and editing files, compilation, interacting with the job scheduler. The login node is not designed to run production computations. In fact, running jobs that are too computationally intensive on the login node can severely impact performance for other users and system processes and it is prohibited by our policies. Instead, all heavy computations must be submitted to a job queue. Jobs are automatically distributed among compute nodes by the job scheduler. The compute nodes available on NU HPC facilities are also not all the same. For example, on Shabyt there two types of compute nodes: those that are equipped with CPUs only and those that in addition to CPUs also have GPUs. Moreover, the CPU models in these two types of compute nodes are not the same. While the number of CPU cores is the same in all compute nodes of Shabyt, the GPU nodes feature CPUs with somewhat lower clockspeed. In order to separate jobs that are intended to be run on CPU and GPU nodes, the scheduler is configured to have two different partitions. It is responsibility of the users to submit their jobs to the correct SLURM partition. For example, if a job does not make any use of GPUs, it should be submitted to a CPU partition so that the expensive GPU do not stay idle and can be used by those users who run software that is capable of taking advantage of GPUs. The information about the hardware configuration of the nodes is available on page Systems.

I can run my tasks interactively on a compute node (e.g. play with my Jupyter notebook)

In principle, interactive access and interactive execution can be realized on an HPC cluster. Sometimes it is enabled on sufficiently large systems by either dedicating a certain number of nodes to interactive work or preempting currently executed jobs. However, this approach is not quite consistent with the general philosophy of HPC, which aims to achieve highly efficient utilization of expensive computational resources. Indeed, if a user interacts with sections of a Jupyter notebook or types Matlab commands in a terminal one by one, it will result in a situation when a significant amount of time the CPU is idle. A better approach is when the user does all the code development, interactive manipulations, debugging, etc. on a workstation computer, makes sure everything works as intended, and then executes heavy production calculations as a batch job on a supercomputer.

I cannot install my own software

Well, it depends. Users can build and install software in their own home directories, but not system-wide. They can also create custom environments and install packages for languages like Python and R by using built-in package managers. They can use Easybuild to build packages of their choice. Users do need to keep in mind that any software installation inside their home directories will count against their disk quota.

The system-wide installation of software and creation of modules are generally taken care of by the HPC team. If you would like something to be installed as software available to all HPC users, you can make a request through the ticketing system.

I can use an HPC cluster for immediate real-time processing of data supplied continuously from external sources

If you need to do heavy (or not so heavy) real-time processing of data that comes from external sources, for example AI image recognition of 24/7 astronomical observations in order to trigger some action on your telescope within a second; collect, process in real time, and store important medical patient data coming from hospitals across the country; or process a large volume of bank transactions, then you must deploy a dedicated server or use a suitable cloud service provider. Pretty much all compute jobs on an HPC cluster are meant to be carried out in background, when resources for that job become available. It is generally impossible to predict when exactly the computations will occur and complete as jobs are put in a queue. Scheduling jobs and their background execution is the feature that enables maximum possible utilization of the expensive HPC equipment (up to 100%, if users submit a sufficient number of job requests). On the contrary, dedicated mission critical servers must always be available for one specific task and overprovision resources for it, which leads to their underutilization over extended period of time.