Policies

From NU HPC Wiki
Jump to navigation Jump to search

Important Note: Software configurations on NU HPC facilities are updated on a continuous basis. Minor policy changes also occur regularly. Some of these changes might not be immediately reflected on this website. The limits on job execution and maximum storage allocations are subject to change based on decisions made by the NU HPC Committee and actual system utilization.

Acceptable use

The HPC system is a unique resource for NU researchers and the community. It has special characteristics, such as a large amount of RAM and the capability for massive parallelism. Due to its uniqueness and expense, its use is supervised by the HPC team to ensure efficient and fair utilization.

Users are accountable for their actions. It is responsibility of PIs to ensure that their group members have the necessary expertise to use NU HPC facilities properly and do it for research purposes only. Intentional misuse of NU HPC resources or noncompliance with our Acceptable Use Policy can lead to temporary or permanent disabling of accounts, and administrative or even legal actions.

Storage quotas

Home directory

Users’ home directories are physically stored on fast SSD arrays that have very high bandwidth and enterprise class endurance of the flash drives.

In the case of Irgetas and Shabyt cluster, the main storage servers are connected to the system via Infiniband. All compute nodes are also connected via Infiniband. This provides very high bandwidth for users both when they access their data from the login node and when running their jobs on compute nodes using SLURM.

In Muon cluster the main SSD storage is in the login node with all SSD connected via fast u.2 interfaces. However, Muon's compute nodes have limited bandwidth with the login node (1 Mbit/s Ethernet). Therefore, batch jobs cannot read and write data faster than this network bandwidth.

Default quota for users’ home directories on NU HPC systems
System Path Default storage limit
Irgetas /home/<username> 400 GB
Shabyt /shared/home/<username> 100 GB
Muon /shared/home/<username> 250 GB

In some exceptional cases users may be granted a higher storage quota in their home directories. An increased limit must be requested via Helpdesk's ticketing system. Such requests are reviewed on a individual basis and approved only in exceptional cases.

Checking your storage quota

In Shabyt and Muon one can use the following terminal commands to check your or your group member storage quota in home directory as well as to see how much of it is actually being used.

beegfs-ctl --getquota --uid $(id -u)

beegfs-ctl --getquota --uid $(id -u <username>)

Additional storage - zdisk, datahub

In Shabyt cluster, users can store larger amounts of data in their group directory on a slower HDD array. Keep in mind that this array is not connected with Infiniband. Therefore, data access and transfer speeds from both the login node and compute nodes are limited to the standard 1 Mbit/s Ethernet speeds. In /zdisk, each research group has a shared allocation. This can be particularly handy when the data needs to be transferred, exchanged, or shared within a research group. Similarly, in Irgetas cluster, there is a directory called shared /datahub where each research group has its allocation for additional storage on an external HDD array.

Default storage quota for zdisk on NU HPC systems
System Path Default storage limit
Irgetas /datahub/<researchgroupname> 1 TB
Shabyt /zdisk/<researchgroupname> 1 TB
Muon /zdisk/<researchgroupname> 1 TB

Again, in exceptional cases individual users or groups may be granted an increased quota. Such requests are reviewed on an individual basis upon receiving a ticket by the PI via NU Helpdesk.


Data Integrity and Backup

Users are fully responsible for the integrity and safety of their data stored on NU HPC facilities. Although our clusters employ enterprise-grade hardware, failures remain possible. Home directories (/shared/home) are automatically backed up several times per week. Please note that this policy does not cover group storage allocations in /zdisk and /datahub. In the event of a major hardware failure, access to your data may be unavailable for an extended period while the system is under repair. In some cases, full recovery may take days or even weeks. Furthermore, no storage system is 100% reliable. For this reason, we strongly recommend that you maintain your own backups of important or irreplaceable data on your personal computer or other secure storage solutions. Regular personal backups will help ensure data safety and minimize disruption in case of unexpected system issues.

Partitions

A partition in SLURM essentially means a queue: a logical grouping of compute nodes that share the same access rules and limits. Users submit jobs to a partition, and SLURM schedules them on nodes belonging to that partition. On NU HPC systems partitions group compute nodes that have identical hardware.

Irgetas

The Irgetas cluster has two available partitions for user jobs.

  • ZEN4 : This partition includes 10 CPU-only nodes. Each node has two 96-core AMD EPYC 9684X CPUs
  • H100 : This partition consists of 6 GPU nodes. Each node has two 96-core AMD EPYC 9454 CPUs and four Nvidia H100 GPUs. All Irgetas jobs requiring GPU computations must be queued to this partition. While it is possible to run jobs that use CPUs only in this partition, users are highly discouraged from doing so to ensure efficient utilization of the system. Submitting CPU jobs to partition H100 can only be justified if this partition sits idle for a very long time, while the ZEN4 partition is heavily crowded with many jobs waiting in the queue.

Shabyt

The Shabyt cluster has two available partitions for user jobs.

  • CPU : This partition includes 20 CPU-only nodes. Each node has two 32-core AMD EPYC 7502 CPUs
  • NVIDIA : This partition consists of 4 GPU nodes. Each node has two 32-core AMD EPYC 7452 CPUs and two NVIDIA V100 GPUs. All Shabyt jobs requiring GPU computations must be queued to this partition. While it is possible to run jobs that use CPUs only in this partition, users are discouraged from doing so to ensure efficient utilization of the system. Submitting CPU jobs to partition NVIDIA can only be justified if this partition sits idle for a very long time, while the CPU partition is heavily crowded with many jobs waiting in the queue.

Muon

The Muon cluster has a single partition.

  • HPE. This includes all ten compute nodes each having a single 14-core Intel Xeon CPU.

Quality of Service (QoS)

Users belonging to different university units and research groups have different limits on how many jobs they can run simultaneously. This is controlled by the Quality of Service (QoS) category in SLURM.

Irgetas

The Irgetas cluster has four active QoS categories

  • hpcnc : Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC), which procured Shabyt
  • nu : All other NU researchers (default category)
  • issai : Members of the Institute of Smart Systems and Artificial Intelligence
  • stud : Students with temporary accounts who take courses related to HPC (e.g. PHYS 421/521/721)

Shabyt

The Shabyt cluster has three active QoS categories:

  • hpcnc : Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC), which procured Shabyt
  • nu : All other NU researchers (default category)
  • stud : Students with temporary accounts who take courses related to HPC (e.g. PHYS 421/521/721)

Muon

The Muon cluster has two active QoS categories:

  • hpcnc : Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC), which procured Shabyt
  • nu : All other NU researchers (default category)

Job time limits

The following table lists maximum allowed job durations (wall time) in different partitions of NU HPC systems, as well as key characteristics (RAM, number of cores, number of GPUs) for compute nodes in each partition.

Time limits for jobs in different partitions of NU HPC systems
System Partition Max job duration Number of nodes

available

Max CPU cores

per node

Max threads

per node

RAM

per node

GPUs

per node

Irgetas ZEN4 7 days (168 hours) 10 192 384 384 GB n/a
Irgetas H100 4 days (96 hours) 6 192 384 768 GB 4
Shabyt CPU 14 days (336 hours) 20 64 128 256 GB n/a
Shabyt NVIDIA 2 days (48 hours) 4 64 128 256 GB 2
Muon HPE 14 days (336 hours) 10 14 28 64 GB n/a

Limits on the number of jobs, cores, threads, and GPUs

All limits on the number of simultaneously running jobs, CPU cores used, GPUs used, and job priorities are listed below for all clusters and QoS categories.

Maximum number of simultaneously running jobs, CPU cores, and threads for NU HPC systems
System QoS Partition Max simultaneously

running jobs

per user

Max CPU cores

per user

(total for all

running jobs)

Max threads

per user

(total for all

running jobs)

Max GPUs

per user

(total for all

running jobs)

Job launch priority

(higher relative value

means it moves up faster

in the list of waiting jobs)

Irgetas hpcnc ZEN4 12 576 1152 n/a 10
Irgetas nu ZEN4 12 576 1152 n/a 10
Irgetas issai ZEN4 12 576 1152 n/a 10
Irgetas hpcnc H100 12 576 1152 12 10
Irgetas nu H100 12 576 1152 12 10
Irgetas issai H100 12 1152 2304 24 30
Shabyt hpcnc CPU, NVIDIA 40 1280 2560 8 10
Shabyt nu CPU, NVIDIA 12 256 512 8 5
Shabyt stud CPU, NVIDIA 4 128 256 4 5
Muon hpcnc,nu HPE 40 140 280 n/a 10


Acknowledgments in publications

If computational resources provided by Nazarbayev University Research Computing (NU RC) were essential to research reported in a publication, please include an acknowledgment — typically in the same section where funding sources are acknowledged. Example wordings (feel free to adapt), but ensure the exact phrase Nazarbayev University Research Computing appears:

  • The authors acknowledge the use of computational resources provided by Nazarbayev University Research Computing.
  • A.B. and C.D. acknowledge the use of the Irgetas HPC cluster at Nazarbayev University Research Computing.