Policies
Important Note: Software configurations on NU HPC facilities are updated on a continuous basis. Minor policy changes also occur regularly. Some of these changes might not be immediately reflected on this website. The limits on job execution and maximum storage allocations are subject to change based on decisions made by the NU HPC Committee and actual system utilization.
Acceptable use
The HPC system is a unique resource for NU researchers and the community. It has special characteristics, such as a large amount of RAM and the capability for massive parallelism. Due to its uniqueness and expense, its use is supervised by the HPC team to ensure efficient and fair utilization.
Users are accountable for their actions. It is responsibility of PIs to ensure that their group members have the necessary expertise to use NU HPC facilities properly and do it for research purposes only. Intentional misuse of NU HPC resources or noncompliance with our Acceptable Use Policy can lead to temporary or permanent disabling of accounts, and administrative or even legal actions.
Storage quotas
Home directory
Users’ home directories are physically stored on fast SSD arrays that have very high bandwidth and enterprise class endurance of the flash drives.
In the case of Irgetas and Shabyt cluster, the main storage servers are connected to the system via Infiniband. All compute nodes are also connected via Infiniband. This provides very high bandwidth for users both when they access their data from the login node and when running their jobs on compute nodes using SLURM.
In Muon cluster the main SSD storage is in the login node with all SSD connected via fast u.2 interfaces. However, Muon's compute nodes have limited bandwidth with the login node (1 Mbit/s Ethernet). Therefore, batch jobs cannot read and write data faster than this network bandwidth.
System | Path | Default storage limit |
---|---|---|
Irgetas | /home/<username>
|
400 GB |
Shabyt | /shared/home/<username>
|
100 GB |
Muon | /shared/home/<username>
|
250 GB |
In some exceptional cases users may be granted a higher storage quota in their home directories. An increased limit must be requested via Helpdesk's ticketing system. Such requests are reviewed on a individual basis and approved only in exceptional cases.
Checking your storage quota
In Shabyt and Muon one can use the following terminal commands to check your or your group member storage quota in home directory as well as to see how much of it is actually being used.
beegfs-ctl --getquota --uid $(id -u)
beegfs-ctl --getquota --uid $(id -u <username>)
Additional storage - zdisk, datahub
In Shabyt cluster, users can store larger amounts of data in their group directory on a slower HDD array. Keep in mind that this array is not connected with Infiniband. Therefore, data access and transfer speeds from both the login node and compute nodes are limited to the standard 1 Mbit/s Ethernet speeds. In /zdisk
, each research group has a shared allocation. This can be particularly handy when the data needs to be transferred, exchanged, or shared within a research group. Similarly, in Irgetas cluster, there is a directory called shared /datahub
where each research group has its allocation for additional storage on an external HDD array.
System | Path | Default storage limit |
---|---|---|
Irgetas | /datahub/<researchgroupname>
|
1 TB |
Shabyt | /zdisk/<researchgroupname>
|
1 TB |
Muon | /zdisk/<researchgroupname>
|
1 TB |
Again, in exceptional cases individual users or groups may be granted an increased quota. Such requests are reviewed on an individual basis upon receiving a ticket by the PI via NU Helpdesk.
Data Integrity and Backup
Users are fully responsible for the integrity and safety of their data stored on NU HPC facilities. Although our clusters employ enterprise-grade hardware, failures remain possible. Home directories (/shared/home
) are automatically backed up several times per week. Please note that this policy does not cover group storage allocations in /zdisk
and /datahub
.
In the event of a major hardware failure, access to your data may be unavailable for an extended period while the system is under repair. In some cases, full recovery may take days or even weeks. Furthermore, no storage system is 100% reliable.
For this reason, we strongly recommend that you maintain your own backups of important or irreplaceable data on your personal computer or other secure storage solutions. Regular personal backups will help ensure data safety and minimize disruption in case of unexpected system issues.
Partitions
A partition in SLURM essentially means a queue: a logical grouping of compute nodes that share the same access rules and limits. Users submit jobs to a partition, and SLURM schedules them on nodes belonging to that partition. On NU HPC systems partitions group compute nodes that have identical hardware.
Irgetas
The Irgetas cluster has two available partitions for user jobs.
ZEN4
: This partition includes 10 CPU-only nodes. Each node has two 96-core AMD EPYC 9684X CPUsH100
: This partition consists of 6 GPU nodes. Each node has two 96-core AMD EPYC 9454 CPUs and four Nvidia H100 GPUs. All Irgetas jobs requiring GPU computations must be queued to this partition. While it is possible to run jobs that use CPUs only in this partition, users are highly discouraged from doing so to ensure efficient utilization of the system. Submitting CPU jobs to partition H100 can only be justified if this partition sits idle for a very long time, while the ZEN4 partition is heavily crowded with many jobs waiting in the queue.
Shabyt
The Shabyt cluster has two available partitions for user jobs.
CPU
: This partition includes 20 CPU-only nodes. Each node has two 32-core AMD EPYC 7502 CPUsNVIDIA
: This partition consists of 4 GPU nodes. Each node has two 32-core AMD EPYC 7452 CPUs and two NVIDIA V100 GPUs. All Shabyt jobs requiring GPU computations must be queued to this partition. While it is possible to run jobs that use CPUs only in this partition, users are discouraged from doing so to ensure efficient utilization of the system. Submitting CPU jobs to partition NVIDIA can only be justified if this partition sits idle for a very long time, while the CPU partition is heavily crowded with many jobs waiting in the queue.
Muon
The Muon cluster has a single partition.
HPE
. This includes all ten compute nodes each having a single 14-core Intel Xeon CPU.
Quality of Service (QoS)
Users belonging to different university units and research groups have different limits on how many jobs they can run simultaneously. This is controlled by the Quality of Service (QoS) category in SLURM.
Irgetas
The Irgetas cluster has four active QoS categories
hpcnc
: Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC), which procured Shabytnu
: All other NU researchers (default category)issai
: Members of the Institute of Smart Systems and Artificial Intelligencestud
: Students with temporary accounts who take courses related to HPC (e.g. PHYS 421/521/721)
Shabyt
The Shabyt cluster has three active QoS categories:
hpcnc
: Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC), which procured Shabytnu
: All other NU researchers (default category)stud
: Students with temporary accounts who take courses related to HPC (e.g. PHYS 421/521/721)
Muon
The Muon cluster has two active QoS categories:
hpcnc
: Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC), which procured Shabytnu
: All other NU researchers (default category)
Job time limits
The following table lists maximum allowed job durations (wall time) in different partitions of NU HPC systems, as well as key characteristics (RAM, number of cores, number of GPUs) for compute nodes in each partition.
System | Partition | Max job duration | Number of nodes
available |
Max CPU cores
per node |
Max threads
per node |
RAM
per node |
GPUs
per node |
---|---|---|---|---|---|---|---|
Irgetas | ZEN4
|
7 days (168 hours) | 10 | 192 | 384 | 384 GB | n/a |
Irgetas | H100
|
4 days (96 hours) | 6 | 192 | 384 | 768 GB | 4 |
Shabyt | CPU
|
14 days (336 hours) | 20 | 64 | 128 | 256 GB | n/a |
Shabyt | NVIDIA
|
2 days (48 hours) | 4 | 64 | 128 | 256 GB | 2 |
Muon | HPE
|
14 days (336 hours) | 10 | 14 | 28 | 64 GB | n/a |
Limits on the number of jobs, cores, threads, and GPUs
All limits on the number of simultaneously running jobs, CPU cores used, GPUs used, and job priorities are listed below for all clusters and QoS categories.
System | QoS | Partition | Max simultaneously
running jobs per user |
Max CPU cores
per user (total for all running jobs) |
Max threads
per user (total for all running jobs) |
Max GPUs
per user (total for all running jobs) |
Job launch priority
(higher relative value means it moves up faster in the list of waiting jobs) |
---|---|---|---|---|---|---|---|
Irgetas | hpcnc
|
ZEN4
|
12 | 576 | 1152 | n/a | 10 |
Irgetas | nu
|
ZEN4
|
12 | 576 | 1152 | n/a | 10 |
Irgetas | issai
|
ZEN4
|
12 | 576 | 1152 | n/a | 10 |
Irgetas | hpcnc
|
H100
|
12 | 576 | 1152 | 12 | 10 |
Irgetas | nu
|
H100
|
12 | 576 | 1152 | 12 | 10 |
Irgetas | issai
|
H100
|
12 | 1152 | 2304 | 24 | 30 |
Shabyt | hpcnc
|
CPU , NVIDIA
|
40 | 1280 | 2560 | 8 | 10 |
Shabyt | nu
|
CPU , NVIDIA
|
12 | 256 | 512 | 8 | 5 |
Shabyt | stud
|
CPU , NVIDIA
|
4 | 128 | 256 | 4 | 5 |
Muon | hpcnc ,nu
|
HPE
|
40 | 140 | 280 | n/a | 10 |
Acknowledgments in publications
If computational resources provided by Nazarbayev University Research Computing (NU RC) were essential to research reported in a publication, please include an acknowledgment — typically in the same section where funding sources are acknowledged. Example wordings (feel free to adapt), but ensure the exact phrase Nazarbayev University Research Computing appears:
- The authors acknowledge the use of computational resources provided by Nazarbayev University Research Computing.
- A.B. and C.D. acknowledge the use of the Irgetas HPC cluster at Nazarbayev University Research Computing.