Policies

Important Note: Software configurations on NU HPC facilities are updated on a continuous basis. Minor policy changes also occur regularly. Some of these changes might not be immediately reflected on this website. The limits on job execution and maximum storage allocations are subject to change based on decisions made by the NU HPC Committee and actual system utilization.

Acceptable use

The HPC system is a unique resource for NU researchers and the community. It has special characteristics, such as a large amount of RAM and the capability for massive parallelism. Due to its uniqueness and expense, its use is supervised by the HPC team to ensure efficient and fair utilization.

Users are accountable for their actions. It is responsibility of PIs to ensure that their group members have the necessary expertise to use NU HPC facilities properly and do it for research purposes only. Intentional misuse of NU HPC resources or noncompliance with our Acceptable Use Policy can lead to temporary or permanent disabling of accounts, and administrative or even legal actions.

Storage quotas

Home directory

Users’ home directories are physically stored on fast SSD arrays that have very high bandwidth and enterprise class endurance of the flash drives.

In the case of Irgetas and Shabyt cluster, the main storage servers are connected to the system via Infiniband. All compute nodes are also connected via Infiniband. This provides very high bandwidth for users both when they access their data from the login node and when running their jobs on compute nodes using SLURM.

In Muon cluster the main SSD storage is in the login node with all SSD connected via fast u.2 interfaces. However, Muon's compute nodes have limited bandwidth with the login node (1 Mbit/s Ethernet). Therefore, batch jobs cannot read and write data faster than this network bandwidth.

Default quota for users’ home directories on NU HPC systems
System	Path	Default storage limit
Irgetas	`/shared/home/<username>`	400 GB
Shabyt	`/shared/home/<username>`	100 GB
Muon	`/shared/home/<username>`	250 GB

In some exceptional cases users may be granted a higher storage quota in their home directories. An increased limit must be requested via Helpdesk's ticketing system. Such requests are reviewed on a individual basis and approved only in exceptional cases.

Checking your storage quota

In Shabyt and Muon one can use the following terminal commands to check your or your group member storage quota in home directory as well as to see how much of it is actually being used.

beegfs-ctl --getquota --uid $(id -u)

beegfs-ctl --getquota --uid $(id -u <username>)

Additional storage - zdisk, datahub

In Shabyt cluster, users can store larger amounts of data in their group directory on a slower HDD array. Keep in mind that this array is not connected with Infiniband. Therefore, data access and transfer speeds from both the login node and compute nodes are limited to the standard 1 Mbit/s Ethernet speeds. In /zdisk, each research group has a shared allocation. This can be particularly handy when the data needs to be transferred, exchanged, or shared within a research group. Similarly, in Irgetas cluster, there is a directory called shared /datahub where each research group has its allocation for additional storage on an external HDD array.

Default storage quota for zdisk on NU HPC systems
System	Path	Default storage limit
Irgetas	`/datahub/<researchgroupname>`	1 TB
Shabyt	`/zdisk/<researchgroupname>`	1 TB
Muon	`/zdisk/<researchgroupname>`	1 TB

Again, in exceptional cases individual users or groups may be granted an increased quota. Such requests are reviewed on an individual basis upon receiving a ticket by the PI via NU Helpdesk.

Data Integrity and Backup

Users are fully responsible for the integrity and safety of their data stored on NU HPC facilities. Although our clusters employ enterprise-grade hardware, failures remain possible. Home directories (/shared/home) are automatically backed up several times per week. Please note that this policy does not cover group storage allocations in /zdisk and /datahub. In the event of a major hardware failure, access to your data may be unavailable for an extended period while the system is under repair. In some cases, full recovery may take days or even weeks. Furthermore, no storage system is 100% reliable. For this reason, we strongly recommend that you maintain your own backups of important or irreplaceable data on your personal computer or other secure storage solutions. Regular personal backups will help ensure data safety and minimize disruption in case of unexpected system issues.

Partitions

A partition in SLURM essentially means a queue: a logical grouping of compute nodes that share the same access rules and limits. Users submit jobs to a partition, and SLURM schedules them on nodes belonging to that partition. On NU HPC systems partitions group compute nodes that have identical hardware.

Irgetas

The Irgetas cluster has two available partitions for user jobs.

ZEN4 : This partition includes 10 CPU-only nodes. Each node has two 96-core AMD EPYC 9684X CPUs
H100 : This partition consists of 6 GPU nodes. Each node has two 96-core AMD EPYC 9454 CPUs and four Nvidia H100 GPUs. All Irgetas jobs requiring GPU computations must be queued to this partition. While it is possible to run jobs that use CPUs only in this partition, users are highly discouraged from doing so to ensure efficient utilization of the system. Submitting CPU jobs to partition H100 can only be justified if this partition sits idle for a very long time, while the ZEN4 partition is heavily crowded with many jobs waiting in the queue.

Shabyt

The Shabyt cluster has two available partitions for user jobs.

CPU : This partition includes 20 CPU-only nodes. Each node has two 32-core AMD EPYC 7502 CPUs
NVIDIA : This partition consists of 4 GPU nodes. Each node has two 32-core AMD EPYC 7452 CPUs and two NVIDIA V100 GPUs. All Shabyt jobs requiring GPU computations must be queued to this partition. While it is possible to run jobs that use CPUs only in this partition, users are discouraged from doing so to ensure efficient utilization of the system. Submitting CPU jobs to partition NVIDIA can only be justified if this partition sits idle for a very long time, while the CPU partition is heavily crowded with many jobs waiting in the queue.

Muon

The Muon cluster has a single partition.

HPE. This includes all ten compute nodes each having a single 14-core Intel Xeon CPU.

Quality of Service (QoS)

Users belonging to different university units and research groups have different limits on how many jobs they can run simultaneously. This is controlled by the Quality of Service (QoS) category in SLURM.

Irgetas

The Irgetas cluster has four active QoS categories

hpcnc : Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC), which procured Shabyt
nu : All other NU researchers (default category)
issai : Members of the Institute of Smart Systems and Artificial Intelligence
issai-ext : External collaborators of the Institute of Smart Systems and Artificial Intelligence
stud : Students with temporary accounts who take courses related to HPC (e.g. PHYS 421/521/721)

Shabyt

The Shabyt cluster has three active QoS categories:

hpcnc : Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC), which procured Shabyt
nu : All other NU researchers (default category)
stud : Students with temporary accounts who take courses related to HPC (e.g. PHYS 421/521/721)

Muon

The Muon cluster has two active QoS categories:

hpcnc : Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC), which procured Shabyt
nu : All other NU researchers (default category)

Job time limits

The following table lists maximum allowed job durations (wall time) in different partitions of NU HPC systems, as well as key characteristics (RAM, number of cores, number of GPUs) for compute nodes in each partition.

Time limits for jobs in different partitions of NU HPC systems
System	Partition	Max job duration	Number of nodes available	Max CPU cores per node	Max threads per node	RAM per node	GPUs per node

Irgetas	`ZEN4`	7 days (168 hours)	10	192	384	384 GB	n/a
Irgetas	`H100`	4 days (96 hours)	6	192	384	768 GB	4

Shabyt	`CPU`	14 days (336 hours)	20	64	128	256 GB	n/a
Shabyt	`NVIDIA`	2 days (48 hours)	4	64	128	256 GB	2

Muon	`HPE`	14 days (336 hours)	10	14	28	64 GB	n/a

Limits on the number of jobs, cores, threads, and GPUs

All limits on the number of simultaneously running jobs, CPU cores used, GPUs used, and job priorities are listed below for all clusters and QoS categories.

Maximum number of simultaneously running jobs, CPU cores, and threads for NU HPC systems
System	QoS	Partition	Max simultaneously running jobs per user	Max CPU cores per user (total for all running jobs)	Max threads per user (total for all running jobs)	Max GPUs per user (total for all running jobs)	Job launch priority (higher relative value means it moves up faster in the list of waiting jobs)

Irgetas	`hpcnc`	`ZEN4`	12	576	1152	n/a	5
Irgetas	`nu`	`ZEN4`	12	576	1152	n/a	5
Irgetas	`issai`	`ZEN4`	12	576	1152	n/a	5
Irgetas	`issai-ext`	`ZEN4`	4	192	384	n/a	1
Irgetas	`stud`	`ZEN4`	4	192	384	n/a	5
Irgetas	`hpcnc`	`H100`	12	576	1152	12	5
Irgetas	`nu`	`H100`	12	576	1152	12	5
Irgetas	`issai`	`H100`	24	1152	2304	24	10
Irgetas	`issai-ext`	`H100`	12	576	1152	12	5
Irgetas	`stud`	`H100`	4	192	384	4	5

Shabyt	`hpcnc`	`CPU`, `NVIDIA`	40	1280	2560	8	10
Shabyt	`nu`	`CPU`, `NVIDIA`	12	256	512	8	5
Shabyt	`stud`	`CPU`, `NVIDIA`	4	128	256	4	5

Muon	`hpcnc`,`nu`	`HPE`	40	140	280	n/a	10

Acknowledgments in publications

If computational resources provided by Nazarbayev University Research Computing (NU RC) were essential to research reported in a publication, please include an acknowledgment — typically in the same section where funding sources are acknowledged. Example wordings (feel free to adapt), but ensure the exact phrase Nazarbayev University Research Computing appears:

The authors acknowledge the use of computational resources provided by Nazarbayev University Research Computing.
A.B. and C.D. acknowledge the use of the Irgetas HPC cluster at Nazarbayev University Research Computing.

Policies

Contents

Acceptable use

Storage quotas

Home directory

Checking your storage quota

Additional storage - zdisk, datahub

Data Integrity and Backup

Partitions

Irgetas

Shabyt

Muon

Quality of Service (QoS)

Irgetas

Shabyt

Muon

Job time limits

Limits on the number of jobs, cores, threads, and GPUs

Acknowledgments in publications

Navigation menu