Policies
Important Note: Software configurations on NU HPC facilities are updated on a continuous basis. Minor policy changes also occur regularly. Some of these changes might not be immediately reflected on this website. The limits on job execution and maximum storage allocations are subject to change based on decisions made by the NU HPC Committee and actual system utilization.
Acceptable use
The HPC system is a unique resource for NU researchers and the community. It has special characteristics, such as a large amount of RAM and the capability for massive parallelism. Due to its uniqueness and expense, its use is supervised by the HPC team to ensure efficient and fair utilization.
Users are accountable for their actions. It is responsibility of PIs to ensure that their group members have the necessary expertise to use NU HPC facilities properly and do it for research purposes only. Intentional misuse of NU HPC resources or noncompliance with our Acceptable Use Policy can lead to temporary or permanent disabling of accounts, and administrative or even legal actions.
Storage quotas
Home directory
Users’ home directories are physically stored on fast SSD arrays that have very high bandwidth and enterprise class endurance of the flash drives.
In the case of Shabyt cluster, the main storage servers are connected to the system via Infiniband interfaces (100 Mbit/s). All compute nodes are also connected via Infiniband. This provides very high bandwidth for users both when they access their data from the login node and when running their jobs on compute nodes using SLURM.
In Muon cluster the main SSD storage is in the login node with all SSD connected via fast u.2 interfaces. However, please keep in mind that Muon's compute nodes have limited bandwidth with the login node (1 Mbit/s Ethernet). Therefore, batch jobs cannot read and write data faster than this.
System | Path | Default storage limit |
---|---|---|
Shabyt cluster | /shared/home/<username>
|
100 GB |
Muon cluster | /shared/home/<username>
|
250 GB |
In some exceptional cases users may be granted a higher storage quota in their home directories. An increased limit must be requested via Helpdesk's ticketing system. Such requests are reviewed on a individual basis and approved only in exceptional cases.
Checking your storage quota
One can use the following terminal commands to check your or your group member storage quota in home directory as well as to see how much of it is actually being used.
beegfs-ctl --getquota --uid $(id -u)
beegfs-ctl --getquota --uid $(id -u <username>)
Additional storage - zdisk
In Shabyt cluster, users can store larger amounts of data in their group directory on a slower HDD array. Keep in mind that this array does not have an Infiniband connection. Therefore, data access and transfer speeds from both the login node and compute nodes are limited to the standard 1 Mbit/s Ethernet speeds. In zdisk, each research group has a shared allocation. This can be particularly handy when the data needs to be transferred or shared within a single research group.
System | Path | Default storage limit |
---|---|---|
Shabyt cluster | /zdisk/<researchgroupname>
|
1 TB |
Muon cluster | /zdisk/<researchgroupname>
|
1 TB |
Again, in exceptional cases users/group may be granted an increased quota. Such requests are reviewed on an individual basis upon receiving a ticket by the PI via NU Helpdesk.
Data integrity and backup
Please be advised that users take full responsibility for the integrity and safety of their data stored on NU HPC facilities. While our clusters feature enterprise level hardware, failures are still a possibility. We do backup data in user home directories automatically several times a week (note that this applies only to your home directory in /shared/home
, not to the group storage allocations in /zdisk
). However, if a major hardware failure takes place, even if your data is eventually restored, you may not have access to it for a prolonged period of time while the system is offline being repaired. In some unfortunate situations it might take many days or even weeks to get everything back. Moreover, no system or storage solution is 100% reliable. Therefore we highly recommend that you backup your data (at least the important and precious part of it) on your personal computer from time to time.
Limits on the jobs and their execution time
Shabyt: partitions and time limits
Currently, Shabyt cluster has two available partitions for user jobs.
CPU
: This partition includes 20 CPU-only nodes. Each node has two 32-core AMD EPYC CPUsNVIDIA
: This partition consists of 4 GPU nodes. Each node has two 32-core AMD EPYC CPUs and two NVIDIA V100 GPUs. All jobs requiring GPU computations must be queued to this partition. While it is possible to run jobs that use CPUs only in this partition, users are discouraged from doing so to ensure efficient utilization of the system. Submitting CPU jobs to partition NVIDIA can only be justified if this partition sits idle for a long time, while the CPU partition is heavily crowded with many jobs waiting in the queue.
Partition | Max job duration | Number of nodes available | Max CPU cores per node | Max threads per node | RAM (GB) per node |
---|---|---|---|---|---|
CPU
|
14 days (336 hours) | 20 | 64 | 128 | 256 |
NVIDIA
|
2 days (48 hours) | 4 | 64 | 128 | 256 |
Shabyt: Quality of Service (QoS)
In Shabyt, user belonging to different groups have different limits on how many jobs they can run simultaneously. This is controlled by the Quality of Service (QoS) category. Currently we have only two active QoS categories:
hpcnc
: Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC), which originally procured Shabytnu
: All other NU researchers
Shabyt: limits on the number of jobs and cores/threads
QoS | Supported partition | Max jobs per user | Max CPU cores per user
(total for all running jobs) |
Max threads per user
(total for all running jobs) |
Job launch priority (higher means it
moves up faster in the list of waiting jobs) |
---|---|---|---|---|---|
hpcnc
|
CPU, NVIDIA | 40 | 1280 | 2560 | 10 |
nu
|
CPU, NVIDIA | 12 | 256 | 512 | 5 |
Muon: partitions and time limits
Currently, there is only a single partition in Muon cluster that is called HPE
. It includes all ten compute nodes with 14-core Intel Xeon CPUs in each. There are no limits on the number of simultaneously running jobs or CPU cores used in Muon.
Partition | Max job duration | Number of nodes available | Max CPU cores per node | Max threads per node | RAM (GB) per node |
---|---|---|---|---|---|
HPE
|
14 days (336 hours) | 10 | 14 | 28 | 64 |
Muon: Quality of Service (QoS)
In Muon, there is just one active Quality of Service (QoS) category:
hpcnc
: Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC) and members of NU Physics Department
Muon: limits on the number of jobs and cores/threads
QoS | Supported partition | Max jobs per user | Max CPU cores per user
(total for all running jobs) |
Max threads per user
(total for all running jobs) |
Job launch priority (higher means it
moves up faster in the list of waiting jobs) |
---|---|---|---|---|---|
hpcnc
|
HPE | 10 | 140 | 280 | 10 |
Acknowledgments in publications
If the computational resources provided by NU HPC facilities were an essential tool in your research that resulted in a publication, we ask that you include an acknowledgment in it. A natural place for it is the same section where you would typically acknowledge funding sources. Two of many possible formats of this acknowledgement are as follows:
- The authors acknowledge the use of computational facilities provided by the Nazarbayev University Research Computing.
- A.B. and C. D. (author initials) acknowledge the use of Shabyt HPC cluster at Nazarbayev University Research Computing.
Regardless of the format you adopt, please have the "Nazarbayev University Research Computing" word expression in there.