Policies: Difference between revisions

Revision as of 14:42, 17 September 2024

Important Note: Software configurations on NU HPC facilities are updated on a continuous basis. Minor policy changes also occur regularly. Some of these changes might not be immediately reflected on this website. The limits on job execution and maximum storage allocations are subject to change based on decisions made by the NU HPC Committee and actual system utilization.

Acceptable Use

The HPC system is a unique resource for NU researchers and the community. It has special characteristics, such as a large amount of RAM and the capability for massive parallelism. Due to its uniqueness and expense, its use is supervised by the HPC team to ensure efficient and fair utilization.

Users are accountable for their actions. It is responsibility of PIs to ensure that their group members have the necessary expertise to use NU HPC facilities properly and do it for research purposes only. Intentional misuse of NU HPC resources or noncompliance with our Acceptable Use Policy can lead to temporary or permanent disabling of accounts, and administrative or even legal actions.

Storage quotas

Home directory

Users’ home directories are physically stored on fast SSD arrays that have very high bandwidth and enterprise class endurance of the flash drives.

In the case of Shabyt cluster, the main storage servers are connected to the system via Infiniband interfaces (100 Mbit/s). All compute nodes are also connected via Infiniband. This provides very high bandwidth for users both when they access their data from the login node and when running their jobs on compute nodes using SLURM.

In Muon cluster the main SSD storage is in the login node with all SSD connected via fast u.2 interfaces. However, please keep in mind that Muon's compute nodes have limited bandwidth with the login node (1 Mbit/s Ethernet). Therefore, batch jobs cannot read and write data faster than this.

Default quota for users’ home directories on NU HPC systems
System	Path	Default storage limit
Shabyt cluster	`/shared/home/<username>`	100 GB
Muon cluster	`/shared/home/<username>`	100 GB

In some exceptional cases users may be granted a higher storage quota in their home directories. An increased limit must be requested via Helpdesk's ticketing system. Such requests are reviewed on a individual basis and approved only in exceptional cases.

Checking your storage quota

One can use the following terminal commands to check your or your group member storage quota in home directory as well as to see how much of it is actually being used.

beegfs-ctl --getquota --uid $(id -u)

beegfs-ctl --getquota --uid $(id -u <username>)

Additional storage - zdisk

In Shabyt cluster, users can store larger amounts of data in their group directory on a slower HDD array. Keep in mind that this array does not have an Infiniband connection. Therefore, data access and transfer speeds from both the login node and compute nodes are limited to the standard 1 Mbit/s Ethernet speeds. In zdisk, each research group has a shared allocation. This can be particularly handy when the data needs to be transferred or shared within a single research group.

Default storage quota for zdisk on NU HPC systems
System	Path	Default storage limit
Shabyt cluster	`/zdisk/<researchgroupname>`	1 TB
Muon cluster	`/zdisk/<researchgroupname>`	1 TB

Again, in exceptional cases users/group may be granted an increased quota. Such requests are reviewed on an individual basis upon receiving a ticket by the PI via NU Helpdesk.

Data integrity and backup

Please be advised that users take full responsibility for the integrity and safety of their data stored on NU HPC facilities. While our clusters feature enterprise level hardware, failures are still a possibility. We do backup data in user home directories automatically several times a week (note that this applies only to your home directory in /shared/home, not to the group storage allocations in /zdisk). However, if a major hardware failure takes place, even if your data is eventually restored, you may not have access to it for a prolonged period of time while the system is offline being repaired. In some unfortunate situations it might take many days or even weeks to get everything back. Moreover, no system or storage solution is 100% reliable. Therefore we highly recommend that you backup your data (at least the important and precious part of it) on your personal computer from time to time.

Limits on the jobs and their execution time

Shabyt partitions and time limits

Currently, Shabyt cluster has two available partitions for user jobs.

CPU : This partition includes 20 CPU-only nodes. Each node has two 32-core AMD EPYC CPUs
NVIDIA : This partition consists of 4 GPU nodes. Each node has two 32-core AMD EPYC CPUs and two NVIDIA V100 GPUs. All jobs requiring GPU computations must be queued to this partition. While it is possible to run jobs that use CPUs only in this partition, users are discouraged from doing so to ensure efficient utilization of the system. Submitting CPU jobs to partition NVIDIA can only be justified if this partition sits idle for a long time, while the CPU partition is heavily crowded with many jobs waiting in the queue.

Time limits for jobs in Shabyt partitions
Partition	Max job duration	Number of nodes available	Max CPU cores per node	Max threads per node	RAM (GB) per node
`CPU`	14 days (336 hours)	20	64	128	256
`NVIDIA`	2 days (48 hours)	4	64	128	256

Quality of Service (QoS)

In Shabyt, user belonging to different groups have different limits on how many jobs they can run simultaneously. This is controlled by the Quality of Service (QoS) category. Currently we have only two active QoS categories:

hpcnc : Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC), which originally procured Shabyt
nu : All other NU researchers

Shabyt limits on the number of jobs and cores/threads

Maximum number of simultaneously running jobs, CPU cores, and threads for Shabyt
QoS	Supported partition	Max jobs per user	Max CPU cores per user	Max threads per user
`hpcnc`	CPU, NVIDIA	40	1280	2560
`nu`	CPU, NVIDIA	12	256	512

Muon partitions and time limits

Currently, there is only a single partition in Muon cluster that is called HPE. It includes all ten compute nodes with 14-core Intel Xeon CPUs in each. There are no limits on the number of simultaneously running jobs or CPU cores used in Muon.

Time limits for jobs in Muon partitions
Partition	Max job duration	Number of nodes available	Max CPU cores per node	Max threads per node	RAM (GB) per node
`HPE`	14 days (336 hours)	10	14	28	64

Acknowledgments

If the computational resources provided by NU HPC facilities were an essential tool in your research that resulted in a publication, we ask that you include an acknowledgment in it. A natural place for it is the same section where you would typically acknowledge funding sources. Two of many possible formats of this acknowledgement are as follows:

The authors acknowledge the use of computational facilities provided by the Nazarbayev University Research Computing.
A.B. and C. D. (author initials) acknowledge the use of Shabyt HPC cluster at Nazarbayev University Research Computing.

@@ Line 71: / Line 71: @@
 !Partition
 !Max job duration
 !Number of nodes
-!Max CPU cores per node
+available
-!Max threads per node
+!Max CPU cores
-!RAM (GB) per node
+per node
+!Max threads
+per node
+!RAM (GB)
+per node
 |-
 |<code>CPU</code>
@@ Line 102: / Line 106: @@
 !QoS
 !Supported partition
-!Max jobs per user
+!Max jobs
-!Max CPU cores per user
+per user
-!Max threads per user
+!Max CPU cores
+per user
+!Max threads
+per user
 |-
 |<code>hpcnc</code>
@@ Line 126: / Line 133: @@
 !Max job duration
 !Number of nodes
-!Max CPU cores per node
+available
-!Max threads per node
+!Max CPU cores
-!RAM (GB) per node
+per node
+!Max threads
+per node
+!RAM (GB)
+per node
 |-
 |<code>HPE</code>