Policies: Difference between revisions

From NU HPC Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 39: Line 39:
=== Additional storage - zdisk ===  
=== Additional storage - zdisk ===  


In Shabyt cluster, users can store larger amounts of data in their group directory on a slower HDD array. Keep in mind that this array does not have an Infiniband connection. Therefore, data access and transfer speeds from both the login node and compute nodes are limited to the standard 1 Mbit/s Ethernet speeds. In zdisk, each research group has a shared allocation. This can be particularly handy when the data needs to be transferred or shared within a single research group.
In Shabyt cluster, users can store larger amounts of data in their group directory on a slower HDD array. Keep in mind that this array is not connected with Infiniband. Therefore, data access and transfer speeds from both the login node and compute nodes are limited to the standard 1 Mbit/s Ethernet speeds. In <code>/zdisk</code>, each research group has a shared allocation. This can be particularly handy when the data needs to be transferred, exchanged, or shared within a research group.
{| class="wikitable"
{| class="wikitable"
|+Default storage quota for zdisk on NU HPC systems
|+Default storage quota for zdisk on NU HPC systems
Line 54: Line 54:
|1 TB
|1 TB
|}
|}
Again, in exceptional cases users/group may be granted an increased quota. Such requests are reviewed on an individual basis upon receiving a ticket by the PI via [https://helpdesk.nu.edu.kz/support/catalog/items/272 NU Helpdesk].
Again, in exceptional cases individual users or groups may be granted an increased quota. Such requests are reviewed on an individual basis upon receiving a ticket by the PI via [https://helpdesk.nu.edu.kz/support/catalog/items/272 NU Helpdesk].


== Data integrity and backup ==
== Data integrity and backup ==

Revision as of 13:18, 7 March 2025

Important Note: Software configurations on NU HPC facilities are updated on a continuous basis. Minor policy changes also occur regularly. Some of these changes might not be immediately reflected on this website. The limits on job execution and maximum storage allocations are subject to change based on decisions made by the NU HPC Committee and actual system utilization.

Acceptable use

The HPC system is a unique resource for NU researchers and the community. It has special characteristics, such as a large amount of RAM and the capability for massive parallelism. Due to its uniqueness and expense, its use is supervised by the HPC team to ensure efficient and fair utilization.

Users are accountable for their actions. It is responsibility of PIs to ensure that their group members have the necessary expertise to use NU HPC facilities properly and do it for research purposes only. Intentional misuse of NU HPC resources or noncompliance with our Acceptable Use Policy can lead to temporary or permanent disabling of accounts, and administrative or even legal actions.

Storage quotas

Home directory

Users’ home directories are physically stored on fast SSD arrays that have very high bandwidth and enterprise class endurance of the flash drives.

In the case of Shabyt cluster, the main storage servers are connected to the system via Infiniband interfaces (100 Gbit/s). All compute nodes are also connected via Infiniband. This provides very high bandwidth for users both when they access their data from the login node and when running their jobs on compute nodes using SLURM.

In Muon cluster the main SSD storage is in the login node with all SSD connected via fast u.2 interfaces. However, please keep in mind that Muon's compute nodes have limited bandwidth with the login node (1 Mbit/s Ethernet). Therefore, batch jobs cannot read and write data faster than this network bandwidth.

Default quota for users’ home directories on NU HPC systems
System Path Default storage limit
Shabyt cluster /shared/home/<username> 100 GB
Muon cluster /shared/home/<username> 250 GB

In some exceptional cases users may be granted a higher storage quota in their home directories. An increased limit must be requested via Helpdesk's ticketing system. Such requests are reviewed on a individual basis and approved only in exceptional cases.

Checking your storage quota

One can use the following terminal commands to check your or your group member storage quota in home directory as well as to see how much of it is actually being used.

beegfs-ctl --getquota --uid $(id -u)

beegfs-ctl --getquota --uid $(id -u <username>)

Additional storage - zdisk

In Shabyt cluster, users can store larger amounts of data in their group directory on a slower HDD array. Keep in mind that this array is not connected with Infiniband. Therefore, data access and transfer speeds from both the login node and compute nodes are limited to the standard 1 Mbit/s Ethernet speeds. In /zdisk, each research group has a shared allocation. This can be particularly handy when the data needs to be transferred, exchanged, or shared within a research group.

Default storage quota for zdisk on NU HPC systems
System Path Default storage limit
Shabyt cluster /zdisk/<researchgroupname> 1 TB
Muon cluster /zdisk/<researchgroupname> 1 TB

Again, in exceptional cases individual users or groups may be granted an increased quota. Such requests are reviewed on an individual basis upon receiving a ticket by the PI via NU Helpdesk.

Data integrity and backup

Please be advised that users take full responsibility for the integrity and safety of their data stored on NU HPC facilities. While our clusters feature enterprise level hardware, failures are still a possibility. We do backup data in user home directories automatically several times a week (note that this applies only to your home directory in /shared/home, not to the group storage allocations in /zdisk). However, if a major hardware failure takes place, even if your data is eventually restored, you may not have access to it for a prolonged period of time while the system is offline being repaired. In some unfortunate situations it might take many days or even weeks to get everything back. Moreover, no system or storage solution is 100% reliable. Therefore we highly recommend that you backup your data (at least the important and precious part of it) on your personal computer from time to time.

Limits on the jobs and their execution time

Shabyt: partitions and time limits

Currently, Shabyt cluster has two available partitions for user jobs.

  • CPU : This partition includes 20 CPU-only nodes. Each node has two 32-core AMD EPYC CPUs
  • NVIDIA : This partition consists of 4 GPU nodes. Each node has two 32-core AMD EPYC CPUs and two NVIDIA V100 GPUs. All jobs requiring GPU computations must be queued to this partition. While it is possible to run jobs that use CPUs only in this partition, users are discouraged from doing so to ensure efficient utilization of the system. Submitting CPU jobs to partition NVIDIA can only be justified if this partition sits idle for a long time, while the CPU partition is heavily crowded with many jobs waiting in the queue.
Time limits for jobs in Shabyt partitions
System Partition Max job duration Number of nodes

available

Max CPU cores

per node

Max threads

per node

RAM (GB)

per node

Shabyt cluster CPU 14 days (336 hours) 20 64 128 256
Shabyt cluster NVIDIA 2 days (48 hours) 4 64 128 256

Shabyt: Quality of Service (QoS)

In Shabyt, user belonging to different groups have different limits on how many jobs they can run simultaneously. This is controlled by the Quality of Service (QoS) category. Currently we have only two active QoS categories:

  • hpcnc : Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC), which originally procured Shabyt
  • nu : All other NU researchers

Shabyt: limits on the number of jobs and cores/threads

Maximum number of simultaneously running jobs, CPU cores, and threads for Shabyt
System QoS Supported partition Max simultaneously

running jobs per user

Max CPU cores per user

(total for all running jobs)

Max threads per user

(total for all running jobs)

Job launch priority

(higher means it moves up

faster in the list of waiting jobs)

Shabyt cluster hpcnc CPU, NVIDIA 40 1280 2560 10
Shabyt cluster nu CPU, NVIDIA 12 256 512 5

Muon: partitions and time limits

Currently, there is only a single partition in Muon cluster that is called HPE. It includes all ten compute nodes with 14-core Intel Xeon CPUs in each.

Time limits for jobs in Muon partitions
System Partition Max job duration Number of nodes

available

Max CPU cores

per node

Max threads

per node

RAM (GB)

per node

Muon cluster HPE 14 days (336 hours) 10 14 28 64

Muon: Quality of Service (QoS)

In Muon, there is just one active Quality of Service (QoS) category:

  • hpcnc : Members of research groups that are part of the research cluster called High Performance Computing, Networking, and Cybersecurity (HPCNC) and members of NU Physics Department

Muon: limits on the number of jobs and cores/threads

Maximum number of simultaneously running jobs, CPU cores, and threads for Muon
System QoS Supported partition Max simultaneously

running jobs per user

Max CPU cores per user

(total for all running jobs)

Max threads per user

(total for all running jobs)

Job launch priority

(higher means it moves up

faster in the list of waiting jobs)

Muon cluster hpcnc HPE 40 140 280 10

Acknowledgments in publications

If the computational resources provided by NU HPC facilities were an essential tool in your research that resulted in a publication, we ask that you include an acknowledgment in it. A natural place for it is the same section where you would typically acknowledge funding sources. Two of many possible formats of this acknowledgement are as follows:

  • The authors acknowledge the use of computational facilities provided by the Nazarbayev University Research Computing.
  • A.B. and C.D. (author initials) acknowledge the use of Shabyt HPC cluster at Nazarbayev University Research Computing.

Regardless of the format you adopt, please have the "Nazarbayev University Research Computing" word expression in there.