Skip to content

Commit

Permalink
Merge pull request #62 from hpc-unibe-ch/61-streamline-documentation-…
Browse files Browse the repository at this point in the history
…prior-to-ubelix9

Streamlined documentation (outdated/obsolete)
  • Loading branch information
grvlbit authored Jan 12, 2024
2 parents 0b4697b + 5c78e21 commit b13a282
Show file tree
Hide file tree
Showing 26 changed files with 191 additions and 870 deletions.
33 changes: 0 additions & 33 deletions docs/file-system/quota.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,36 +59,3 @@ Furthermore, the SCRATCH quota is presented starting with `SCR_`, where `SCR_usr
- HOME and SCRATCH: values presented are actual values directly gathered from the file system

Note: the coloring of the relative values is green (<70%), yellow (70% < x < 90%), red (>90%).

## advanced quota method

The following `mmlsquota` command present you actual values from the file system.
For `$HOME`:

```Bash
$ mmlsquota --block-size=G -u $USER rs_gpfs:svc_homefs
Block Limits | File Limits
Filesystem Fileset type KB quota limit in_doubt grace | files quota limit in_doubt grace Remarks
rs_gpfs svc_homefs USR 444181792 1073741824 1073741824 6072144 none | 815985 1000000 1000000 2462 none
```

The `--block-size` option specify the unit {K , M, G, T} in which the numbers of blocks are displayed:

```Bash
mmlsquota --block-size=G -j workspace1 rs_gpfs
Block Limits | File Limits
Filesystem type GB quota limit in_doubt grace | files quota limit in_doubt grace Remarks
rs_gpfs FILESET 57 10240 11264 0 none | 5 10000000 11000000 0 none
```

The output shows the quotas for a worspace called `Workspace1`. The quotas are set to a soft limit of 10240 GB, and a hard limit of 11264 GB. 57 GB is currently allocated to the workspace. An in_doubt value greater than zero means that the quota system has not yet been updated as to whether the space that is in doubt is still available or not.

If the user/workspace exceeds the soft limit, the grace period will be set to one week. If usage is not reduced to a level below the soft limit during that time, the quota system interprets the soft limit as the hard limit and no further allocation is allowed. The user(s) can reset this condition by reducing usage enough to fall below the soft limit. The maximum amount of disk space the workspace/user can accumulate during the grace period is defined by the hard limit. The same information is also displayed for the file limits (number of files).

## Request Higher Quota Limits

!!! types info "clean before ask"
Make sure to clean up your directories before requesting additional storage space.

There will be no quota increase for HOME directories. Additional storage for workspaces can be requested by the workspace owner or the deputy, see [Workspace Management](../hpc-workspaces/management.md#additional-storage)

File renamed without changes.
26 changes: 3 additions & 23 deletions docs/general/costs_investments.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,39 +17,19 @@ used to buy additional resources for the cluster/storage will be purchased with
the additional budget.

!!! tip "Get in touch with us!"
If you are interested in any of the below investment opportunities, get in
If you are interested in our investment opportunities, get in
touch with us by starting a service request at the [Service
Portal](https://serviceportal.unibe.ch/sp).

## CPUs

For CPUs we do not have formal investment opportunities as of yet. We are
working on a business model that allows fair investment in this area. Please
get in touch with us if you are interested in getting higher/broader privileges
on CPU partitions.

## GPUs

We provide far beyond 100 GPUs to our users. In contrast to CPUs we work with
preemption on the GPU partition. For your chosen investment you gain the
privilege to preempt other user on provided number of GPUs you invested for.
That means, whenever there are no free GPUs and you start your job, other
user's jobs are terminated to have yours start almost immediately.

As the number of GPUs is limited there may be none available but will be
ordered for you as soon as possible. Nevertheless, please try to plan ahead as
much as possible as GPU availability is quite scarce and it may take months
until we can get new cards.

## Disk Storage
## Disk Storage Costs

### Workspaces

Every **research group** has **10TB** free of charge quota. This can be used
within one or more Workspaces. The amount used per Workspace is set at
application time and can be changed later within the limitation.

Additional storage can be purchased for CHF 50 per TB and year. On the
Additional storage can be purchased. On the
application or modification form an quota upper limit can be set.
Accounted will be the actual usage only. Therefore, the actual usage is monitored
twice a day. The average value of all data points is used for accounting and
Expand Down
86 changes: 21 additions & 65 deletions docs/general/faq.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
# FAQ

## Description

This page provides a collection of frequently asked questions.

## File system
Expand All @@ -11,10 +9,9 @@ If you reached your quota, you will get strange warning about not being able to

1. Decluttering: Check for unnecessary data. This could be:

- unused application packages, e.g. Python(2) packages in `$HOME/.local/lib/python*/site-packages/*`
- temporary computational data, like already post processed output files
- duplicated data
- ...
- unused application packages, e.g. Python(2) packages in `$HOME/.local/lib/python*/site-packages/*`
- temporary computational data, like already post processed output files
- duplicated data

2. Pack and archive: The HPC storage is a high performance parallel storage and not meant to be an archive. Data not used in the short to midterm should be packed and moved to an archive storage.

Expand All @@ -31,64 +28,7 @@ HPC Workspaces are managed by the group manager/leader and if applicable a deput
### I need to share data with my colleges. What can I do?
HPC Workspaces are meant to host shared data. See [HPC Workspaces](../hpc-workspaces/workspaces.md)

<!-- ## Where should I put my data?
A coarse classification may be:
| data type | suggested target |
| :--- | :--- |
| private configuration data, e.g. SSH keys | HOME |
| temporary (weeks to month) application input/output data | SCRATCH |
| persistent application input/results, meant to be shared (some-when) | Workspace |
| applications, meant to be shared (some-when) | Workspace | -->

### Where can I get a Workspace?
A research group manager need to **create** the Workspace, since there are possibilities for charged extensions.

If you want to **join an existing** Workspace. Ask the Workspace manager or its deputy to add you.
See [HPC Workspaces](../hpc-workspaces/workspaces.md)

### How much does a Workspace cost?
Workspaces itself are free of charge. Every research group has 10TB disk space free of charge, which can be used in multiple Workspaces.
If necessary, additional storage can be purchased per Workspace, where only the actual usage will be charged, see [Workspace Management](../hpc-workspaces/management.md#additional-storage)


### What if our 10TB free of charge research group quota is full?
Your Research group manager or a registered deputy can apply for an additional quota. Actual used quota will be charged.

### Why can I not submit jobs anymore?
After joining an HPC Workspace the private SLURM account gets deactivated and a Workspace account need to be specified.
This can be done by loading the Workspace module, see [Workspace environment](../hpc-workspaces/environment.md):

```Bash
module load Workspace
```

Otherwise Slurm will present the following error message:
```Bash
sbatch: error: AssocGrpSubmitJobsLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
```
With this method we aim to distribute our resources in a more fair manner. HPC resources including compute power should be distriuted between registered research groups. We can only relate users with research groups by utilizing Workspace information.
## Software issues
### Why is my private conda installation broken after migration
Unfortunately, Anaconda hard wires absolute paths into almost all files (including scripts and binary files).
A proper migration process may have included `conda pack`.
There is a way you may access your old environments and create new ones with the same specification:
```
export CONDA_ENVS_PATH=${HOME}/anaconda3/envs ## or where you had your old envs
module load Anaconda3
eval "$(conda shell.bash hook)"
conda info --envs
conda activate oldEnvName ## choose your old environment name
conda list --explicit > spec-list.txt
unset CONDA_ENVS_PATH
conda create --name myEnvName --file spec-list.txt # select a name
```
Please, also note that there is a system wide Anaconda installation, so no need for your own separate one.
Finally, after recreating your environments please delete all old Anaconda installations and environments. These are not only big but also a ton of files.

### Why the system is complaining abount not finding an existing module?

Expand Down Expand Up @@ -122,8 +62,6 @@ When loading `foss/2021a`, the `zlib/.1.2.11-GCCcore-10.3.0` should get loaded,
Please take this as an indication that you accidentality mix different toolchains, and rethink your procedure, and stay within the same toolchain and toolchain version.

## Environment issues
### I am using zsh, but some commands and tools fail, what can I do?
There are known caveats with LMOD (or module system) and Bash scripts in zsh environments. Bash scripts do not source any system or user files. To initialize the (module) environment properly, you need to set `export BASH_ENV=/etc/bashrc` in your zsh profile (`.zshrc`).

### I modified my bashrc, but its not doing what I expect, how can I debug that bash script?
The bashrc can be debugged as all other bash scripts, using
Expand Down Expand Up @@ -163,6 +101,24 @@ The job is not allowed to start because you have reached the maximum of allowed
**(ReqNodeNotAvail, UnavailableNodes:...)**
Some node required by the job is currently not available. The node may currently be in use, reserved for another job, in an advanced reservation, `DOWN`, `DRAINED`, or not responding.**Most probably there is an active reservation for all nodes due to an upcoming maintenance downtime (see output of** `scontrol show reservation`) **and your job is not able to finish before the start of the downtime. Another reason why you should specify the duration of a job (--time) as accurately as possible. Your job will start after the downtime has finished.** You can list all active reservations using `scontrol show reservation`.

### Why can I not submit jobs anymore?
After joining an HPC Workspace the private SLURM account gets deactivated and a Workspace account need to be specified.
This can be done by loading the Workspace module, see [Workspace environment](../hpc-workspaces/environment.md):

```Bash
module load Workspace
```

Otherwise Slurm will present the following error message:
```Bash
sbatch: error: AssocGrpSubmitJobsLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
```
With this method we aim to distribute our resources in a more fair manner. HPC resources including compute power should be distriuted between registered research groups. We can only relate users with research groups by utilizing Workspace information.
### Why can't I submit further jobs?

!!! types note ""
Expand Down
14 changes: 4 additions & 10 deletions docs/halloffame.md → docs/general/halloffame.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
If you previously used UBELIX to do your computational work and you acknowledged this
in your publication and want to your publication listed here, please drop us a note via [https://serviceportal.unibe.ch/hpc](https://serviceportal.unibe.ch/hpc).
If you are wondering how you can acknowledge the usage of UBELIX in your
publication, have a look at the [homepage](index.md) of this documentation, where
publication, have a look at the [homepage](../index.md) of this documentation, where
you will find a text recommendation acknoowledging the use of our cluster.

## Papers and Articles
Expand Down Expand Up @@ -79,9 +79,9 @@ Leichtle A, Fiedler G et al. | Pancreatic carcinoma, pancreatitis, and healthy c

## Posters

![Poster J. T. Casanova et al, 2023](images/casanova_2023_iap.png "J. T. Casanova et al., Computational approach to anti-Kasha photochemistry of Pt-dithiolene complexes, 2023"){: style="max-width: 100%"}
![Poster J. T. Casanova et al, 2023](../images/casanova_2023_iap.png "J. T. Casanova et al., Computational approach to anti-Kasha photochemistry of Pt-dithiolene complexes, 2023"){: style="max-width: 100%"}

![Poster Schwab et al, 2016](images/hof_schwab_2016_ncsml.png "Schwab et al., Computational neuroscience: Validation and reliability of directed dynamic networks of the brain, 2016"){: style="max-width: 100%"}
![Poster Schwab et al, 2016](../images/hof_schwab_2016_ncsml.png "Schwab et al., Computational neuroscience: Validation and reliability of directed dynamic networks of the brain, 2016"){: style="max-width: 100%"}

## Newspapers

Expand All @@ -92,10 +92,4 @@ Berner Forscher entdecken neue Klimazustände, in denen Leben möglich ist. | De
## Create an Entry

If you used UBELIX for your publication please let your entry added to the list.
Open a ticket or create a pull request, see [Documentation Update](general/support.md).
The format of the entry would be markdown:

```
<first author>, <last author> | <title> | [Details](<Boris link>) | [Direct Link](<DOI link>)
```
where the authors are lastname and first letter of first name. Subscripts can be created using `<sub>2</sub> `.
Please open a ticket with the details of your publication.
40 changes: 28 additions & 12 deletions docs/general/news.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,33 @@
# News

11-08-2022:
12.01.2024:

- added two additional login nodes to the cluster
- The user documentation has been streamlined and updated with recent information
- The UBELIX9 testing system, previewing the next generation OS is now availble for all users.

In our ongoing commitment to providing a secure and efficient computing environment, we are migrating the HPC system from CentOS 7 to Rocky Linux 9.

We are pleased to inform you that a part of our infrastructure has been migrated and is ready to be tested by you. To get you started, please take the time to read this information thoroughly.

As part of the migration, we have implemented general software and security updates to ensure a secure and optimized computing environment. Please consult the manual pages (i.e., `man <command>`) to review the latest command syntax.

The list of software modules managed by the UBELIX Team accessible via the module commands has been updated. Please note that old software versions may have been discontinued in favor of more recent versions. Additionally, the Vital-IT and UBELIX software stacks have been merged. Explore the enhanced range of modules to benefit from the latest tools and applications available on UBELIX using the module spider command.

While we have taken measures to minimize user impact, it is crucial to be aware of potential adjustments needed on your end. Most importantly, please verify that your workflows, scripts, and applications are compatible with the new environment.

It is important to note that there may be a need to recompile your executables for compatibility with the new system. Existing Python environments are expected to remain functional unless special libraries such as TensorFlow with GPU support are used. These may require a fresh installation.

Additionally, older software modules that are no longer managed by the UBELIX team may need to be installed by users if required. Instructions for custom software modules installations can be found in the documentation section on EasyBuild.

The testing system is kept simple, and therefore, only default Quality of Service (QOS) is available now. Investor resources have not been migrated yet and are still fully accessible on the old system. Existing job scripts that use the debug, long, gpu_preempt and invest QOS need to be updated. Investors are encouraged to reach out to us if they wish to proceed with the migration of their resources.

To access the new system please login to submit02: `ssh <username>@submit02.unibe.ch`

Note that the graphical monitoring ([https://ubelix.unibe.ch/)](https://ubelix.unibe.ch/)) does not cover the new testing environment yet. Please use the `squeue --me` command to query your jobs status on the new system. More details on the monitoring of the new system will follow.

If you encounter any issues, we are ready to assist you. Feel free to reach out via [https://serviceportal.unibe.ch/hpc](https://serviceportal.unibe.ch/hpc). Please make sure to specify that your problem is related to the UBELIX testing environment and provide as much information as possible.

We appreciate your attention to these details and your cooperation as we work together to ensure a smooth transition to Rocky Linux 9.

04-05-2021:
Happy computing!

- major SLUM partition restructure, see [Slurm partitions](../slurm/partitions.md). Job scripts may need to be adapted.
- HPC Workspace officially in production [HPC Workspace Overview](../hpc-workspaces/workspaces.md)
- Kernel, CUDA driver, SLURM, and Spectrum Scale update
- in June: HOME quota fixed to 1TB, removal of GPFS institute shared directories

17-02-2021:

- Home migration: User HOMEs started to get migrated to the newer Spectrum Scale System storage
- HPC Workspaces: Beta Phase of custom group shared file spaces with tools and Slurm accounting
Loading

0 comments on commit b13a282

Please sign in to comment.