Introduction

In the context of their activities, GIGA members need to work with documents and scientific data, and to perform calculations using scientific applications sometimes requiring high-performance computers. This document describes aspects pertaining to personal computers, data storage for single users or for groups, backup solutions, data exchange with non-GIGA scientists, high-performance computing options, and code versioning in a collaborative environment.

Purpose

The purpose of this policy is to detail data storage and processing infrastructure available at GIGA as well as to define acceptable usage of these resources. Following these rules is important to enable effective usage and management of the available resources in the benefit of the entire GIGA community. Note that user documentation explaining how to use this infrastructure is available in the form of a wiki (ADD LINK) and will be communicated to the users in the near future, directly through the XENIA platform.

Personal computer

GIGA members either use their own personal computer or are provided with a ULiège-owned computer by their PI. In both cases, management and maintenance of personal computers at GIGA is the role of a specific IT service of ULiège called UDIMED-GIGA. Supported systems include Windows PC, Apple Mac and - some - GNU/LINUXes.

Acquisition and setup

Every computer (personal or ULiège-owned) must be properly configured before joining the GIGA IT network.
This includes :

getting personnal login, password and credentials
configuring Intranet and Internet access
configuring shared ressources (storage, printers...)
installing some non-scientific tools (DoX, anti-virus, office softwares)
installing some scientific tools

When a new ULiège-owned computer has to be bought, IT specialists need to be contacted by the GIGA member. They will provide advice and help to select the most appropriate computer or device based on the needs of the GIGA member. They will also select the supplier among official ULiege suppliers, take care of the purchase order, receive the order and prepare the hardware as well as install requested software to provide a ready-to-use system.
One advantage of this procedure is that it will empower the UDIMED-GIGA to provide a more efficient technical follow up. Moreover, hardware listed by UDIMED-GIGA are insured against theft and against fire and water damages.

More info (in french) at http://udimed.ulg.ac.be

Cloud storage for versioning and backup with dox

For the other data types (eg: Word/Excel/Powerpoint files, ...) UDIMED-GIGA provides the GIGA members with a tool called DoX.
DoX is an ULiège cloud storage system which, like its alternatives (Eq. Google Drive, Dropbox), allows to synchronize and store the versioning of the user's data through his/her electronic devices (Windows, Mac, Linux computers but also iPhone, iPad and Android devices).
The quota is 100Gb and its content is by default backuped every night.

A cloud storage, in opposition to our physicals clustered storage and tapes, is a model of data storage in which digital data are stored in logical pools and physical storage spans multiple servers (and often locations).
It allows high availability of data through all electronic devices with web or mobile user identifications.

This system provides :

An automated synchronization of the data
The storage of the different versions of the data
An internal editor
A way to share data with non-ULiege users (Meaning without credentials)

Storage

Simple representation of our storage structure here

Mass Storage

Upon joining the GIGA, members will be assigned access to a group data storage (group share) for working (storing, updating, deleting) on data and files that need to be shared and accessible to everyone in their respective group/platform. Members of each specific share are responsible for managing their data according to the rules defined in this document.

The GIGA provides the members with a professional infrastructure (called "mass storage") which includes a very large disk capacity to store big amounts of data, as well as its equivalent in tapes for the backuping (tape library). The mass storage is organized to enable the sharing of data using specific group working spaces.

The disk space is organized as a clustered storage which is an addition of many hard drives, with on top of it several layers called "file system" split in fileset which allows GIGA users/groups to work in compartmented spaces. This makes it easier to define quotas, number of inodes, specific rules as well as provides better support to the users.

On the other hand, a tape library is a storage device which contains one or more tape drives, a number of slots to hold tape cartridges, a bar-code reader to identify the cartridges and an automated method for loading them (a robot).

This whole infrastructure being "Linux-based", it implies certain constraints :

A strict naming sense for both directories and files :

No spaces (use underscores "_" instead) - No non-English characters (accents, foreign symbols)
Are allowed : az AZ 09 _ and -

The use of the backup and nobackup directories - described further in this document See example
Each member must respect his home space and group's quotas. Users generating numerous and big datasets must strictly avoid to duplicate their files. See example

User home

Each GIGA member is given a personal space of 50Gb on the mass storage infrastructure called "Home" that is by default NOT backuped.

point d'entree The main purpose of this Home is to give a buffered space to the users in order to store their temporarily data, before dropping them into the shared spaces. The Home is also the preferred space to store some intermediary results and pre-workflows before running them.

The user's home is organized with sub-directories generated upon user's assignment to specific projects within his group/lab

see use cases number 6.2

Group home

Within a thematic unit (called URT or Research) or a platform (called PTF or Platforms), each lab will have its own "Group Home" divided in sub-directories dedicated to different projects. The head of the lab is managing the membership of its own Group Home thanks to XENIA.

Each GIGA member is by definition part of a group or a lab and is working on one or multiple projects. Upon its assignment to a project, the member is granted access to a group space shared with other members assigned to the same project.
Data generated on the GIGA-Platforms or downloaded from an external source must be stored in a "Share" directory of 2Tb available in every project specific sub-directories.
The quotas, if justified, can be expanded. The Group Homes are by default backuped every night.

A "Group Home" is composed of project specific sub-directories that have:

One directory called nobackup which is, like indicated, not backuped
Two others called Share_IMG and Share_GEN. These are specific to URT where data related to the project are being stored. IMG being imaging data and GEN NGS sequencing data

Sharing outside GIGA

upon request Another data sharing system has been created to allow the GIGA members to work in a collaborative way similar to group shares but with “non-GIGA” clients.
2nd DoX ? (TBD, in testing phase)

Backup

All the data stored on the Mass Storage (except for the data stored in nobackup directories) are backuped every night on tapes in another location using a synchronizing system called "TSM".

Tapes

The Tapes library is a robot with 4 drives working on, actually, 620 tapes.
It has a retention policy of 28 days for average use of the spaces but which can shift to a shorter period if users' activity on file-systems is high.
Because this process has quite a heavy load, it is made to work during the night with a window of work between 00 and 07am.

Being able to handle 2Tb~2,5Tb during that time, it is asked to the users to :

Avoid daily changes for an amount over 2Tb
Try to regroup as much small files as they can (through, for example, zip/rar/tar archives)
Ask GIGA's IT before moving over 1Tb
Create/use as much as they can the nobackup directories in order to prevent fast/long in time/massive works on file-sets. If they work with a tool that generate large amount/size of files they can therefore configure their script to use temporary datas in one of those directory and just have their final output in the backuped spaces

Offline Archiving

The offline archiving is a long term solution for older datasets that still need to be saved/kept. When the archive is created all the data on mass storage is erased.
An expiration date is still requested and can be 10 years or more if justified but when the chosen time period has passed, all data is automatically erased.
Like the nobackup directories, users can create archive directories containing the datasets that has to be offline backuped. (TBC)
(Create a rule to allow users to buy tapes IF they want more flexibilty on those conditions ?)

Data Processing

For high throughput data processing (scientific calculation), GIGA provides its members with different types of calculation clusters. Those coming each with their pros and cons.

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

The components of a cluster are usually connected to each other through fast local area networks, with each node (computer used as a server) running its own instance of an operating system.

To be allowed to use the clusters, users need to have basic knowledge of :

LINUX: Terminal navigation, SSH (secured connections), software handling (different from Mac and Windows). Presentations and workshops can be given on demand. See example
SLURM: a user can NEVER process on the main frame and have to use Slurm to submit jobs to a cluster. See example

The Primary and Secondary clusters allows too users to have access in a transparent way to their private and shared spaces.
Whenever they connect to them, they will start using them from their home which allows them to go to their respective group shares.

Primary cluster

Also called "genetic cluster", it is hosted and managed at SeGI (Service General d'Informatique de l'ULiège).
Each of it's nodes are bought by PI's from specific platform/research groups. Therefore, the users might (or not) have access to it though the system.
It is made in a collaborative way, meaning anyone willing to contribute by buying one or more node(s) will be added to the whole cluster.
It is as well for public as for production uses. Users' private and shared storage spaces are linked to it and so seamlessly (everyone's starting point).
See example

Scratch

Within the cluster are defined special spaces, working as temporary ones, for the users' manipulated data (while executing their workflows).

Those coming in two different forms :

Each node has a directory /local. They are relatively smalls but extremely fast because they are on the processing machine. The retainment policy is strict about the users purging them-self from their datasets at the end of each run
The second one, which is called /gallia, is bigger, and still extremely fast because connected to the network through high-speed hardware. This one needs from the users' to enter a request to GIGA's IT to get credentials. The retainment policy give them 15 days before an automatic purge

See example

Secondary cluster

Hosted at GIGA, this cluster works like the primary but requires a separate demand of credentials to allow it's use with users' private and shared storage spaces.
To do so, users must make a request to GIGA's IT and provide him their ULiege username and the platform/group which they are related to.
See example

Inter Belgian French-universities cluster

The Consortium des Équipements de Calcul Intensif, or CÉCI as a means of it's own to share high-performance scientific computing facilities and resources, and mutualize know-how and expertise among the universities' clusters. Concerning their system and software infrastructures, GIGA's and theirs are alike in order to facilitate as much as possible the users.

It allows GIGA members to use it's infrastructure given some conditions. Those being : - The user must make a request of credential on CÉCI website - CÉCI purpose is solely for public use, no production is allowed on it. Meaning they are allowed to do punctual runs but not automated continuous workflows - GIGA users won't have direct access to their private and group storages. Hence they will have to take care by themself of their data flow from our storage to CÉCI's
See example

GitLab

When working on scientific datasets, GIGA members might need to use and write programming code.
The GIGA provides members with a dedicated environment called GitLab (Need to talk about space quotas) that allows collaborative work on programming code and/or non-formated text (That do not includes Microsoft/Libre/Open Offices documents). GitLab is based on a powerful and very useful versioning system. The content is by default backuped every night.