Policy
Introduction
The GIGA provides data storage spaces and high-processing computers to its members.
Upon joining the GIGA, members will be assigned access to a group data storage (group share) for working (storing, updating, deleting) on data and files that need to be shared and accessible to everyone in their respective group/platform.
Another data sharing system has been created to allow the GIGA members to work in a collaborative way similar to group shares but with “non-GIGA” clients.
Members of each specific share are responsible for managing their data according to the rules defined in this document.
Purpose
The purpose of this policy is to inform the community on data storage and processing infrastructure available as well as to define acceptable usage of these resources.
Effective management and use of individual, group and “outside” data storages will enable system administrators to manage GIGA’s computing resources more efficiently.
Storage
Simple representation of our storage structure here
Personal computer
Management and maintenance of personal computers at the GIGA is the role of a specific IT service of ULiège called UDIMED-GIGA.
For new GIGA members, there are two possible scenarios : they use their own personal computer, or are provided with one (ULiège-owned) by their laboratory.
Supported systems include Windows PC, Apple Mac and - some - GNU/LINUXes.
For any new system, IT specialists advise GIGA members and help select the most appropriate computer or device for their needs.
They select the supplier (among official ULiege suppliers), take care of the purchase order, receive the order and prepare the hardware / install requested
software to provide a ready-to-use system.
Knowing the system, UDIMED-GIGA can provide a more efficient technical follow up.
Moreover, hardware listed by UDIMED-GIGA are insured against theft, fire, water damages.
Any new computer (property of the GIGA member or the ULiège) must be properly configured before joining the GIGA IT network.
This includes :
- getting personnal login, password and credentials
- configuring Intranet and Internet access
- configuring shared ressources (storage, printers...)
- installing some non-scientific tools (DoX, anti-virus, office softwares)
- installing some scientific tools
More info (in french) at http://udimed.ulg.ac.be
Mass Storage
The GIGA provides the members with a professional infrastructure (called "mass storage") which includes a very large disk capacity to store big amounts of data, as well as its equivalent in tapes for the backuping (tape library). The mass storage is organized to enable the sharing of data using specific group working spaces.
The disk space is organized as a clustered storage which is an addition of many hard drives, with on top of it several layers called "file system" split in fileset which allows GIGA users/groups to work in compartmented spaces and so makes it easier to define quotas, number of inodes, specific rules, as well as bringing better support to all of them.
On the other hand, a tape library is a storage device which contains one or more tape drives, a number of slots to hold tape cartridges, a bar-code reader to identify the cartridges and an automated method for loading them (a robot).
This whole infrastructure being "Linux-based", it implies certain constraints :
- A strict naming sense for both directories and files :
- No spaces (use underscores "_" instead)
- No no-English characters (accents, foreign symbols)
- Are allowed : az AZ 09 _ and -
- The use of the backup and nobackup directories - described further in this document - has to respect carefully the following rule XXXXXXXXX ??? (that? or force them to ask IT service first ?). See example
- Each member must respect his home space and group's quotas. Users generating numerous and big datasets must strictly avoid to duplicate their files. See example
User home
Each GIGA member is given a personal space of 100Gb on the mass storage infrastructure called "User Home" that is by default backuped.
The User home is organized with sub-directories : - One called nobackup which is, like indicated, not backuped - The others are generated upon user's assignment to specific projects within his group/lab
Group home
Within a thematic unit (called URT or Research) or a platform (called PTF or Plateforms), each lab will have its own "Group Home" divided in sub-directories dedicated to different projects. The head of the lab is managing the membership of its own Group Home.
Each GIGA member is by definition part of a group or a lab and is working on one or multiple projects. Upon its assignment to a project, the member is granted access to a group space shared with other members assigned to the same project.
Data generated on the GIGA-Platforms or downloaded from an external source must be stored in a "Share" directory of 2Tb present in every project specific sub-directories.
The quotas, if justified, can be expanded. The Group Homes are by default backuped every night.
A "Group Home" is composed of project specific sub-directories that have: - One directory called nobackup which is, like indicated, not backuped - In the case of thematic units, two others called Share_IMG and Share_GEN. These are special spaces where data generated for the project by a platform are being stored. IMG being imaging data and GEN NGS sequencing data
DoX
For the other data types (eg: Word/Excel/Powerpoint files, ...) UDIMED-GIGA provides the GIGA members with a tool called DoX.
DoX is an ULiège cloud storage system which, like its alternatives (Eq. Google Drive, Dropbox), allows to synchronize and store the versioning of the user's data through his electronic devices (Windows, Mac, Linux computers but also iPhone, iPad and Android devices).
The quota is of 100Gb and its content is by default backuped every night.
A cloud storage, in oposition to our physicals clustered storage and tapes, is a model of data storage in which the digital data is stored in logical pools, the physical storage spans multiple servers (and often locations).
It is made for high availability through all electronic devices with web or mobile user identifications.
This system provides :
- An automated synchronization of the data
- The storage of the different versions of the data
- An internal editor
- A way to share data with non-ULiege users (Meaning without credentials)
Sharing outside GIGA
2nd DoX ? (TBD, in testing phase)
GitLab
When working on scientific datasets, GIGA members might need to use and eventually write programming code.
The GIGA provides members with a specialized environment called GitLab (Need to talk about space quotas) that allows collaborative work on programming code and/or non-formated text (That do not includes Microsoft/Libre/Open Offices documents) but also use a powerful versioning system. The content is by default backuped every night.
This system provides : - The versioning of its content on an atomic level (File by file, through to any single character) - A user friendly interface through web access (easy access)
- The ability to work in structured teams with granular credentials control (Up to the PI's and/or their own computing specialist(s))
- The ability to create separated projects called repositories:
- Either private to the user or its groups, or public to, as an example, store the annex of published papers
- A more robust, powerful yet harder to handle projects through Linux terminals. Presentations and workshops can be given on demand
In order to maintain a manageable environment, users are asked to :
- Define groups and projects naming in a readable way. See example
- Keep their repositories well organized
Backup
All the data stored on the Mass Storage (except for the data stored in nobackup directories) are backuped every night on tapes in another location using a synchronizing system called "TSM".
Tapes
The Tapes library is a robot with 4 drives working on, actually, 620 tapes.
It has a retention policy of 28 days for average use of the spaces but which can shift to a shorter period if users' activity on file-systems is high.
Because this process has quite a heavy load, it is made to work during the night with a window of work between 00 and 07am.
Being able to handle 2Tb~2,5Tb during that time, it is asked to the users to : - Avoid daily changes for an amount over 2Tb - Try to regroup as much small files as they can (through, for example, zip/rar/tar archives) - Ask GIGA's IT before moving over 1Tb - Create/use as much as they can the nobackup directories in order to prevent fast/long in time/massive works on file-sets. If they work with a tool that generate large amount/size of files they can therefore configure their script to use temporary datas in one of those directory and just have their final output in the backuped spaces
Offline Archiving
The offline archiving is a long term solution for older datasets that still need to be saved/kept. When the archive is created all the data on mass storage is erased.
An expiration date is still requested and can be 10 years or more if justified but when the chosen time period has passed, all data is automatically erased.
Like the nobackup directories, users can create archive directories containing the datasets that has to be offline backuped. (TBC)
(Create a rule to allow users to buy tapes IF they want more flexibilty on those conditions ?)
Data Processing
For high throughput data processing (scientific calculation), GIGA provides its members with different types of calculation clusters. Those coming each with their pros and cons.
A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.
The components of a cluster are usually connected to each other through fast local area networks, with each node (computer used as a server) running its own instance of an operating system.
To be allowed to use the clusters, users need to have basic knowledge of :
- LINUX: Terminal navigation, SSH (secured connections), software handling (different from Mac and Windows). Presentations and workshops can be given on demand. See example
- SLURM: a user can NEVER process on the main frame and have to use Slurm to submit jobs to a cluster. See example
The Primary and Secondary clusters allows too users to have access in a transparent way to their private and shared spaces.
Whenever they connect to them, they will start using them from their home which allows them to go to their respective group shares.
Primary cluster
Also called "genetic cluster", it is hosted and managed at SeGI (Service General d'Informatique de l'ULiège).
Each of it's nodes are bought by PI's from specific platform/research groups. Therefore, the users might (or not) have access to it though the system.
It is made in a collaborative way, meaning anyone willing to contribute by buying one or more node(s) will be added to the whole cluster.
It is as well for public as for production uses. Users' private and shared storage spaces are linked to it and so seamlessly (everyone's starting point).
See example
Scratch
Within the cluster are defined special spaces, working as temporary ones, for the users' manipulated data (while executing their workflows).
Those coming in two different forms :
- Each node has a directory /local. They are relatively smalls but extremely fast because they are on the processing machine. The retainment policy is strict about the users purging them-self from their datasets at the end of each run
- The second one, which is called /gallia, is bigger, and still extremely fast because connected to the network through high-speed hardware. This one needs from the users' to enter a request to GIGA's IT to get credentials. The retainment policy give them 15 days before an automatic purge
Secondary cluster
Hosted at GIGA, this cluster works like the primary but requires a separate demand of credentials to allow it's use with users' private and shared storage spaces.
To do so, users must make a request to GIGA's IT and provide him their ULiege username and the platform/group which they are related to.
See example
Inter Belgian French-universities cluster
The Consortium des Équipements de Calcul Intensif, or CÉCI as a means of it's own to share high-performance scientific computing facilities and resources, and mutualize know-how and expertise among the universities' clusters. Concerning their system and software infrastructures, GIGA's and theirs are alike in order to facilitate as much as possible the users.
It allows GIGA members to use it's infrastructure given some conditions. Those being :
- The user must make a request of credential on CÉCI website
- CÉCI purpose is solely for public use, no production is allowed on it. Meaning they are allowed to do punctual runs but not automated continuous workflows
- GIGA users won't have direct access to their private and group storages. Hence they will have to take care by themself of their data flow from our storage to CÉCI's
Use cases
Backup Nobackup
TBA
Avoid duplicates
Because users think "this is easier/faster", they tend to simply copy input datasets to other destinations those being their private home or their groups'.
That is not true at all, in fact despite being fast it still needs time to copy.
The good practice would be to have one spot centralizing input datasets with a good/readable structure (directories) and simply calling them "as inputs" while using the needed software.
From there, what is to place in the project's directory would be :
- The output from workflows
- Some notes
- Some plotting
Eg. :
A user is in his home. From there, are also links to their groups where are stored among other things the data generated by one of GIGA's platforms.
That user want to use a software called STAR (RNA-Seq tool), that would look like :
STAR --genomeDir Platform/GEN/References/ --readFilesCommand zcat
--readFilesIn_ Platform/GEN/MyLab/Share_GEN/GM12878.rep1.R1.fastq.gz Platform/GEN/MyLab/Share_GEN/GM12878.rep1.R2.fastq.gz
--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within
--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM
--runThreadN 20 --outFileNamePrefix_ "Platform/GEN/MyLab/MyProject/star_GM12878_rep1/"
To explain a bit more this example, the user would always have his reference genomes (or whatever type of reference to work with), and the raw data generated by one of GIGA platforms at the same spot...and so those would become static values in his/her scripts. Those being :
(myhome/)Platform/GEN/References/ and
(myhome/)Platform/GEN/MyLab/Share_GEN/
The only thing he/she would have to define being therefore the data coming OUT of this workflow, that being :
(myhome/)Platform/GEN/MyLab/MyProject/star_GM12878_rep1/
Thanks to that way to work, would result a more clean/fast/easy way to work as an individual, as teams but also allows to share among coworkers AND more important, to save storage space (hence, more projects possible to create, more time to keep backups, etc).
Right nomenclature
GIGA's computing infrastructure being "Linux based", it implies a few good-practices when it comes to naming.
Nowaday, it accepts more forein characters but still has issues with those.
Because of that, is asked to the users to use exclusively alpha-numeric characters, underscores "_" and dashes "-".
So in the case one would want to sort his directories structure like for example :
/myhome/mygroupshare/Hervé/
/myhome/mygroupshare/ゲノム/
/myhome/mygroupshare/الجينوم/
/myhome/mygroupshare/some other things/
The problem is that GIGA's backuping system being quite sensitive it would run into errors of even worst : just pass those directories and their whole content.
The solution would be to always use english characters and replace spaces by underscores like that :
/myhome/mygroupshare/Herve/
/myhome/mygroupshare/some_other_things/
/myhome/mygroupshare/someOtherThings/
Linux basics
TBA
Slurm basics
TBA
Primary cluster
TBA
Scratches
TBA
Secondary cluster
TBA
CECI cluster
TBA
Hardware specs
Technical explanations. : - http://www.ceci-hpc.be/clusters.html - https://en.wikipedia.org/wiki/InfiniBand#Performance
Primary cluster
11 partitions
Intel and AMD
26 nodes
688 cores
5.53 TB total. 128~512Gb / node
Gigabit ethernet
Secondary cluster
1 partition
AMD
17 nodes
389 (logical) cores
32~2585Gb of memory / node
Gigabit ethernet
CECI cluster
NIC4
1 partition
Intel
128 nodes
2048 (physical) CPU
64Gb of memory / node
InfiniBand Quad DataRate
Vega
1 partition
AMD
43 nodes
2752 (physical) CPU
256Gb of memory / node
InfiniBand Quad DataRate
Hercules
1 partition
Intel
64 nodes
896 (physical) CPU
36~128Gb of memory / node
Gigabit ethernet
Dragon1
1 partition
Intel
26 nodes
416 (physical) CPU
128Gb of memory / node
Gigabit ethernet
Lemaitre2
1 partition
Intel
115 nodes
1380 (physical) CPU
48Gb of memory / node
InfiniBand Quad DataRate
Hmem
1 partition
AMD
17 nodes
816 (physical) CPU
128~512Gb of memory / node
InfiniBand Quad DataRate
Zenobe
1 partition
Intel
582 nodes
13968 (physical) CPU
64~256Gb of memory / node
InfiniBand 2x Quad DataRate + 1x Fourteen DataRate