Difference between revisions of "Policy"

Line 1: Line 1:
=Introduction  =
+
[[File:logoGIGA.jpg|frame|none]]
The GIGA provides all storage spaces and processing computers. Access to each storage space is strictly limited to it’s owning user. 
 
  
Upon joining the GIGA, technical staff will also be assigned access to a group storage (share) for working (storing, updating, deleting) on data and files that need to be shared and accessible to everyone in their respective group/platform. 
+
<hr>
 +
<br /><br /><br />
  
All members in each designated "group share" are responsible for managing their own data according to each group’s established procedures.
+
= 1. Introduction =
  
If the business need arises, GIGA members have a third option for sharing data and files.
+
The GIGA provides data storage spaces and high-processing computers to its members.<br />
 +
Upon joining the GIGA, members will be assigned access to a group data storage (group share) for working (storing, updating, deleting) on data and files that need to be shared and accessible to everyone in their respective group/platform.<br />
 +
Another data sharing system has been created to allow the GIGA members to work in a collaborative way similar to group shares but with “non-GIGA” clients.<br />
 +
Members of each specific share are responsible for managing their data according to the rules defined in this document.
  
In this instance, a sharing system has been created to allow the users to work in a collaborative way similar to group shares but with “non-university” customers.
+
= 2. Purpose =
  
Members of each specific share are responsible for managing their data according to the rules defined later in this document.
+
The purpose of this policy is to inform the community on data storage and processing infrastructure available as well as to define acceptable usage of these resources.<br />
 +
Effective management and use of individual, group and “outside” data storages will enable system administrators to manage GIGA’s computing resources more efficiently.
  
=Purpose  =
+
= 3. Storage =
The purpose of this policy is to establish and communicate data storage resources at GIGA as well as to define acceptable usage of it's storage. 
 
  
The IT infrastructure is configured to optimally support the technology requirements of our constituents.
+
''Simple representation of our storage structure here''
 +
 
 +
== 3.1 Personal computer ==
 +
 
 +
Management and maintenance of the computer park of the GIGA is the role of a specific computing unit of ULiège called UDIMED-GIGA.
  
Effective management and use of individual, group and “outside” storages by constituents will enable system administrators to manage GIGA’s computing resources more efficiently and so by extension will allow each member to make the most of what is given to them.
+
For a new GIGA member, there are two possible scenarios : they use their own personal computer (PC) (whether Windows based PC or Mac) or are provided with one (ULiège-owned) by their laboratory.
  
=Storage  =
+
To buy a hardware and computer peripherals, the UDIMED-GIGA helps and advises the GIGA member to build the purchase order. UDI is in contact with the official suppliers of the University of Liège to provide you with a price offer. Upon the reciept of the computer, UDI can prepare the material directly in their office and provides you with a ready-to-use hardware with the requested software. The follow-up can then be more effective in the event of breakdown or for a technical intervention.
''Simple representation of our storage structure here''
+
 
 +
In the case of a member using his own personnal computer, it must be configured before being connected to the GIGA network.
  
==Personal computer ==
+
The configuration of the computer (own PC or ULiège-owned) is including : - Definition of credentials and configuration of your access to the Internet - Mounting the shared storage - Installing some non-scientific tools (DoX, anti-virus, office softwares) - Installing some scientific tools - Setting up the printers
The GIGA works, in some cases, with other services. 
 
  
One of them being 'UDIMED' which is a decentralized computer unit of the ULg (acting as relay of the SeGI) within the Faculty of Medicine.
+
== 3.2 Mass Storage ==
  
She manages and maintains the computer park for the Faculty.
+
The GIGA provides the members with a professional infrastructure (called &quot;mass storage&quot;) which includes a very large disk capacity to store big amounts of data, as well as its equivalent in tapes for the backuping (tape library). The mass storage is organized to enable the sharing of data using specific group working spaces.
  
Whenever one arrives, he must ask his PI for a computer OR using his own.
+
The disk space is organized as a clustered storage which is an addition of many hard drives, with on top of it several layers called &quot;file system&quot; split in fileset which allows GIGA users/groups to work in compartmented spaces and so makes it easier to define quotas, number of inodes, specific rules, as well as bringing better support to all of them.
  
In that special case, the personnal computer must be configured. Those modifications are :
+
On the other hand, a tape library is a storage device which contains one or more tape drives, a number of slots to hold tape cartridges, a bar-code reader to identify the cartridges and an automated method for loading them (a robot).
* Defining credentials
 
* Mounting the shared storage
 
* Installing some non-scientific tools. Eq DoX, antivirus, office softwares
 
* Installing some scientific tools
 
* Setting up our printers
 
  
 +
This whole infrastructure being &quot;Linux-based&quot;, it implies certain constraints : 1. A strict naming sense for both directories and files : - No spaces (use underscores &quot;_&quot; instead) - No no-English characters (accents, foreign symbols) - Are allowed : '''az AZ 09 _''' and '''-''' 2. The use of the '''backup''' and '''nobackup''' directories - described further in this document - has to respect carefully the following rule XXXXXXXXX ??? (''that? or force them to ask IT service first ?''). ''[[#61-backup-nobackup|See example]]'' 3. Each member must respect his home space and group's quotas. Users generating numerous and big datasets must strictly avoid to duplicate their files. ''[[#62-avoid-duplicates|See example]]''
  
All that coming with different rules/conditions/restrictions, it might be recommanded to avoid that solution and prefear to ask for a computer from GIGA itself.
+
=== 3.2.1 User home ===
  
==Mass Storage  ==
+
Each GIGA member is given a personal space of '''100Gb''' on the mass storage infrastructure called &quot;User Home&quot; that is by default backuped.
The GIGA provide a complex and huge infrastructure which include a big (''yet not infinite'') storage as well as it's equivalent for backuping relevent data, sharing and also group working spaces. Our community of users being quite huge too, everyone must be aware of rules and good practices but in return can expect from GIGA to get a proper environment.
 
  
This whole infrastructure being "Linux-based", it requires certain things :
+
The User home is organized with sub-directories : - One called '''nobackup''' which is, like indicated, not backuped - The others are generated upon user's assignment to specific projects within his group/lab
* A strict naming sense for both directories and files. That meaning : no spaces, uses underscore \'' instead but also english format, without accents.
 
* The '''backup''', '''nobackup''' and '''archive''' directories told off further in this document can be created anywhere in the directories structure but is asked from the users to be careful (''that? or force them to ask IT service first ?''). ''[[#TBA|See example]]''
 
* Each user must be aware of it's private space but also his group's quotas. Despite being huge, that much users generating numerous and big datasets everyone must avoid to duplicate their files. To do so are provided spaces where all input data will be stored and backuped. The users can, for the time of their work on those, temporary create a copy in their private space but ultimately they are asked to delete those when the work is done and it's results moved to their respective shared group spaces. ''[[#TBA_2|See example]]''
 
  
===User home ===
+
=== 3.2.2 Group home ===
Each person working at GIGA is given a personal space of '''100Gb''' on our storage infrastructure. They are by default not backuped.
 
  
In that directory are already stored a few sub-directories :
+
Within a thematic unit (called '''URT''' or '''Research''') or a platform (called '''PTF''' or '''Plateforms'''), each lab will have its own &quot;Group Home&quot; divided in sub-directories dedicated to different projects. The head of the lab is managing the membership of its own Group Home.<br />
* One called '''backup''' which is, like the name implies, backuped
+
Each GIGA member is by definition part of a group or a lab and is working on one or multiple projects. Upon its assignment to a project, the member is granted access to a group space shared with other members assigned to the same project.<br />
* The others are generated on base of your assignment to projects, where you will be able to work with your project associates and PI
+
Data generated on the GIGA-Platforms or downloaded from an external source must be stored in a &quot;Share&quot; directory of '''2Tb''' present in every project specific sub-directories.<br />
 +
The quotas, if justified, can be expanded. The Group Homes are by default backuped every night.
  
===Group home  ===
+
A &quot;Group Home&quot; is composed of project specific sub-directories that have: - One directory called '''nobackup''' which is, like indicated, not backuped - In the case of thematic units, two others called '''Share_IMG''' and '''Share_GEN'''. These are special spaces where data generated for the project by a platform are being stored. '''IMG''' being imaging data and '''GEN''' NGS sequencing data
Everyone working in teams on one or multiple projects, are also present shared spaces of '''2Tb''' for each of our platforms (called '''PTF''') and research groups (called '''URT''').
 
The quotas, if justified, can be expanded. They are by default backuped every night.
 
  
Those group spaces are containing all the groups' projects, which are all created the same way. They have, like the user homes, sub-directories :
+
== 3.3 DoX ==
* One called '''nobackup''' which is, like the name implies, not backuped
 
* If you are working for an URT you also get two others called '''Share''IMG''' and '''Share''GEN'''. They are special shared spaces where the data ordered to the PTF teams is given to you. '''IMG''' being imaging and '''GEN''' genomics sequencing
 
  
==DoX  ==
+
For the other data types (eg: Word/Excel/Powerpoint files, ...) UDIMED-GIGA provides the GIGA members with a tool called '''DoX'''.<br />
Aside all scientific materials, are also the more "office ones".
+
DoX is an ULiège cloud storage system which, like its alternatives (''Eq. Google Drive, Dropbox''), allows to synchronize and store the versioning of the user's data through his electronic devices (Windows, Mac, Linux computers but also iPhone, iPad and Android devices).<br />
UDIMED provides to each GIGA member a space called '''DoX''' of '''100Gb'''. That space is a cloud storage which is, like it's alternatives (''Eq. Google Drive, Dropbox''), allows you to synchronize your data through your electronic devices. That including : Windows, Mac, Linux computers but also iPhone, iPad and Android devices. It's content is by default backuped every night.
+
The quota is of '''100Gb''' and its content is by default backuped every night.
  
This system provides :
+
A cloud storage, in oposition to our physicals clustered storage and tapes, is a model of data storage in which the digital data is stored in logical pools, the physical storage spans multiple servers (and often locations).<br />
* The versioning of it's content
+
It is made for high availability through all electronic devices with web or mobile user identifications.
* An internal editor
 
* A way to share with non-ULiege users (''meaning without credentials'')
 
  
==Sharing outside GIGA  ==
+
This system provides : - An automated synchronization of the data - The storage of the different versions of the data - An internal editor - A way to share data with non-ULiege users (''Meaning without credentials'')
''2nd DoX ? (TBD, in testing phase)''
 
  
==GitLab  ==
+
== 3.4 Sharing outside GIGA ==
Working on scientific datasets, people might need to use and eventually write programmation code. 
 
To that purpose is provided a specialized environment called '''GitLab''' (''Need to talk about spaces provided''). It is a framework that allows collaborative work on programming code and/or non-formated text (''That do not includes Microsoft/Libre/Open Offices documents'') but also use a powerful versioning system. It's content is by default backuped every night.
 
  
This system provides :
+
''2nd DoX ? (TBD, in testing phase)''
* The versioning of it's content on an atomic level (''file by file, through to any single character'')
 
* An '''easy access''' through web interface
 
* The ability to work in '''structured teams''' with granular credentials control (''Up to the PI's and/or their own computing specialist(s)'')
 
* The ability to create 'projects' called '''repositories''' entirely privates to your groups but also publics to, as an example, store annex to published papers
 
* A more robust, powerful yet harder to handle projects through Linux terminals. ''Presentations and workshops can be given on demand''
 
  
 +
== 3.5 GitLab ==
  
In order to have a clean environment, which would be good for both users and administrators, users are asked to :
+
When working on scientific datasets, GIGA members might need to use and eventually write programming code.<br />
* Define groups and projects naming in a readable way. ''[[#TBA_3|See example]]''
+
The GIGA provides members with a specialized environment called '''GitLab''' (''Need to talk about space quotas'') that allows collaborative work on programming code and/or non-formated text (''That do not includes Microsoft/Libre/Open Offices documents'') but also use a powerful versioning system. The content is by default backuped every night.
  
=Backup  =
+
This system provides : - The versioning of its content on an atomic level (''File by file, through to any single character'') - A user friendly interface through web access ('''easy access''') - The ability to work in '''structured teams''' with granular credentials control (''Up to the PI's and/or their own computing specialist(s)'') - The ability to create separated projects called '''repositories''': - Either private to the user or its groups, or public to, as an example, store the annex of published papers - A more robust, powerful yet harder to handle projects through Linux terminals. ''Presentations and workshops can be given on demand''
The hosted data, GIGA members' work, being so important are given solutions as much secure as they are flexible.
 
  
Each of those having their conditions of use it is of the highest importance for the users to have a strict concern of it's rules. 
+
In order to maintain a manageable environment, users are asked to : - Define groups and projects naming in a readable way. ''[[#63-right-nomenclature|See example]]'' - Keep their repositories well organized
  
If not, might result skiped backups, corruption of them or many other unwanted things.
+
= 4. Backup =
  
==Tapes  ==
+
All the data stored on the Mass Storage (except for the data stored in nobackup directories) are backuped every night on tapes in another location using a synchronizing system called &quot;TSM&quot;.
The '''Tapes library''' is a robot with '''4 drives''' working on actually '''620 tapes'''.
 
  
It has a retention policy of 28 days for average use of the spaces but which can shift to a shorter period if their activity on filsystems is high.
+
== 4.1 Tapes ==
  
Because this process has quite a heavy load, it is made to work during the night with a window of work between 11pm and 07am.
+
The '''Tapes library''' is a robot with '''4 drives''' working on, actually, '''620 tapes'''.<br />
 +
It has a retention policy of 28 days for average use of the spaces but which can shift to a shorter period if users' activity on file-systems is high.<br />
 +
Because this process has quite a heavy load, it is made to work during the night with a window of work between 00 and 07am.
  
Being able to handle 2Tb~2,5Tb during that time, it is asked to the users to :
+
Being able to handle 2Tb~2,5Tb during that time, it is asked to the users to : - Avoid daily changes for an amount over 2Tb - Try to regroup as much small files as they can (through, for example, zip/rar/tar archives) - Ask GIGA's IT before moving over 1Tb - Create/use as much as they can the '''nobackup''' directories in order to prevent fast/long in time/massive works on file-sets. If they work with a tool that generate large amount/size of files they can therefore configure their script to use temporary datas in one of those directory and just have their final output in the backuped spaces
* Avoid daily changes for an amount over 2Tb
 
* Ask GIGA's IT before moving over 1Tb
 
* Create/use as much as he can the '''nobackup''' directories in order to prevent fast/long in time/massive works on filesets
 
  
==Offline Archiving ==
+
== 4.2 Offline Archiving ==
The archiving is a more '''long term''' solution yet less flexible. It's pros being the reduction of space and the possibility to keep it's content listed in a text file.
 
  
This is made for our members who would have huge datasets but knowing they won't need it for a few years : is asked from them to ask for an expiration date, which could be even 10 years or more if justified when they enter their request.
+
The offline archiving is a '''long term''' solution for older datasets that still need to be saved/kept. When the archive is created all the data on mass storage is erased.<br />
 +
An expiration date is still requested and can be 10 years or more if justified but when the chosen time period has passed, all data is automatically erased.<br />
 +
Like the '''nobackup''' directories, users can create '''archive''' directories containing the datasets that has to be offline backuped. (''TBC'')<br />
 +
(''Create a rule to allow users to buy tapes IF they want more flexibilty on those conditions ?'')
  
Like the '''nobackup''' directories users can create '''archive''' ones containing the datasets, then contact GIGA's IT to provide him your needs. (''Default delay when files are in that dir, modification after request's reception ?'')
+
= 5. Data Processing =
  
=Data Processing =
+
For high throughput data processing (scientific calculation), GIGA provides its members with different types of calculation '''clusters'''. Those coming each with their pros and cons.
Are also provided environments called '''clusters''' for high throughput data processing (calculation). Those coming each with their pros and cons.
 
  
What they share is that users :
+
A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.
* Willing to use them need a certain level of knowledge about Linux terminal navigation, SSH (secured connections). ''Presentations and workshops can be given on demand''. ''[[#TBA_4|See example]]''
 
* Can not, never, process on the main frame so need as well the basics of jobs submitting to a cluster (through a software called '''Slurm'''). ''[[#TBA_5|See example]]''
 
* Can have access in a transparent way to their private and shared spaces
 
  
==Primary cluster ==
+
The components of a cluster are usually connected to each other through fast local area networks, with each node (computer used as a server) running its own instance of an operating system.
Hosted at SeGI, this cluster is our main one.
 
  
Each of it's nodes are bought by PI's from specific platforms/research group.
+
To be allowed to use the clusters, users need to have basic knowledge of : - LINUX: Terminal navigation, SSH (secured connections), software handling (different from Mac and Windows). ''Presentations and workshops can be given on demand''. ''[[#64-linux-basics|See example]]'' - SLURM: a user can NEVER process on the main frame and have to use '''Slurm''' to submit jobs to a cluster. ''[[#65-slurm-basics|See example]]''
  
Therefore, the users might or not have access to it though the system is made in a collaborative way meaning anyone can contribute to one or more and access to the shared cluster.
+
The Primary and Secondary clusters allows too users to have access in a transparent way to their private and shared spaces.<br />
 +
Whenever they connect to them, they will start using them from their home which allows them to go to their respective group shares.
  
It is made as well for public as for production uses. Users' private and shared storage spaces are linked to it and so seamlessly.  
+
== 5.1 Primary cluster ==
  
''[[#TBA_6|See example]]''
+
Also called &quot;'''genetic cluster'''&quot;, it is hosted and managed at SeGI (''Service General d'Informatique de l'ULiège'').<br />
 +
Each of it's nodes are bought by PI's from specific platform/research groups. Therefore, the users might (or not) have access to it though the system.<br />
 +
It is made in a collaborative way, meaning anyone willing to contribute by buying one or more node(s) will be added to the whole cluster.<br />
 +
It is as well for public as for production uses. Users' private and shared storage spaces are linked to it and so seamlessly (everyone's starting point).<br />
 +
''[[#66-primary-cluster|See example]]''
  
===Scratch===
+
=== 5.1.1 Scratch ===
Within the cluster are special spaces, working as temporary ones for the users' manipulated data (while executing their workflows).
 
  
Those coming in two different forms :
+
Within the cluster are defined special spaces, working as temporary ones, for the users' manipulated data (while executing their workflows).
* Each node has a directory '''/local'''. They are relatively smalls but extremely fast because they are on the processing machine. The retainment policy is strict about the users purging themself from their datasets at the end of each run
 
* The second one, which is called '''/gallia''', is bigger, and still extremely fast because directly connected to the scientific backbone. This one needs from the users' to enter a request to GIGA's IT to get credentials. The retainment policy give them 15 days before an automatic purge
 
  
''[[#TBA_7|See example]]''
+
Those coming in two different forms : - Each node has a directory '''/local'''. They are relatively smalls but extremely fast because they are on the processing machine. The retainment policy is strict about the users purging them-self from their datasets at the end of each run - The second one, which is called '''/gallia''', is bigger, and still extremely fast because connected to the network through high-speed hardware. This one needs from the users' to enter a request to GIGA's IT to get credentials. The retainment policy give them 15 days before an automatic purge<br />
 +
''[[#67-scratches|See example]]''
  
==Secondary cluster ==
+
== 5.2 Secondary cluster ==
Hosted at GIGA, this cluster works like the primary but requires a separate demand of credentials to allow it's use with users' private and shared storage spaces.
 
  
To do so, users must make a request to GIGA's IT and provide him their ULiege username and the platform/group which they are related to.
+
Hosted at GIGA, this cluster works like the primary but requires a separate demand of credentials to allow it's use with users' private and shared storage spaces.<br />
 +
To do so, users must make a request to GIGA's IT and provide him their ULiege username and the platform/group which they are related to.<br />
 +
''[[#68-secondary-cluster|See example]]''
  
''[[#TBA_8|See example]]''
+
== 5.3 Inter Belgian French-universities cluster ==
  
==Inter Belgian French-universities cluster ==
 
 
The ''Consortium des Équipements de Calcul Intensif'', or '''CÉCI''' as a means of it's own to share high-performance scientific computing facilities and resources, and mutualize know-how and expertise among the universities' clusters. Concerning their system and software infrastructures, GIGA's and theirs are alike in order to facilitate as much as possible the users.
 
The ''Consortium des Équipements de Calcul Intensif'', or '''CÉCI''' as a means of it's own to share high-performance scientific computing facilities and resources, and mutualize know-how and expertise among the universities' clusters. Concerning their system and software infrastructures, GIGA's and theirs are alike in order to facilitate as much as possible the users.
  
It allows GIGA members to use it's infrastructure given some conditions. Those being :
+
It allows GIGA members to use it's infrastructure given some conditions. Those being : - The user must make a request of credential on CÉCI website - CÉCI purpose is solely for public use, no production is allowed on it. Meaning they are allowed to do punctual runs but not automated continuous workflows - GIGA users won't have direct access to their private and group storages. Hence they will have to take care by themself of their data flow from our storage to CÉCI's<br />
* The user must make a request of credential on CÉCI website
+
''[[#69-ceci-cluster|See example]]''
* CÉCI purpose is solely for public use, no production is allowed on it
 
* Except for one of their clusters, '''NIC4''' (''ULiege's''), GIGA users won't have direct access to their private and group storages. Hence they will have to take care by themself of their data flow from our storage to CÉCI's.
 
  
''[[#TBA_9|See example]]''
+
= 6. Use cases =
  
=Use cases=
+
TBA
 +
 
 +
== 6.1 Backup Nobackup ==
  
==TBA==
 
 
TBA
 
TBA
  
==TBA==
+
== 6.2 Avoid duplicates ==
 +
 
 +
Because users think &quot;this is easier/faster&quot;, they tend to simply copy '''input''' datasets to other destinations those being their private home or their groups'.<br />
 +
That is not true at all, in fact despite being fast it still needs time to copy.
 +
 
 +
The good practice would be to have one spot centralizing input datasets with a good/readable structure ('''directories''') and simply calling them &quot;as inputs&quot; while using the needed software.
 +
 
 +
From there, what is to place in ''the project's directory'' would be : - The output from workflows - Some notes - Some plotting
 +
 
 +
Eg. :<br />
 +
A user is in his home. From there, are also links to their groups where are stored among other things the data generated by one of GIGA's platforms.<br />
 +
That user want to use a software called STAR (RNA-Seq tool), that would look like :
 +
 
 +
''STAR --genomeDir'' '''Platform/GEN/References/''' ''--readFilesCommand zcat _<br />
 +
''--readFilesIn_ '''Platform/GEN/MyLab/Share_GEN/GM12878.rep1.R1.fastq.gz Platform/GEN/MyLab/Share_GEN/GM12878.rep1.R2.fastq.gz''' ''_<br />
 +
''--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within _<br />
 +
''--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM _<br />
 +
''--runThreadN 20 --outFileNamePrefix_ '''&quot;Platform/GEN/MyLab/MyProject/star_GM12878_rep1/&quot;'''
 +
 
 +
To explain a bit more this example, the user would always have his reference genomes (or whatever type of reference to work with), and the raw data generated by one of GIGA platforms at the same spot...and so those would become static values in his/her scripts. Those being :<br />
 +
''(myhome/)Platform/GEN/References/'' '''and'''<br />
 +
''(myhome/)Platform/GEN/MyLab/Share_GEN/''
 +
 
 +
The only thing he/she would have to define being therefore the data coming '''OUT''' of this workflow, that being :<br />
 +
''(myhome/)Platform/GEN/MyLab/MyProject/star_GM12878_rep1/''
 +
 
 +
Thanks to that way to work, would result a more clean/fast/easy way to work as an individual, as teams but also allows to share among coworkers '''AND''' more important, to save storage space (hence, more projects possible to create, more time to keep backups, etc).
 +
 
 +
== 6.3 Right nomenclature ==
 +
 
 +
GIGA's computing infrastructure being &quot;Linux based&quot;, it implies a few good-practices when it comes to naming.<br />
 +
Nowaday, it accepts more forein characters but still has issues with those.
 +
 
 +
Because of that, is asked to the users to use exclusively alpha-numeric characters, underscores &quot;_&quot; and dashes &quot;-&quot;.
 +
 
 +
So in the case one would want to sort his directories structure like for example :<br />
 +
''/myhome/mygroupshare/Hervé/''<br />
 +
''/myhome/mygroupshare/ゲノム/''<br />
 +
''/myhome/mygroupshare/الجينوم/''<br />
 +
''/myhome/mygroupshare/some other things/''
 +
 
 +
The problem is that GIGA's backuping system being quite sensitive it would run into errors of even worst : just pass those directories and their whole content.
 +
 
 +
The solution would be to '''always''' use english characters and replace spaces by underscores like that :<br />
 +
''/myhome/mygroupshare/Herve/''<br />
 +
''/myhome/mygroupshare/some_other_things/''<br />
 +
''/myhome/mygroupshare/someOtherThings/''
 +
 
 +
== 6.4 Linux basics ==
 +
 
 
TBA
 
TBA
  
==TBA==
+
== 6.5 Slurm basics ==
 +
 
 
TBA
 
TBA
  
==TBA==
+
== 6.6 Primary cluster ==
 +
 
 
TBA
 
TBA
  
==TBA==
+
== 6.7 Scratches ==
 +
 
 
TBA
 
TBA
  
==TBA==
+
== 6.8 Secondary cluster ==
 +
 
 
TBA
 
TBA
  
==TBA==
+
== 6.9 CECI cluster ==
 +
 
 
TBA
 
TBA
  
==TBA==
+
= 7. Hardware specs =
TBA
+
 
 +
Technical explanations. : - http://www.ceci-hpc.be/clusters.html - https://en.wikipedia.org/wiki/InfiniBand#Performance
 +
 
 +
== 7.1 Primary cluster ==
 +
 
 +
11 partitions<br />
 +
Intel and AMD<br />
 +
26 nodes<br />
 +
688 cores<br />
 +
5.53 TB total. 128~512Gb / node<br />
 +
Gigabit ethernet
 +
 
 +
== 7.2 Secondary cluster ==
 +
 
 +
1 partition<br />
 +
AMD<br />
 +
17 nodes<br />
 +
389 (logical) cores<br />
 +
32~2585Gb of memory / node<br />
 +
Gigabit ethernet
 +
 
 +
== 7.3 CECI cluster ==
 +
 
 +
=== 7.3.1 NIC4 ===
 +
 
 +
1 partition<br />
 +
Intel<br />
 +
128 nodes<br />
 +
2048 (physical) CPU<br />
 +
64Gb of memory / node<br />
 +
InfiniBand Quad DataRate
 +
 
 +
=== 7.3.2 Vega ===
 +
 
 +
1 partition<br />
 +
AMD<br />
 +
43 nodes<br />
 +
2752 (physical) CPU<br />
 +
256Gb of memory / node<br />
 +
InfiniBand Quad DataRate
 +
 
 +
=== 7.3.3 Hercules ===
 +
 
 +
1 partition<br />
 +
Intel<br />
 +
64 nodes<br />
 +
896 (physical) CPU<br />
 +
36~128Gb of memory / node<br />
 +
Gigabit ethernet
 +
 
 +
=== 7.3.4 Dragon1 ===
 +
 
 +
1 partition<br />
 +
Intel<br />
 +
26 nodes<br />
 +
416 (physical) CPU<br />
 +
128Gb of memory / node<br />
 +
Gigabit ethernet
 +
 
 +
=== 7.3.5 Lemaitre2 ===
 +
 
 +
1 partition<br />
 +
Intel<br />
 +
115 nodes<br />
 +
1380 (physical) CPU<br />
 +
48Gb of memory / node<br />
 +
InfiniBand Quad DataRate
 +
 
 +
=== 7.3.6 Hmem ===
 +
 
 +
1 partition<br />
 +
AMD<br />
 +
17 nodes<br />
 +
816 (physical) CPU<br />
 +
128~512Gb of memory / node<br />
 +
InfiniBand Quad DataRate
 +
 
 +
=== 7.3.7 Zenobe ===
  
==TBA==
+
1 partition<br />
TBA
+
Intel<br />
 +
582 nodes<br />
 +
13968 (physical) CPU<br />
 +
64~256Gb of memory / node<br />
 +
InfiniBand 2x Quad DataRate + 1x Fourteen DataRate

Revision as of 21:07, 25 April 2018





1. Introduction

The GIGA provides data storage spaces and high-processing computers to its members.
Upon joining the GIGA, members will be assigned access to a group data storage (group share) for working (storing, updating, deleting) on data and files that need to be shared and accessible to everyone in their respective group/platform.
Another data sharing system has been created to allow the GIGA members to work in a collaborative way similar to group shares but with “non-GIGA” clients.
Members of each specific share are responsible for managing their data according to the rules defined in this document.

2. Purpose

The purpose of this policy is to inform the community on data storage and processing infrastructure available as well as to define acceptable usage of these resources.
Effective management and use of individual, group and “outside” data storages will enable system administrators to manage GIGA’s computing resources more efficiently.

3. Storage

Simple representation of our storage structure here

3.1 Personal computer

Management and maintenance of the computer park of the GIGA is the role of a specific computing unit of ULiège called UDIMED-GIGA.

For a new GIGA member, there are two possible scenarios : they use their own personal computer (PC) (whether Windows based PC or Mac) or are provided with one (ULiège-owned) by their laboratory.

To buy a hardware and computer peripherals, the UDIMED-GIGA helps and advises the GIGA member to build the purchase order. UDI is in contact with the official suppliers of the University of Liège to provide you with a price offer. Upon the reciept of the computer, UDI can prepare the material directly in their office and provides you with a ready-to-use hardware with the requested software. The follow-up can then be more effective in the event of breakdown or for a technical intervention.

In the case of a member using his own personnal computer, it must be configured before being connected to the GIGA network.

The configuration of the computer (own PC or ULiège-owned) is including : - Definition of credentials and configuration of your access to the Internet - Mounting the shared storage - Installing some non-scientific tools (DoX, anti-virus, office softwares) - Installing some scientific tools - Setting up the printers

3.2 Mass Storage

The GIGA provides the members with a professional infrastructure (called "mass storage") which includes a very large disk capacity to store big amounts of data, as well as its equivalent in tapes for the backuping (tape library). The mass storage is organized to enable the sharing of data using specific group working spaces.

The disk space is organized as a clustered storage which is an addition of many hard drives, with on top of it several layers called "file system" split in fileset which allows GIGA users/groups to work in compartmented spaces and so makes it easier to define quotas, number of inodes, specific rules, as well as bringing better support to all of them.

On the other hand, a tape library is a storage device which contains one or more tape drives, a number of slots to hold tape cartridges, a bar-code reader to identify the cartridges and an automated method for loading them (a robot).

This whole infrastructure being "Linux-based", it implies certain constraints : 1. A strict naming sense for both directories and files : - No spaces (use underscores "_" instead) - No no-English characters (accents, foreign symbols) - Are allowed : az AZ 09 _ and - 2. The use of the backup and nobackup directories - described further in this document - has to respect carefully the following rule XXXXXXXXX ??? (that? or force them to ask IT service first ?). See example 3. Each member must respect his home space and group's quotas. Users generating numerous and big datasets must strictly avoid to duplicate their files. See example

3.2.1 User home

Each GIGA member is given a personal space of 100Gb on the mass storage infrastructure called "User Home" that is by default backuped.

The User home is organized with sub-directories : - One called nobackup which is, like indicated, not backuped - The others are generated upon user's assignment to specific projects within his group/lab

3.2.2 Group home

Within a thematic unit (called URT or Research) or a platform (called PTF or Plateforms), each lab will have its own "Group Home" divided in sub-directories dedicated to different projects. The head of the lab is managing the membership of its own Group Home.
Each GIGA member is by definition part of a group or a lab and is working on one or multiple projects. Upon its assignment to a project, the member is granted access to a group space shared with other members assigned to the same project.
Data generated on the GIGA-Platforms or downloaded from an external source must be stored in a "Share" directory of 2Tb present in every project specific sub-directories.
The quotas, if justified, can be expanded. The Group Homes are by default backuped every night.

A "Group Home" is composed of project specific sub-directories that have: - One directory called nobackup which is, like indicated, not backuped - In the case of thematic units, two others called Share_IMG and Share_GEN. These are special spaces where data generated for the project by a platform are being stored. IMG being imaging data and GEN NGS sequencing data

3.3 DoX

For the other data types (eg: Word/Excel/Powerpoint files, ...) UDIMED-GIGA provides the GIGA members with a tool called DoX.
DoX is an ULiège cloud storage system which, like its alternatives (Eq. Google Drive, Dropbox), allows to synchronize and store the versioning of the user's data through his electronic devices (Windows, Mac, Linux computers but also iPhone, iPad and Android devices).
The quota is of 100Gb and its content is by default backuped every night.

A cloud storage, in oposition to our physicals clustered storage and tapes, is a model of data storage in which the digital data is stored in logical pools, the physical storage spans multiple servers (and often locations).
It is made for high availability through all electronic devices with web or mobile user identifications.

This system provides : - An automated synchronization of the data - The storage of the different versions of the data - An internal editor - A way to share data with non-ULiege users (Meaning without credentials)

3.4 Sharing outside GIGA

2nd DoX ? (TBD, in testing phase)

3.5 GitLab

When working on scientific datasets, GIGA members might need to use and eventually write programming code.
The GIGA provides members with a specialized environment called GitLab (Need to talk about space quotas) that allows collaborative work on programming code and/or non-formated text (That do not includes Microsoft/Libre/Open Offices documents) but also use a powerful versioning system. The content is by default backuped every night.

This system provides : - The versioning of its content on an atomic level (File by file, through to any single character) - A user friendly interface through web access (easy access) - The ability to work in structured teams with granular credentials control (Up to the PI's and/or their own computing specialist(s)) - The ability to create separated projects called repositories: - Either private to the user or its groups, or public to, as an example, store the annex of published papers - A more robust, powerful yet harder to handle projects through Linux terminals. Presentations and workshops can be given on demand

In order to maintain a manageable environment, users are asked to : - Define groups and projects naming in a readable way. See example - Keep their repositories well organized

4. Backup

All the data stored on the Mass Storage (except for the data stored in nobackup directories) are backuped every night on tapes in another location using a synchronizing system called "TSM".

4.1 Tapes

The Tapes library is a robot with 4 drives working on, actually, 620 tapes.
It has a retention policy of 28 days for average use of the spaces but which can shift to a shorter period if users' activity on file-systems is high.
Because this process has quite a heavy load, it is made to work during the night with a window of work between 00 and 07am.

Being able to handle 2Tb~2,5Tb during that time, it is asked to the users to : - Avoid daily changes for an amount over 2Tb - Try to regroup as much small files as they can (through, for example, zip/rar/tar archives) - Ask GIGA's IT before moving over 1Tb - Create/use as much as they can the nobackup directories in order to prevent fast/long in time/massive works on file-sets. If they work with a tool that generate large amount/size of files they can therefore configure their script to use temporary datas in one of those directory and just have their final output in the backuped spaces

4.2 Offline Archiving

The offline archiving is a long term solution for older datasets that still need to be saved/kept. When the archive is created all the data on mass storage is erased.
An expiration date is still requested and can be 10 years or more if justified but when the chosen time period has passed, all data is automatically erased.
Like the nobackup directories, users can create archive directories containing the datasets that has to be offline backuped. (TBC)
(Create a rule to allow users to buy tapes IF they want more flexibilty on those conditions ?)

5. Data Processing

For high throughput data processing (scientific calculation), GIGA provides its members with different types of calculation clusters. Those coming each with their pros and cons.

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

The components of a cluster are usually connected to each other through fast local area networks, with each node (computer used as a server) running its own instance of an operating system.

To be allowed to use the clusters, users need to have basic knowledge of : - LINUX: Terminal navigation, SSH (secured connections), software handling (different from Mac and Windows). Presentations and workshops can be given on demand. See example - SLURM: a user can NEVER process on the main frame and have to use Slurm to submit jobs to a cluster. See example

The Primary and Secondary clusters allows too users to have access in a transparent way to their private and shared spaces.
Whenever they connect to them, they will start using them from their home which allows them to go to their respective group shares.

5.1 Primary cluster

Also called "genetic cluster", it is hosted and managed at SeGI (Service General d'Informatique de l'ULiège).
Each of it's nodes are bought by PI's from specific platform/research groups. Therefore, the users might (or not) have access to it though the system.
It is made in a collaborative way, meaning anyone willing to contribute by buying one or more node(s) will be added to the whole cluster.
It is as well for public as for production uses. Users' private and shared storage spaces are linked to it and so seamlessly (everyone's starting point).
See example

5.1.1 Scratch

Within the cluster are defined special spaces, working as temporary ones, for the users' manipulated data (while executing their workflows).

Those coming in two different forms : - Each node has a directory /local. They are relatively smalls but extremely fast because they are on the processing machine. The retainment policy is strict about the users purging them-self from their datasets at the end of each run - The second one, which is called /gallia, is bigger, and still extremely fast because connected to the network through high-speed hardware. This one needs from the users' to enter a request to GIGA's IT to get credentials. The retainment policy give them 15 days before an automatic purge
See example

5.2 Secondary cluster

Hosted at GIGA, this cluster works like the primary but requires a separate demand of credentials to allow it's use with users' private and shared storage spaces.
To do so, users must make a request to GIGA's IT and provide him their ULiege username and the platform/group which they are related to.
See example

5.3 Inter Belgian French-universities cluster

The Consortium des Équipements de Calcul Intensif, or CÉCI as a means of it's own to share high-performance scientific computing facilities and resources, and mutualize know-how and expertise among the universities' clusters. Concerning their system and software infrastructures, GIGA's and theirs are alike in order to facilitate as much as possible the users.

It allows GIGA members to use it's infrastructure given some conditions. Those being : - The user must make a request of credential on CÉCI website - CÉCI purpose is solely for public use, no production is allowed on it. Meaning they are allowed to do punctual runs but not automated continuous workflows - GIGA users won't have direct access to their private and group storages. Hence they will have to take care by themself of their data flow from our storage to CÉCI's
See example

6. Use cases

TBA

6.1 Backup Nobackup

TBA

6.2 Avoid duplicates

Because users think "this is easier/faster", they tend to simply copy input datasets to other destinations those being their private home or their groups'.
That is not true at all, in fact despite being fast it still needs time to copy.

The good practice would be to have one spot centralizing input datasets with a good/readable structure (directories) and simply calling them "as inputs" while using the needed software.

From there, what is to place in the project's directory would be : - The output from workflows - Some notes - Some plotting

Eg. :
A user is in his home. From there, are also links to their groups where are stored among other things the data generated by one of GIGA's platforms.
That user want to use a software called STAR (RNA-Seq tool), that would look like :

STAR --genomeDir Platform/GEN/References/ --readFilesCommand zcat _
--readFilesIn_ Platform/GEN/MyLab/Share_GEN/GM12878.rep1.R1.fastq.gz Platform/GEN/MyLab/Share_GEN/GM12878.rep1.R2.fastq.gz _
--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within _
--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM _
--runThreadN 20 --outFileNamePrefix_ "Platform/GEN/MyLab/MyProject/star_GM12878_rep1/"

To explain a bit more this example, the user would always have his reference genomes (or whatever type of reference to work with), and the raw data generated by one of GIGA platforms at the same spot...and so those would become static values in his/her scripts. Those being :
(myhome/)Platform/GEN/References/ and
(myhome/)Platform/GEN/MyLab/Share_GEN/

The only thing he/she would have to define being therefore the data coming OUT of this workflow, that being :
(myhome/)Platform/GEN/MyLab/MyProject/star_GM12878_rep1/

Thanks to that way to work, would result a more clean/fast/easy way to work as an individual, as teams but also allows to share among coworkers AND more important, to save storage space (hence, more projects possible to create, more time to keep backups, etc).

6.3 Right nomenclature

GIGA's computing infrastructure being "Linux based", it implies a few good-practices when it comes to naming.
Nowaday, it accepts more forein characters but still has issues with those.

Because of that, is asked to the users to use exclusively alpha-numeric characters, underscores "_" and dashes "-".

So in the case one would want to sort his directories structure like for example :
/myhome/mygroupshare/Hervé/
/myhome/mygroupshare/ゲノム/
/myhome/mygroupshare/الجينوم/
/myhome/mygroupshare/some other things/

The problem is that GIGA's backuping system being quite sensitive it would run into errors of even worst : just pass those directories and their whole content.

The solution would be to always use english characters and replace spaces by underscores like that :
/myhome/mygroupshare/Herve/
/myhome/mygroupshare/some_other_things/
/myhome/mygroupshare/someOtherThings/

6.4 Linux basics

TBA

6.5 Slurm basics

TBA

6.6 Primary cluster

TBA

6.7 Scratches

TBA

6.8 Secondary cluster

TBA

6.9 CECI cluster

TBA

7. Hardware specs

Technical explanations. : - http://www.ceci-hpc.be/clusters.html - https://en.wikipedia.org/wiki/InfiniBand#Performance

7.1 Primary cluster

11 partitions
Intel and AMD
26 nodes
688 cores
5.53 TB total. 128~512Gb / node
Gigabit ethernet

7.2 Secondary cluster

1 partition
AMD
17 nodes
389 (logical) cores
32~2585Gb of memory / node
Gigabit ethernet

7.3 CECI cluster

7.3.1 NIC4

1 partition
Intel
128 nodes
2048 (physical) CPU
64Gb of memory / node
InfiniBand Quad DataRate

7.3.2 Vega

1 partition
AMD
43 nodes
2752 (physical) CPU
256Gb of memory / node
InfiniBand Quad DataRate

7.3.3 Hercules

1 partition
Intel
64 nodes
896 (physical) CPU
36~128Gb of memory / node
Gigabit ethernet

7.3.4 Dragon1

1 partition
Intel
26 nodes
416 (physical) CPU
128Gb of memory / node
Gigabit ethernet

7.3.5 Lemaitre2

1 partition
Intel
115 nodes
1380 (physical) CPU
48Gb of memory / node
InfiniBand Quad DataRate

7.3.6 Hmem

1 partition
AMD
17 nodes
816 (physical) CPU
128~512Gb of memory / node
InfiniBand Quad DataRate

7.3.7 Zenobe

1 partition
Intel
582 nodes
13968 (physical) CPU
64~256Gb of memory / node
InfiniBand 2x Quad DataRate + 1x Fourteen DataRate