Difference between revisions of "Policy"

m (Laurent.fournier moved page Data policy to Policy)
Line 1: Line 1:
= 1. Introduction =
+
= Introduction =
 
 
 
The GIGA provides data storage spaces and high-processing computers to its members.<br />
 
The GIGA provides data storage spaces and high-processing computers to its members.<br />
 
Upon joining the GIGA, members will be assigned access to a group data storage (group share) for working (storing, updating, deleting) on data and files that need to be shared and accessible to everyone in their respective group/platform.<br />
 
Upon joining the GIGA, members will be assigned access to a group data storage (group share) for working (storing, updating, deleting) on data and files that need to be shared and accessible to everyone in their respective group/platform.<br />
Line 6: Line 5:
 
Members of each specific share are responsible for managing their data according to the rules defined in this document.
 
Members of each specific share are responsible for managing their data according to the rules defined in this document.
  
= 2. Purpose =
+
= Purpose =
 
 
 
The purpose of this policy is to inform the community on data storage and processing infrastructure available as well as to define acceptable usage of these resources.<br />
 
The purpose of this policy is to inform the community on data storage and processing infrastructure available as well as to define acceptable usage of these resources.<br />
 
Effective management and use of individual, group and “outside” data storages will enable system administrators to manage GIGA’s computing resources more efficiently.
 
Effective management and use of individual, group and “outside” data storages will enable system administrators to manage GIGA’s computing resources more efficiently.
  
= 3. Storage =
+
= Storage =
 
 
 
''Simple representation of our storage structure here''
 
''Simple representation of our storage structure here''
  
== 3.1 Personal computer ==
+
== Personal computer ==
 
+
Management and maintenance of personal computers at the GIGA is the role of a specific IT service of ULiège called UDIMED-GIGA.<br /><br />
Management and maintenance of the computer park of the GIGA is the role of a specific computing unit of ULiège called UDIMED-GIGA.
 
  
For a new GIGA member, there are two possible scenarios : they use their own personal computer (PC) (whether Windows based PC or Mac) or are provided with one (ULiège-owned) by their laboratory.
+
For new GIGA members, there are two possible scenarios : they use their own personal computer, or are provided with one (ULiège-owned) by their laboratory. <br />
 +
Supported systems include Windows PC, Apple Mac and - some - GNU/LINUXes.<br /><br />
  
To buy a hardware and computer peripherals, the UDIMED-GIGA helps and advises the GIGA member to build the purchase order. UDI is in contact with the official suppliers of the University of Liège to provide you with a price offer. Upon the reciept of the computer, UDI can prepare the material directly in their office and provides you with a ready-to-use hardware with the requested software. The follow-up can then be more effective in the event of breakdown or for a technical intervention.
+
For any new system, IT specialists advise GIGA members and help select the most appropriate computer or device for their needs.<br />
 +
They select the supplier (among official ULiege suppliers), take care of the purchase order, receive the order and prepare the hardware / install requested
 +
software to provide a ready-to-use system.  <br />
 +
Knowing the system, UDIMED-GIGA can provide a more efficient technical follow up. <br />
 +
Moreover, hardware listed by UDIMED-GIGA are insured against theft, fire, water damages.<br /><br />
  
In the case of a member using his own personnal computer, it must be configured before being connected to the GIGA network.
+
Any new computer (property of the GIGA member or the ULiège) must be properly configured before joining the GIGA IT network.<br /><br />
  
The configuration of the computer (own PC or ULiège-owned) is including : - Definition of credentials and configuration of your access to the Internet - Mounting the shared storage - Installing some non-scientific tools (DoX, anti-virus, office softwares) - Installing some scientific tools - Setting up the printers
+
This includes :
 +
- getting personnal login, password and credentials
 +
- configuring Intranet and Internet access
 +
- configuring shared ressources (storage, printers...)
 +
- installing some non-scientific tools (DoX, anti-virus, office softwares)
 +
- installing some scientific tools<br /><br />
  
== 3.2 Mass Storage ==
+
More info (in french) at http://udimed.ulg.ac.be <br />
  
The GIGA provides the members with a professional infrastructure (called &quot;mass storage&quot;) which includes a very large disk capacity to store big amounts of data, as well as its equivalent in tapes for the backuping (tape library). The mass storage is organized to enable the sharing of data using specific group working spaces.
+
== Mass Storage ==
 +
The GIGA provides the members with a professional infrastructure (called &quot;mass storage&quot;) which includes a very large disk capacity to store big amounts of data, as well as its equivalent in tapes for the backuping (tape library). The mass storage is organized to enable the sharing of data using specific group working spaces.<br /><br />
  
The disk space is organized as a clustered storage which is an addition of many hard drives, with on top of it several layers called &quot;file system&quot; split in fileset which allows GIGA users/groups to work in compartmented spaces and so makes it easier to define quotas, number of inodes, specific rules, as well as bringing better support to all of them.
+
The disk space is organized as a clustered storage which is an addition of many hard drives, with on top of it several layers called &quot;file system&quot; split in fileset which allows GIGA users/groups to work in compartmented spaces and so makes it easier to define quotas, number of inodes, specific rules, as well as bringing better support to all of them.<br /><br />
  
On the other hand, a tape library is a storage device which contains one or more tape drives, a number of slots to hold tape cartridges, a bar-code reader to identify the cartridges and an automated method for loading them (a robot).
+
On the other hand, a tape library is a storage device which contains one or more tape drives, a number of slots to hold tape cartridges, a bar-code reader to identify the cartridges and an automated method for loading them (a robot).<br /><br />
  
This whole infrastructure being &quot;Linux-based&quot;, it implies certain constraints : 1. A strict naming sense for both directories and files : - No spaces (use underscores &quot;_&quot; instead) - No no-English characters (accents, foreign symbols) - Are allowed : '''az AZ 09 _''' and '''-''' 2. The use of the '''backup''' and '''nobackup''' directories - described further in this document - has to respect carefully the following rule XXXXXXXXX ??? (''that? or force them to ask IT service first ?''). ''[[#61-backup-nobackup|See example]]'' 3. Each member must respect his home space and group's quotas. Users generating numerous and big datasets must strictly avoid to duplicate their files. ''[[#62-avoid-duplicates|See example]]''
+
This whole infrastructure being &quot;Linux-based&quot;, it implies certain constraints :  
 +
# A strict naming sense for both directories and files :  
 +
#* No spaces (use underscores &quot;_&quot; instead)  
 +
#* No no-English characters (accents, foreign symbols)  
 +
#* Are allowed : '''az AZ 09 _''' and '''-'''  
 +
# The use of the '''backup''' and '''nobackup''' directories - described further in this document - has to respect carefully the following rule XXXXXXXXX ??? (''that? or force them to ask IT service first ?''). ''[[#61-backup-nobackup|See example]]''  
 +
# Each member must respect his home space and group's quotas. Users generating numerous and big datasets must strictly avoid to duplicate their files. ''[[#62-avoid-duplicates|See example]]''
  
=== 3.2.1 User home ===
+
=== User home ===
 
+
Each GIGA member is given a personal space of '''100Gb''' on the mass storage infrastructure called &quot;User Home&quot; that is by default backuped.<br /><br />
Each GIGA member is given a personal space of '''100Gb''' on the mass storage infrastructure called &quot;User Home&quot; that is by default backuped.
 
  
 
The User home is organized with sub-directories : - One called '''nobackup''' which is, like indicated, not backuped - The others are generated upon user's assignment to specific projects within his group/lab
 
The User home is organized with sub-directories : - One called '''nobackup''' which is, like indicated, not backuped - The others are generated upon user's assignment to specific projects within his group/lab
  
=== 3.2.2 Group home ===
+
=== Group home ===
 
 
 
Within a thematic unit (called '''URT''' or '''Research''') or a platform (called '''PTF''' or '''Plateforms'''), each lab will have its own &quot;Group Home&quot; divided in sub-directories dedicated to different projects. The head of the lab is managing the membership of its own Group Home.<br />
 
Within a thematic unit (called '''URT''' or '''Research''') or a platform (called '''PTF''' or '''Plateforms'''), each lab will have its own &quot;Group Home&quot; divided in sub-directories dedicated to different projects. The head of the lab is managing the membership of its own Group Home.<br />
 
Each GIGA member is by definition part of a group or a lab and is working on one or multiple projects. Upon its assignment to a project, the member is granted access to a group space shared with other members assigned to the same project.<br />
 
Each GIGA member is by definition part of a group or a lab and is working on one or multiple projects. Upon its assignment to a project, the member is granted access to a group space shared with other members assigned to the same project.<br />
 
Data generated on the GIGA-Platforms or downloaded from an external source must be stored in a &quot;Share&quot; directory of '''2Tb''' present in every project specific sub-directories.<br />
 
Data generated on the GIGA-Platforms or downloaded from an external source must be stored in a &quot;Share&quot; directory of '''2Tb''' present in every project specific sub-directories.<br />
The quotas, if justified, can be expanded. The Group Homes are by default backuped every night.
+
The quotas, if justified, can be expanded. The Group Homes are by default backuped every night.<br /><br />
  
 
A &quot;Group Home&quot; is composed of project specific sub-directories that have: - One directory called '''nobackup''' which is, like indicated, not backuped - In the case of thematic units, two others called '''Share_IMG''' and '''Share_GEN'''. These are special spaces where data generated for the project by a platform are being stored. '''IMG''' being imaging data and '''GEN''' NGS sequencing data
 
A &quot;Group Home&quot; is composed of project specific sub-directories that have: - One directory called '''nobackup''' which is, like indicated, not backuped - In the case of thematic units, two others called '''Share_IMG''' and '''Share_GEN'''. These are special spaces where data generated for the project by a platform are being stored. '''IMG''' being imaging data and '''GEN''' NGS sequencing data
  
== 3.3 DoX ==
+
== DoX ==
 
 
 
For the other data types (eg: Word/Excel/Powerpoint files, ...) UDIMED-GIGA provides the GIGA members with a tool called '''DoX'''.<br />
 
For the other data types (eg: Word/Excel/Powerpoint files, ...) UDIMED-GIGA provides the GIGA members with a tool called '''DoX'''.<br />
 
DoX is an ULiège cloud storage system which, like its alternatives (''Eq. Google Drive, Dropbox''), allows to synchronize and store the versioning of the user's data through his electronic devices (Windows, Mac, Linux computers but also iPhone, iPad and Android devices).<br />
 
DoX is an ULiège cloud storage system which, like its alternatives (''Eq. Google Drive, Dropbox''), allows to synchronize and store the versioning of the user's data through his electronic devices (Windows, Mac, Linux computers but also iPhone, iPad and Android devices).<br />
The quota is of '''100Gb''' and its content is by default backuped every night.
+
The quota is of '''100Gb''' and its content is by default backuped every night.<br /><br />
  
 
A cloud storage, in oposition to our physicals clustered storage and tapes, is a model of data storage in which the digital data is stored in logical pools, the physical storage spans multiple servers (and often locations).<br />
 
A cloud storage, in oposition to our physicals clustered storage and tapes, is a model of data storage in which the digital data is stored in logical pools, the physical storage spans multiple servers (and often locations).<br />
It is made for high availability through all electronic devices with web or mobile user identifications.
+
It is made for high availability through all electronic devices with web or mobile user identifications.<br /><br />
  
This system provides : - An automated synchronization of the data - The storage of the different versions of the data - An internal editor - A way to share data with non-ULiege users (''Meaning without credentials'')
+
This system provides :  
 
+
* An automated synchronization of the data  
== 3.4 Sharing outside GIGA ==
+
* The storage of the different versions of the data  
 +
* An internal editor  
 +
* A way to share data with non-ULiege users (''Meaning without credentials'')
  
 +
== Sharing outside GIGA ==
 
''2nd DoX ? (TBD, in testing phase)''
 
''2nd DoX ? (TBD, in testing phase)''
  
== 3.5 GitLab ==
+
== GitLab ==
 
 
 
When working on scientific datasets, GIGA members might need to use and eventually write programming code.<br />
 
When working on scientific datasets, GIGA members might need to use and eventually write programming code.<br />
The GIGA provides members with a specialized environment called '''GitLab''' (''Need to talk about space quotas'') that allows collaborative work on programming code and/or non-formated text (''That do not includes Microsoft/Libre/Open Offices documents'') but also use a powerful versioning system. The content is by default backuped every night.
+
The GIGA provides members with a specialized environment called '''GitLab''' (''Need to talk about space quotas'') that allows collaborative work on programming code and/or non-formated text (''That do not includes Microsoft/Libre/Open Offices documents'') but also use a powerful versioning system. The content is by default backuped every night.<br /><br />
  
This system provides : - The versioning of its content on an atomic level (''File by file, through to any single character'') - A user friendly interface through web access ('''easy access''') - The ability to work in '''structured teams''' with granular credentials control (''Up to the PI's and/or their own computing specialist(s)'') - The ability to create separated projects called '''repositories''': - Either private to the user or its groups, or public to, as an example, store the annex of published papers - A more robust, powerful yet harder to handle projects through Linux terminals. ''Presentations and workshops can be given on demand''
+
This system provides : - The versioning of its content on an atomic level (''File by file, through to any single character'') - A user friendly interface through web access ('''easy access''')  
 +
* The ability to work in '''structured teams''' with granular credentials control (''Up to the PI's and/or their own computing specialist(s)'')  
 +
* The ability to create separated projects called '''repositories''':  
 +
** Either private to the user or its groups, or public to, as an example, store the annex of published papers  
 +
* A more robust, powerful yet harder to handle projects through Linux terminals. ''Presentations and workshops can be given on demand''
  
In order to maintain a manageable environment, users are asked to : - Define groups and projects naming in a readable way. ''[[#63-right-nomenclature|See example]]'' - Keep their repositories well organized
+
In order to maintain a manageable environment, users are asked to :  
 
+
* Define groups and projects naming in a readable way. ''[[#63-right-nomenclature|See example]]''  
= 4. Backup =
+
* Keep their repositories well organized
  
 +
= Backup =
 
All the data stored on the Mass Storage (except for the data stored in nobackup directories) are backuped every night on tapes in another location using a synchronizing system called &quot;TSM&quot;.
 
All the data stored on the Mass Storage (except for the data stored in nobackup directories) are backuped every night on tapes in another location using a synchronizing system called &quot;TSM&quot;.
  
== 4.1 Tapes ==
+
== Tapes ==
 
 
 
The '''Tapes library''' is a robot with '''4 drives''' working on, actually, '''620 tapes'''.<br />
 
The '''Tapes library''' is a robot with '''4 drives''' working on, actually, '''620 tapes'''.<br />
 
It has a retention policy of 28 days for average use of the spaces but which can shift to a shorter period if users' activity on file-systems is high.<br />
 
It has a retention policy of 28 days for average use of the spaces but which can shift to a shorter period if users' activity on file-systems is high.<br />
Because this process has quite a heavy load, it is made to work during the night with a window of work between 00 and 07am.
+
Because this process has quite a heavy load, it is made to work during the night with a window of work between 00 and 07am.<br /><br />
  
 
Being able to handle 2Tb~2,5Tb during that time, it is asked to the users to : - Avoid daily changes for an amount over 2Tb - Try to regroup as much small files as they can (through, for example, zip/rar/tar archives) - Ask GIGA's IT before moving over 1Tb - Create/use as much as they can the '''nobackup''' directories in order to prevent fast/long in time/massive works on file-sets. If they work with a tool that generate large amount/size of files they can therefore configure their script to use temporary datas in one of those directory and just have their final output in the backuped spaces
 
Being able to handle 2Tb~2,5Tb during that time, it is asked to the users to : - Avoid daily changes for an amount over 2Tb - Try to regroup as much small files as they can (through, for example, zip/rar/tar archives) - Ask GIGA's IT before moving over 1Tb - Create/use as much as they can the '''nobackup''' directories in order to prevent fast/long in time/massive works on file-sets. If they work with a tool that generate large amount/size of files they can therefore configure their script to use temporary datas in one of those directory and just have their final output in the backuped spaces
  
== 4.2 Offline Archiving ==
+
== Offline Archiving ==
 
 
 
The offline archiving is a '''long term''' solution for older datasets that still need to be saved/kept. When the archive is created all the data on mass storage is erased.<br />
 
The offline archiving is a '''long term''' solution for older datasets that still need to be saved/kept. When the archive is created all the data on mass storage is erased.<br />
 
An expiration date is still requested and can be 10 years or more if justified but when the chosen time period has passed, all data is automatically erased.<br />
 
An expiration date is still requested and can be 10 years or more if justified but when the chosen time period has passed, all data is automatically erased.<br />
Line 95: Line 110:
 
(''Create a rule to allow users to buy tapes IF they want more flexibilty on those conditions ?'')
 
(''Create a rule to allow users to buy tapes IF they want more flexibilty on those conditions ?'')
  
= 5. Data Processing =
+
= Data Processing =
 +
For high throughput data processing (scientific calculation), GIGA provides its members with different types of calculation '''clusters'''. Those coming each with their pros and cons.<br /><br />
  
For high throughput data processing (scientific calculation), GIGA provides its members with different types of calculation '''clusters'''. Those coming each with their pros and cons.
+
A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.<br /><br />
  
A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.
+
The components of a cluster are usually connected to each other through fast local area networks, with each node (computer used as a server) running its own instance of an operating system.<br /><br />
  
The components of a cluster are usually connected to each other through fast local area networks, with each node (computer used as a server) running its own instance of an operating system.
+
To be allowed to use the clusters, users need to have basic knowledge of :  
 
+
* LINUX: Terminal navigation, SSH (secured connections), software handling (different from Mac and Windows). ''Presentations and workshops can be given on demand''. ''[[#64-linux-basics|See example]]''  
To be allowed to use the clusters, users need to have basic knowledge of : - LINUX: Terminal navigation, SSH (secured connections), software handling (different from Mac and Windows). ''Presentations and workshops can be given on demand''. ''[[#64-linux-basics|See example]]'' - SLURM: a user can NEVER process on the main frame and have to use '''Slurm''' to submit jobs to a cluster. ''[[#65-slurm-basics|See example]]''
+
* SLURM: a user can NEVER process on the main frame and have to use '''Slurm''' to submit jobs to a cluster. ''[[#65-slurm-basics|See example]]''<br /><br />
  
 
The Primary and Secondary clusters allows too users to have access in a transparent way to their private and shared spaces.<br />
 
The Primary and Secondary clusters allows too users to have access in a transparent way to their private and shared spaces.<br />
 
Whenever they connect to them, they will start using them from their home which allows them to go to their respective group shares.
 
Whenever they connect to them, they will start using them from their home which allows them to go to their respective group shares.
  
== 5.1 Primary cluster ==
+
== Primary cluster ==
 
 
 
Also called &quot;'''genetic cluster'''&quot;, it is hosted and managed at SeGI (''Service General d'Informatique de l'ULiège'').<br />
 
Also called &quot;'''genetic cluster'''&quot;, it is hosted and managed at SeGI (''Service General d'Informatique de l'ULiège'').<br />
 
Each of it's nodes are bought by PI's from specific platform/research groups. Therefore, the users might (or not) have access to it though the system.<br />
 
Each of it's nodes are bought by PI's from specific platform/research groups. Therefore, the users might (or not) have access to it though the system.<br />
Line 116: Line 131:
 
''[[#66-primary-cluster|See example]]''
 
''[[#66-primary-cluster|See example]]''
  
=== 5.1.1 Scratch ===
+
=== Scratch ===
 +
Within the cluster are defined special spaces, working as temporary ones, for the users' manipulated data (while executing their workflows).<br /><br />
  
Within the cluster are defined special spaces, working as temporary ones, for the users' manipulated data (while executing their workflows).
+
Those coming in two different forms :  
 
+
* Each node has a directory '''/local'''. They are relatively smalls but extremely fast because they are on the processing machine. The retainment policy is strict about the users purging them-self from their datasets at the end of each run  
Those coming in two different forms : - Each node has a directory '''/local'''. They are relatively smalls but extremely fast because they are on the processing machine. The retainment policy is strict about the users purging them-self from their datasets at the end of each run - The second one, which is called '''/gallia''', is bigger, and still extremely fast because connected to the network through high-speed hardware. This one needs from the users' to enter a request to GIGA's IT to get credentials. The retainment policy give them 15 days before an automatic purge<br />
+
* The second one, which is called '''/gallia''', is bigger, and still extremely fast because connected to the network through high-speed hardware. This one needs from the users' to enter a request to GIGA's IT to get credentials. The retainment policy give them 15 days before an automatic purge
 
''[[#67-scratches|See example]]''
 
''[[#67-scratches|See example]]''
  
== 5.2 Secondary cluster ==
+
== Secondary cluster ==
 
 
 
Hosted at GIGA, this cluster works like the primary but requires a separate demand of credentials to allow it's use with users' private and shared storage spaces.<br />
 
Hosted at GIGA, this cluster works like the primary but requires a separate demand of credentials to allow it's use with users' private and shared storage spaces.<br />
 
To do so, users must make a request to GIGA's IT and provide him their ULiege username and the platform/group which they are related to.<br />
 
To do so, users must make a request to GIGA's IT and provide him their ULiege username and the platform/group which they are related to.<br />
 
''[[#68-secondary-cluster|See example]]''
 
''[[#68-secondary-cluster|See example]]''
  
== 5.3 Inter Belgian French-universities cluster ==
+
== Inter Belgian French-universities cluster ==
 
+
The ''Consortium des Équipements de Calcul Intensif'', or '''CÉCI''' as a means of it's own to share high-performance scientific computing facilities and resources, and mutualize know-how and expertise among the universities' clusters. Concerning their system and software infrastructures, GIGA's and theirs are alike in order to facilitate as much as possible the users.<br /><br />
The ''Consortium des Équipements de Calcul Intensif'', or '''CÉCI''' as a means of it's own to share high-performance scientific computing facilities and resources, and mutualize know-how and expertise among the universities' clusters. Concerning their system and software infrastructures, GIGA's and theirs are alike in order to facilitate as much as possible the users.
 
  
It allows GIGA members to use it's infrastructure given some conditions. Those being : - The user must make a request of credential on CÉCI website - CÉCI purpose is solely for public use, no production is allowed on it. Meaning they are allowed to do punctual runs but not automated continuous workflows - GIGA users won't have direct access to their private and group storages. Hence they will have to take care by themself of their data flow from our storage to CÉCI's<br />
+
It allows GIGA members to use it's infrastructure given some conditions. Those being :  
 +
* The user must make a request of credential on CÉCI website  
 +
* CÉCI purpose is solely for public use, no production is allowed on it. Meaning they are allowed to do punctual runs but not automated continuous workflows  
 +
* GIGA users won't have direct access to their private and group storages. Hence they will have to take care by themself of their data flow from our storage to CÉCI's
 
''[[#69-ceci-cluster|See example]]''
 
''[[#69-ceci-cluster|See example]]''
  
= 6. Use cases =
+
= Use cases =
 
+
== Backup Nobackup ==
 
TBA
 
TBA
  
== 6.1 Backup Nobackup ==
+
== Avoid duplicates ==
 
 
TBA
 
 
 
== 6.2 Avoid duplicates ==
 
 
 
 
Because users think &quot;this is easier/faster&quot;, they tend to simply copy '''input''' datasets to other destinations those being their private home or their groups'.<br />
 
Because users think &quot;this is easier/faster&quot;, they tend to simply copy '''input''' datasets to other destinations those being their private home or their groups'.<br />
That is not true at all, in fact despite being fast it still needs time to copy.
+
That is not true at all, in fact despite being fast it still needs time to copy.<br /><br />
  
The good practice would be to have one spot centralizing input datasets with a good/readable structure ('''directories''') and simply calling them &quot;as inputs&quot; while using the needed software.
+
The good practice would be to have one spot centralizing input datasets with a good/readable structure ('''directories''') and simply calling them &quot;as inputs&quot; while using the needed software.<br /><br />
 
 
From there, what is to place in ''the project's directory'' would be : - The output from workflows - Some notes - Some plotting
 
  
 +
From there, what is to place in ''the project's directory'' would be :
 +
* The output from workflows
 +
* Some notes
 +
* Some plotting
 +
<br />
 
Eg. :<br />
 
Eg. :<br />
 
A user is in his home. From there, are also links to their groups where are stored among other things the data generated by one of GIGA's platforms.<br />
 
A user is in his home. From there, are also links to their groups where are stored among other things the data generated by one of GIGA's platforms.<br />
That user want to use a software called STAR (RNA-Seq tool), that would look like :
+
That user want to use a software called STAR (RNA-Seq tool), that would look like :<br /><br />
  
''STAR --genomeDir'' '''Platform/GEN/References/''' ''--readFilesCommand zcat _<br />
+
''STAR --genomeDir'' '''Platform/GEN/References/''' ''--readFilesCommand zcat <br />
''--readFilesIn_ '''Platform/GEN/MyLab/Share_GEN/GM12878.rep1.R1.fastq.gz Platform/GEN/MyLab/Share_GEN/GM12878.rep1.R2.fastq.gz''' ''_<br />
+
''--readFilesIn_ '''Platform/GEN/MyLab/Share_GEN/GM12878.rep1.R1.fastq.gz Platform/GEN/MyLab/Share_GEN/GM12878.rep1.R2.fastq.gz''' ''<br />
''--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within _<br />
+
''--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within <br />
''--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM _<br />
+
''--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM <br />
''--runThreadN 20 --outFileNamePrefix_ '''&quot;Platform/GEN/MyLab/MyProject/star_GM12878_rep1/&quot;'''
+
''--runThreadN 20 --outFileNamePrefix_ '''&quot;Platform/GEN/MyLab/MyProject/star_GM12878_rep1/&quot;'''<br /><br />
  
 
To explain a bit more this example, the user would always have his reference genomes (or whatever type of reference to work with), and the raw data generated by one of GIGA platforms at the same spot...and so those would become static values in his/her scripts. Those being :<br />
 
To explain a bit more this example, the user would always have his reference genomes (or whatever type of reference to work with), and the raw data generated by one of GIGA platforms at the same spot...and so those would become static values in his/her scripts. Those being :<br />
 
''(myhome/)Platform/GEN/References/'' '''and'''<br />
 
''(myhome/)Platform/GEN/References/'' '''and'''<br />
''(myhome/)Platform/GEN/MyLab/Share_GEN/''
+
''(myhome/)Platform/GEN/MyLab/Share_GEN/''<br /><br />
  
 
The only thing he/she would have to define being therefore the data coming '''OUT''' of this workflow, that being :<br />
 
The only thing he/she would have to define being therefore the data coming '''OUT''' of this workflow, that being :<br />
''(myhome/)Platform/GEN/MyLab/MyProject/star_GM12878_rep1/''
+
''(myhome/)Platform/GEN/MyLab/MyProject/star_GM12878_rep1/''<br /><br />
  
 
Thanks to that way to work, would result a more clean/fast/easy way to work as an individual, as teams but also allows to share among coworkers '''AND''' more important, to save storage space (hence, more projects possible to create, more time to keep backups, etc).
 
Thanks to that way to work, would result a more clean/fast/easy way to work as an individual, as teams but also allows to share among coworkers '''AND''' more important, to save storage space (hence, more projects possible to create, more time to keep backups, etc).
  
== 6.3 Right nomenclature ==
+
== Right nomenclature ==
 
 
 
GIGA's computing infrastructure being &quot;Linux based&quot;, it implies a few good-practices when it comes to naming.<br />
 
GIGA's computing infrastructure being &quot;Linux based&quot;, it implies a few good-practices when it comes to naming.<br />
Nowaday, it accepts more forein characters but still has issues with those.
+
Nowaday, it accepts more forein characters but still has issues with those.<br /><br />
  
Because of that, is asked to the users to use exclusively alpha-numeric characters, underscores &quot;_&quot; and dashes &quot;-&quot;.
+
Because of that, is asked to the users to use exclusively alpha-numeric characters, underscores &quot;_&quot; and dashes &quot;-&quot;.<br /><br />
  
 
So in the case one would want to sort his directories structure like for example :<br />
 
So in the case one would want to sort his directories structure like for example :<br />
Line 183: Line 197:
 
''/myhome/mygroupshare/ゲノム/''<br />
 
''/myhome/mygroupshare/ゲノム/''<br />
 
''/myhome/mygroupshare/الجينوم/''<br />
 
''/myhome/mygroupshare/الجينوم/''<br />
''/myhome/mygroupshare/some other things/''
+
''/myhome/mygroupshare/some other things/''<br /><br />
  
The problem is that GIGA's backuping system being quite sensitive it would run into errors of even worst : just pass those directories and their whole content.
+
The problem is that GIGA's backuping system being quite sensitive it would run into errors of even worst : just pass those directories and their whole content.<br /><br />
  
 
The solution would be to '''always''' use english characters and replace spaces by underscores like that :<br />
 
The solution would be to '''always''' use english characters and replace spaces by underscores like that :<br />
Line 192: Line 206:
 
''/myhome/mygroupshare/someOtherThings/''
 
''/myhome/mygroupshare/someOtherThings/''
  
== 6.4 Linux basics ==
+
== Linux basics ==
 
 
 
TBA
 
TBA
  
== 6.5 Slurm basics ==
+
== Slurm basics ==
 
 
 
TBA
 
TBA
  
== 6.6 Primary cluster ==
+
== Primary cluster ==
 
 
 
TBA
 
TBA
  
== 6.7 Scratches ==
+
== Scratches ==
 
 
 
TBA
 
TBA
  
== 6.8 Secondary cluster ==
+
== Secondary cluster ==
 
 
 
TBA
 
TBA
  
== 6.9 CECI cluster ==
+
== CECI cluster ==
 
 
 
TBA
 
TBA
  
= 7. Hardware specs =
+
= Hardware specs =
 
+
Technical explanations. :  
Technical explanations. : - http://www.ceci-hpc.be/clusters.html - https://en.wikipedia.org/wiki/InfiniBand#Performance
+
- http://www.ceci-hpc.be/clusters.html  
 
+
- https://en.wikipedia.org/wiki/InfiniBand#Performance
== 7.1 Primary cluster ==
 
  
 +
== Primary cluster ==
 
11 partitions<br />
 
11 partitions<br />
 
Intel and AMD<br />
 
Intel and AMD<br />
Line 229: Line 237:
 
Gigabit ethernet
 
Gigabit ethernet
  
== 7.2 Secondary cluster ==
+
== Secondary cluster ==
 
 
 
1 partition<br />
 
1 partition<br />
 
AMD<br />
 
AMD<br />
Line 238: Line 245:
 
Gigabit ethernet
 
Gigabit ethernet
  
== 7.3 CECI cluster ==
+
== CECI cluster ==
 
+
=== NIC4 ===
=== 7.3.1 NIC4 ===
 
 
 
 
1 partition<br />
 
1 partition<br />
 
Intel<br />
 
Intel<br />
Line 249: Line 254:
 
InfiniBand Quad DataRate
 
InfiniBand Quad DataRate
  
=== 7.3.2 Vega ===
+
=== Vega ===
 
 
 
1 partition<br />
 
1 partition<br />
 
AMD<br />
 
AMD<br />
Line 258: Line 262:
 
InfiniBand Quad DataRate
 
InfiniBand Quad DataRate
  
=== 7.3.3 Hercules ===
+
=== Hercules ===
 
 
 
1 partition<br />
 
1 partition<br />
 
Intel<br />
 
Intel<br />
Line 267: Line 270:
 
Gigabit ethernet
 
Gigabit ethernet
  
=== 7.3.4 Dragon1 ===
+
=== Dragon1 ===
 
 
 
1 partition<br />
 
1 partition<br />
 
Intel<br />
 
Intel<br />
Line 276: Line 278:
 
Gigabit ethernet
 
Gigabit ethernet
  
=== 7.3.5 Lemaitre2 ===
+
=== Lemaitre2 ===
 
 
 
1 partition<br />
 
1 partition<br />
 
Intel<br />
 
Intel<br />
Line 285: Line 286:
 
InfiniBand Quad DataRate
 
InfiniBand Quad DataRate
  
=== 7.3.6 Hmem ===
+
=== Hmem ===
 
 
 
1 partition<br />
 
1 partition<br />
 
AMD<br />
 
AMD<br />
Line 294: Line 294:
 
InfiniBand Quad DataRate
 
InfiniBand Quad DataRate
  
=== 7.3.7 Zenobe ===
+
=== Zenobe ===
 
 
 
1 partition<br />
 
1 partition<br />
 
Intel<br />
 
Intel<br />

Revision as of 09:49, 27 April 2018

Introduction

The GIGA provides data storage spaces and high-processing computers to its members.
Upon joining the GIGA, members will be assigned access to a group data storage (group share) for working (storing, updating, deleting) on data and files that need to be shared and accessible to everyone in their respective group/platform.
Another data sharing system has been created to allow the GIGA members to work in a collaborative way similar to group shares but with “non-GIGA” clients.
Members of each specific share are responsible for managing their data according to the rules defined in this document.

Purpose

The purpose of this policy is to inform the community on data storage and processing infrastructure available as well as to define acceptable usage of these resources.
Effective management and use of individual, group and “outside” data storages will enable system administrators to manage GIGA’s computing resources more efficiently.

Storage

Simple representation of our storage structure here

Personal computer

Management and maintenance of personal computers at the GIGA is the role of a specific IT service of ULiège called UDIMED-GIGA.

For new GIGA members, there are two possible scenarios : they use their own personal computer, or are provided with one (ULiège-owned) by their laboratory.
Supported systems include Windows PC, Apple Mac and - some - GNU/LINUXes.

For any new system, IT specialists advise GIGA members and help select the most appropriate computer or device for their needs.
They select the supplier (among official ULiege suppliers), take care of the purchase order, receive the order and prepare the hardware / install requested software to provide a ready-to-use system.
Knowing the system, UDIMED-GIGA can provide a more efficient technical follow up.
Moreover, hardware listed by UDIMED-GIGA are insured against theft, fire, water damages.

Any new computer (property of the GIGA member or the ULiège) must be properly configured before joining the GIGA IT network.

This includes : - getting personnal login, password and credentials - configuring Intranet and Internet access - configuring shared ressources (storage, printers...) - installing some non-scientific tools (DoX, anti-virus, office softwares) - installing some scientific tools

More info (in french) at http://udimed.ulg.ac.be

Mass Storage

The GIGA provides the members with a professional infrastructure (called "mass storage") which includes a very large disk capacity to store big amounts of data, as well as its equivalent in tapes for the backuping (tape library). The mass storage is organized to enable the sharing of data using specific group working spaces.

The disk space is organized as a clustered storage which is an addition of many hard drives, with on top of it several layers called "file system" split in fileset which allows GIGA users/groups to work in compartmented spaces and so makes it easier to define quotas, number of inodes, specific rules, as well as bringing better support to all of them.

On the other hand, a tape library is a storage device which contains one or more tape drives, a number of slots to hold tape cartridges, a bar-code reader to identify the cartridges and an automated method for loading them (a robot).

This whole infrastructure being "Linux-based", it implies certain constraints :

  1. A strict naming sense for both directories and files :
    • No spaces (use underscores "_" instead)
    • No no-English characters (accents, foreign symbols)
    • Are allowed : az AZ 09 _ and -
  2. The use of the backup and nobackup directories - described further in this document - has to respect carefully the following rule XXXXXXXXX ??? (that? or force them to ask IT service first ?). See example
  3. Each member must respect his home space and group's quotas. Users generating numerous and big datasets must strictly avoid to duplicate their files. See example

User home

Each GIGA member is given a personal space of 100Gb on the mass storage infrastructure called "User Home" that is by default backuped.

The User home is organized with sub-directories : - One called nobackup which is, like indicated, not backuped - The others are generated upon user's assignment to specific projects within his group/lab

Group home

Within a thematic unit (called URT or Research) or a platform (called PTF or Plateforms), each lab will have its own "Group Home" divided in sub-directories dedicated to different projects. The head of the lab is managing the membership of its own Group Home.
Each GIGA member is by definition part of a group or a lab and is working on one or multiple projects. Upon its assignment to a project, the member is granted access to a group space shared with other members assigned to the same project.
Data generated on the GIGA-Platforms or downloaded from an external source must be stored in a "Share" directory of 2Tb present in every project specific sub-directories.
The quotas, if justified, can be expanded. The Group Homes are by default backuped every night.

A "Group Home" is composed of project specific sub-directories that have: - One directory called nobackup which is, like indicated, not backuped - In the case of thematic units, two others called Share_IMG and Share_GEN. These are special spaces where data generated for the project by a platform are being stored. IMG being imaging data and GEN NGS sequencing data

DoX

For the other data types (eg: Word/Excel/Powerpoint files, ...) UDIMED-GIGA provides the GIGA members with a tool called DoX.
DoX is an ULiège cloud storage system which, like its alternatives (Eq. Google Drive, Dropbox), allows to synchronize and store the versioning of the user's data through his electronic devices (Windows, Mac, Linux computers but also iPhone, iPad and Android devices).
The quota is of 100Gb and its content is by default backuped every night.

A cloud storage, in oposition to our physicals clustered storage and tapes, is a model of data storage in which the digital data is stored in logical pools, the physical storage spans multiple servers (and often locations).
It is made for high availability through all electronic devices with web or mobile user identifications.

This system provides :

  • An automated synchronization of the data
  • The storage of the different versions of the data
  • An internal editor
  • A way to share data with non-ULiege users (Meaning without credentials)

Sharing outside GIGA

2nd DoX ? (TBD, in testing phase)

GitLab

When working on scientific datasets, GIGA members might need to use and eventually write programming code.
The GIGA provides members with a specialized environment called GitLab (Need to talk about space quotas) that allows collaborative work on programming code and/or non-formated text (That do not includes Microsoft/Libre/Open Offices documents) but also use a powerful versioning system. The content is by default backuped every night.

This system provides : - The versioning of its content on an atomic level (File by file, through to any single character) - A user friendly interface through web access (easy access)

  • The ability to work in structured teams with granular credentials control (Up to the PI's and/or their own computing specialist(s))
  • The ability to create separated projects called repositories:
    • Either private to the user or its groups, or public to, as an example, store the annex of published papers
  • A more robust, powerful yet harder to handle projects through Linux terminals. Presentations and workshops can be given on demand

In order to maintain a manageable environment, users are asked to :

  • Define groups and projects naming in a readable way. See example
  • Keep their repositories well organized

Backup

All the data stored on the Mass Storage (except for the data stored in nobackup directories) are backuped every night on tapes in another location using a synchronizing system called "TSM".

Tapes

The Tapes library is a robot with 4 drives working on, actually, 620 tapes.
It has a retention policy of 28 days for average use of the spaces but which can shift to a shorter period if users' activity on file-systems is high.
Because this process has quite a heavy load, it is made to work during the night with a window of work between 00 and 07am.

Being able to handle 2Tb~2,5Tb during that time, it is asked to the users to : - Avoid daily changes for an amount over 2Tb - Try to regroup as much small files as they can (through, for example, zip/rar/tar archives) - Ask GIGA's IT before moving over 1Tb - Create/use as much as they can the nobackup directories in order to prevent fast/long in time/massive works on file-sets. If they work with a tool that generate large amount/size of files they can therefore configure their script to use temporary datas in one of those directory and just have their final output in the backuped spaces

Offline Archiving

The offline archiving is a long term solution for older datasets that still need to be saved/kept. When the archive is created all the data on mass storage is erased.
An expiration date is still requested and can be 10 years or more if justified but when the chosen time period has passed, all data is automatically erased.
Like the nobackup directories, users can create archive directories containing the datasets that has to be offline backuped. (TBC)
(Create a rule to allow users to buy tapes IF they want more flexibilty on those conditions ?)

Data Processing

For high throughput data processing (scientific calculation), GIGA provides its members with different types of calculation clusters. Those coming each with their pros and cons.

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

The components of a cluster are usually connected to each other through fast local area networks, with each node (computer used as a server) running its own instance of an operating system.

To be allowed to use the clusters, users need to have basic knowledge of :

  • LINUX: Terminal navigation, SSH (secured connections), software handling (different from Mac and Windows). Presentations and workshops can be given on demand. See example
  • SLURM: a user can NEVER process on the main frame and have to use Slurm to submit jobs to a cluster. See example

The Primary and Secondary clusters allows too users to have access in a transparent way to their private and shared spaces.
Whenever they connect to them, they will start using them from their home which allows them to go to their respective group shares.

Primary cluster

Also called "genetic cluster", it is hosted and managed at SeGI (Service General d'Informatique de l'ULiège).
Each of it's nodes are bought by PI's from specific platform/research groups. Therefore, the users might (or not) have access to it though the system.
It is made in a collaborative way, meaning anyone willing to contribute by buying one or more node(s) will be added to the whole cluster.
It is as well for public as for production uses. Users' private and shared storage spaces are linked to it and so seamlessly (everyone's starting point).
See example

Scratch

Within the cluster are defined special spaces, working as temporary ones, for the users' manipulated data (while executing their workflows).

Those coming in two different forms :

  • Each node has a directory /local. They are relatively smalls but extremely fast because they are on the processing machine. The retainment policy is strict about the users purging them-self from their datasets at the end of each run
  • The second one, which is called /gallia, is bigger, and still extremely fast because connected to the network through high-speed hardware. This one needs from the users' to enter a request to GIGA's IT to get credentials. The retainment policy give them 15 days before an automatic purge

See example

Secondary cluster

Hosted at GIGA, this cluster works like the primary but requires a separate demand of credentials to allow it's use with users' private and shared storage spaces.
To do so, users must make a request to GIGA's IT and provide him their ULiege username and the platform/group which they are related to.
See example

Inter Belgian French-universities cluster

The Consortium des Équipements de Calcul Intensif, or CÉCI as a means of it's own to share high-performance scientific computing facilities and resources, and mutualize know-how and expertise among the universities' clusters. Concerning their system and software infrastructures, GIGA's and theirs are alike in order to facilitate as much as possible the users.

It allows GIGA members to use it's infrastructure given some conditions. Those being :

  • The user must make a request of credential on CÉCI website
  • CÉCI purpose is solely for public use, no production is allowed on it. Meaning they are allowed to do punctual runs but not automated continuous workflows
  • GIGA users won't have direct access to their private and group storages. Hence they will have to take care by themself of their data flow from our storage to CÉCI's

See example

Use cases

Backup Nobackup

TBA

Avoid duplicates

Because users think "this is easier/faster", they tend to simply copy input datasets to other destinations those being their private home or their groups'.
That is not true at all, in fact despite being fast it still needs time to copy.

The good practice would be to have one spot centralizing input datasets with a good/readable structure (directories) and simply calling them "as inputs" while using the needed software.

From there, what is to place in the project's directory would be :

  • The output from workflows
  • Some notes
  • Some plotting


Eg. :
A user is in his home. From there, are also links to their groups where are stored among other things the data generated by one of GIGA's platforms.
That user want to use a software called STAR (RNA-Seq tool), that would look like :

STAR --genomeDir Platform/GEN/References/ --readFilesCommand zcat
--readFilesIn_ Platform/GEN/MyLab/Share_GEN/GM12878.rep1.R1.fastq.gz Platform/GEN/MyLab/Share_GEN/GM12878.rep1.R2.fastq.gz
--outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within
--twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM
--runThreadN 20 --outFileNamePrefix_ "Platform/GEN/MyLab/MyProject/star_GM12878_rep1/"

To explain a bit more this example, the user would always have his reference genomes (or whatever type of reference to work with), and the raw data generated by one of GIGA platforms at the same spot...and so those would become static values in his/her scripts. Those being :
(myhome/)Platform/GEN/References/ and
(myhome/)Platform/GEN/MyLab/Share_GEN/

The only thing he/she would have to define being therefore the data coming OUT of this workflow, that being :
(myhome/)Platform/GEN/MyLab/MyProject/star_GM12878_rep1/

Thanks to that way to work, would result a more clean/fast/easy way to work as an individual, as teams but also allows to share among coworkers AND more important, to save storage space (hence, more projects possible to create, more time to keep backups, etc).

Right nomenclature

GIGA's computing infrastructure being "Linux based", it implies a few good-practices when it comes to naming.
Nowaday, it accepts more forein characters but still has issues with those.

Because of that, is asked to the users to use exclusively alpha-numeric characters, underscores "_" and dashes "-".

So in the case one would want to sort his directories structure like for example :
/myhome/mygroupshare/Hervé/
/myhome/mygroupshare/ゲノム/
/myhome/mygroupshare/الجينوم/
/myhome/mygroupshare/some other things/

The problem is that GIGA's backuping system being quite sensitive it would run into errors of even worst : just pass those directories and their whole content.

The solution would be to always use english characters and replace spaces by underscores like that :
/myhome/mygroupshare/Herve/
/myhome/mygroupshare/some_other_things/
/myhome/mygroupshare/someOtherThings/

Linux basics

TBA

Slurm basics

TBA

Primary cluster

TBA

Scratches

TBA

Secondary cluster

TBA

CECI cluster

TBA

Hardware specs

Technical explanations. : - http://www.ceci-hpc.be/clusters.html - https://en.wikipedia.org/wiki/InfiniBand#Performance

Primary cluster

11 partitions
Intel and AMD
26 nodes
688 cores
5.53 TB total. 128~512Gb / node
Gigabit ethernet

Secondary cluster

1 partition
AMD
17 nodes
389 (logical) cores
32~2585Gb of memory / node
Gigabit ethernet

CECI cluster

NIC4

1 partition
Intel
128 nodes
2048 (physical) CPU
64Gb of memory / node
InfiniBand Quad DataRate

Vega

1 partition
AMD
43 nodes
2752 (physical) CPU
256Gb of memory / node
InfiniBand Quad DataRate

Hercules

1 partition
Intel
64 nodes
896 (physical) CPU
36~128Gb of memory / node
Gigabit ethernet

Dragon1

1 partition
Intel
26 nodes
416 (physical) CPU
128Gb of memory / node
Gigabit ethernet

Lemaitre2

1 partition
Intel
115 nodes
1380 (physical) CPU
48Gb of memory / node
InfiniBand Quad DataRate

Hmem

1 partition
AMD
17 nodes
816 (physical) CPU
128~512Gb of memory / node
InfiniBand Quad DataRate

Zenobe

1 partition
Intel
582 nodes
13968 (physical) CPU
64~256Gb of memory / node
InfiniBand 2x Quad DataRate + 1x Fourteen DataRate