ICARE computing resources for users

Overview

ICARE provides computational resources to registered and accredited ICARE users. Registered users can run their own codes in a linux environment that is very similar to the ICARE production environment, with on-line access to the entire ICARE archive. This PaaS (Platform as a Service) service is specially useful for users running codes on long time-series data sets who can’t afford to download huge amounts of data to their own facility. This service is also useful to mature and test codes that are intended to run in operational mode the ICARE production environment. This service is suitable for both interactive use and massive batch processing exported to the back-end computing nodes of the cluster.

Registration

A specific registration is required to access ICARE computing resources. Because ICARE resources are limited, access is restricted to partners working with ICARE on collaborative projects. You register to ICARE data services first (see here), then fill out this additional registration form to request an SSH account. You will be required to provide additional information including the framework of your request and an ICARE project referent.
If you only want to access ICARE data services (i.e. FTP or web access), please use the data access registration form.

Description of the cluster

The ICARE computing cluster is composed of one front-end server and 114 allocated cores spread over 5 back-end computing nodes (see table):
  • 1 front-end server (access.icare.univ-lille1.fr)
  • 5 computing nodes

Servers Number of cores
allocated to cluster
Hyperthreading Processor RAM
Front-end
access.icare.univ-lille1.fr
26 No Intel(R) Xeon(R)
2xGold 5120 CPU @ 2.20GHz
384 Go
Node 001 26 No Intel(R) Xeon(R)
2xGold 5120 CPU @ 2.20GHz
384 Go
Node 002-005 22 physical cores
44 logical cores
Yes Intel(R) Xeon(R)
2xSilver 4116 CPU @ 2.10GHz
384 Go


The front-end server is the primary access to the cluster. No intensive processing is to be run on the front-end server. It is dedicated to interactive use only. All intensive processing jobs must be run on the computing nodes and must be submitted through the job scheduler SLURM (see below).

Disk Space

  • Home Directory (40 TB total)
This space should be used for storing files you want to keep in the long term such as source codes, scripts, etc. The home directory is backed up nightly.
Note: home directories are shared by all nodes of the cluster, so be aware that any modification in your home directory on access32 also modifies your home directory on access64.
  • Main Storage Space /work_users (40 TB total)
This is the main storage space for large amounts of data. This work space is backed up nightly.
  • Scratch Space /scratch (78 TB total)
The scratch filesystem is intended for temporary storage and should be considered volatile. Older files are subject to being automatically purged. No backup of any kind is performed for this work space.

Logging in

To use the computer cluster, you have to log in to the front-end server access.icare.univ-lille1.fr using your ICARE username and password:
 
ssh -X username@access.icare.univ-lille1.fr

Cluster Software and Environment Modules

We are using the Environment Modules Package to provide a dynamic modification of a user’s environment.

The Environment Modules package is a tool that simplifies shell initialization and lets users easily modify their environment during the session with modulefiles. Each modulefile contains the information needed to configure the shell for an application.

The main module commands are:
module avail     # to list all available modules you can load
module list        # to list your currently loaded modules
module load moduleName       # to load moduleName into your environment
module unload moduleName   # to unload moduleName from your environment

When you login into ICARE cluster some modules are automatically loaded for your convenience. Initially, your module environment is not empty !

  • Display default environment variables
To see the default environment that you get at login issue the "module list" command.
[ops@access ~]$ module list 
Currently Loaded Modulefiles:
  1) rhel6/icare_env/1-00_with_PYTHON_2.6   2) rhel6/idl/8.2                          3) rhel6/matlab/R2018b

  • Display all available software installed on the cluster
[ops@access ~]$  module avail
------------------------------------- /usr/local/modulefiles -------------------------------------------
rhel6/anaconda/2/5.3.1               rhel6/ferret/6.82                    rhel6/icare_env/1-00_with_PYTHON_2.6 rhel6/idl/8.2                        rhel6/matlab/R2012a
rhel6/anaconda/3/5.3.1               rhel6/ferret/6.9                     rhel6/icare_env/2-01_with_PYTHON_2.7 rhel6/idl/8.7.2                      rhel6/matlab/R2018b

  • Show what a module sets for your shell environment
module show rhel6/icare_env/1-00_with_PYTHON_2.6 
-------------------------------------------------------------------
/usr/local/modulefiles/rhel6/icare_env/1-00_with_PYTHON_2.6:
 
prepend-path	 PATH /usr/local/env64_rhel6_1-00/opt/hdf4/bin:/usr/local/env64_rhel6_1-00/opt/netcdf/bin:/usr/local/env64_rhel6_1-00/bin/swig:/usr/local/env64_rhel6_1-00/bin:/usr/local/env64_rhel6_1-00/opt/scilab/bin 
append-path	 PATH /usr/local/env64_rhel6_1-00/opt/scilab/bin:/usr/local/env64_rhel6_1-00/opt/gcc/bin 
setenv		 PYTHONPATH /usr/local/env64_rhel6_1-00/lib64/python2.6/site-packages/grib_api:/usr/local/env64_rhel6_1-00/lib64/python2.6/site-packages:/usr/local/env64_rhel6_1-00/lib/python2.6/site-packages/grib_api:/usr/local/env64_rhel6_1-00/lib/python2.6/site-packages 
prepend-path	 LD_LIBRARY_PATH /usr/local/env64_rhel6_1-00/opt/netcdf/lib64:/usr/local/env64_rhel6_1-00/lib64:/usr/local/env64_rhel6_1-00/opt/scilab/lib64/scilab:/usr/local/env64_rhel6_1-00/opt/netcdf/lib:/usr/local/env64_rhel6_1-00/lib:/usr/local/env64_rhel6_1-00/opt/instantclient_11_2:/usr/local/env64_rhel6_1-00/opt/scilab/lib/scilab 
setenv		 BUFR_TABLES /usr/local/env64_rhel6_1-00/lib/python2.6/site-packages/pybufr_ecmwf/ecmwf_bufrtables 
setenv		 FER_DIR /usr/local/env64_rhel6_1-00/opt/ferret 
setenv		 JAVA_HOME /usr/local/env64_rhel6_1-00/opt/jdk 
setenv		 FER_DSETS /usr/local/env64_rhel6_1-00/opt/ferret/fer_dsets 
setenv		 FER_WEB_BROWSER firefox 
setenv		 FER_DATA_THREDDS http://ferret.pmel.noaa.gov/geoide/geoIDECleanCatalog.xml /usr/local/env64_rhel6_1-00/opt/ferret/fer_dsets /usr/local/env64_rhel6_1-00/opt/ferret/fer_dsets 
setenv		 FER_DATA /usr/local/env64_rhel6_1-00/opt/ferret/fer_dsets/data /usr/local/env64_rhel6_1-00/opt/ferret/go /usr/local/env64_rhel6_1-00/opt/ferret/examples 
setenv		 FER_DESCR /usr/local/env64_rhel6_1-00/opt/ferret/fer_dsets/descr 
setenv		 FER_GRIDS /usr/local/env64_rhel6_1-00/opt/ferret/fer_dsets/grids 
setenv		 FER_GO /usr/local/env64_rhel6_1-00/opt/ferret/go /usr/local/env64_rhel6_1-00/opt/ferret/examples /usr/local/env64_rhel6_1-00/opt/ferret/contrib 
setenv		 FER_EXTERNAL_FUNCTIONS /usr/local/env64_rhel6_1-00/opt/ferret/ext_func/libs 
setenv		 FER_PALETTE /usr/local/env64_rhel6_1-00/opt/ferret/ppl 
setenv		 SPECTRA /usr/local/env64_rhel6_1-00/opt/ferret/ppl 
setenv		 FER_FONTS /usr/local/env64_rhel6_1-00/opt/ferret/ppl/fonts 
setenv		 PLOTFONTS /usr/local/env64_rhel6_1-00/opt/ferret/ppl/fonts 
setenv		 FER_LIBS /usr/local/env64_rhel6_1-00/opt/ferret/lib 
setenv		 FER_DAT /usr/local/env64_rhel6_1-00/opt/ferret 
-------------------------------------------------------------------
  • Get help information about a module
module help rhel6/anaconda/3/5.3.1
 
----------- Module Specific Help for 'rhel6/anaconda/3/5.3.1' --------------------
 
This modulefile defines all the pathes and variables
needed to use the ICARE environment anaconda3-5.3.1
.............................................
  • Loading/ unloading modules
Modules can be loaded and unloaded dynamically.
[ops@access ~]$  module list 
Currently Loaded Modulefiles:
  1) rhel6/icare_env/1-00_with_PYTHON_2.6   2) rhel6/idl/8.2                          3) rhel6/matlab/R2018b
 
[ops@access ~]$ which matlab
/usr/local/modules/rhel6/matlab/R2018b/bin/matlab
 
[ops@access ~]module unload rhel6/matlab/R2018b
[ops@access ~]$ module list
Currently Loaded Modulefiles:
  1) rhel6/icare_env/1-00_with_PYTHON_2.6   2) rhel6/idl/8.2
 
[ops@access ~]module load rhel6/matlab/R2012a
 
[ops@access ~]$ module list
Currently Loaded Modulefiles:
  1) rhel6/icare_env/1-00_with_PYTHON_2.6   2) rhel6/idl/8.2                          3) rhel6/matlab/R2012a
 
[ops@access ~]$ which matlab
/usr/local/modules/rhel6/matlab/R2012a/bin/matlab
  • Unload ALL software modules
The module purge command will remove all currently loaded modules. This is particularly useful if you have to run incompatible software (e.g. python 2.x or python 3.x). The module unload command will remove a specific module.
[ops@access ~]module purge

Running your jobs

No intensive processing is to be run on the front-end node. Processing jobs must be submitted through SLURM's job scheduler to run on the computing nodes. SLURM (Simple Linux Utility for Resource Management) is a workload manager and a job scheduling system for LINUX clusters.

In the current configuration, all the computing nodes belong to one single partition named "COMPUTE" (i.e. all jobs end up in the same queue). The maximum RAM allowed is 4GB per job and the maximum execution time is 24 hours by default (i.e. jobs are automatically killed if this limit is reached). See options to modify that limit (--time option)

The job priority is automatically adjusted based on the required resources specified by the user when scheduling the job. The lower the resources the higher priority.

SLURM commands

Jobs can be submitted to the scheduler using sbatch or srun
  • sbatch: to submit a job to the queue

The job is submitted via the sbatch command. SLURM then assigns a number to the job and places it in the queue. It will execute when the resources are available.
ops@access:~ $ sbatch submit.sh
Submitted batch job 17

Example (submit.sh) for bash users

#!/bin/bash
 
#===============================================================================
# Options SBATCH :
#SBATCH --job-name=TestJob    # Defines a name for the batch job
#SBATCH --time=10:00           # Time limit for the job.(format = m:s ou h:m:s ou j-h:m:s) 
#SBATCH -o OUTPUT_FILE   # Specifies the file containing the stdout 
#SBATCH -e ERROR_FILE    # Specifies the file containing the stderr
#SBATCH --mem=2000         # Memory limit per compute node for the  job
#SBATCH --partition=COMPUTE   # Partition is a queue for jobs. (Default is COMPUTE)
#SBATCH --mail-type=ALL          # When email is sent to user (all notifications)
#SBATCH --mail-user=user@univ-lille.fr  # User's email address
 
 
### Setting the TMPDIR environment variable, specify a directory that is accessible to the user ID 
export TMPDIR=/scratch/$USER/temp
mkdir -p $TMPDIR
 
###Purge any previous modules
module purge
 
###Load the application
module load  rhel6/anaconda/3/5.3.1   #load module anaconda/Python 3.6.7
 
### Run program
./executable_name
 
#===============================================================================

Example (submit.sh)) for tcsh users

#!/bin/tcsh
 
#===============================================================================
# Options SBATCH :
#SBATCH --job-name=TestJob    # Defines a name for the batch job
#SBATCH --time=10:00           # Time limit for the job.(format = m:s ou h:m:s ou j-h:m:s) 
#SBATCH -o OUTPUT_FILE   # Specifies the file containing the stdout 
#SBATCH -e ERROR_FILE    # Specifies the file containing the stderr
#SBATCH --mem=2000         # Memory limit per compute node for the  job
#SBATCH --partition=COMPUTE   # Partition is a queue for jobs. (Default is COMPUTE)
#SBATCH --mail-type=ALL          # When email is sent to user (all notifications)
#SBATCH --mail-user=user@univ-lille.fr  # User's email address
 
### Setting the TMPDIR environment variable, specify a directory that is accessible to the user ID 
setenv TMPDIR /scratch/$USER/temp
mkdir -p $TMPDIR
 
###Purge any previous modules
module purge
 
###Load the application
module load  rhel6/anaconda/3/5.3.1   #load module anaconda/Python 3.6.7
 
### Run program
./executable_name
 
#===============================================================================
  • srun: to submit a job for interactive execution (as you would execute any command line), i.e. you lose the prompt until the execution is complete.
    Example of a run in the partition COMPUTE for 30 minutes :
ops@access:~ $ srun --partition=COMPUTE –time=30.0 job.sh
  • squeue: to view information about jobs
Usage:
ops@access:~ $ squeue
squeue –u <myusername>

  • scancel: to remove a job from the queue, or cancel it if it is running
ops@access:~ $ scancel <jobid>
ops@access:~ $ scancel cancel -u <myusername> --state=pending  (cancels all pending jobs by <myusername>)
ops@access:~ $ scancel cancel -u <myusername> --state=running  (cancels all running jobs by <myusername>)

  • sinfo: provides information about nodes and partitions
sinfo -N -l
NODELIST NODES PARTITION  STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
node001 1 COMPUTE*    idle 26  2:14:1 386225    0 1000  (null) none
node002 1 COMPUTE*    idle 22  2:24:1 386225    0 1000  (null) none
node003 1 COMPUTE*    idle 22  2:24:1 386225    0 1000  (null) none
node004 1 COMPUTE*    idle 22  2:24:1 386225    0 1000  (null) none
node005 1 COMPUTE*    idle 22  2:24:1 386225    0 1000  (null) none

  • scontrol: to see the configuration and state of a job
ops@access:~ $ scontrol show job <jobid>
  • sview: is a graphical user interface version

The following table translates some of the more commonly used options for qsub to their sbatch equivalents:

qsub to sbatch translation
To specify the: qsub option sbatch option Comments
Queue/partition -q QUEUENAME -p QUEUENAME Torque "queues" are called "partitions" in slurm.
Note: the partition/queue structure has been simplified, see below.
Number of nodes/ cores requested -l nodes=NUMBERCORES -n NUMBERCORES See below
-l nodes=NUMBERNODES:CORESPERNODE -N NUMBERNODES -n NUMBERCORES
Wallclock limit -l walltime=TIMELIMIT -t TIMELIMIT TIMELIMIT should have form of HOURS:MINUTES:SECONDS. Slurm supports some other time formats as well.
Memory requirements -l mem=MEMORYmb --mem=MEMORY Torque/Maui: This is Total memory used by job
Slurm: This is memory per node
-l pmem=MEMORYmb --mem-per-cpu=MEMORY This is per CPU/core. MEMORY in MB
Stdout file -o FILENAME -o FILENAME This will combine stdout/stderr on slurm if -e not given also
Stderr file -e FILENAME -e FILENAME This will combine stderr/stdout on slurm if -o not given also
Combining stdout/stderr -j oe -o OUTFILE
and no -eoption
stdout and stderr merged to stdout/OUTFILE
-j eo -e ERRFILE
and no -ooption
stdout and stderr merged to stderr/ERRFILE
Email address -M EMAILADDR --mail-user=EMAILADDR  
Email options -mb --mail-type=BEGIN Send email when job starts
-me --mail-type=END Send email when job ends
-mbe --mail-type=BEGIN
--mail-type=END
Send email when job starts and ends
Job name -N NAME --job-name=NAME  
Working directory -d DIR --workdir=DIR  

See also

A documentation of SLURM and SLURM commands is available online:
http://slurm.schedmd.com
http://slurm.schedmd.com/man_index.html