Batchfarm

New cluster (kronos/nyx)

The new batchfarm runs on SLURM (SL) (SlurmUsage). Hades users
have to be added to the users of account hades to be able to run jobs
on the farm. The list of hades users is maintained by J.Markert@gsi.de

Some rules to work with the new cluster (KronosCluster):

  • Files are written to /lustre/nyx/hades filesystem.
  • SL does not support other files systems than /lustre/nyx . batch
    scripts can not use user's home dir nor /misc/hadessoftware
  • The hadessoftware is distributed to the batchfarm via
    /cvmfs/hades.gsi.de/install/ .....
    The same path is used by hades jessie64 desktop machines.
  • The batch jobs can be submited to the farm from the kronos.hpc.gsi.de
    cluster. This machines provides our software, the user homedir and
    a filessytem mount to /lustre/nyx . You can compile and test (run) your
    programs here.
  • A set of example batch scripts for Pluto, UrQmd, HGeant, DSTs and user
    analysis you can retrieve from
    svn checkout https://subversion.gsi.de/hades/hydra2/trunk/scripts/batch/GE
    The folders contain sendScript.sh+jobScript.sh (GE) and
    sendScript_SL.sh+jobScript_SL.sh (SL).
    The general concept is to work with file lists as input to the sendScript, which
    takes care about the sync from user homedir to the submission dir on
    /lustre/nyx. The files in the list are splitted automatically to job arrays to
    minimize the load on the sceduler. The sendScript finally calls the sbatch
    command of SLURM to submit the job. The jobScript is the part which
    runs on the batch nodes.
  • Disk usage on nyx can be monitored here
  • Kronos load can be monitored here

 

SLURM tips:

The most relevant commands to work with SL:

  • sbatch   : sbatch  submits  a  batch script to SLURM.
  • squeue   : used to view job and job step information for jobs managed by SLURM.
  • scancel  : used to signal or cancel jobs, job arrays or job steps.
  • sinfo    : used to view partition and node information for a system running SLURM.
  • sreport  : used to generate reports of job usage and cluster utilization for SLURM jobs saved
                         to the SLURM Database.
  • scontrol : used  to  view  or modify Slurm configuration including: job, job step, node, partition,
                         reservation, and overall system configuration.
     
 Examples:

squeue  -u  user             : show all jobs of user
squeue -t  R                 : show jobs in a certain state (PENDING (PD),
                               RUNNING (R), SUSPENDED (S),COMPLETING (CG),
                               COMPLETED (CD), CONFIGURING (CF),
                               CANCELLED (CA),FAILED (F), TIMEOUT (TO),
                               PREEMPTED (PR), BOOT_FAIL (BF) ,
                               NODE_FAIL (NF) and SPECIAL_EXIT (SE))
scancel  -u user             : cancel all jobs of user user
scancel jobid                : cancel job with jobid
scancel -t PD -u <username>  : cancel all pending jobs of a user
scontrol show job -d <jobid> : show detailed info about a job

scontrol hold <jobid>        : suspend a job
scontrol resume <jobid>      : resume a suspended job


 

 

Old cluster (prometheus/hera):

The old batchfarm runs on GridEngine (GE). Some rules to
work with the new cluster:

  • Files are written to /hera/hades filesystem.
  • GE does not support other files systems than /hera . batch
    scripts can not use user's home dir nor /misc/hadessoftware
  • The hadessoftware is distributed to the batchfarm via
    /cvmfs/hades.gsi.de/install/ .....
    The same path is used by hades squeeze64 desktop machines.
  • The batch jobs can be submited to the farm from the pro.hpc.gsi.de
    cluster. This machines provides our software, the user homedir and
    a filessytem mount to /hera . You can compile and test (run) your
    programs here.

 

http://wiki.gsi.de/cgi-bin/view/Linux/BatchFarm