[Issues] Peregrine cluster down for planned maintenance today

Ewout Helmich helmich at astro.rug.nl
Tue Apr 17 11:29:06 CEST 2018


Hi everyone,

My apologies for this belated mail. Today Peregrine is down for planned
maintenance: see below.

Regards,
Ewout Helmich


-----------------

Dear Peregrine users,

Due to network maintenance in the DUO data center, the Peregrine cluster
will be unavailable on April 17. We will use this downtime to do some
regular maintenance as well, which includes the following things:

  * install security and (minor) operating system updates;
  * perform file system checks on all Lustre file systems;
  * update the SLURM scheduler to the latest version;
  * reorganize the software repository to allow for automatic selection
    of the right (CPU optimized) version of software modules on each
    type of compute node;
  * install a job profiling framework that allows you to monitor the
    resource usage of your job in much more detail.

You can read more about the last two items in the second issue of The
Flying Falcon newsletter
<https://redmine.hpc.rug.nl/redmine/attachments/download/517/April%203,%202018.pdf>.

The cluster should be back online before the end of the day, and we will
let you know when the maintenance has completed.

In order to drain all the compute nodes and make sure that no jobs are
running anymore on April 17, we created a reservation that prevents any
jobs from starting if they are not guaranteed to complete before this day.

Best regards,
The HPC Team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.astro-wise.org/pipermail/issues/attachments/20180417/acb02727/attachment.html>


More information about the Issues mailing list