[Issues] Peregrine cluster down

Tue Jul 3 18:41:08 CEST 2018

Hi everyone,

Just got this from the CIT.

Kind regards,

Ewout

---

Dear Peregrine user,

As you might have noticed, during last week the Peregrine filesystems 
have encountered multiple crashes. Currently making /data available in a 
stable way does not even seem to be possible.
This is why have decided to perform unscheduled maintenance, and start 
with the upgrade of the storage environment on short notice.
Fortunately we have already done a lot of preparation. We can therefore 
minimize the downtime considerably, due to the use of temporary storage. 
We will use this temporary storage, while we upgrade the original 
Peregrine storage systems.

The maintenance will have the following consequences:

  * From now on until Friday 06-07 Peregrine will be unavailable. If we
    finish earlier the system will be made available again sooner.
  * After the downtime Peregrine will be configured with temporary
    storage. This means that in the future we will plan scheduled
    downtime to switch back to the original upgraded storage.
  * All running jobs will unfortunately be lost. Waiting jobs will be
    suspended.
  * Since we don't have a copy of /scratch, this file system will be
    empty when we resume operations.
  * We will, however, provide read only access to the old /scratch for
    one week to allow you to copy important data.
  * If possible we will make the login nodes available for read-only
    access to the data. Some reboots will be necessary however.

We apologize for any inconvenience this unscheduled maintenance will cause.

Kind regards,

Fokke Dijkstra
HPC-team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.astro-wise.org/pipermail/issues/attachments/20180703/99967d50/attachment.html>