[Issues] Peregrine cluster down

Ewout Helmich helmich at astro.rug.nl
Wed Jul 4 10:25:40 CEST 2018


Please note that it is possible to use the DPU normally used for
coaddition. Select it from the awe-prompt:

awe> dpu.set_dpu_client('coadddpu.astro.target.astro-wise.org')

or update your configuration file (~/.awe/Environment.cfg) by adding (or
changing) the dpu_name option:

dpu_name             : coadddpu.astro.target.astro-wise.org

Follow the status of your jobs by selecting the dpu from the drop-down
menu on the webpage:

https://dpu.hpc.rug.astro-wise.org/

---

Of course it's also possible to use your local CPU, by replacing:

awe> dpu.run(....)

with

awe> lpu.run(...)          (remove options specific to the DPU, like
dpu_time)

Regards,

Ewout

On 07/03/2018 06:41 PM, Ewout Helmich wrote:
>
> Hi everyone,
>
> Just got this from the CIT.
>
> Kind regards,
>
> Ewout
>
> ---
>
> Dear Peregrine user,
>
>
> As you might have noticed, during last week the Peregrine filesystems
> have encountered multiple crashes. Currently making /data available in
> a stable way does not even seem to be possible.
> This is why have decided to perform unscheduled maintenance, and start
> with the upgrade of the storage environment on short notice.
> Fortunately we have already done a lot of preparation. We can
> therefore minimize the downtime considerably, due to the use of
> temporary storage. We will use this temporary storage, while we
> upgrade the original Peregrine storage systems.
>
> The maintenance will have the following consequences:
>
>   * From now on until Friday 06-07 Peregrine will be unavailable. If
>     we finish earlier the system will be made available again sooner.
>   * After the downtime Peregrine will be configured with temporary
>     storage. This means that in the future we will plan scheduled
>     downtime to switch back to the original upgraded storage.
>   * All running jobs will unfortunately be lost. Waiting jobs will be
>     suspended.
>   * Since we don't have a copy of /scratch, this file system will be
>     empty when we resume operations. 
>   * We will, however, provide read only access to the old /scratch for
>     one week to allow you to copy important data.
>   * If possible we will make the login nodes available for read-only
>     access to the data. Some reboots will be necessary however.
>
> We apologize for any inconvenience this unscheduled maintenance will
> cause. 
>
> Kind regards,
>
> Fokke Dijkstra
> HPC-team
>
>
> _______________________________________________
> Issues mailing list
> Issues at astro-wise.org
> http://mailman.astro-wise.org/mailman/listinfo/issues

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.astro-wise.org/pipermail/issues/attachments/20180704/791683cd/attachment.html>


More information about the Issues mailing list