Events:

2024-04-20 at 18:21 [dardel]
Singularity and other container based jobs can now be used again on the compute nodes.

We aim at also restart login nodes coming Tuesday 23:rd of April around 09:00. After the restart, containers will work on those too.

All new jobs started will run on a compute where the lustre file-system client has been updated.

2024-04-11 at 21:29 [dardel]
Earlier today, around 15:15, and over a couple of hours, some jobs may have been affected by getting a void default PDC module. Jobs specifying explicit PDC/version number should not have been affected.
2024-04-05 at 17:39 [dardel]
Containers using user namespaces are disabled until further notice. This means that for example Singularity will likely not work, also other programs such as Firefox.
2024-03-18 at 19:29 [dardel]
The dardel login node is having issues and is being rebooted again.

Until resolved/an update is available we ask any user to resist other actions than submitting/checking jobs, edit plain files.

Please avoid spawning a new ssh session every other second, to initiate a massive file transfer, or start I/O intensive multi-cpu/multi-task heavy pre/post-processing analysis of very large data sets, &c

2024-03-18 at 14:34 [dardel]
The dardel login node is having issues and is being rebooted.
2024-03-14 at 16:30 [klemming]
The server side upgrade of the Klemming and Scania file systems are now done. Job start has been resumed. Please report anything out of the ordinary.
2024-03-13 at 15:24 [klemming]
Tomorrow, March 14th, starting at 10AM CET, we will upgrade the Klemming and Scania file systems to a version that should fix the server side bug. The file systems should stay available during the procedure, but a number of shorter freezes will occur. Already running jobs will be allowed to continue, but new job starts will be delayed during the procedure to minimize the risk of jobs being disturbed.
2024-03-11 at 11:25 [dardel]
The lustre server side of /cfs/klemming/ was restarted shortly after 08:00 this morning.

Any job running making use of /cfs/klemming/ beetween roughly 2024-03-10/20:00 and the restart this morning likely affected. Potentially completely stuck.

As many compute nodes got flagged being in poor shape, and do not run any jobs, we will take this opportunity and re-start them with a bug fix (CAST-35315) aimed at the lustre _client_ kernel bug.

The lustre server side bug remain.

New jobs started will run on a compute where lustre client kernel bug is fixed.

2024-03-10 at 22:07 [dardel]
Login nodes and compute nodes did loose contact with important parts of the server side of /cfs/klemming/ roughly an hour ago.

Possibility to login and to access files seriously affected as is likely any running job needing access to klemming as well.
2024-02-27 at 15:05 [dardel]
The file system issue has been identified, awaiting fix from the vendor. Jobs are slowly being started but please be aware that there is a risk of further outages until the fix has been delivered and applied. We are really sorry for the inconvenience.
2024-02-24 at 20:07 [dardel]
Serious file system problems, job starts have been disabled again, investigation is ongoing.
2024-02-24 at 18:12 [dardel]
System maintenance done, Dardel is running jobs since a few hours.
2024-02-14 at 18:00 [dardel]
Issues related to flapping network connectivity between file-servers and compute clients addressed. Job starts resumed since half an hour.

Please be aware of that the forthcoming extensive update 2024-02-19, and that the internal bug in the lustre file-system both remain.

Important info can be found at issues/update .

2024-02-12 at 20:55 [dardel]
As issues continue (also involving flapping connectivity between file-servers and clients) no jobs will be allowed to enter running state, should they reside under /cfs/klemming.

Please find more pieces of info on this, and info on forthcoming update starting 2024-02-19 at issues/update .

2024-02-12 at 18:30 [dardel]
Status of the ongoing serious issues regarding the lustre client (/cfs/klemming) and of forthcoming extensive upgrade, starting 2024-02-19, can be found at issues/update .
2024-02-05 at 14:58 [dardel]
After the updates last week (starting Wednesday 2024-01-31) many applications have hit what seem to be an internal bug in the lustre file-system client.

Typically this manifests itself through jobs not terminating/finishing properly. Nodes get stuck 'completing' after job finish for longer periods of time. Other jobs fail to start up properly on all nodes.

Several applications seem to be hit by the bug. However, 'vasp' applications seem more unfortunate.

Work to apply a work-around is on-going.

2024-02-01 at 13:47 [dardel]
System software update finished, jobs running since a while.

Please find description of updates .

Be equipped with some patience on new login sessions as it takes time to populate user private module cache.

2024-01-24 at 16:16 [dardel]
System will be unavailable due to system software upgrade starting Wednesday 2024-01-31 at 10:00. The work is estimated to be finished within two days and the system available again Friday 2024-02-02. More information will follow in the beginning of next week.

Please find Dardel being updated starting on 31 January

2024-01-16 at 12:13 [dardel]
POD and workload manager restart/restore complete.
2024-01-16 at 11:27 [dardel]
the POD hosting the slurm workload manager master-daemon got a failure roughly half an hour ago. Restart/restore in progress.
All flash news for 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss