Events:

2023-12-18 at 15:52 [dardel]
Start of jobs resumed since half an hour. Accesses to files in /cfs/klemming/ expected to operate ordinary again. Please report anomalies.
2023-12-16 at 14:49 [dardel]
Since this morning, 2023-12-16, a fraction of accesses to select files residing on certain storage elements in /cfs/klemming/ takes very long to complete. It is not yet known whether this is related to the file-system itself, or i.e. the network and/or clients mounting the file-system.

New job starts paused until further notice. (still the case / 2023-12-18 at 10:00)

2023-12-11 at 16:33 [dardel]
System software maintenance completed, job starts resumed since half an hour. Please report unexpected behaviour.
2023-12-06 at 16:17 [dardel]
System software maintenance planned to start this coming Monday 2023-12-11 at 08:00. The work is estimated to be finished within one to two days, i.e., system on-line again 2023-12-12.
2023-10-12 at 20:40 [dardel]
The unexpected HSN (high speed network) issues overcome, jobs running since roughly an hour. Todays HSN issue and the planned system software maintenance earlier this week did not depend on each other.
2023-10-12 at 07:32 [dardel]
All new job starts blocked while investigating if a changed HSN (high speed network) setting may cause HSN performance loss and/or HSN deadlock.
2023-10-11 at 17:07 [dardel]
System software maintenance completed, job starts resumed since half an hour. Please report unexpected behaviour.
2023-10-03 at 13:42 [dardel]
System software maintenance planned to start Tuesday 2023-10-10 at 09:00. The work is estimated to be finished within one to two days, i.e., system on-line again 2023-10-11.
2023-09-18 at 15:35 [dardel]
System software maintenance completed, job starts resumed.
2023-09-11 at 16:17 [dardel]
System will be unavailable for system software maintenance starting Monday 2023-09-18 at 09:00. The work is estimated to be finished within two days, i.e., system on-line again 2023-09-19.
2023-09-08 at 17:29 [dardel]
a so called 'container' and 'pod' containing slurmctld (scheduler master daemon) was unexpectedly restarting a couple of hours ago. During outage/restart new job submits fail. This has now been remedied, and operation is supposed to be back to normal.
2023-08-31 at 09:45 [dardel]
the primary login node is being restarted as it behaved poorly after system software maintenance.
2023-08-30 at 19:52 [dardel]
System software maintenance completed, job starts resumed since half an hour.
2023-08-22 at 13:33 [dardel]
System will be unavailable for system software maintenance starting Tuesday 2023-08-29 at 09:00. The work is estimated to take two days, i.e., system on-line again 2023-08-31.
2023-07-11 at 10:17 [klemming]
A cable replacement will take place for a klemming disk path. During preparation/actual replacement new job starts will be blocked. Accesses to /cfs/klemming/ might experience temporary freezes.

The work is anticipated to take one to two hours.

2023-07-09 at 21:33 [dardel]
the primary dardel login node is experiencing problems (some processes seem to get stuck on accessing lustre file-system) and is being rebooted shortly.
2023-07-04 at 11:28 [dardel]
An enforced failover of a flapping klemming disk path is about to be made. The failover is expected to last in the order of minutes. No new jobs allowed to start during failover. File-system accesses will freeze during failover.

After a completed and ok failover, jobs will be allowed to start running again.

2023-06-29 at 23:00 [dardel]
System work completed. User jobs running since a while. Please report unexpected behaviour of batch jobs.
2023-06-22 at 16:29 [dardel]
The planned downtime after midsummer is changed to affect entire Dardel and to start 9.00, Wednesday, June 28, until June 29, with June 30 as a backup.
2023-06-16 at 10:13 [dardel]
The Lustre file systems of Dardel will have a required software update after midsummer. Klemming (used by academic and some industry users) will be down from 9.00, June 27 until June 28. The Scania disk will be down from 9.00 June 29 until June 30. We believe Dardel should only be unavailable when the filesystem you use for the home directory and data is down.
2023-05-26 at 00:08 [dardel]
The PDC login portal (https://loginportal.pdc.kth.se/) is now again online.
2023-05-25 at 21:38 [dardel]
There are issues on behalf of making use of the PDC login portal. Consider it off-line for the time being.
2023-05-22 at 15:53 [dardel]
all nodes in the GPU partition has gotten a BIOS flash update, addressing a potential Escalation of Privilege situation. Jobs allowed to run again.
2023-05-19 at 18:10 [dardel]
the internal CEPH storage in use by most internal management services now restarted. Jobs starting since half an hour. Several jobs still running since prior the outage.

Some jobs likely were experiencing issues during outage.

(GPUs still off, security audit pushed forward in time.)

2023-05-19 at 11:36 [dardel]
VM hosting slurm master daemon, and other services on other VMs needed to manage the system still have issues. Fault search is ongoing.
2023-05-19 at 08:24 [dardel]
the VM hosting the slurm master daemon seem stopped since 05:12 this morning. No job state changes possible since then. Investigation / restart in progress.
2023-05-18 at 23:38 [dardel]
no new job starts allowed on GPU partition while investigating a potential Escalation of Privilege situation on GPUs.
2023-05-13 at 06:27 [dardel]
the primary login has been redirected to another access node since several hours. Please report eventual anomalies.
2023-05-12 at 23:02 [dardel]
the primary login is still having issues. Work is on-going. Jobs already running or waiting in queue so far un-affected.
2023-05-12 at 21:06 [dardel]
the primary login node is unresponsive and is being rebooted.
2023-05-08 at 16:33 [dardel]
Access re-enabled, system running jobs since half an hour. Please report anomalies.
2023-05-07 at 21:57 [dardel]
As several users have experienced, the updates are not yet fully completed and verified. Access remain disabled. Until further notice.
2023-04-24 at 09:00 [dardel]
Please find info at Downtime for Dardel upgrade from 2-5 May 2023 on forthcoming system availability.
2023-04-26 at 23:20 [dardel]
Job queues now restarted.
2023-04-26 at 19:30 [dardel]
During the benchmark testing earlier this day we had multiple failures of the SLURM controller for unknown causes. We hope that we can restart job queues later today. We have also placed a number of potentially problematic compute-nodes into maintenance for further investigation.
2023-04-17 at 11:13 [dardel]
Connectivity between compute nodes and external entities operational again. Exact cause not pin-pointed, i.e., can happen again.
2023-04-17 at 10:10 [dardel]
Once again connectivity between compute nodes and external entities, such as i.e. license servers, is not reliable. Investigation in progress.
2023-04-14 at 11:15 [dardel]
Connectivity between compute nodes and external entities operational again. New job starts enabled.
2023-04-14 at 09:48 [dardel]
Connectivity between compute nodes and external entities, such as i.e. license servers, is currently not reliable. Investigation in progress.
2023-04-12 at 12:21 [dardel]
The primary login rebooted, with updated HSN firmware in use, logins enabled.
2023-04-12 at 11:01 [dardel]
Another flap of the HSN (high speed network) on the primary login node has occurred, and it's being rebooted. This time with updated firmware to potentially reduce the risk for it to happen again.
2023-04-10 at 11:03 [dardel]
A flapping HSN switch port has been reset, the login node rebooted, logins enabled again.
2023-04-09 at 22:59 [dardel]
Unfortunately the general login node still is unable to go properly on-line. User jobs already submitted/running on any compute node piece of the system progress in order without any reported obstacle. Investigation is continuing.
2023-04-09 at 20:09 [dardel]
the dardel general login node is very unresponsive since earlier today. Investigation on-going.
2023-04-03 at 22:06 [dardel]
login to the system re-enabled.

Information on how to use the updated system sent to the dardel-users mail-list, and can also be found at Dardel is updated and the GPU partition is open

2023-03-31 at 17:00 [dardel]
The upgrade work is ongoing, albeit somewhat slower than anticipated. Next update will be sent Monday (April 3) the latest.

(Please find Downtime to update Dardel starting on 27 March 2023 aimed at giving a description.)

2023-03-20 at 14:53 [dardel]
Reminder: Upgrade set to start on 27 March 2023 at 8.00. The system may be un-accessible/down for the entire week. A block is in place to prevent jobs to run further than that moment in time.

(Please find Downtime to update Dardel starting on 27 March 2023 aimed at giving a description.)

All flash news for 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss