Events:

2008-12-19 at 16:46 [xxx (lucidor)]

The central switch of the Lucidor interconnect has had a critical hardware failure. As a consequence the queue on Lucidor is halted. Lucidor is not expected to be available for parallel jobs again until 12/1 at best.

2008-12-18 at 22:55 [xxx (lucidor)]

The reason for not having any available nodes on lucidor is that the myricom network seem to have been down for the past 24 hours. We are sorry for the inconvenience.

2008-12-14 at 19:09

Ekman: The node running the scheduler probably has a hardware fault. Any request to the batch-system will be stuck in a queue and not processed until the scheduler-node is on-line again.

2008-12-03 at 08:18

forwarded/informational: components of KTH networks will be upgraded 2008-12-04, starting at 19:00. Minor interrupts might occur.

2008-11-20 at 16:20

Jobs are now running on the clusters again after restoring the AFS-volumes from backup to another server. Please report anything that seems strange.

2008-11-19 at 19:12

update: the afs-server is a good way into being restored from tape.

2008-11-19 at 11:48

further info - a disk within a raid broke, during recovery (rebuilding raid using raid spare disk) a second disk did fail and the raid become in-operational. We will try a manual salvage of the raid first, and if that fails need to restore contents from backup.

2008-11-19 at 11:24

possible afs-problem, all node allocation suspended.

2008-11-14 at 12:37

The queue on Ferlin has now been started again.

2008-11-14 at 09:11

The ferlin system is temporarily stopped due to problems with the ethernet switch-setup.

2008-11-12 at 13:54

Network patching finished for now. Resuming job-start on all systems.

2008-11-12 at 07:54 [xxx (lenngren)]

Problems with mpi-licenses after network break-down. All node-allocation paused on lenngren/lise and lenngren/juliana.

2008-11-11 at 22:00 [xxx (lucidor)]

Cluster closed due to network problems.

2008-11-11 at 20:10

Exteme Networks router pdc1-gw did crash again after Cygate sent defect and too few replacement parts.

2008-11-04 at 20:59

This is for the PDC users that have their $HOME directory on one of CSCs AFS servers _only_: CSC reports problems with _one_ of their AFS servers, if your $HOME directory was there, your job(s) have probably crashed. /PDC staff on behalf of CSC staff.

2008-11-04 at 21:04

forward information for CSC users: One of the AFS filservers at CSC is out of order, work to solve the problem is in progress.

2008-11-03 at 10:15 [xxx (lenngren)]

Infiniband now restarted, and a few 'black-hole' nodes also removed. Jobs allowed to start again. In case your job experience unexpected problems, please report.

2008-11-02 at 17:12

Problems with a router caused PDC networks to break. Connections within, to and from PDC broken. Many running jobs requiring network most certainly damaged.

2008-10-31 at 11:39 [xxx (lenngren)]

The batch system on Lenngren is halted because of infiniband instabilities resulting in lots of crahsed jobs. This may be a result of this weeks power outage. Therefore no jobs will be run on Lenngren until earliest on Monday.

2008-10-29 at 21:28

And all sections of lenngren on-line again. All systems shold be considered working 'as usual' from now on. In case you find anything not working as expected, please report it to support.

2008-10-29 at 17:01

Ferlin now back online. Lucidor (aka luc2) back online since a couple of hours.

2008-10-29 at 10:45

Power restored. Now initiating the process to salvage systems.

The cause of the UPS shutdown was an overheat condition. Currently it is not clear if the UPS failed and caused the room to overheat or the room cooling failed and caused the UPS to overheat first. Shutdown temperature was 42deg Celsius which lower than the specified maximum temperature of the UPS. Several rectifier modules in the UPS are broken and have to be replaced. Eaton is shipping replacement parts. Until the UPS is operational again, it is bypassed and all calculation nodes are directly on the KTH power distribution. All servers are on a second seperate UPS system, too.

2008-10-29 at 08:40

Due to a major power failure, most PDC systems are currently unavailable. Investigation is in progress and more information will be forthcoming.

2008-10-29 at 07:55

One afs-server seem gone. Major impact on several systems.

2008-10-13 at 10:09

afs-server trevally.pdc.kth.se experienced a kernel panic early this morning. It has now been restarted and salvaged.

2008-10-10 at 23:13

Ferlin: the login-node has been hung since around 1730 today. It was just power-cycled and then at least came online again.

2008-10-09 at 17:57 [xxx (Hebb)]

The whole machine is now fully functional again.

2008-10-01 at 13:49 [xxx (HSM)]

The tape library and the HSM is now fully functional again.

2008-10-01 at 12:43

Storage update: After 2 1/2 days of work, IBM managed to get the tape library online again. The HSM and TSM systems that access the library are restared and should be fully operational. Please report further problems that we might have missed.

2008-09-29 at 19:00

The automatic tape library controller PC is still fried. IBM will continue the service tomorrow when more and hopefully better parts arrive at around lunchtime. So unfortunately no ETA on a fixed library yet.

2008-09-26 at 18:42

The automatc tape library has a permanent power failure.
IBM service notified, repair will start monday.
HSM availabibity degraded.

2008-09-20 at 12:24

redistributed information: Maintenance work on some CSC servers will be performed on Saturday 20 September starting at 10 am. Most UNIX computers at this time. Services like email and www will also be affected.

2008-08-26 at 10:04

The AFS server glycine crashed again this morning. Most of it is back on-line again, but unfortunately one of the partitions is still down due to filesystem inconsistencies.

2008-08-25 at 10:32

The AFS server glycine is currently restarting after a crash. It will take some time for the filesystem consistency check to complete before the volumes are available again.

2008-08-20 at 16:36 [xxx (Hebb)]

Hebb will be unavailable Thursday-Friday this week due to it being moved. The whole system will be shutdown at 08:00 Thursday, and available again as soon as the move is complete.

2008-08-15 at 11:55 [xxx (lenngren)]

The PDC Intro. to HPC Summer School will be held August 18-29. The students will have higher priority on Lenngren during MPI exercises in the second week. Exercises are always held during office hours.

2008-08-13 at 14:52 [xxx (lenngren)]

The log in node, lise.pdc.kth.se, will be rebooted today at 15:00 due to kernel problems.

2008-08-04 at 15:52 [xxx (Hebb)]

Hebb was reboot recently because of one midplane acting strange. The whole system is now up and running again.

2008-07-28 at 14:15 [xxx (SBC / CBR)]

Impact: All users with files on server glycine.

During debugging of an overload condition on fileserver glycine, the server aborted (crashed) after delivering some debug information.

Currently salvaging file systems on glycine. Typical duration: 50minutes.

2008-07-18 at 10:29

The login node of the ferlin cluster was restarted at 10:15 without prior warning due to a handling error. It should be back again now. We apologise for any inconvenience caused.

2008-07-17 at 10:43

During July and beginning of August PDC support will be low on staff, but available for your questions during office hours as usual! Please take into account that response times during this period might therefore be longer than what you are commonly used to. We hope that our users will respect that we will answer all questions, one at a time, and that you do not need to re-send your questions in hope of more attention. We at PDC support wish all our users a very relaxing Summer!

2008-07-16 at 13:30 [xxx (SBC / CBR)]

Followup to previous message:
The reason for the downtime was the combination of network problems and broken software (rdisc) which could not recover from the outage.
Status:
~200 nodes up again, more will follow
Easy restarted
Only "itchy" login node available yet
Please report possible problems we have not seen (yet).

2008-07-16 at 10:11 [xxx (SBC / CBR)]

Most of system unavailable. Investigating.

2008-07-13 at 19:12 [xxx (Hebb)]

After some additional disturbance in the system during the weekend, one midplane is now unavailable and will probably stay so the whole of next week.

2008-07-09 at 19:13 [xxx (Hebb)]

One node card unavailable. No ETA on replacement yet.

2008-06-26 at 13:27 [xxx (lenngren)]

The log in node, lise.pdc.kth.se, was restarted due to running out of resources. Everything is now running again.

2008-06-19 at 11:25

On Friday June 20th, the PDC helpdesk will be closed for Midsummer's Eve but we will be back to handle your support requests on Monday morning.

2008-06-18 at 12:15 [xxx (lucidor)]

Network (causing afs-problems) modified. Node allocation resumed.

2008-06-17 at 19:06 [xxx (lucidor)]

afs problems on lucidor; scheduling off-line.

2008-06-10 at 10:52

The network has now been stable for a day and we have re-enabled the allocation on Lenngren (lise, juliana), Lucidor-II, and the SBC-cluster. However, please report any oddities to us. Note that Ferlin still is off the network since it, in fact, was Ferlin that toppled the network in the first place. Some further investigation remains before Ferlin will be back online.

2008-06-09 at 14:34

The persisting problem with the network has been pinpointed and fixed. The network should be back to normal.

2008-06-05 at 20:46

All allocation paused due to (internal?) network hickups.

2008-06-05 at 20:43

Network problems affecting all systems. Everything might not be back until after the weekend.

2008-04-25 at 10:19

PDC's helpdesk will close at lunchtime Wednesday April 30th and reopen again on Monday May 5th - due to Walpurgis Night[1], Labour Day[2], and a so calles "squeeze day"[3]. We hope the bugs and problems will take these days off, too. [1] http://en.wikipedia.org/wiki/Walpurgis_Night [2] http://en.wikipedia.org/wiki/Labour_day [3] http://www.kth.se/ict/anstallda/personal/1.9081

2008-03-25 at 18:28 [xxx (Hebb)]

The system is now fully functional again, with one new node card and two new CPU cards.

2008-03-25 at 13:00 [xxx (Hebb)]

One CPU card will be changed today between 13-15.

2008-03-19 at 16:38

The PDC helpdesk closes for the Easter Holiday 12.00 on Thursday March 29th and reopens again on Tuesday morning. Happy holidays!

2008-02-17 at 18:05

We are currently having problems with one of our routers. Until this issue has been resolved network connectivity to and to some extent within PDC may be erratic.

2008-02-14 at 16:30 [xxx (SBC / CBR)]

It looks like it will take the whole evening to get the SBC AFS-servers back on track. We apologise for any inconvenience.

2008-02-14 at 15:00 [xxx (SBC / CBR)]

SBC AFS-servers valine and glycine are acting up. Investigation is in progress.

2008-01-30 at 21:19

Informational: mail1.kth.se has been down for large parts of today, and is expected to be down until at least tomorrow. Mail delivery to it will be delayed. Many, but not all, of pdc employees uses it to read mail. Their response time could be longer than usual.

2008-01-23 at 11:06

forwarding CSC info: Maintenance work on some CSC servers will be performed on Saturday January 26 starting at 9 am. Most UNIX computers at CSC will be heavily affected during this time. Services like email and www will also be affected.

All flash news for 2026, 2025, 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss