Events:

2004-12-17 at 16:45 [xxx (lucidor)]
One of the interactive nodes on Lucidor refused all kinds of connections and had to be rebooted. It is back online now.
2004-12-15 at 14:58 [xxx (SBC / CBR)]
The SBC AFS fileserver cysteine.pdc.kth.se will be restarted at 19:00 today (2004-12-15). The restart is expected to be unnoticable, but may (depending on the severity of the fault causing the restart) cause a downtime of approximatelly 45 minutes. Please note that most SBC home directories resides on cysteine. The reason for the restart is that the volserver, which handles, among other things, backups, is unresponsive thus effectively stopping backups of the volumes on cysteine until fixed.
2004-12-11 at 15:51 [xxx (lucidor)]
Lucidor: gigabit ethernet unreliable. One random 4port section hangs in a 12..24hour interval. GigE switches has to be manually powered off/on to recover. Jobs on nodes on frozen ports affected.
2004-12-09 at 13:44 [xxx (lucidor)]
Hardware maintenance on redundant parts of the lucidor Gigabit ethernet. As a precaution node allocation has been paused between 1400 and 1500 hours.
2004-12-08 at 12:08 [xxx (lucidor)]
Redundant power-supply removed. Switch on-line again. Only jobs executing on nodes h04n{XX} should have been affected.
2004-12-08 at 09:12 [xxx (lucidor)]
An ethernet switch fault has resulted in networking problems on the Itanium system. Job allocation is paused and the system is currently inaccessible from the outside. Work is in progress to resolve the problem.
2004-12-13 at 10:00 [xxx (HSM)]
System closed due to tape library hardware changes (there were planned for last week but were cancelled due to event external to PDC).
2004-11-30 at 13:31 [xxx (strindberg)]
Nighthawk - /gpfs/projects: one of the disk-servers is being restarted.
2004-11-23 at 17:30 [xxx (lucidor)]
Blumino.pdc.kth.se (log in node of lucidor) ran out of resources and had to be rebooted. Sorry for any inconvenience.
2004-11-22 at 12:30 [xxx (HSM)]
The HSM system will be unavailable for a few hours for a OS upgrade.
2004-11-18 at 13:00 [xxx (HSM)]
The HSM system will be unavailable during the afternoon due to a software upgrade.
2004-11-15 at 17:22 [xxx (HSM)]
The HSM system had to be rebooted due to a piloting error. The culprit is very sorry.
2004-12-03 at 08:00 [xxx (HSM)]
The HSM system will be unavailable during the entire day due to a tape library hardware change.
2004-11-15 at 15:18
Forwarded - Informational: for users with data at nada.kth.se, quote: "Maintenance work on the AFS servers at Nada will be done on 27 November starting at 10:00 am. Most UNIX computers at Nada will be inoperable during this time. Other services, like E-mail and WWW servers, will also be affected."
2004-11-16 at 13:00 [xxx (HSM)]
The HSM system will be down for a hardware upgrade for 5 hours during the afternoon.
2004-11-10 at 16:22
Due to upgrades of the power infrastructure at PDC, a service window has been scheduled for the SBC, Lucidor, Swegrid and Nighthawk clusters between 10:00 and 14:00, Wednesday 2004-11-17. The expectation is that no interruptions will occur. However, users of the login and interactive nodes should be aware that there is a heightened risk of power loss during the service window.
2004-11-08 at 13:49 [xxx (SBC / CBR)]
Due to what is suspected to be AFS client problems, the SBC cluster login node itchy.pdc.kth.se will be restarted within 5 minutes.
2004-11-04 at 13:16
Forwarded - Informational: for users with data at nada.kth.se: "Tonight (Nov 4) at 5 pm it is unfortunately necessary for us to restart the AFS-server gre. The stop should be fairly short, but many computers at Nada could be affected."
2004-10-29 at 23:18
Gaussian G03, revision B.05 was temporarily unavailable the past hour due to a fileserver mishap.
2004-10-28 at 21:36
Lucidor and Linux IA32 systems:
Updated versions of Intel compilers version 7 installed
Updated versions of Intel mkl libraries installed (v7.0.1)
Default version of Intel MKL library bumped to 6.1.1
2004-10-27 at 14:29 [xxx (strindberg)]
Interactive node on Kallsup is down. It may take until tomorrow before it is back up.
2004-10-25 at 16:00
The afs-server carp will be restarted Wednesday 2004-10-27 between 10:00 and 13:00; Files residing on carp will be temporarily unavailable.
2004-10-05 at 15:47 [xxx (SBC / CBR)]
The AFS fileserver cysteine.pdc.kth.se have been restarted, causing home-volumes for most SBC-users to be temporarily unavailiable.
2004-10-04 at 20:24
AFS: fileserver process on carp restarted.
2004-10-04 at 17:20 [xxx (strindberg)]
Kallsup: the log in node did crash. Reboot in progress.
2004-10-04 at 12:02
Fileserver process on gills restarted.
2004-10-02 at 10:00
All systems: the AFS server gills has had a fault. Most of PDC's systems affected, ranging from partially to entirely.
The AFS server process accepted requests without serving them. We have restarted the server and it is up and running again since approx 11:55.
2004-09-05 at 17:50 [xxx (lucidor)]
Log in node (blumino) is being rebooted to regain used system resources.
2004-09-17 at 10:14 [xxx (SBC / CBR)]
Tuesday 040921 at 10:00 the SBC database server sbcdb.pdc.kth.se will be restarted due to software upgrades, it is expected to be availiable again no later than 10:30.
2004-09-17 at 09:48 [xxx (SBC / CBR)]
Tuesday 040921 at 10:30, the interactive nodes of the SBC cluster will be shutdown for software upgrades and hardware reconfiguration. The nodes are expected to be availiable again no later than 17:00 the same day.
2004-09-17 at 09:42 [xxx (SBC / CBR)]
Tuesday 040921 the login nodes of the SBC cluster, itchy, scratchy and krusty will be restarted due to software upgrades. The downtime is expected to be 10 minutes, starting 09:00, 09:20 and 09:40 respectively. The cluster job queue will be paused between 09:00 and 10:00.
2004-09-03 at 12:56 [xxx (HSM)]
The HSM and backup systems will be down on Monday 6/9 and Thuesday 7/9 for a tape library system upgrade.
2004-08-25 at 19:26 [xxx (SBC / CBR)]
The login node krusty.pdc.kth.se has been restarted since excessive load made it unresponsive. It should be back again approximatelly 19:45. The queue will be paused and sbcweb unavailable till then.
2004-08-13 at 14:44 [xxx (SBC / CBR)]
The login node called itchy (s02n01) will be reinstalled next Wednesday (Aug 18th) between 13:00 and 16:00 (probably shorter). Please don't submit any new jobs from itchy until after that time. Use krusty or scratchy instead.
2004-08-12 at 17:59 [xxx (lucidor)]
The traditional PDC summerschool will take place on August 16-27. During these two weeks the students will have higher priority than other users in the queue on Lucidor
2004-02-06 at 01:00 [xxx (strindberg)]
Kallsup: /gpfs/kallsup should be back online, however with reduced capacity for the time being. Unfortunatelly some jobs might have been affected during the past hour.
2004-08-05 at 14:05 [xxx (strindberg)]
Kallsup is being maintenance rebooted.
2004-07-16 at 20:13 [xxx (strindberg)]
/gpfs/projects and /gpfs/scratch is now back online. /gpfs/kallsup is, however, not and won't be until, hopefully, next week. The queue is now running again.
2004-07-15 at 18:48 [xxx (strindberg)]
GPFS still down due to broken node. Waiting for IBM service.
2004-07-14 at 18:14 [xxx (strindberg)]
The batch system has been paused due to problems with the GPFS filesystem. The system will hopefully be up again sometime tomorrow.
2004-07-08 at 14:52 [xxx (SBC / CBR)]
2004-07-12 13:00 to approximatelly 15:00, the V-nodes in the SBC cluster will be down for software maintainance (a new kernel and AFS-client will be installed).
2004-07-08 at 14:49 [xxx (SBC / CBR)]
2004-07-09 13:00 to approximatelly 15:00, the U-nodes in the SBC cluster will be down for software maintainance (a new kernel and AFS-client will be installed).
2004-07-07 at 11:00 [xxx (lucidor)]
Kernel revision bumped. Executables using MPI (mpich/gm) will have to be recompiled/relinked.
2004-07-02 at 17:02
Informational, for Nada users: Urgent system maintenance on several servers will start at 10 am tomorrow Saturday (3 July). This will affect nearly all of Nadas computers.
2004-06-22 at 17:30 [xxx (SBC / CBR)]
The SBC cluster node itchy.pdc.kth.se will be down for maintainance 20040623. The scheduled service window is between 08:00 and 16:00 but will very likely be shorter. Please use krusty.pdc.kth.se or scratchy.pdc.kth.se during the downtime.
2004-06-15 at 14:31
The email system for kth.se will be upgraded during the weekend June 18 through June 20. The work will begin June 18 (Friday) at 18:00. The system will be partially unavailable during the entire weekend. Some preparatory work, which may cause a delay in email delivery, will be performed during the evening of Thursday June 17.
2004-05-26, start at 2004-05-27 09:00
KTHNOC and PDC will test the power redundancy systems ( UPSen, generator, -48VDC supplies) by means of a full-scale powerloss simulation. We will disconnect the 6KV supply to the building, and monitor all systems for behaviour during the outage. Affected: We do not expect any outages, due to extensive testing of various subsystems during last weeks, but there is certainly a heightened risk during the test. We will post operators in all rooms, and have the opportunity to switch back immediately if we discover any problems.
2004-05-24 at 17:44 [xxx (lucidor)]
New Intel compilers v.8 available.
module add i-compilers/latest
C/C++: 8.0.066 Fortran: 8.0.046
2004-05-19 at 08:39
Ascension Day (Kristi himmelfärds) is approaching, which is a Swedish holiday. The PDC helpdesk will be closed from May 20th, 2004. It will open again on May 24th, 2004.
2004-05-18 at 15:20 [xxx (SBC / CBR)]
The SBC-cluster login node scratchy.pdc.kth.se is up again and the queue is un-paused.
2004-05-18 at 13:13 [xxx (SBC / CBR)]
Due to a faulty disk on the login node scratchy.pdc.kth.se, the SBC-cluster queue has been temporarily stopped. Estimated service time is 2 hours.
2004-05-13 at 18:03 [xxx (strindberg)]
Nightawk - node serving gpfs/scratch went down. Now restarted.
2004-05-13 at 17:00 [xxx (strindberg)]
Nighthawk (kallsup) became fully operational at 13:00.
2004-05-13 at 10:24 [xxx (strindberg)]
Some nodes on the IBM SP (nighthawk, kallsup) are unreachable and others are running slow. PDC is looking into the problem. A fix will likely not be possible until after 5 PM today.
2004-05-11 at 13:12
PGI compilers 5.1-6 installed on all x86 and AMD Linux systems. Use module add pgi/latest to test.
2004-04-14 at 19:00
All compute nodes, sp-doc, tsm et cetera should be back except for sbc-nodes, which are currently being moved (logically) to a network with room for more machines. Please let us know if anything seem broken.
2004-04-05 at 15:17 [xxx (strindberg)]
the Nighthawk log-in-node will be rebooted some time between 1600 and 1700, today, 2004-04-05.
2004-04-01 at 20:00
Mail deliveries to PDC have been quite sluggish most of the day. If you haven't received any reply on mail you sent today, this might well be the reason.
2004-04-01 at 15:58 [xxx (lucidor)]
Default version of Intel compilers changed to version 7.1-41 on Lucidor (IA64) and all IA32 machines.
2004-03-30 beginning 2004-04-14 at 08:00
Planned stop of all calculation nodes starting 2004-04-14 at 08:00 because of maintainance in parts of the power supply system (UPS-B). Network, login nodes and file services stay up and should not be affected. Duration max 8hours but probably shorter.
2004-03-28 at 13:48
One of the AFS database machines is down. If anything seems a bit sluggish, this may be the cause.
2004-03-28 at 01:28 [xxx (lucidor)]
Gaussian-03 runs between Thursday evening and now crashed with "error while loading shared libraries" due to a mistake at the compiler upgrade. This has hopefully now been corrected by retrieving the missing files from backup.
2004-03-25 at 22:42 [xxx (lucidor)]
Updated Intel 8.0 compilers on all Intel systems

module add i-compilers/latest
Version 7.1 is still default.
2004-03-25 at 20:59 [xxx (lucidor)]
Updated Intel MKL libraries 6.1.1-004 on all Intel systems

module add mkl/latest

2004-03-25 at 18:42 [xxx (lucidor)]
Updated Intel 7.1 compilers (now 7.1-31) on IA64 and I386. Try them out by:

   module del i-compilers

   module add i-compilers/new

The previous version (7.1-31) will remain the default version for a shorter period unless there are any problems with the new version.
2004-03-18 at 16:34
Original message: Power failure, all systems down.

Longer explanation: The part of the newly delivered UPS system that powers the computational parts of the PDC clusters had a dramatic failure. For reasons yet unknown both redundant UPSes tripped their 3x315A fuses and took out the main breaker for the building with them. Fortunately the UPS system that powers the file servers survived and the backup diesel kicked in. As there was no power loss for the fileservers, data loss should be minimal. If you still want to know more, contact pdc-staff as usual. 2004-03-18 18:26 / haba.

2004-03-13 at 23:38 [xxx (SBC / CBR)]
The AFS file server alanine.pdc.kth.se is back. The reason for the crash is currently not known.
2004-03-13 at 21:09 [xxx (SBC / CBR)]
The AFS file server alanine.pdc.kth.se is currently down, investigation in progress.
2004-02-18 at 14:30
The license servers will be moved at 17.00 today. They should be up again within two hours.
2004-02-10 at 10:44 [xxx (HSM)]
Problems with a tape drive currently stops tape activity.
2004-02-05 at 14:57
General: primary router has a randomly broken linecard. As the backup router is not on location (we are still moving) this led to severe outages in all things relying on network connectivity. Broken linecard grounded.
2004-02-05 at 12:24
General: running intrusive diagnostics primary router. All network dependent equipment affected.
2004-02-05 at 09:09 [xxx (SBC / CBR)]
We're having some network problems that cause AFS to be unavailable on the SBC cluster. Problem determination in progress.
2004-02-02 at 15:53
Further moving at 2004-02-03 and 2004-02-04, pieces of information to be found through the news&events icon.
2004-01-27 at 09:00 [xxx (HSM)]
The HSM system will be rebooted due to a software upgrade.
2004-01-28 at 18:00 [xxx (SBC / CBR)]
Due to system software upgrades, the login-nodes of the SBC-cluster, scratchy and krusty, will be restarted at 18:00 Wed 28th Jan. They should be up and running again at 18:15 the same day. The queue will be paused during the upgrade. Please contact sbc-staff@pdc.kth.se if you have any questions.
2004-01-20 at 14:52
Nighthawk SP back in line.
2004-02-16 at 08:00 [xxx (HSM)]
The HSM system will be relocated to the new computer hall together with the tape library. It will be unavailable for 2-4 days depending on how agreeable mr Murphy is.
2004-01-09 at 12:18 [xxx (HSM)]
The HSM system is now back online.
2004-01-08 at 13:46 [xxx (HSM)]
The HSM system is currently down due to hardware problems. The system will hopefully be back again some time tomorrow.
2004-01-07 at 12:50
"God fortsättning!" PDC helpdesk is now open again. (08-790 7800 is manned and mails to pdc-staff@pdc.kth.se will be answered.)
All flash news for 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss