Events:

2001-12-31 at 12:53 [xxx (SBC / CBR)]
sbc-m1 has been rebooted and is now up again.
2001-12-31 at 10:36 [xxx (SBC / CBR)]
sbc-m1.pdc.kth.se is down. This affects all scheduling on the SBC cluster. sbc-m1 will be rebooted later today.
2001-12-29 at 13:58 [xxx (strindberg)]
A full disk partition on the easy node caused scheduler confusion. The problem has been fixed and the queue has been restarted.
2001-12-29 at 13:07 [xxx (strindberg)]
Problem investigation in progress.
2001-12-28 at 13:00 [xxx (selma)]
As interactive logins are few, We'll move some things around in the computer room which will result in a reboot of Selma. As the jobs can be checkpointed, interruption should be minimal. Selma is planned to be up again at 15:00.
2001-12-19 at 10:36
During the holidays PDC helpdesk will be closed between 2001-12-24 -- 2002-01-04. We reopen 2002-01-07.
2001-12-18 at 17:04 [xxx (strindberg)]
Nighthawk: node crash causes temporary unavailability of parts of /gpfs/projects and /gpfs/scratch.
2001-12-12 at 10:13
There is currently a network problem at SUNET which may affect PDC connectivity.
2001-12-11 at 11:41 [xxx (SBC / CBR)]
Unfortunatelly sbc-07 blew the fuse on its powerstrip, bringing down all other machines on the same strip, i. e sbc-01->sbc-06 and the switch of the cluster.
2001-12-05 at 10:46 [xxx (SBC / CBR)]
The kakbit master machine(sbc-m1) is having some problems. This has the effect that job starting can be a temporarily interrupted. Running jobs should be unaffected.
2001-11-29 at 23:36
The fileserver bream has been recovered and the batch queue on Strindberg has been restarted.
2001-11-29 at 15:26
Fileserver bream will be rebooted shortly; we'll see if that helps.
2001-11-29 at 11:49
We're having some problems with one file server, we're investigating.
2001-11-29 at 10:02 [xxx (HSM)]
hsm.pdc.kth.se will be serviced in about one hour for one hour. The HSM system will be unavailable during this time.
2001-11-28 at 14:00
Due to some required network rerouting and switch software upgrade, there will a couple of short network outages next Wednesday Nov 28, at around 14.00. This will affect the demo root, the cube, the Linux Lab, and some server machines. The operation should be finished in 30 minutes.
2001-11-21 at 14:13 [xxx (strindberg)]
Strindberg: /gpfs/scratch will be unmounted for maintenance during the afternoon today, 2001-11-21. Batch lines for nodes depending on /gpfs/scratch will be held during maintenance.
2001-11-09 at 15:53 [xxx (strindberg)]
Strindberg: /gpfs/scratch faulty disk is replaced.
2001-11-09 at 14:20 [xxx (strindberg)]
Strindberg: /gpfs/scratch - disk fault. recovery and replacement initiated.
2001-11-08 at 11:18
Boye crashed (repeatedly) with a kernel error. Investigation in progress.
2001-11-06 at 14:57 [xxx (strindberg)]
Nighthawk: temporary unavailability of /gpfs/{projects,scratch} resolved.
2001-11-03 at 18:07 [xxx (SBC / CBR)]
sbc-m1 back up.
2001-11-02 at 20:29 [xxx (SBC / CBR)]
sbc-m1 has crashed. It will be rebooted tomorrow. Kakbit on the SBC cluster will not work until sbc-m1 has been rebooted.
2001-11-02 at 15:14 [xxx (strindberg)]
Nighthawk; problems resolved. N, H, K-class nodes, and /gpfs/{scratch,projects} are back in production.
2001-11-02 at 15:14 [xxx (strindberg)]
Nighthawk; /gpfs/projects and /gpfs/scratch is still unmounted and N, H and K-nodes are kept off-batch until problem resolved.
2001-11-02 at 09:41 [xxx (strindberg)]
Nighthawk; /gpfs/projects and /gpfs/scratch will be unmounted at 12:00 today, 2001-11-02, but will be remounted within an hour.
2001-11-01 at 10:41
Boye : Back up again (for now).
2001-11-01 at 08:47
Boye : Continuing problems, machine is up but graphics are dead. The CAVE is down until further notice.
2001-10-31 at 22:47 [xxx (strindberg)]
Nighthawk/kallsup; One of the K-nodes, nf03n01, had a hardware fault and has been restarted.
2001-10-31 at 17:04
Boye crashed. Recovery in progress.
2001-10-29 at 11:25
Boye crashed for unknown reason. Back up now.
2001-10-26 at 13:40
sherman/spdoc shutdown and restart will take place during the afternoon.
2001-10-24 at 09:32 [xxx (HSM)]
The HSM server crashed with a kernel panic at 02:30 this morning. DMF will remain offline for a few hours more.
2001-10-23 at 16:42
Sprat and boye will be brought down for OS upgrades beginning at 9.00 on wednesday 24/10. They should be back again after lunch.
2001-10-22 at 14:22
Nighthawk: /gpfs/kallsup back on line.
2001-10-20 at 08:52
Nighthawk: hardware fault causes loss of one gpfs filesystem, /gpfs/kallsup.
2001-10-19 at 11:56 [xxx (strindberg)]
Problem caused by Kerberos time-outs. A workaround is in place and we now use another Kerberos server.
2001-10-19 at 06:58 [xxx (strindberg)]
Problem investigation in progress for Stridberg and Kallsup.
2001-10-12 at 12:09 [xxx (strindberg)]
Strindberg; switch operation fault.
2001-10-11 at 10:31
Nighthawk nodes har back online.
2001-10-11 at 09:35
We have some problems with the nighthawk nodes.
2001-10-08 at 16:28
Informational: sherman aka spdoc reboot at 2001-10-08 at 17:00 to activate new software.
2001-10-07 at 15:54
HSM system back online.
2001-10-07 at 14:42
HSM system migration/recall stopped due to a full disk causing problems with a system database.
2001-10-06 at 18:29
Nighthawk: batch nodes are back in production. Note; They have gotten new system software installed and activated.
2001-10-06 at 09:34
Nighthawk; HA subsystem inconsistency causes gpfs to unmount.
2001-10-10 at 14:00
The HSM system will be partially unavailable for up to two hours due to tape drive recabling.
2001-10-02 at 18:21
The switch adapter on the Easy node, has crashed. Work is in progress.
2001-10-02 at 10:35
One of our AFS servers wen't down, invetigation in progress
2001-09-28 at 13:35
The HSM machine is down for hardware repairs. Back up in 30 minutes or so.
2001-09-27 at 22:09 [xxx (strindberg)]
Strindberg: a second switch clock fault for the same switch board has occured. We will disable the suspect switch board and attempt to continue with reduced capacity.
2001-09-27 at 19:30 [xxx (strindberg)]
Strindberg; switch clocking fault. this failure affects mostly all strindberg functions.
2001-09-27 at 13:53 [xxx (strindberg)]
Strindberg: temporary switch-instability might have affected switch-dependendent sub-systems, i.e. gpfs.
2001-09-27 at 10:41
The HSM machine may be briefly unavailable this morning.
2001-09-24 at 11:51
There might be some brief network outages due to a faulty switch.
2001-09-20 at 17:08
Networks/General; network improvement work continues. Next session is scheduled to start Wednesday 2001-09-26 at 17:00. Several subsystems may be affected.
2001-09-20 at 10:35
Broken routing fixed. All systems in production again.
2001-09-20 at 09:10
General network instability.

All systems may be affected. Investigation in progress.

2001-09-20 at 01:28
Networks; network instability have caused problems, i.e. between some clients and file-servers.
2001-09-19 at 09:24
We have some problems on the Nighthawk nodes queue temporarily stopped
2001-09-14 at 01:11
General: Network reconfiguration complete since several hours. SPs back in production since midnight.
2001-09-12 at 18:06
General: Reconfiguration of networks to take place starting at 2001-09-13 13:00. Several sub-systems will become unavailable, including the SPs.
2001-09-11 at 13:39
The HSM system will be brought down briefly for a hardware upgrade at 10.00 tomorrow morning (12/9).
2001-09-04 at 15:38
General: UPS maintenance scheduled to start tomorrow, 2001-09-05 at 10:00.
2001-09-04 at 15:37
General: Network reconfig complete. Further reconfigurations will follow within a seven day period.
2001-09-04 at 12:16
General: Reconfiguration of networks to take place starting at 2001-09-04 13:00. Several sub-systems will become unavailable.
2001-08-30 at 10:20
One of the fileservers went south. Should be back in operation.
2001-08-28 at 14:33
The HSM system is back online after repairs.
2001-08-27 at 11:21
The HSM system is down due to a hardware problem. It should be up again sometime tomorrow at the latest.
2001-08-24 at 10:35
We are currently experiencing some problems with the parallell filesystem, queue temporarily stopped
2001-08-23 at 08:08
IBM SP: nodes put down by easy. Soon to be fixed.
2001-08-22 at 08:59
We are currently having problems with the SP. We are working to solve the problem.
2001-08-21 at 17:07
HSM system temporarily stopped due to a broken tape drive. Files not off-line can still be accessed. We expect the drive to be repaired tomorrow.
2001-08-21 at 14:18
Seems like more things are broken, stopping allocation
2001-08-21 at 13:57
GPFS problems on august rebooting...
2001-08-15 at 14:59
The new HSM system has now been opened. It has a much larger cache filesystem (425GB), can store more data on tape (in excess of 25TB) and both tape and disk access times should be improved. See the HSM Usage Guide for more information.
2001-08-15 at 00:20
General; hardware problems with primary nameserver and mail server. Repair action in progress.
2001-08-14 at 19:27 [xxx (strindberg)]
Strindberg: Scheduling resumed; please report problems to sp2-staff.
2001-08-14 at 19:14 [xxx (strindberg)]
Strindberg: waiting for salvage of /gpfs/projects.
2001-08-14 at 17:59 [xxx (strindberg)]
Strindberg: Follow on gpfs problems to the HA faults.
2001-08-14 at 15:06 [xxx (strindberg)]
Strindberg: High Availability wrongly assuming nodes down. This caused gpfs to unmount.
2001-06-19 at 13:45
sherman (spdoc) reboot.
2001-06-15 at 17:23
stridberg login-node crashed. Reboot in progress. Will take some time to investigate what went wrong.
2001-05-26 at 05:02 [xxx (strindberg)]
Strindberg/Nighthawk: gpfs/scratch & gpfs/projects restore is supposedly complete. Please report any missing files to pdc-staff.
2001-05-25 at 18:00
File server broken. Some users might not be able to access their data. File server rebuild and restore is in progress.
2001-05-25 at 19:25 [xxx (strindberg)]
Strindberg/Nighthawk; recreating and restore of /gpfs/projects and /gpfs/scratch started.
2001-05-25 at 14:14 [xxx (strindberg)]
Strindberg/Nighthawk: the /gpfs/projects and /gpfs/scratch will be unavailable during a filesystem consistensy check.
2001-05-23 at 17:13 [xxx (strindberg)]
Strindberg/Nighthawk: broken GPFS disk might cause problems for certain users of gpfs/scratch. Repair action initiated.
2001-05-23 at 12:33 [xxx (strindberg)]
Strindberg; reboot of log in node.
2001-05-21 at 10:22 [xxx (strindberg)]
Strindberg/Nighthawk; recovery of /gpfs/ in progress.
2001-05-18 at 07:10 [xxx (strindberg)]
Strindberg/Nighthawk: nf03n09 has been rebooted.
2001-05-17 at 13:01 [xxx (strindberg)]
Strindberg; HA subsystem causes gpfs not to mount properly on all nodes.
2001-05-07 at 14:07 [xxx (strindberg)]
Strindberg: rebooting the log in node due to excess paging space use.
2001-04-27 at 19:55
On Monday 30/4 PDC Helpdesk will be closed. It will reopen On Wednesday 2/5.
2001-04-27 at 18:17
One AFS fileserver has been down for 15 minutes, but should be back up again.
2001-04-09 at 14:53
Problems with fileserver
2001-03-16 at 17:10 [xxx (strindberg)]
Strindberg/Nighthawk; new SP switch2 activated. Most nodes will be in Interactive state during weekend burn-in.
2001-03-07 at 15:20 [xxx (strindberg)]
Strindberg: GPFS umount due to HA inconsistency.
2001-02-14 at 15:21
Kallsup : The system has been brought up again after a system failure.
2001-02-14 at 11:56
Kallsup: Some kind of hang, investigations going on
2001-01-11 at 14:14 [xxx (strindberg)]
Strindberg; gpfs/projects restore is complete.
2001-01-10 at 19:26 [xxx (strindberg)]
Strindberg; gpfs/projects - restore in progress. The filesystem will be mounted when restore is complete.
2001-01-09 at 23:36 [xxx (strindberg)]
Strindberg - /gpfs/projects; recovering the filesystem after migration failure during HW rearrangement (see the news page.)
All flash news for 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss