Events:

1998-12-31 at 12:19 [xxx (strindberg)]
Strindberg: Switch adapter up again. Scheduler resumed.
1998-12-31 at 11:53 [xxx (strindberg)]
Strindberg: PIOFS errors. Scheduler stopped. Investigating.
1998-12-14 at 13:00 [xxx (strindberg)]
Strindberg: Wednesday service window moved to tomorrow, 1998-12-15.
1998-12-11 at 19:00
Network: there was a network outage for a few minutes a while ago.
1998-12-09 at 15:45
Selma: Please read about maintenance in the current news/info.
1998-11-25 at 17:00 [xxx (strindberg)]
Strindberg: all /pfs data lost due to switch fault. Please inform us if your job crashed due to lost data.
1998-11-25 at 13:30 [xxx (strindberg)]
Strindberg: Scheduler temporarily paused due to switch problems.
1998-11-21 at 10:00 [xxx (strindberg)]
Strindberg: /pfs problems remain. Please report if you think your job is affected.
1998-11-19 at 00:31 [xxx (strindberg)]
Strindberg: Restore of /pfs in progress, scheduling as usual.
1998-11-18 at 22:00 [xxx (strindberg)]
Strindberg: scheduling is resumed (currently semi-manual.)
1998-11-18 at 15:00 [xxx (strindberg)]
Strindberg: a first estimate of service window end is at 18:00, 1998-11-18.
1998-11-16 at 20:00 [xxx (strindberg)]
Strindberg: Piofs will be reformatted during the service window Wednesday 1998-11-18. All files will be lost.
1998-11-16 at 14:00 [xxx (strindberg)]
Strindberg: Piofs fault search. Node allocation paused.
1998-11-07 at 23:59
Afs: one fileserver did crash at 1998-11-07 18:45. Please report problems to pdc-staff@pdc.kth.se
1998-11-05 at 18:00 [xxx (strindberg)]
Strindberg: restarting is overdue.
1998-11-05 at 11:15 [xxx (strindberg)]
Strindberg: fileserver node back. Please report any problems with the parallel filesystem.
1998-11-05 at 08:30 [xxx (strindberg)]
Strindberg: loss of one parallel fileserver node. repair in progress.
1998-11-03 at 13:30 [xxx (strindberg)]
Strindberg: Hardware maintenance window delayed until Thursday, 1998-11-05, 15:00 through 17:00.
1998-10-30 at 11:20
Main power outage at 11:03. We went to backup power during the outage but are now back to normal. You should not have noticed anything.
1998-10-29 at 15:00 [xxx (strindberg)]
Strindberg: Hardware maintenance window enabled 1998-11-04, 13:00 through 15:00.
1998-10-26 at 13:20
/pfs is back since a few hours. Scheduling resumed.
1998-10-26 at 09:33 [xxx (strindberg)]
Problems with PIOFS (/pfs) the parallel filesystem on strindberg detected. Investigation in progress. Scheduling of jobs stoped.
1998-10-23 at 10:00
We inaugurate our first six-wall cave today. All other business runs as usual but you might experience an increased latency trying to get hold of people.
1998-10-15 at 09:15 [xxx (strindberg)]
Strindberg: Network hardware replaced. Strindberg connected again.
1998-10-15 at 07:31 or earlier [xxx (strindberg)]
Strindberg: Network hardware failure. Strindberg without connection to the rest of the world.
1998-10-14 at 17:40 [xxx (strindberg)]
Strindberg: the new system software is in place. Please let us know whether your job was disturbed.
1998-10-14 at 11:38 [xxx (strindberg)]
Strindberg: All nodes will be rebooted during the afternoon, 1998-10-14, to activate new system software.
1998-10-14 at 11:27 [xxx (strindberg)]
Scheduler problems on strindberg. Scheduler temporarily stopped.
1998-10-12 at 22:00 [xxx (strindberg)]
Strindberg: All nodes will be rebooted during the the service window allocated for tomorrow, 1998-10-13. This includes the log in node.
1998-10-07 at 12:50
On October 14 there will be preventive maintenance on selma for about 8 hours. Should start in the morning. During this time the machine will not be available.
1998-10-07 at 01:04 - 02:30
AFS file server died. Jobs may have been affected. Please report any problems.
1998-10-05 at 16:38
Network problems (routing related) resolved.
1998-10-05 at 15:48
Network problems, investigation in progress.
1998-09-29 at 16:55
Main computer room cooling: Tomorrow morning there will be service done on this equipment. In theory it should not affect anything.
1998-09-18 at 18:10
AFS: One fileserver salvaging. Scheduling temporarily paused.
1998-09-16 at 18:00 [xxx (strindberg)]
Net: there were problems with the network between strindberg and some servers. Now resolved.
1998-09-12 at 14:00
AFS: One fileserver crashed. We don't think anything was lost.
1998-09-11 at 18:55 [xxx (strindberg)]
Strindberg: Allocation of nodes resumed. It was paused for an hour while babysitting a nervous fileserver.
1998-09-10 at 02:00
Kallsup: It will be several hours before the machine is back up. Hopefully the scheduler will be started sometime in the morning.
1998-09-09 at 20:30
Kallsup: File permissions has been severely screwed up. Until this has been fixed, further logins will be disabled. The scheduler is also stopped. Time to fix: unknown.
1998-08-25 at 13:15
Selma: Running. A patch to fix the "sudden death" problem will be installed tomorrow 1998-08-26 around lunchtime.
1998-08-25 at 13:10 [xxx (strindberg)]
Strindberg: one /pfs-server temporarily stuck. Running jobs might have been affected.
1998-08-25 at 12:10
Selma: Computer is not available. Investigation in progress.
1998-08-24 at 17:00
Kallsup: Operating system hang, dump and reboot in progress.
1998-08-19 at 22:05
Kallsup: The malfunctioning drive holding /home/10 and /home/11 has been replaced. No data was lost. However, running jobs were terminated.
1998-08-19 at 20:30 [xxx (strindberg)]
Strindberg: Problems have been resolved and the system is running normally again.
1998-08-19 at 19:30 [xxx (strindberg)]
Strindberg: a failing job manager node caused global job termination... Investigation in progress.
1998-08-19 at 14:45
Kallsup: Back online. The machine will be shutdown after 22:00 tonight for maintainance on /home/10 and /home/11.
1998-08-19 at 14:00
Kallsup: Disk problems. Investigation in progress.
1998-08-10 at 14:00
Selma: Reboot because of file system reconfiguration and a kernel software problem.
1998-08-06 at 17:40
AFS: One fileserver did crash - again. We don't think anything was lost. Please let us know if we're wrong.
1998-08-05 at 08:30
AFS: One fileserver did crash early this morning. Investigation of loss in progress.
1998-08-04 at 16:15
HSM&Kallsup:Kallsup rebooted to enable serviced hardware in the HSM. No access to the HSM is possible during the reboot.
1998-07-30 at 23:30 [xxx (strindberg)]
Strindberg: One /pfs-server had a hardware fault. All jobs using the parallel file system might have faulted.
1998-07-29 at 21:30
Kallsup: Jobs seem to hit `CPU limit exceeded' earlier than intended. Please report problems to kallup-staff.
1998-07-29 at 15:30
AFS: one fileserver did crash. Back online.
1998-07-29 at 13:25 [xxx (strindberg)]
Strindberg: closing service window. We did apply a switch E-fix.
1998-07-26 at 17:17
HSM&Kallsup: back online. Kallsup users might experience loss of files due to crashed jobs.
1998-07-26 at 01:50
HSM&Kallsup: can not yet determine the exact cause of error. Kallsup is offline until at least tomorrow, 1998-07-26, (ie, later today.)
1998-07-25 at 23:00
HSM&Kallsup: Scsi-bus errors on one channel. Investigation in progress.
1998-07-17 at 13:00 [xxx (strindberg)]
Strindberg: /pfs to be reformatted (erased). Please contact sp2-staff for problems you experience related to this.
1998-07-16 at 16:00 [xxx (strindberg)]
Strindberg: /pfs move is complete.
1998-07-13 at 16:00 [xxx (strindberg)]
Strindberg: new attempt to move /pfs on Wednesday, 1998-07-15.
1998-07-13 at 13:00 [xxx (strindberg)]
Strindberg: upgrade complete. PSSP 2.4 installed and running.
1998-07-12 at 16:15 [xxx (strindberg)]
Strindberg: Upgrade of Parallel System software in progress. Upgraded nodes will not be released back the batch pool before they have been thouroughly tested (just guess if we have previous upgrade experiences...)

This means that the number available nodes for batch processing will significantly reduced for a few days.

1998-07-10 at 12:35
Selma: Back in business again.
1998-07-10 at 11:52
Selma: Computer unavailable. Investigation in progress.
1998-07-09 at 12:30 [xxx (strindberg)]
Strindberg: We have restored /pfs to the contents it had as of 1998-07-08 11:00.
1998-07-09 at 11:00 [xxx (strindberg)]
Strindberg: more problems with /pfs, scheduling halted.
1998-07-09 at 10:00 [xxx (strindberg)]
Strindberg: /pfs problem: server process internal error. Pfs has been restarted, and the file system is available again.
1998-07-08 at 15:15 [xxx (strindberg)]
Strindberg: /pfs maintenance continues; jobs that has files not yet available will be deferred until they become available.
1998-07-07 at 11:00 [xxx (strindberg)]
Strindberg: maintenance on /pfs planned for tomorrow, 1998-07-08 between 11:00 and 15:00.
1998-07-06 at 15:19
HSM: Further spare parts that has been shipped have been delayed during transport. The filesystem is thus still unavailable.
1998-06-30 at 11:37
Selma: Some filesystems are missing due to a controller failure. We hope to have an interim solution in place around tuesday afternoon. be back online before Monday.
1998-06-29 at 16:05
HSM: The problems are located to the Maxstrat disk system. HSM has limited accessiblity. Queues on selma and kallsup are temporarily freezed to avoid job loss.
1998-06-29 at 14:20
HSM: Temporary problems in HSM, investigation in progress.
1998-06-26 at 21:00 [xxx (strindberg)]
Strindberg: piofs back, scheduling resumed.
1998-06-26 at 15:00 [xxx (strindberg)]
Strindberg: patches applied, new problem: one piofs server node has disk problem.
1998-06-26 at 11:00 [xxx (strindberg)]
Strindberg: we will apply patches on all PEs between 1998-06-26 between 12:00 and 14:00.
1998-06-24 at 17:00 [xxx (strindberg)]
Strindberg: frame 05 put back in batch.
1998-06-23 at 23:20
Selma: System will not be availabe 98-06-24 11:30 to 12:30
1998-06-14 at 14:00 [xxx (strindberg)]
Strindberg: frame 05 seem to get hit by switch-problems. Put in service.
1998-06-04 at 18:00 [xxx (strindberg)]
Strindberg: the SMP-nodes, M, are publicly available.
1998-06-04 at 14:30
AFS: replacing bad SIM in one volume database server. Is a transparent operation according to the book.
1998-06-03 at 16:30 [xxx (strindberg)]
Strindberg: Switch problems.
1998-06-01 at 09:00
Kallsup: the system has been rebooted to activate new hardware.
1998-05-28 at 22:51 [xxx (strindberg)]
Strindberg: we did just restart the job manager. All running jobs probably affected.
1998-05-27 at 10:00 [xxx (strindberg)]
Strindberg/piofs: /pfs have been reset to clean structure due to a faulty disk. Only the default /pfs home-catalogue structure remain.
1998-05-27 at 05:15 [xxx (strindberg)]
Strindberg/piofs: empty machine at 1998-05-27 between 11:00 and 13:00 to check warnings from one of the disks containing piofs (/pfs.)
1998-05-21 at 08:34 [xxx (strindberg)]
Strindberg: Software upgrade hickup fixed by restarting the job manager.
1998-05-21 at 07:52 [xxx (strindberg)]
Strindberg: Easy thinks 114 nodes are not quite fit to run jobs. Investigating.
1998-05-20 at 17:00 [xxx (strindberg)]
Strindberg: Parts of the SP system software will be upgraded to enable support for new SMP nodes.
1998-05-08 at 13:40
All systems: Nothing was lost during the file server reboot 1998-05-06.
1998-05-06 at 15:38
All systems: Major file server reboot. Jobs stopped. Investigation in progress.
1998-05-03 at 20:45 [xxx (strindberg)]
Strindberg: reboot of the log in node due to excessive use of system resources.
1998-04-29 at 11:45
Selma: We're back. Thank you for being patient.
1998-04-28 at 21:30
Network: the main pdc router was restarted at 2032.
1998-04-28 at 16:48
Selma: System ran out of processes. Salvage in process....
1998-04-13 at 17:20
AFS: about to start salvage on one fileserver. Node allocation paused during salvage. Files residing on the fileserver will be blocked during salvage.
1998-04-06 at 01:41
Selma will be rebooted 1998-04-06 at 11:00. Some file systems will be changed. Up again after lunch.
1998-03-30 at 15:10
Working microcode has been reinstalled in all tape drives. DMF and ADSM are back online. This means that user data in the HSM system is now available again.
1998-03-30 at 00:30
Major HSM and ADSM tape drive problems. All drives have been taken off-line. DMF (HSM functionality for PDC users) and ADSM (backup services for various PDC users) have been shut down.

We expect to solve this problem during the day.

1998-03-29 at 23:00
HSM system unavailable until search of faulty tape drive completed.
1998-03-26 at 09:30
HSM system and ADSM backups will not be available today due to tape library maintenance. Migrated HSM files are expected to be back online this afternoon.
1998-03-25 at 15:00
Kallsup: Back inline.
1998-03-25 at 13:00
Kallsup: operating system crash. Disk problems suspected. System recovery in progress.
1998-03-23 at 08:20 [xxx (strindberg)]
Strindberg: Scheduler resumed.
1998-03-23 at 08:00 [xxx (strindberg)]
Strindberg: The scheduler is currently stopped since there are some network problems.
1998-03-19 at 18:50 [xxx (strindberg)]
Strindberg: There have been problems with one server. Several jobs did not start properly.
1998-03-19 at 18:27 [xxx (strindberg)]
Strindberg: Scheduler resumed operation
1998-03-19 at 18:20 [xxx (strindberg)]
Strindberg: Scheduler temporarily halted due to server problems
1998-03-06 at 01:45
Selma: System will be down for maintainance and diagnostics 1998-03-06 from 13:00 to 14:00 MET
1998-03-03 at 08:30
Kallsup: operating system hang, dump and reboot in progress
1998-03-02 at 17:42
AFS: fileserver salvaged, node allocation resumed.
1998-03-02 at 16:45
AFS: one fileserver went down, node allocation paused while salvaging.
1998-02-27 at 19:30
Network: Restart of primary pdc router.
1998-02-27 at 10:00
Selma: Disk errors, rebooting. Expected downtime unknown. Probably up again in the evening.
1998-02-16 at 12:26
Kallsup: One IO processor is not responding. Controlled reboot at 1998-02-16 14:30 MET. Back again one hour later.
1998-02-13 at 17:00
Selma: Planned downtime for OS upgrade from 1998-02-23 18:00 to 1998-02-25 18:00
1998-02-05 at 17:00
Network: an outage in the production ring caused broken connections between servers and production machines.
1998-02-04 at 10:00
Kallsup: Disk errors! Recovery in progress.
1998-01-23 at 11:00 [xxx (strindberg)]
Strindberg: The system is now open and ready to process your jobs with 132 fresh batch-nodes and four additional nodes for interactive use.
1998-01-20 at 19:48 [xxx (strindberg)]
Strindberg: The upgrade is currently about 40 hours behind schedule. Strindberg will be back at noon on Friday 1998-01-23.
1998-01-16 at 14:20
Selma: Operating System crash. Cause unknown. Rebooting.
1998-01-15 at 14:30
Selma: operating system panic, system restarted.
1998-01-14 at 16:24 [xxx (strindberg)]
Strindberg: the switch network went down due to loss of power in a switch board. Affected users will be contacted.
1998-01-09 at 09:08
Kallsup: LS-Dyna version 940 was upgraded to patch level 1.
1998-01-02 at 09:30
AFS: A disruptive server restart may have caused problems with running jobs
All flash news for 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss