From Moon at STONY-BROOK.SCRC.Symbolics.COM Tue Dec 23 02:36:00 1986 From: Moon at STONY-BROOK.SCRC.Symbolics.COM (David A. Moon) Date: Dec 22 86 20:36 EST Subject: Kludging around the KS's flakey disk In-Reply-To: <861222020255.2.SRA@WHORFIN.LCS.MIT.EDU> Message-ID: <861222203625.1.MOON@EUPHRATES.SCRC.Symbolics.COM> Date: Mon, 22 Dec 86 02:02 EST From: Rob Austein I just spent five minutes looking at the DISK code. Based on that wealth of experience, it appears to me that if I were to bring up the KS with UNSAFE+1 JFCL'd out, it would stop doing a BUGPAUSE every hour or so. Somebody tell me why I shouldn't do this, before I wear out the $P keys on the KS's console.... The idea was that if the disk goes unsafe, is reset, and goes unsafe again within one second, it is probably about to explode and the fire department should be called. It looks like the code jumps to UNSAFE for a lot of different reasons, not all of which are the drive going unsafe. The drive also goes unsafe for a lot of different reasons, some of more consequence than others. I agree that if you JFCL out the BUGPAUSE it won't do it any more, so if you think this won't do irreparable harm to the disk, go ahead. From sra at XX.LCS.MIT.EDU Mon Dec 22 08:02:00 1986 From: sra at XX.LCS.MIT.EDU (Rob Austein) Date: Dec 22 86 02:02 EST Subject: Kludging around the KS's flakey disk Message-ID: <861222020255.2.SRA@WHORFIN.LCS.MIT.EDU> I just spent five minutes looking at the DISK code. Based on that wealth of experience, it appears to me that if I were to bring up the KS with UNSAFE+1 JFCL'd out, it would stop doing a BUGPAUSE every hour or so. Somebody tell me why I shouldn't do this, before I wear out the $P keys on the KS's console.... From ALAN at AI.AI.MIT.EDU Mon Dec 22 05:17:45 1986 From: ALAN at AI.AI.MIT.EDU (Alan Bawden) Date: Dec 21 86 23:17:45 EST Subject: RP06s eat dead bears In-Reply-To: Msg of Sun 21 Dec 86 13:52:25 EST from David Vinayak Wallace Message-ID: <133398.861221.ALAN@AI.AI.MIT.EDU> Date: Sun, 21 Dec 86 13:52:25 EST From: David Vinayak Wallace AI hung in a fashion I'd never seen on the KS's: disk accesses would hang forever. Pages in core were easily accessible; ITS ran fine unless you tried to touch the disk. I could create a job and run some instructions in low memory, but when I tried to do a .CALL OPEN ITS hung. You must have been lucky. Our RP06's have been causing ITS to hang in this way ever since day 1. I went upstairs and the system console said: DSK: UNIT #1 CAME BACK ONLINE DSK: UNIT #0 CAME BACK ONLINE DSK: UNIT #1 CAME BACK ONLINE and some status registers. For all three, ER1= 40000 which is Drive Unsafe. Unsafe almost always comes on. The interesting bits were the ones in ER3, which according to the crash dump were (according to the crufty documentation) "AC power low", "DC power low" and "Spare" (!). I dumped it to CRASH DSKOFL, not that I think it will help any. Why? I manadged to get the ER3 bits out of it. Notice that the crash file was written out but the dates were not set on the file? DSKDMP never sets write dates, because it doesn't know how to tell the time. The DMPCPY program (which TARAKA runs when the system boots) sets the date on any file it suspects is a crash dump. In this case, DMPCPY wasn't able to set the date either, because you had cold booted the machine, so ITS didn't know what time it was when DMPCPY ran. ... By the way, should these be going to BUG-ITS or KS-ITS -- I can never tell any more. You sent this to the right place. KS-ITS is almost -never- the right place to send a Bug Report. From ALAN at AI.AI.MIT.EDU Mon Dec 22 04:34:01 1986 From: ALAN at AI.AI.MIT.EDU (Alan Bawden) Date: Dec 21 86 22:34:01 EST Subject: Gee, that's not his host's name In-Reply-To: Msg of Sat 20 Dec 86 13:22 EST from Ramin Zabih Message-ID: <133383.861221.ALAN@AI.AI.MIT.EDU> Date: Sat, 20 Dec 86 13:22 EST From: Ramin Zabih Typing :FINGER on AI just produced this output: ... RDZ Ramin Zabih F T23 <>: 709 x8827 RDZ, Zvona (Chaos) It seems that someone is confused about the name of the 3600 I'm using... RDZ's host has a short name of "NULL". He's been expecting the name "NULL" to break some program ever since he named it that. I presume thats why he mailed this bug report to Bug-LISP. Nothing's broken actually. I was just hacking him... From GUMBY at AI.AI.MIT.EDU Sun Dec 21 19:52:25 1986 From: GUMBY at AI.AI.MIT.EDU (David Vinayak Wallace) Date: Dec 21 86 13:52:25 EST Subject: No subject Message-ID: <133197.861221.GUMBY@AI.AI.MIT.EDU> AI hung in a fashion I'd never seen on the KS's: disk accesses would hang forever. Pages in core were easily accessible; ITS ran fine unless you tried to touch the disk. I could create a job and run some instructions in low memory, but when I tried to do a .CALL OPEN ITS hung. I went upstairs and the system console said: DSK: UNIT #1 CAME BACK ONLINE DSK: UNIT #0 CAME BACK ONLINE DSK: UNIT #1 CAME BACK ONLINE and some status registers. For all three, ER1= 40000 which is Drive Unsafe. I dumped it to CRASH DSKOFL, not that I think it will help any. Notice that the crash file was written out but the dates were not set on the file? I cold-booted just in case -- ITS seems to be running fine now. By the way, should these be going to BUG-ITS or KS-ITS -- I can never tell any more. david From RDZ at AI.AI.MIT.EDU Sat Dec 20 19:22:00 1986 From: RDZ at AI.AI.MIT.EDU (Ramin Zabih) Date: Dec 20 86 13:22 EST Subject: Gee, that's not my host's name Message-ID: <861220132243.9.RDZ@NULLSTELLENSATZ.AI.MIT.EDU> Typing :FINGER on AI just produced this output: -User- --Full name-- Jobnam Idle TTY -Console location- ___005 < [not logged in] HACTRN 23.T05 906 x1729 CENT, OAF KWH Ken Haase HACTRN *:**.T15 Net site PREP (Chaos) DPH Daniel Huttenlocher HACTRN 46.T16 723 x8843 Alan, DPH RDZ Ramin Zabih F T23 <>: 709 x8827 RDZ, Zvona (Chaos) It seems that someone is confused about the name of the 3600 I'm using... From Alan at AI Fri Dec 19 08:54:39 1986 From: Alan at AI (Alan at AI) Date: Dec 19 86 02:54:39 EST Subject: OK, I just saw it happen again. Message-ID: <132522.861219.ALAN@AI.AI.MIT.EDU> For most of yesterday (Thursday the 18th) COMSAT on MC was catatonic. Our guess is that it was stuck in a JOB device wait (waiting for the DQ device). As soon as Alan looked at the situation COMSAT started running again, so probably something he did caused COMSAT to get PCLSR'd out of the system call for the first time all day, and the second time the timing screw did not occur. The right thing is for someone to fix the last bug in the JOB/BOJ code. A quick fix the COMSAT maintainers might consider, is to take an occasional %PIRLT interrupt to keep its interactions with DQ lubricated. A better fix would be for Alan to finish up the improved Domain Demon interface, so that COMSAT can use it instead, and not be subject to this particular class of ITS bug. From ALAN at AI.AI.MIT.EDU Thu Dec 18 08:57:19 1986 From: ALAN at AI.AI.MIT.EDU (Alan Bawden) Date: Dec 18 86 02:57:19 EST Subject: No subject Message-ID: <132118.861218.ALAN@AI.AI.MIT.EDU> TTYSET on a TTY opened as a device (rather than as a console) clobbers the wrong TTYST* words! From ALAN at AI.AI.MIT.EDU Thu Dec 4 20:00:04 1986 From: ALAN at AI.AI.MIT.EDU (Alan Bawden) Date: Thu, 4 Dec 86 14:00:04 EST Subject: more fukt In-Reply-To: Msg of Thu 4 Dec 86 03:20:35 EST from Pandora B. Berman Message-ID: <126472.861204.ALAN@AI.AI.MIT.EDU> Date: Thu, 4 Dec 86 03:20:35 EST From: Pandora B. Berman ... maybe lester didn't fix the disk hard enough. I don't think it is related. This is a problem we have had ever since we went to two drives. The symptom is that for no apparent reason a drive interrupts you and reports that it has just recently come back online. Both drives do it. We have no idea why they do this. We also have no idea why the code that we put in to recover from this doesn't work. (Perhaps we need to reset the drive harder when this happens.) Luckily this doesn't seem to happen all that often, but it has been the most common reason for AI crashes since we got the new drive. It also seems that if one drive is in the middle of a transfer, and you tell the other drive to do something, the first drive will interrupt you and complain that you shouldn't bother it while it is busy. We simply ignore these complaints, which seems to work just fine. (Like it's happened 470 times since AI came up 10 hours ago...) From CENT at AI.AI.MIT.EDU Thu Dec 4 09:20:35 1986 From: CENT at AI.AI.MIT.EDU (Pandora B. Berman) Date: Thu, 4 Dec 86 03:20:35 EST Subject: more fukt Message-ID: <126255.861204.CENT@AI.AI.MIT.EDU> it happened again. dumped to CRASH;FUCKED AGAIN. maybe lester didn't fix the disk hard enough. From CENT at AI.AI.MIT.EDU Thu Dec 4 07:19:00 1986 From: CENT at AI.AI.MIT.EDU (Pandora B. Berman) Date: Thu, 4 Dec 86 01:19:00 EST Subject: &^*^&$%!! Message-ID: <126198.861204.CENT@AI.AI.MIT.EDU> things were hanging all over, and alan diagnosed that a disk had briefly gone offline, and AI had not quite recovered correctly. only thing i could do, really, was to lift switch 0 and reload. crash dump (i think) to CRASH;FUCKME HARDER, a fine traditional name.