Tue Nov 22 14:25:04 CET 2016
Date: Saturday, 29 October 1983, 01:52-EDT
From: Christopher C. Stacy <CStacy at MIT-MC>
Subject: ITS wedgeding
To: BUG-ITS at MIT-MC
Cc: TAFT at MIT-MC, KMP at MIT-MC, KLH at SRI-NIC, Moon at SCRC-TENEX,
ELLEN at MIT-MC, JPG at MIT-MC, GSB at MIT-MC
TAFT at MIT-MC 10/28/83 14:31:11 Re: Crash of 10/28 14:27
Subject: Crash of 10/28 14:27
MC was catatonic, but running. Though jobs could be logged in, no
response to anything other than ^G could be obtained. I checked
around that this was really the case and then stopped it.
MC was also merrily printing lots of my little "Warning: <out of this
or that>" messages every chance it got.
In ITS 1353, crash file CRASH;ITS LOWCOR, SYSjob 88, Core 71, Net 24,
IMP 338, Chaos 255, INet 96, NCP 5, TCP 256, TCPbuf 53, on MIT-MC:
There is definitely something weird going on here.
Of the 50 disk channels, none are free.
There is no more free low core (for making file or network buffers.)
There are about 37 ___nnn CHAOS and 22 ___nnn TCP jobs trying to boot.
They are in LOAD or OPEN, trying to read ATSIGN CHAOS or ATSIGN TCP,
respectively. They all have read 0% of their file.
We can look at a representative wedged server job: user idx 101.
___101 TCP, is blocked inside a LOAD call: NLOADD+6/ SKIPG QSFBS(A) A/ 40
I don't entirely grok this disk code yet, but from the channel state
(%QALBK) and mode (READ/USER DATA), and the comment where he blocks,
it looks like he is waiting for the file channel to receive a buffer
with the first page of the TCPSER file.
Also, there may also be something weird file-wise with this particular job.
Unlike the others, I think it has two (QUSR idx) channels: 17 and 40.
According to PEEK, he is also opened RAT; (no file name), which I
guess accounts for channel 17. But, huh? What? Why?
Now, the TCP situation shows that there can be up to 180. packet
buffers, of which zero are free out of a total zero allocated.
I checked in XBUSER and friends.
There is only one TCP frob around, and he's not doing a whole lot:
TCP index 21, which has no associated job, is in "CLSACK" for SRI-CSL.
Has received FIN for input, State is "Last ACK".
User channel state: (input) Foreign host RESET, retransmit timeout (output).
The close reason is: Closed by user, Closed by foreign host.
This is (I think) reasonable, but does not account for all those other TCP servers!
1. WHAT ARE ALL THOSE NETWORK SERVER JOBS DOING WITHOUT ANY
NETWORK CONNECTIONS? Is it that the connections went away
before the jobs had a chance to get started, or what?
What are all these jobs for, anyway?
They have to load their programs before they can open their network
connection. The incoming RFC (SYN in the TCP case) that started these
jobs probably timed out and was rejected before you dumped the system,
hence no trace of it was left. If they're trying to load the ATSIGN xxx
program (rather than the actual server) that means they haven't even
yet queried the system's RFC (SYN) queue to find out what contact name
(port number) they are supposed to serve.
The system is supposed to filter out duplicate RFCs and SYNs, so if that's
working each of those jobs was created by a separate user attempt to connect.
Of course if they block forever in the LOAD system call, once created they
will never go away.
2. WHERE DID ALL THE LOW CORE GO?
This seems to be the reason those server jobs can't get going.
The lack of low core is indeed the reason the servers can't get going; they
can't get a buffer to use to read their disk files they're trying to load.
Even in PDUMP format the first page of the file has to be read into low core
to find out what pages are to be loaded.
One reason for lack of low core is bloatage of directories. I've been meaning
to make a straightfoward but not totally simple fix to allow directories to
be stored anywhere in core, but haven't got around to it. I guess you should
bug me about this periodically.
However, in this case that isn't the problem: there are only 9 pages of
directories (DSKUDR in Peek M mode) plus 7 pages of disk buffers. I looked
around at the MEMBLT and MEMPNT tables. A lot of the low core pages are
just user pages of job 100, a worthless TEX job. See the large comment
a few lines below SWPOPG; evidently the system keeps trying to swap
out more and more pages to get some low memory, but keeps swapping out
the wrong jobs.
The right fix is to make it specifically swap out pages in "low" memory
(as counted by LMEMFR) in this case, or else to put the core shuffler
back in to move those pages to "high" memory. Maybe I'll look into this
in a few days when I get time.
3. I heard that typein was echoing very slowly for users during this
wedged period. I wonder if this was true, it was it true for local
consoles as well as STYs? I don't really understand why this should
be, in either case. When the system is running out of core, I have
usually been able to do SYS$J (provided there was a channel for me
to .OPEN on USR:) and look around.
Lack of low core could not affect echoing on local terminals, and would only
affect echoing on network terminals if there wasn't enough core to make
network packet buffers. Of course maybe the users were confused and
what they really meant was that their programs were busted.
More information about the ITS