From JNC at XX.LCS.MIT.EDU Wed May 25 00:00:00 1988 From: JNC at XX.LCS.MIT.EDU (J. Noel Chiappa) Date: Wed 25 May 88, 00:00 Subject: TCB's all in use. In-Reply-To: <12400803897.66.SRA@XX.LCS.MIT.EDU> Message-ID: <12401227468.31.JNC@XX.LCS.MIT.EDU> Rob, your analysis is 100% on the money, and your selected fix also appears to be the Right Thing. I suggest we do that. Noel ------- From SRA at XX.LCS.MIT.EDU Tue May 24 00:00:00 1988 From: SRA at XX.LCS.MIT.EDU (Rob Austein) Date: Tue 24 May 88, 00:00 Subject: TCB's all in use. In-Reply-To: <384099.880524.ALAN@AI.AI.MIT.EDU> Message-ID: <12400803897.66.SRA@XX.LCS.MIT.EDU> The code itself is per spec, although it may not be sufficiently paranoid. See RFC 793, pages 38-39. It includes a diagram which is somewhat easier to follow than the text: Figure 13 "Normal Close Sequence": TCP A TCP B 1. ESTABLISHED ESTABLISHED 2. (Close) FIN-WAIT-1 --> --> CLOSE-WAIT 3. FIN-WAIT-2 <-- <-- CLOSE-WAIT 4. (Close) TIME-WAIT <-- <-- LAST-ACK 5. TIME-WAIT --> --> CLOSED 6. (2 MSL) CLOSED ITS is party "A" in this case. COMSAT tells ITS "close this connection", ITS sends off a FIN. Party "B" ACKs the packet but doesn't ACK the FIN until it feels like it (closing is half-duplex). When party "B" decides to close too, it sends a FIN to ITS (note the odd sequence numbers here). ITS is supposed to ACK this FIN so that party "B" knows the connection has indeed been closed in everybody's opinion (ie, FIN is considered data to the extent that must be ACKed). So ITS sends the ACK and goes into the TIME-WAIT state. If ITS hears nothing for a certain period of time ("2 MSL") ITS assumes everything's cool and punts the TCB. If, however, ITS gets another FIN from party "B", ITS must assume that the ACK ITS just sent got lost, so ITS sends another ACK and resets the timer. Whew. No wonder so many implementers get confused by this! It is easy to see how a misbehaving TCP on party "B" could keep us wedged here forever. The RFC says that the only thing you can get when in a TIME-WAIT state is a retransmission of the other party's FIN. Perhaps the ITS code takes that for a law of nature rather than a description of two working TCPs having a conversation. It would be interesting to know if the packet that caused us to get to TSIATW is really the {FIN,ACK} packet we're assuming. If not I'd think that's immediate grounds for dropping the connection on the floor, since it demonstrates that at least one party is seriously confused about the current state. If it really is a FIN that we keep getting over and over and over, it might be reasonable to keep track of how many times we've gone through this routine and just punt when it gets ridiculous. I think this is even legitimate: either the foreign machine is broken or the intervening path is consistantly losing our ACKs, and in either case it won't do any good to send more ACKs so we might as well not bother. Of course this is the first time I've ever tried to follow all those silly state diagrams in TCP, so I might be completely confused. --Rob ------- From ALAN at AI.AI.MIT.EDU Tue May 24 07:47:53 1988 From: ALAN at AI.AI.MIT.EDU (Alan Bawden) Date: May 24 88 01:47:53 EDT Subject: TCB's all in use. Message-ID: <384099.880524.ALAN@AI.AI.MIT.EDU> Here is my diagnosis of the lossage. Consider the following code: ; TSIATW - Received ACK while in TIME-WAIT state. This should be ; a re-transmit of the remote FIN. ACK it, and restart ; 2-MSL timeout. TSIATW: METER("TCP: ACK in .XSTMW") MOVSI T,(TC%ACK) TRCPKT R,"TSIATW ACK send in TIME-WAIT" CALL TSOSNR ; Send simple ACK in response. JRST TSITM2 ; and restart 2-MSL timeout. Well, if the guy on the other end keeps sending you ACKs, the timeout keeps getting reset and the TCB never gets freed. I have verified that this is in fact the path that causes the problem by patching that JRST TSITM2 to be a POPJ P, and watching the stuck TCB's all vanish. I actually don't understand the logic here, it would seem to me that you should only be sending an ACK for a actual FIN, not just an ACK. I didn't look to see if the other guy was sending both ACK and FIN or just ACK. Do you suppose it is likely that the other machines all have this bug as well and the two are just spinning their wheels bouncing ACKs back an forth? There does seem to be other code that handles ACKing of FINs elsewhere in the the TCP code, but I don't understand enough to know if it is active when you are in the TIME-WAIT state or not. Conceivably the POPJ P, I patched in might be the solution to the problem? Suggestions? From Alan at AI.AI.MIT.EDU Mon May 23 20:14:00 1988 From: Alan at AI.AI.MIT.EDU (Alan Bawden) Date: May 23 88 14:14 EDT Subject: What's with all these TIMWTs? In-Reply-To: <424674.880523.GUMBY0@MC.LCS.MIT.EDU> Message-ID: <19880523181422.3.ALAN@PIGPEN.AI.MIT.EDU> Date: Mon, 23 May 88 04:23:59 EDT From: David Vinayak Wallace Right now there are 26 connections to unix.sri.com in state TIMWT, plus, plus one in FINWT1 to ub.cc.umich.edu. ... There really only looks like there are two lost packets (one to host 0!). ... Its not a question of lost packets, its a question of TCBs, the per-connection data-structure ITS has to maintain throughout the connection's lifetime. Unfortunately a connection can live on after a user process is done with it while the operating systems do some final handshaking to close things down cleanly. It appears that some new version of Unix is making the rounds that does something that causes this handshaking to take virtually forever. AI and MC each have 30 TCB's. They used to have 20, but I increased that when this problem first started happening. I just had to reload AI for the same reason. There are crash dumps for the interested in AI:CRASH;CRASH TCB and MC:CRASH;TCP BITIT. From ZVONA at AI.AI.MIT.EDU Mon May 23 19:09:29 1988 From: ZVONA at AI.AI.MIT.EDU (David Chapman) Date: May 23 88 13:09:29 EDT Subject: No subject Message-ID: <383531.880523.ZVONA@AI.AI.MIT.EDU> FTP just failed, complaining that "all sockets in use". Peek showed only two FILE jobs an do FTPs besides mine. What's going on? How do I fix it? From ALAN at MC.LCS.MIT.EDU Wed May 11 22:23:09 1988 From: ALAN at MC.LCS.MIT.EDU (Alan Bawden) Date: May 11 88 16:23:09 EDT Subject: And you thought PDUMP finally worked. Message-ID: <418741.880511.ALAN@MC.LCS.MIT.EDU> Note the -times- in the following consecutive entries from the mailer STATS file on MC. 084735 Note: GC'ing MSGS, 5646555-1620312=4026243 145143 ===> BUG: FATAL ERROR <=== Date: 05/11/88 14:51:42 Autopsy from 22770 Preserved from 22061 Last UUO = 017100,,062447 at 52657 MC was low on disk space at the time, which is probably what caused the original error, but what do you suppose was happening during the 6 hour pause? I'll guess that there is a bug in the PDUMP system call (probably also having to do with low disk space) that kept COMSAT hung until Zvona logged in and noticed the problem. Probably something he did to try and diagnose the problem PCLSR'd the call, and then it finished.