Seeking High IMP Reliability

Alex McKenzie
Personal
Professional
INWG Documents
Family
- Alexander A. McKenzie II >
  - Mount Washington >
    
    Genesis of the Observatory
    
    World Record Wind
  - LORAN >
    
    Crusing the Labrador
    
    Acquisition of Canadian sites for Long-Range-Navigation Stations
    
    Sites #1 and #2: Loran Memo #108
    
    LRN Site No. 3
    
    Report of Construction at L.R.N. Site #3, 8/10-11/5 1942
    
    LRN Site No. 4 (Bonavista Point, Newfoundland)
    
    Supplies for Site 4
    
    Drawings Left at Site #4 by A.A. McKenzie
    
    Site 4 Letter of March 24, 1943
    
    LRN Site No. 5
    
    LRN Site No. 8
    
    LRN Site No. 9
    
    Test Plan - Eastern US
    
    LORAN - Part 1
    
    LORAN - Part 2
    
    LORAN - Part 3
    
    End of LORAN
- Genealogy >
  - Athelstan & Burke - 2010
Photos
Edit Website

Seeking High IMP Reliability in Maintenance of the 1970s ARPAnet

© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Published as: D.C. Walden, A.A. McKenzie, W.B. Barker, “Seeking High Reliability in Maintenance of the 1970s ARPAnet,” IEEE Annals of the History of Computing, vol 44, no 2, April-June 2022, pp 8-19
Digital Object Identifier: 10.1109/MAHC.2022.3171970

This republication here of the final submitted draft is in accordance with IEEE policy on Electronic Reprints by authors.

Seeking High IMP Reliability in
Maintenance of the 1970's ARPAnet

DC Walden
AA McKenzie
WB Barker

During the first years of ARPAnet operations, computers were not highly reliable, but the network was built from standard computers and was expected to function as a utility with high reliability. We managed to achieve the desired reliability, as perceived by ARPAnet users, by making innovations in hardware, maintenance procedures, software, and network operations.

The preparation of this paper was spearheaded by David Walden. David recruited the other 2 authors, prepared the outline of the paper he envisioned, and drafted some sections. Sadly David’s deteriorating health prevented him from contributing his usual amount of energy and knowledge, but this paper would not have been written without him.

This paper draws heavily on the personal experiences of the authors, many of which have not been previously reported in the literature. Our focus is on the 1969-1975 time period when ARPAnet was the sole responsibility of the Advanced Research Projects Agency (ARPA). A 2018 paper[1] discusses ARPAnet maintenance after 1975.

Background

In 1969 ARPA awarded a contract to Bolt Beranek and Newman Inc. (BBN) to build, install, and maintain a new type of data communication network based on the concept of packet switching (ARPAnet).[2] The design called for a small computer called an Interface Message Processor (IMP) to be collocated with each of the computers (Hosts) that ARPA desired to have access to the network. Each IMP was to be connected to 2 or more other IMPs by 50kbps telephone circuits to form a loose mesh network. The IMPs provided a standard interface to the hosts and were responsible for breaking Host messages into packets, finding a route through the network to the message destination, managing errors on the circuits, reassembling the packets into the original message at the destination IMP, and delivering the message to the destination Host.[3]

The concept of packet switching had never before been reduced to practice in a substantial network, so ARPAnet was a research project in its own right. Nevertheless, the Host organizations were expected by ARPA to use ARPAnet as a reliable service, so the network users needed to depend on the reliability and availability of the net in the same way they depended on electric power and telephone services. The network itself was designed to be reliable, with the IMPS automatically routing each packet around trouble spots. So long as the design constraint of every IMP connected to at least two others was followed, no single IMP or circuit failure would block the movement of a packet from its source to its destination. However, from a user’s viewpoint, the failure of the IMP at either his location or the destination location was a network outage. Thus the IMPs themselves were a reliability bottleneck.

Over the years there have been many approaches to designing more reliable computers. First, the individual hardware components have become more reliable, from vacuum tubes to solid state components to integrated circuits. At a given level of component technology, efforts to make computers more reliable have focused on replicating hardware and choosing a “majority opinion” when identical components produced different outcomes.[4] Unfortunately there were no minicomputers based on redundant hardware available (within our budget, at least), and the 9-month period between contract signing and first IMP delivery did not afford us the luxury of designing our own. We started with a commercially-available minicomputer with a normal reliability profile and worked to make reliability better over the first several years of network operation. [Eventually we did design a multiprocessor IMP (the Pluribus) to achieve increased performance and reliability[5], but this paper focuses on the steps we took to improve the reliability of the mono-processor IMP.]

ARPAnet was the seed from which the Internet grew, and the role of packet switching in modern data communication is well understood. A good deal has been written about BBN's development of the software that implemented packet switching in the IMPs and made the network reliable. Less well documented are the BBN developments in maintenance and support which allowed the ARPAnet IMPs to meet the reliability goals the user community needed. The IMP network was unique (or an early example) in several aspects of its support and maintenance, quite apart from the idea of packet switching. This paper addresses the developments needed to deal with those attributes, such as:

The IMPs were installed in environments which were indifferent to their support. No on-site person had any contractual responsibility for the IMP.

IMPs were installed at universities and research institutions holding ARPA contracts. In this environment they were objects of interest to graduate students and junior researchers who often wanted to open them up to see what made them tick.

Trouble on the circuits between IMPs was reported by BBN to the telephone company for repair, but BBN was at neither end of most circuits and the telephone company was often reluctant to believe our reports.
Computer users in the early 1970's expected computer up-time no better than 95%.
ARPAnet users expected up-time comparable to the electric power supply, orders of magnitude more reliable than typical computers.
BBN was contractually responsible only for maintenance of IMPs, but to users the “network” was everything along the path from their fingers to the service they wanted to access.

Some of these aspects were understood by the BBN design team from the beginning, and the IMP design attempted to address them. Others arose during the course of our support and maintenance experience.

Initial Design

The design of the original IMP hardware and software was publicly described early on in The interface message processor for the ARPA computer network[6] presented at the Spring Joint Computer Conference in May 1970. It was based on the Honeywell 516 minicomputer, which could accommodate up to 16K of 16-bit words of core memory. This description noted several features intended to facilitate basic operation and maintenance. Frank Heart, the leader of the project team, had spent many years working with real-time defense systems and was fanatical about ensuring that the system design reduced every possible risk of failure that could be foreseen.

Ruggedized Hardware: An IMP was to be installed within 30 feet cable distance of the Host it supported. This would be in a computer room environment, but it was an organizational orphan, not the responsibility of any individual in the Host operations team. There was a fear that it would be shoved in a corner where it might be bumped by the janitor's mop bucket, or blocked from conditioned air by boxes of line printer paper. There was also the fear that it would prove an attractive object to people who wanted to see how it worked. For these reasons, BBN decided that the IMP would be supplied in a cabinet designed for hostile environments with mechanical and thermal shocks. The steel cabinet could withstand hammer blows (as demonstrated by Honeywell at trade shows with a large wooden mallet).

This concern may seem outlandish, but in the mid-70's we had switched to non-ruggedized IMPs. ARPA had invested in the development of a number of expert systems to help with planning logistics and deployment, and arranged for a demonstration of these systems to the Commander in Chief of the Pacific Fleet via the ARPAnet and an IMP at the University of Hawaii. For several weeks before the scheduled demonstration, the Hawaii IMP crashed every weekday between 9 and 10 am Hawaii time. The ARPA contact at U. Hawaii became furious at this unreliability and warned ARPA that the demonstration was likely to be a big embarrassment. ARPA let BBN know that we were in serious trouble if we didn't fix the problem. We had been unable to discover any likely hardware or software cause for the problem, but based on the consistent timing we suspected an environmental problem, for example a power drop when the building air conditioner started each morning, or a blast of warm humid air when some loading dock door was opened to receive a morning shipment. We sent an operator to Hawaii with an air mattress and a sleeping bag and told him to stay with the IMP day and night looking for some environmental change. The next morning a graduate student walked in, opened the IMP cabinet, attached some wires, and the IMP crashed. It turned out he was working on a new Host interface and needed a well-regulated DC power source, so he decided to tap into the IMP power supply. This caused an internal transient that caused the IMP CPU to crash. When the grad student was ordered to leave the IMP alone it stopped crashing.

Protected Memory: Each IMP had 512 words of protected memory which contained reload and restart procedures. If the IMP hardware detected an external power failure a power-fail interrupt transferred control to a Clean Stop routine in this area. When power was restored another hardware interrupt would transfer control to a restart procedure.

The IMP was also equipped with a Watchdog Timer which counted down to zero and, when reaching zero, generated a high-priority hardware interrupt which transferred control to a reload routine. The IMP was equipped with a high-speed paper tape reader which held three copies of the IMP program in sequence. The reload routine assumed that the program was corrupted and read a fresh copy of the program into memory from the tape reader, then transferred control to the restart routine. The main IMP program loop reset the Watchdog Timer on each execution, so an IMP functioning normally would never have a Watchdog Timer interrupt.

Interface Loopback: Both the host and the modem interfaces were designed by BBN and implemented by Honeywell from the standard Honeywell logic boards used in the rest of the machine. These interfaces were full-duplex and symmetric, and could be looped back on themselves under software control to diagnose problems in the interfaces. The Bell 303 modem that connected the IMP to the communication link also had a loop-back capability which took the data through the transmit side of the modem then back through the receive side. This gave us the ability under program control to diagnose problems in the modems as well as the IMP’s modem interface. Each circuit also had an associated voice link that allowed one to talk directly to a person at the modem on the other end of the line. By sending the right tone down this line, one could loop back the modem at the far end of the line, allowing us to diagnose the circuit itself in both directions. [Some of us were able to whistle the correct tone into the headset to activate this loop-back.]

Circuit Error Handling: We expected the majority of errors to arise from noise on the IMP-to-IMP circuits, and believed it was important to detect and correct circuit errors at the lowest possible level. Each circuit interface included hardware that generated (on transmission) and checked (on reception) a 24-bit Cyclic Redundancy Checksum (CRC). This CRC was designed to detect almost all possible errors in the packets (of a little over 1000 bits each) transmitted from IMP to IMP. The modem interface hardware flagged any packets with CRC errors and the software discarded the marked packets. Each packet was given an identifier by the transmitting IMP and if received correctly was positively acknowledged by the receiving IMP. If the transmitting IMP did not receive an acknowledgment within a fixed time it retransmitted the packet, and this process would repeat until the packet was acknowledged and could be discarded by the transmitter.

Dynamic Routing: The path of each packet through the network was determined for that packet at each IMP it passed through. The path computation was dynamic to take account of the instantaneous state of the IMPs and circuits. No manual actions were required if a circuit or IMP went down or was added to or subtracted from the network. Successive packets from one Host to another might take different paths, and a packet might even backtrack if there were a failure along the path it started out on.

Introspection: As the network was a research project in its own right in addition to being a utility, the IMP software was designed to measure and report a large amount of information about its own operation. Some of the information was useful for operation and maintenance; this included Host and circuit up/down status, sense switch settings, circuit error rates, etc. Other information was useful for performance evaluation: this included things such as queue lengths, packet arrival times, and delay and throughput information. The software allowed detailed reporting of any packet with a “trace” bit set, the generation of test traffic from any IMP to any other, and statistics about this traffic. Of course, some of these capabilities had to be used judiciously because of the load they placed on the network.

The software also included a debugging package (DDT), which allowed the examination or modification of any memory location in the IMP. The debugging package could be accessed from any host on the network, including the so-called fake hosts[7] which interfaced with the Teletype connected to each IMP. In this way, a user at an IMP Teletype was able to examine or modify the memory of any IMP on the network, including his own. This capability was key to the ability of the BBN network support staff to loop and unloop interfaces, modify the IMP programs across the net, and gather debugging information. This use of an unlocked terminal was adequate in the first days but in the late 1972 – early 1973 time period, with increasing pressure for network reliability, permission to use an IMP's DDT was restricted to hosts (real or fake) at BBN. For trouble-shooting purposes, it was also possible to use the local IMP Teletype to access that IMP's DDT if a sense switch on the IMP console panel was turned on. This modest control would certainly be considered inadequate in today's Internet environment, but it worked well enough then.

Evolution of the Network

IMPs were installed at the rate of roughly one every month, starting at the beginning of September 1969. The first three were installed in California and the fourth in Utah. The fifth IMP was installed at BBN headquarters in Cambridge MA and the next several in the Boston area. The next expansion was to the Washington DC area. With the concept of the network proven, ARPA asked BBN to design a machine which would allow character-at-a-time terminals to be directly connected to the network without a host. This device would allow ARPA to support network access for researchers at locations that did not have their own host.

The IMP which could also support terminals was called a Terminal IMP, or TIP.[8] The first TIP was installed in mid-1971. It was based on the Honeywell 316 minicomputer, program-compatible with the 516 but slightly slower and able to accommodate 32K words of memory. The 316 did not come in a ruggedized configuration. The IMP program ran in the first 16K words of memory, and the terminal-handling “host” program ran in the second 16K. BBN designed and manufactured the hardware to interface up to 63 terminals running at data rates from 75bps to 9600bps in asynchronous character mode and 19,200 bps in synchronous.

As the network traffic grew, it began to appear that at some sites the Honeywell-based IMPs did not have enough processing power to support the desired volume of packet traffic or the desired number of connected devices. ARPA asked BBN for a higher-capacity IMP and we developed a multiprocessor machine called the Pluribus, based on a Lockheed processor.[9] The Pluribus machine was produced in both IMP and TIP configurations. Of course the Lockheed and Honeywell CPU's used completely different instruction sets, so adjacent network nodes could be running completely different code. This resulted in operational issues discussed below.

Evolution of Network Operation

BBN was not only responsible for building and installing the ARPAnet, but for keeping it operating and fixing problems as they were discovered. As a research project in its own right, the IMP software was not perfect as first designed. Problems were uncovered in the original routing algorithms, the original congestion control mechanisms, and other aspects of the operating software. Many of these issues are discussed in papers from 1971 to 1974.[10]

Just as important, we were responsible for the day-to-day operation of the ARPAnet, correcting circuit failures and IMP hardware failures by promptly calling in the appropriate maintenance organization. For the circuits, that was some local (to the circuit endpoints) telephone company; for the IMPs, in the first years of the network, that was Honeywell Field Service. First, however, we had to notice that something had failed. From the time of the first IMP delivery to UCLA on September 1, 1969 until February 1970 that was difficult; the IMPs were in California and Utah, and BBN was in Massachusetts. Frequent visits were made by BBN engineers to the installed IMPs, and we relied heavily on help from site personnel.

In mid-February 1970 a circuit was installed connecting the Utah IMP to an IMP at BBN, and we could begin to monitor the status of the ARPAnet's components using the reporting routines built into the IMP software, reporting to BBN. At first these reports were sent to the Teletype fake host, and occasionally one of the ARPAnet project team members would wander by to look at the printout for signs of trouble. But this process rapidly became inadequate, and by mid-1971 a host computer (a Honeywell 316) was set up to receive the reports, sort through them for indications of trouble, and in the event of an apparent circuit or IMP failure sound an alarm and flash an indicator light identifying the circuit or IMP. This was the beginning of the steadily increasing functionality of the Network Control Center (NCC). By early 1972 we were using an old time-shared DEC PDP-1 host at BBN to provide a TIP program reload capability, which needed to be specialized for each TIP. Eventually the PDP-1 took on other support tasks. Staffing was increased to a full-time operator during BBN working hours, then to 16-hour-a-day weekday coverage, and by mid-1972 to full 24-hour 7-day coverage, as the network grew in size and the user community demanded more responsiveness. In 1973 we began using TENEX server systems at BBN to provide additional NCC functions. The NCC growth and functionality is described in papers from 1972[11] and 1975.[12]

Early in 1971 the host community agreed on a standard mechanism (Host-to-Host Protocol, or NCP) for information interchange. Once the NCP was implemented in several hosts, it became relatively easy for application programmers to begin using remote network resources. This was the beginning of an increasing pressure on BBN to make the network reliable enough for host applications to depend on it always being available.

Remote Diagnostics: The most powerful tool we had to find and fix problems in the net was the net itself. The IMP fake hosts allowed us to locally or remotely examine registers that contained details about the IMP’s operation, to insert patches into the operating software, and to cause the IMP to issue control signals to the interfaces to the connected modems and hosts, causing those interfaces to loop back to the IMP for diagnostic purposes. We generally accessed an IMP's DDT from a terminal on our time-shared PDP-1 host located in our office or in the Network Control Center. Before mentioning some of the tools we built on top of these capabilities, we should mention that the very fact that we could field a network including such tools illustrates how different the world of networking was from today’s world. These tools allowed anyone connected to the net – either by walking up to the control teletype in a university computer room or from any host computer connected to the net – to see any data going through the net, to insert data into any message on the net, or to take down any IMP on the net. It would have been straightforward to put into the IMP code to allow only authenticated network managers these capabilities, but in the earliest days it never occurred to us that we needed that protection. This changed a little by early 1973 when some minimal controls were put in place.

Software Checksums: We implemented in the IMP’s operational software checksums used for a variety of purposes such as to verify that the data being used had not been corrupted in the IMP. One such was a software checksum on packets received that could be optionally activated in addition to the hardware CRC check, that we used to diagnose IMP hardware problems. If we started seeing corrupted data, we would activate the software checksum, which would cause the IMP to send any corrupted packets to the Network Control Center, where they would be printed out. Most of the packets we received in this fashion were routing packets, which had a fixed format that made it relatively easy to identify the cause of the failure. For example, an intermittent failure in the interface shift register that caused a given bit to stick on would show up as all the bits in a given word to the left of the faulty bit being set to 1. A failure in the logic that transferred the assembled word into memory might show up as the same bit on the same interface being repeatedly changed from a 1 to a 0 in many packets. In each such case, the repair technician would arrive with the correct replacement card and knowing exactly which card in which interface to replace. If the particular circuit wasn’t being heavily used, we might disable the faulty interface and schedule an after-hours service call to replace the card.

When we were having trouble with a circuit, we used these features to diagnose the problem to the circuit itself or one of the two modems in order to provide information to the carrier as to the nature of the problem. To see the characteristics of the line errors encountered, we would activate the software checksum and turn off the hardware checksum check. Now any data error on the circuit would result in a software checksum error that would cause the bad packet to be printed out in Cambridge. We became proficient at identifying problems in the circuits before the carrier was aware of the problem. After links carried on microwave towers were installed in Florida, when thunderstorms popped up, we could see error rates climb, and by looking at the data we could identify lightning strikes and intense rain cells. The first time the Network Control Center operators placed a trouble call to the carrier to report high error rates on a circuit in California, the carrier technician insisted on knowing which end of the circuit we were calling from. They couldn’t accept that we were calling from Massachusetts.

Neighbor Reloads: At one point, an IMP at a remote location with a hard failure crashed and tried three times to reload itself from paper tape as commanded by the watchdog timer. The result was an impressive pile of jumbled paper tape sitting on the floor beside the IMP. Someone thought the pile unsightly and threw it away. When our maintenance person arrived and repaired the failure, after running the diagnostics, he had no way to reload the IMP software. It appeared that we were going to be stuck with the IMP down until we could fly out a replacement paper tape. Instead, we took down the communication line from a neighbor IMP so that the neighbor would send nothing to the failed IMP. At the failed IMP, we then manually set up the channel pointers for that line to point to the entire memory of the IMP and manually executed an IN command to that modem interface. Then from Cambridge we set the neighbor’s output side channel to similarly point to the entire IMP memory and patched that IMP to one time execute an OUT command, causing one long packet of the entire IMP memory to be sent down the line to the failed IMP. With its memory now a copy of its neighbor, we were able to restart the failed IMP with no trouble.

This process of reloading an IMP from its neighbor proved very powerful and useful. It was built into the normal IMP code as a feature and became how we reloaded all IMPs. This became the mechanism for distributing software updates, loading the new software into an IMP at our Cambridge headquarters and the propagating it to each of the other IMPs on the net one by one. The paper tape readers on the IMPs were removed and discarded.

The introduction of the TIPs required additional procedures. Any given TIP did not in general have an adjacent TIP from which it could be reloaded. In addition, it quickly became apparent that each TIP needed to have customization to efficiently manage the collection of devices connected to it. The IMP portion of the program could be reloaded from a neighbor. For the TIP portion of the code, buffers, and any other special parameters, we stored an image on our time-shared DEC PDP-1 which was attached to the network as a host. Once the IMP portion of a TIP was functioning a special download program in the PDP-1 interacted with the DDT at the target TIP to download the memory image for that TIP and start it running. Once we had this process working, we added to the PDP-1 the capability of releasing software patches to all the network IMPs as necessary.

Similarly, after we added Pluribus IMPs it was not always possible to download even the IMP code from a neighbor, as the neighbor might not be code-compatible. The mechanism we evolved was to have an IMP that needed to be reloaded running a bootstrap routine looking for a special message on any of its circuits saying “download yourself from me.”   The NCC would then begin a software transfer in normal ARPAnet packets from a server host (the PDP-1 or a TENEX time-sharing machine at BBN) to a fake host in a neighbor, and the fake host would transform the packets into a special format for software reloads, consisting of a memory address and a portion of a memory image. The first special packet was the “download yourself from me” packet. When the subsequent packets arrived at the IMP to be reloaded, they were written into the memory locations specified by the address in that packet. The final special format packet would specify a start location and tell the reloaded IMP to jump to that location.

Diagnosis vs. Restarting: There is an obvious conflict between getting as much diagnostic information as possible from a failed machine and getting the failed machine restarted and back online. At first a down IMP was kept down until someone local could be talked through the process of recording the contents of important memory locations and CPU registers. This often resulted in extended down time after an apparent program crash. In mid-1974 we created a tiny routine in the IMP loader area to dump the contents of memory to a network TENEX system where the programmers could examine it at leisure. Since an IMP could not communicate through the network unless its program was fully running, this routine communicated only with an immediate neighbor. When the neighbor saw the specially-formatted dump information arriving over one of its attached circuits, it repackaged the dump contents as standard network messages and forwarded them to the standard destination machine for core dumps. This process allowed us to gather all the available diagnostic information while getting the failed IMP back in operation quickly.

Hardware Maintenance

By the beginning of 1972 the IMPs exhibited reliability better than typical computers of the day, around 98% up-time vs. around 95%. But 98% uptime is 2% downtime which means it is down about a half-hour a day. With the redundant routing over alternate communication lines and alternate IMPs, the users essentially never experienced denial of service due to a failure in the interior of the net. Only outages at one of the two end points of a given communication generally caused service denial. Nonetheless, this meant that on an average day, a user of the net would be unable to complete a desired connection for an hour. No matter that the IMPs were as reliable as the hosts that connected to them, users expected that the net was a utility and should be as reliable as the electrical supply or the telephone dial tone. Users were unhappy with the net being unavailable to them for an hour a day; they perceived that the net was “always down.” By mid-1973 there was talk of canceling the entire net. BBN was given a directive to fix it. In early 1974 Ben Barker, an engineer about to receive a Harvard PhD in Applied Math (Harvard had no computer science department at the time), was put in charge of the effort.

At the time, we were subcontracting with Honeywell, the maker of the minicomputers on which the IMP was built, for maintenance of the IMPs. Ben met with Honeywell to explain to them that we needed to dramatically improve reliability. Honeywell’s reaction was, “You’re getting 98%!? We never get that high. How are you doing it?” Ben concluded that the only way we would be able to get control of the problem was to build our own field maintenance group, with 1-man offices in the Boston, Washington, San Francisco, and Los Angeles areas, where we had clusters of IMPs on the net. In building and running this group, we learned a number of lessons on what is important to achieve high availability.

Maintenance Approach:   What we bought from Honeywell was minicomputers with some custom interfaces. What we delivered to ARPA were communication switches. Appropriate maintenance approaches for these different kinds of products are markedly different. One key example of this difference is a guiding philosophy that we adopted: hardware breaks, often intermittently, and to the maximum extent possible, the software needs to run correctly in the face of such errors. Initially, the software developers were reluctant to implement such an approach; they felt that the software had to be able to assume the hardware was working properly or it was impossible to know what would happen. Eventually, as we implemented defensive code and were able to show its effectiveness at finding errors and improving net reliability, they adopted this defensive posture.

To illustrate the difference between the approach that Honeywell maintenance took to a down machine and the approach we adopted, consider what the service technician would do on arrival. When Honeywell arrived at a failing but still running machine, the first thing they would do was to take the IMP down off the net and run diagnostics. If an error appeared, they would replace logic cards until the error went away. Unfortunately, not all the replacement cards were good. Often, this approach led to compounding errors, further obscuring what needed to be done to get the machine back up. Sometimes the technician would not have a functional replacement card for the one that was failing. Machines were often down for hours or days until a functional replacement could be obtained, often until it could be flown in.

Our approach to maintenance was markedly different, made possible by the fact that we controlled the software as well as the hardware. We strove to make the software tolerant of the inevitable and often intermittent hardware errors that we experienced. We built consistency checks into the software, making the operational system the diagnostic that would find errors. The system gathered error data and provided detailed reports back to the Network Control Center that would generally allow us to identify a failing logic card without taking the IMP down.

Once a failing card was identified, we would dispatch a technician to replace that card. We might disable the particular interface with the failing card and then leave the rest of the machine running until after-hours when the IMP was less heavily used and taking the IMP down wouldn’t be troublesome to the host computer’s users. Once the technician was ready at the IMP site with the replacement card, he would open the machine and prepare to replace the card. We would then send an “IMP GOING DOWN” message to the host computers’ users so that they could finish up what they were doing. The technician would take the IMP down, replace the card, and bring the machine back up. Out-of-service time was generally measured in seconds rather than days.

People: We had experience with a variety of service technicians from Honeywell. They ranged from extraordinarily good people who would have machines back up in minutes to others who might take days to get a machine up. The good ones were generally those who had the most experience with the 516 computers and in particular those with more experience on our systems. To build our team, we hired the very best of the Honeywell technicians, one for each of the four cities where we had concentrations of IMPs. We gave them enormous freedom and discretion; as long as their machines were up, pretty much all we asked of them was to be available and responsive to being paged if we needed them. This arrangement provided great personal motivation to keep the machines reliably up. Our technician in Los Angeles was a surfer. He was welcome to surf all day provided that at the end of each run he would check his pager. He would do his maintenance in the middle of the night when generally there was nobody using the net.

Hardware Design: The 516 minicomputer and the custom interfaces that made the 516 into an IMP were generally well-designed and reliable. However, in the days leading up to the scheduled shipment date of the first IMP to UCLA, we were working feverishly to get the IMP hardware and interfaces to work properly. We wrote a diagnostic program which sent data as fast as it could over all host and modem interfaces simultaneously to put maximum stress on the circuitry. The circuitry proved not up to the task. About once per 24 hours, the IMP would crash in an apparently impossible fashion, with the program counter pointing to a location where there were no executable instructions but rather the data buffers that the machine was sending to itself. The data there were not valid instructions. We examined the locations preceding the location where the machine halted and found that they similarly contained data that were not valid instructions, implying that the computer could not have run through the preceding locations to arrive at the location where it crashed. We then searched the entire memory of the IMP to find whether there was anywhere an instruction which could have caused the IMP to transfer control to the location where it crashed. There was not. Given no possible legal way the machine could have legally died at the location where we found it, it had to be a hardware failure.
We concluded that the cause had to be a synchronizer problem.[13] In any circuit that is controlled by two or more clocks that are not synchronized, there is a remote possibility of the two clocks ticking at so close to the same time that the circuit cannot determine which occurred first. In this case, circuits in the IMP had to decide whether the next cycle should go to execute the next instruction or to transferring data in or out of memory for one of the I/O interfaces. In a very rare case – approximately once in one hundred billion (10^11) cycles – the circuit would not have resolved the dilemma in time. In this case, one section of the circuitry would decide to do an I/O data transfer and load the address of the data buffer into the memory address register while other circuits would decide to execute an instruction. The result was to treat the data in the I/O buffer as an instruction, causing the machine to halt in the data buffer area. A machine such as the IMP with a 1-microsecond clock rate executes 1011 cycles in approximately 24 hours.
We rewired the logic in the central timing chain of the 516 minicomputer to clock the data into this decision logic a quarter microsecond earlier. Allowing the circuitry this extra 250 nanoseconds to settle extended the mean time between failures from 1 day to some number of earth lifetimes.
We made the change with less than 24 hours to go until the IMP was to ship to UCLA. We were sufficiently confident that we had fixed the problem that we shipped on schedule even though the IMP had not completed one MTBF. We never saw a repetition of the problem in the life of the ARPAnet. There were, however, other machine design issues that needed to be addressed.
Display Lamps: For the second generation of IMPs, we shifted from the Honeywell 516 to the less expensive and slower 316. We were very dependent on the 316’s front panel buttons and lights to set up and monitor the machine’s operation. Where the ruggedized 516 used incandescent bulbs behind screw-on lenses for the panel lamps, the 316 substituted spring-return push levers with an incandescent bulb soldered inside each one. It was an unfortunate design since the lever would snap back up when the operator’s finger released it, just as the bulb was starting to light up from the in-rush of current through the cold filament. The result was that typically each month on each IMP, at least one or two bulbs would fail because of normal use and need to be replaced. Replacing these bulbs involved bringing the IMP down, disassembling the front panel, unsoldering the leads of each dead bulb, soldering in a new bulb, and reassembling the machine. This process generally took a couple of hours, during which time the IMP remained down. Worse, once an IMP was down, other issues would often arise in trying to bring it back up. It would typically take hours, sometimes days, to get an IMP back up.

To address this, we designed a custom printed circuit board on which we mounted a row of push-button switches each with an LED lamp built in. We retrofitted all the 316 IMPs in the field as well as each new IMP as it came to us from Honeywell with this card and never again had to take a machine down to replace bulbs.

Waddington Effect: As per Honeywell’s recommendation for the 516, we took the IMPs down for preventive maintenance once a month. As mentioned above, we found that once an IMP was taken down, there was a substantial likelihood that we would not be able to get it back up for hours or even days. Taking an IMP down for preventive maintenance more frequently than necessary dramatically increased the down time. This effect was observed in British bombers in the second World War. Conventional wisdom dictated that taking bombers out of service for preventive maintenance more frequently would increase the percentage of the time that the plane was mission-ready. A British biologist named C. H. Waddington did studies that showed that to the contrary, more recently maintained bombers suffered increased failures and mission unreadiness.[14] Once his recommendations to increase the interval between maintenance and to eliminate all preventive maintenance tasks that could not be demonstrated to increase mission-readiness, the number of effective flying hours of the fleet increased 60%.

We had no knowledge of this work at the time, but it was clear that when IMPs went down for preventive maintenance, they often didn’t come up again for a long time, and that this was a substantial contributor to the lower-than desired availability statistics. We chose to eliminate taking IMPs down for preventive maintenance altogether unless there was a clear reason to do so in a particular case. Our new stand-up PM would consist of removing and cleaning the air filters, feeling the internal temperature, and checking and if necessary adjusting the supply voltages, all with the IMP running. If these all seemed OK, we would close it up and leave. The improvement in the availability statistics was dramatic.

Magic Modem: While the network topology was designed to generally provide an alternate route around an IMP that was down, there were instances of stub nodes that had only one link to another IMP. If that neighbor IMP was down, due to failure or being taken off-line for maintenance, the stub node would be isolated, rendering it effectively also down. We strove to design a device that a technician could bring on a service call which would allow direct connection from one modem to another, bypassing a down IMP and creating a single link out of two links that ordinarily both connected to the now-down IMP. The complexity and amount of buffering required in such a box was a function of how much the speed of the two links might differ. We discovered that the clocking of all links in the net ran at quite precisely the same speed with virtually no phase slippage. This meant that no buffering was needed. The device we built, dubbed a “magic modem”, was simply two pair of wires with connectors at the ends which plugged into the transmit and receive data lines of the Bell 303 modems to which the two lines from the IMP ordinarily connected. The technicians would connect them to the modems whenever they had a down machine. In addition to eliminating taking down stub nodes when a neighbor was down, this device improved network performance appreciably. In the early days of the net, there were only two cross-country nodes. If one of these two lines was down, for maintenance or otherwise, all cross-country traffic had to go over the other line, which could become overloaded. By bypassing the down machine with a magic modem, we were able to eliminate much of this overload.

Busy Machines: A variety of hardware and software problems would manifest themselves as the IMP being slow to respond. The IMP software would increment a counter each time it completed a pass through its background loop. As one of the diagnostics that we ran as part of the operational system, we wrote a program on the PDP-1 which would read this counter, wait a fixed interval, and read it again, giving us the rate at which the machine was completing its background loop. If this rate was unusually low, it would suggest that something was keeping the IMP too busy. In that case, we would examine the interrupt return vectors which would tell us where the machine was executing when it was last interrupted, which told us where the IMP was spending its time. Occasionally, this would lead us to a hardware error. More typically, it would point to an external issue such as host personnel debugging their network interface and sending the IMP illegally-formatted messages as fast as they could. There were times when we would call up surprised host people to tell them what was wrong with their messages.

Real-Time Clock Check: In another example of a diagnostic built into the operational system, the IMP had a real-time clock (RTC) that kept track of time to allow the IMP to time out how long it had been waiting for a receipt acknowledgement from its neighbor among many other things. It was important that these intervals be consistent across the net. In at least one instance, an IMP’s RTC started running at a radically wrong rate producing great confusion among the IMPs. After we replaced the clock card, we wrote a PDP-1 program to read a given IMP’s RTC, wait a fixed interval, and read it again to verify that the clock was running at the correct rate. If it was not, a technician would be sent to replace the clock card.

Pandemic: We had an incident when the entire net went down. We were able to deduce that all IMPs were routing all packets over the shortest path to Harvard. In examining the routing tables of each IMP, we determined that all IMPs believed that packets could get from Harvard to any destination in zero hops. The Harvard IMP had become a black hole; all traffic to and from all destinations was being sent to the Harvard IMP and could never get out again. Fortunately, it was only a few miles from BBN to Harvard, so we were able to get a technician there quickly, and he promptly took the IMP off-line and the rest of the net returned to normal operation. Diagnostics on the Harvard IMP showed that one of its core memory stacks had failed. The particular stack involved was the one that contained the routing tables. The content of all locations in that stack read as all zeros. The IMP concluded that all destinations on the net were zero hops away and transmitted this information to its neighbors and out into the net to form the basis of the routing decisions for all IMPs. To prevent this sort of failure from recurring, we added to the operating software a checksum of the routing tables that was verified before using the table. We never had a recurrence of the Harvard pandemic.

Problems uncovered by these diagnostics led us to think about other checks that we could make on the running program to isolate failures to a particular hardware component. One such mechanism was a Verification program we implemented on the PDP-1. This program interacted with an IMP's DDT to retrieve a copy of core memory and compare it to a stored memory image. This allowed us to find several cases where a memory stack was consistently picking or dropping a bit.

Spare IMP: As the net grew, it was decided that it made sense to have an entire spare IMP that could be flown anywhere on short notice to replace a badly failed machine. The IMP was a large device, housed in a 6-foot equipment rack. This was ungainly to air freight. Most freight shipped in the luggage compartment of passenger flights which were much more readily available to a given destination than were freight flights. Unfortunately, the 6-foot IMP was too tall for passenger flights. This spare IMP was implemented on a 316 computer built in two 3-foot racks with interconnecting cables. These short racks fit into passenger jet cargo holds and greatly improved how quickly we could get the spare to its destination.

We recall only one incident in which we actually deployed the spare. A line power surge had caused the oil-filled line voltage filter in one of the ruggedized 516 IMPs in California to explode, spreading burning oil and smoke throughout the interior of the IMP. It took a few weeks to repair and clean up the interior of the IMP, but we had the spare there and running within a day or two of the initial explosion.

International: We deployed IMP variants to a few locations in Western Europe. These machines were spread out among a number of countries, not close enough together to justify having a resident repair technician. At the same time, it clearly did not work to try to subcontract for maintenance with people totally unfamiliar with our machines. We chose to hire on a part-time basis one individual at each location and bring him back to Cambridge for intensive familiarization and training on our machines and back those individuals up with our technicians from Cambridge on trans-Atlantic flights at the first indication of a challenging problem.

Thermal Testing:   One of the IMPs would fail each weekend in winter. It turned out that the site turned the heat way down when they shut down for the weekend. That IMP had a thermal sensitivity to cold. After we chased down and repaired the fault, we decided to put all future IMPs through thermal testing prior to delivery. We built a pair of test chambers out of plywood and 2X12 timbers with 12 inches of fiberglass insulation on all 6 sides of the chamber. In each, we installed a Sears Roebuck whole house heating and air conditioning system, capable of heating and cooling the chambers from 32 to 120 degrees. We put each new IMP inside and went through a few days of hot, cold, and thermal shock testing. These tests turned up a surprising number of thermally sensitive cards that we replaced prior to shipment. The infant mortality rate of newly-installed systems declined dramatically. In addition, an unused chamber did a fine job of chilling beer.

Results: These and other techniques led to an improvement in average availability of the IMPs from approximately 98% to approximately 99.98% – a 100-fold reduction in down time. The user community perception changed from “The net is always down” to “The net is always up.” A 99% reduction in down time also led to dramatically reduced labor requirements per machine. Improved diagnostic information meant reduced demand for replacement parts. Our changes resulted in cost efficiency in addition to the operational efficiency.

Network Definition Expansion

By the late summer of 1973 there were 20 TIPs installed in the network. ARPA would pick a place where they had a contract or other agreement and have a TIP installed there. They would add a little money to the contract and tell the contractor to buy a bunch of modems or order them from the local Telco, then connect the modems to the TIP's terminal ports and have the local telephone company connect dial-up lines to the modems in a hunt group. ARPA would give the hunt group telephone number to other ARPA contacts in the local area and tell them they could get access to ARPAnet servers by having their terminals connect to that local number. Often the primary purpose of these users was to send and receive electronic mail. The contractor, having carried out ARPA's instructions to provide the modems and lines, was totally uninterested in how they actually functioned. For the dial-up users, everything between their terminal and the email server they were trying to access was “the ARPAnet,” and if anything kept them from getting to their mailbox they complained “the network is down!”

It became apparent that even though BBN had no responsibility for or control over the modems and dial-up facilities, the Network Control Center was the only facility that could serve as a central point of contact to help users who felt “the network is down.” ARPA realized that such complaints reflected poorly on the reputation of their network, and reluctantly provided funding for the NCC to expand its responsibilities to include the entire set of TIP dial-up arrangements.

The first response of the NCC was to add a person to the staff who learned a bit about each of the popular server systems. This person could act as the first point of contact for dial-up users with problems, and could talk the user through procedural difficulties. We quickly discovered, however, that user problems often arose from dial-up ports being broken or disconnected somewhere between the phone line and the TIP. To expedite the detection and repair of these problems we installed an auto-dialer connected to an outgoing WATS[15] line and driven by a program on a BBN TENEX machine. The auto-dialer program was given a list of all the TIP dial-up numbers and periodically called each one to see if a connection could be established. If so, fine; if not then additional work had to be done by the NCC operators to pin down the source of the trouble and call someone at the site to have it repaired. The first step was to have an operator listen in on an extension to the auto-dialer's calls to numbers with trouble. We occasionally made mistakes. One evening an operator heard a person answer and exclaim “It’s that damn fool with the whistle again!” That number was quickly removed from the database of modem numbers! Eventually the operators could track actual troubles to a failed interface card in the TIP, a disconnected modem, or a failed modem; once the problem was identified we could get repair initiated.

Yet another expansion of the NCC responsibility was to begin keeping track of the up/down status of several service hosts. Sometimes when a terminal user was trying to access their electronic mail, or obtain some other service, their perception that “the network is down” was caused by the service host itself being down to the network. Sometimes the service host personnel were well aware that their machine was down and were busily working to get it back up. If that was the case we could inform the user. However, it was remarkably common for service hosts to be running happily for local users but down to the network, perhaps because their NCP routine was malfunctioning. By early 1974 the NCC took on responsibility for tracking the availability of the most frequently-used service hosts, and were often able to alert the operators of those hosts before there were too many unhappy users failing to get access through the network.

Conclusions

We draw the following conclusions from our experience maintaining and supporting the ARPAnet IMPs:

Computers in the late 1960s and early 1970s had a reliability of about 95%. We were able to increase the reliability of the ARPAnet IMP computers to 99.98%.

Good people in maintenance and support make a difference. Appointing an energetic and ambitious young Harvard PhD, with experience in both hardware design and programming, to lead the maintenance and support effort, and giving him considerable autonomy to meet reliability goals worked well. Staffing the field service department with experienced technicians and giving them incentives to increase reliability and uptime also worked well.

A running program is a good diagnostic tool. Having the IMP program constantly checking the hardware for failures often allowed us to fix failing components before they took a machine down.

The network, used properly, was an enormous aid to diagnosis. It took some time before we realized the extent to which we could apply diagnostics from host computers on a round-the-clock basis.

The Waddington Effect pertained as much to the IMP computers as to WWII bombers. It is sad that none of us were made aware of the Waddington Effect during our formal education but it was classified until 1973, and we had to rediscover it for ourselves.

Computers are delivered with faults, which can be fixed before they are deployed. Examples: the synchronizer glitch, the design of the console switches on the Honeywell 316, components which did not meet the manufacturer’s specifications disclosed in our thermal testing facility.

As time passed the users' view of what constituted “the network” expanded, and this drove further innovation in maintenance and support mechanisms and procedures. Today hardware reliability is far better than it was 50 years ago, and support and maintenance issues are quite different. But as the October 4, 2021 Facebook/Instagram outage demonstrated, it is still a challenge to keep a widespread network built from computers running flawlessly.

Acknowledgements

We thank David Hemmendinger for his comments on an early draft which helped immensely with the focus of this paper. We thank the anonymous reviewers for their helpful suggestions on presentation and references.

References

A. Avizienis, “Design of fault-tolerant computers” in 1967 Fall Joint Computer Conf., Proc. AFIPS Conf., vol. 31. Anaheim, CA, USA, Nov. 14-16, 1967: pp 733-743

G.D. Cole, “Computer Network Measurements: Techniques and Experiments,” Ph.D. dissertation, School Eng. Appl. Sci., Univ. California, Los Angeles, 1971

W.R. Crowther, J.M. McQuillan, D.C. Walden, “Reliability Issues in the ARPA Network” in Proceedings of the ACM/IEEE Third Data Communications Symposium, St. Petersburg, FL, USA, Nov. 1973: pp.159-160

F.E. Heart, R.E. Kahn, S.M. Ornstein, W.R. Crowther, D.C. Walden, “The interface message processor for the ARPA computer network,” in 1970 Spring Joint Computer Conf., AFIPS Conf., vol. 36, Atlantic City, NJ, USA; May 5-7, 1970: pp. 551-567;

F.E. Heart, S.M. Ornstein, W.R. Crowther, W.B. Barker, “A New Minicomputer/Multiprocessor for the ARPA Network,” in 1973 National Computer Conf, AFIPS Conf., vol. 42, New York, NY, USA, June 4-8, 1973: pp. 529-537;

R.E. Kahn and W.R. Crowther, “A Study of the ARPA Network Design and Performance,” BBN Inc., Cambridge MA, USA, Rep. 2161, October 1971

D. Katsuki, E.S. Elsam, W.F. Mann, E.S. Roberts, J.G. Robinson, F.S. Skowronski, E.W. Wolf, “Pluribus – An Operational Fault-Tolerant Multiprocessor,” in Proceedings of the IEEE, vol 66, issue 10, October 1978: pp. 1146-1159;

W. M. Littlefield and T. Chaney, “The Glitch Phenomenon,” Computer Systems Laboratory of Washington Univ., St. Louis, MO, USA, Technical Memorandum No. 9, 1966

A.A. McKenzie, B.P. Cosell, J.M. McQuillan, M.J. Thrope, “The Network Control Center for the ARPA Network,” in Proceedings of the First International Conference on Computer Communication, Winkler, Ed., Washington, DC, USA, Oct. 1972: pp. 185-191;

A.A. McKenzie, “The ARPA Network Control Center,” in Proceedings of the ACM/IEEE Fourth Data Communications Symposium, Quebec City, Canada, Oct. 1975: pp. 5-1 to 5-6

J.M. McQuillan, W.R. Crowther, B.P. Cosell, D.C. Walden, “Improvements in the Design and Performance of the ARPA Network,” in 1972 Fall Joint Computer Conf., AFIPS Conf., vol. 41; Anaheim, CA, USA, Dec. 5-7 1972: pp 741-754

J.M. McQuillan, “Report on Changes to ARPAnet Routing in the Last Half of 1973,” in Proceedings of the ILTAM 1974 International Seminar on Performance Evaluation of Data Processing Systems, Weizmann Institute of Science, Rehovot, Israel, July 1974

S.M. Ornstein, F.E. Heart, W.R. Crowther, H.K. Rising, S.B. Russell, A. Michael, “The Terminal IMP for the ARPA Computer Network,” in 1972 Spring Joint Computer Conf., AFIPS Conf., vol. 40, Atlantic City, NJ, USA, May 16-18, 1972: pp. 243-254;

S.M. Ornstein, W.R. Crowther, M.F. Kraley, R.D. Bressler, A. Michel, F.E. Heart, “PLURIBUS – A Reliable Multiprocessor,” in 1975 National Computer Conf., AFIPS Conf., vol. 44, Anaheim, CA, USA, May 19-22, 1975: pp. 551-559;

Ornstein, S.M.; “Appendix I – The Synchronizer Problem,” in Computing in the Middle Ages, 1stBooks, 2002, pp 256-260

L.G. Roberts and B.D. Wessler, “Computer network development to achieve resource sharing,” in 1970 Spring Joint Computer Conf., AFIPS Conf., vol. 36, Atlantic City, NJ, USA; May 5-7, 1970: pp. 543-549

B. Fidler and A.J. Russell, . “Financial and Administrative Infrastructure for the Early Internet: Network Maintenance at the Defense Information Systems Agency.” in Technology and Culture, vol 59,4, October 2018; pp 899-924;

D.P. Siewiorek and R.S. Swarz; Reliable Computer Systems: Design and Evaluation (3rd Edition); A K Peters/CRC Press; December 1998

J. von Neumann. Jan 4-15, 1952. Lectures on Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components. California Institute of Technology, notes by R.S. Pierce. Available: https://static.ias.edu/pitp/archive/2012files/Probabilistic_Logics.pdf

C.H. Waddington; O.R. in World War 2: Operational Research Against the U-Boat, Elek Science, London; 1973

D. Walden and the “IMP Software Guys”; “The Arpanet IMP Program: Retrospective and Resurrection” in IEEE Annals of the History of Computing, vol 36, no 2, April-June 2014; pp 28-39

David C Walden: BA (Mathematics) San Francisco State College. David worked for 3 years at MIT Lincoln Laboratory, and 27 years at BBN, where he was part of the team which developed the initial ARPAnet and later became a technical manager and general manager. Since retiring he studies and writes about quality management, computing history, and digital typography. David died in his sleep on April 27, the day this article was accepted for publication. More information about his work is available at walden-family.com/dave/davehome.htm

Alexander A McKenzie: BS Stevens Institute of Technology, MS (Computer Science) Stanford University. Alex worked for 3 years at Honeywell writing FORTRAN compilers, and 29 years at BBN. He joined the ARPAnet project 2 years after it started and spent the rest of his career involved with the development and use of computer networking. Alex served as the editor of many network protocols, and was Session Layer Chair in the Open Systems standardization effort. More information is available at alexmckenzie.weebly.com . Contact Alex at [email protected]

W Ben Barker: BA (Chemistry and Physics), MS, Ph.D. Applied Mathematics, Harvard University. BBN 1969 – 1995; Senior Engineer on ARPAnet, Director of Manufacturing and Field Service; President, Chairman BBN Computer Corporation; President, BBN Advanced Computers Inc.; President, BBN Communications; President, LightStream Corporation (sold to Cisco); BBN Senior Vice President, Chief Technology Officer. President / CEO of Nasdaq-listed Data RACE, Inc. Author of papers, patents, and congressional testimony on computer architecture, packet switching, laser physics, and national competitiveness. Harvard visiting committee on Information Technology; board member of various eleemosynary entities.

[1] See: Russell 2018

[2] See: Roberts 1970

[3] See: Walden 2014

[4] See: von Neumann 1952, Avizienis 1967, Siewiorek 1998

[5] See: Katsuki 1978

[6] See: Heart 1970

[7] A fake host was an IMP internal software package which interacted with the IMP’s message processing functions as though it were an external Host

[8] See: Ornstein 1972

[9] See: Heart 1973, Ornstein 1975

[10] See: Cole 1971, Kahn 1971,McQuillan 1972, Crowther 1973, McQuillan 1974

[11] See: McKenzie 1972

[12] See: McKenzie 1975

[13] See: Littlefield 1966; Ornstein 2002, Appendix I – The Synchronizer Problem, pp 256-260

[14] See: Waddington 1973

[15] “Wide Area Telephone Service” A bulk rate telephone service that allowed a user to make unlimited calls to a defined service area for a fixed monthly charge. Our defined area was the contiguous 48 US states.

Alex McKenzie

Seeking High IMP Reliability in Maintenance of the 1970s ARPAnet