"Linux Gazette...making Linux just a little more fun!"

The Answer Guy

By James T. Dennis, linux-questions-only@ssc.com
Starshine Technical Services, http://www.starshine.org/

Automated Recovery from System Failures

From anonymous on the L.U.S.T List on 2 Sep 1998

And there will be no human to manually check on the partitions after a power failure.

What's wrong with e2sck? TTYL!

I was thinking about this recently and I came upon an intereseting idea. (I think a friend of mine used the following trick in a commercial product he built around Linux).

The trick is to install two root filesystems (preferably on different drives -- possibly even on different controllers). One of them is the "Rescue Root" the other is the "Production Root." You then configure the "rescue root" partition as the default LILO device and modify the shutdown sequence to over-ride that default with an /sbin/lilo -R command.

If the system boots from the rescue root it is because the system was booted irregularly. The standard shutdown sequence was not run. That rescue root can then do various diagnostics on the product root and other filesystems. If necessary it can newfs and restore the full production environment (from another, normally unused, directory partition or drive). The design of the rescue root is a matter for some consideration and research.

Normally the system will boot into "production" mode. Periodically it can mount the alternative root fs to do filesystem checks and/or an extra filesystem to do backups (of changes to the configuration files). You can ensure that these configuration backups are done under a version control system so that degenerative sets of changes can be automatically backed out in an orderly fashion.

If you combine this with a watchdog timer card and a set of appropriate system monitoring daemons (which all talk to a dispatch that periodically resets the watchdog timer), you should have a system that has about the most bulletproof autorecovery as is possible on PC equipment.

I should note that I haven't prototyped such a system yet. I've just thought of it. A friend of mine also suggested that we devise a way to have another proximate system also doing monitoring (possibly via a null modem). He says he knows how to make a special cable which would plug into the guard dog's printer/parallel port (guard dog is what I've been calling the hypothetical proximal system) and would be run into the case of the system we're monitoring where it would be fit over the reset pins. This, with a small driver should be able to strobe the reset line.

(In fact I joked that we could create a really special cable that would daisy chain to as many as eight other systems and allow independent reboot of any of them).

In any event the monitor system would presumably monitor some/most of the same things as the watchdog timer; so I don't know what benefit it would ultimately offer (unless it was prepared to do or initiate failover to another standby system).

Perhaps this idea might be of interest to the maintainer of the High-Availability HOWTO (Harald Milz -- whom I've blind copied on this message). It's not really "High Availability" but "Automated Recovery" which might be sufficiently close for many applications. (i.e. if a web, mail, dns, or ftp server's downtime can be reduced from "mean hours per incident" to "mean minutes per incident" most sysadmins still get lots of points).

Automated Recovery from System Failures

From R P Herrold on 04 Sep 1998

We build custom Linux solution boxen. In our Build outline, we take this concept a step further in setting up a redhat system -- we carry a spare /boot partition:

(extract)
(base 5.0 install)

Part     name            size    Cyl     cume    actual min
>====    ==========      ====    ====    ====    ==========

 1       /boot           20      ___      20
 2       root            30      ___      50     23
                         (/bin           ___ M)
                         (/lib           ___ M) modules
                         (/root          ___ M)
                         (/sbin          ___ M)
 3       swap            30      ___     80
 4       (extended)
 5       /mnt/spare      30      ___     110     1

... The minima in a 'stripped down' / [root] partition vary depending on where /lib, /var, and /usr end up -- of late, a lot of distributions' packages feel a need to live in /bin or /sbin unnecessarily -- and probably should be in the /usr tree ... Likewise, if a package is NOT statically linked, one can end up with problems, if a partition randomly decides to 'go south.'

I was thinking about this recently and I came upon an intereseting idea. (I think a friend of mine used the following trick in a commercial product he built around Linux).

... We use the 'trick' as well

The trick is to install two root filesystems (preferably on different drives -- possibly even on different controllers). One of them is the "Rescue Root" the other is the "Production Root." You then configure the "rescue root" partition as the default LILO device and modify the shutdown sequence to over-ride that default with an /sbin/lilo -R command.

... carrying the full [root] partition

I should note that I haven't prototyped such a system yet. I've just thought of it. A friend of mine also suggested that we devise

... It works, and can avoid the need to keep a live floppy drive in a host which would otherwise require one for emergency purposes ... aiding in avoiding physical security issues

[ normally I remove sig blocks, but since he copyrighted his... I guess I'll leave it in. Curious one should post a copyright into open mailing lists, though. -- Heather ]

.-- -... ---.. ... -.- -.--
Copyright (C) 1998 R P Herrold
herrold@usa.net NIC: RPH5 (US)
My words are not deathless prose,
but they are mine.
Owl River Company 614 - 221 - 0695
"The World is Open to Linux (tm)"
... Open Source LINUX solutions ...
info@owlriver.com

apache current digi ether goodtimes intlX largedisk
maybe numlock quota recovery script serial session
sound tape testsuite w95ie w95ras w95virus xdm

apache	current	digi	ether	goodtimes	intlX	largedisk
maybe	numlock	quota	recovery	script	serial	session
sound	tape	testsuite	w95ie	w95ras	w95virus	xdm

"Linux Gazette...making Linux just a little more fun!"

The Answer Guy

By James T. Dennis, linux-questions-only@ssc.com Starshine Technical Services, http://www.starshine.org/

Automated Recovery from System Failures

Automated Recovery from System Failures

Copyright © 1998, James T. Dennis Published in Linux Gazette Issue 34 November 1998

By James T. Dennis, linux-questions-only@ssc.com
Starshine Technical Services, http://www.starshine.org/

Copyright © 1998, James T. Dennis
Published in Linux Gazette Issue 34 November 1998