Recovery Using a Backup
From Wiki99
↑ Computers ↑
← prev: Backup of a Remote Drive
next: Installing MySQL →
Contents
|
Before Disaster
You don't need me to tell you about this. Have a backup drive. Run your backup scripts frequently. All the usual stuff.
When Disaster Seems to be Approaching (or has Already Occurred)
All three cases recent I have experienced of a drive going bad happened suddenly, but not immediately. The machine seemed to be slower than normal, and the drive seemed to be making unhealthy sounds as it tried to read and re-read sectors.
If you are interested in trying to predict if your hard drive will go bad, and why, this article from Google discusses their experiences with hard drive failures. A short simple summary of their conclusions appears here.
SMART
Another possible indicator that things are going wrong is that the drive reports through SMART that it is in trouble. (You can read the SMART status of an internal disk by looking at the bottom of the Disk Utility.app window.) My experience has been that SMART has never caught a problem for me; all three disks that went bad on me reported SMART as just fine, but heck, I guess it probably reports problems in some situations.
Google's experience matches mine, that SMART has the potential for detecting drives that will fail, but that it's by no means perfect. In particular the Google paper suggests investigating the SMART data carefully and tells you what to look for. Right now I know of no GUI app that does this, but the command-line app smartmontools does do so. You can install this as usual via
You can google for smartmontools to find various pages that tell you how to use the package; a reasonable such page is here.
Unfortunately SMART is not propagated over USB or firewire, in one more example, as if such were needed, of the all around stunning incompetence of this industry, so you can only use it to monitor your internal drives.
Summary of recovering from disaster
Once you have reason to believe (SMART, unexpected slowness, repeated clicking sounds etc) that your drive is close to death, don't wait. Once again, in the three recent cases where I had drives fail, the drive went from apparently working, though strangely, to not working in about three hours.
You want to get one last backup out of it (using our previous scripts) as soon as you can. The way I handle this is as follows (the steps will all be discussed in more detail below, once you understand the big picture).
Firstly, whatever you do, you do not want to jeopardize your backup drive.
So your first step is to acquire the drive that you will be the replacement for
the sick drive. This is probably a Firewire or USB drive. (If it's a boot drive,
and your Mac only boots off Firewire, then make sure it is a Firewire drive.)
Followed immediately by copying over your last good backup from the backup
drive to this new drive. At this stage, unmount and power down your backup
drive. Whatever happens, you do not want to make some screwup that hurts that
drive.
Now that you replicated your last backup on the new drive, try to get one last
rsync from the sick drive to the new drive. Note that we first copied over the
previous backup to the new drive so as to minimize the amount of work that has
to be done during this backup. A full backup of a large drive can take five or
six hours; if your hard drive only has two hours of life left in it, you want
to use that time copying over whatever changed since the last backup, not all
the stuff that's already backed up.
Ideally this last rsync succeeds. Even if the drive dies partway through, at
least you got some of the new stuff off it. In the worst case, of course, you
can just go back to step two, restore from your backup drive, and accept the
loss of a few days work.
You now need to do two final things to test your backup. The first is a very technical item which is to ensure that the date of the kext directory is more recent than the date of the unix image file. The only thing you need to know about this step is that doing this will force MacOS X to rebuild various cache files which will make the first boot very slow, but subsequent boots a whole lot faster. The second is to fix the permissions on it, which will basically get a few permissions right for a boot drive that rsync is unable to copy properly (remember what we said about the boot file BootX or boot.efi being weird).
Let's now go through each step in detail:
rsync from your backup drive to the new boot drive
You will want to run a script like this below. The important parts, as before, are just a few lines, but it's a whole lot easier to have the script remember to test that the new drive has permissions enabled and that you are running as superuser than to remember this all yourself, especially as you are panicking about your dying hard drive.
As always, you will change the values of the various DEFINES to those appropriate to your installation.
#!/bin/bash
#This script performs recovery from a backup.
# The variables you'd need to set to modify it for you needs are clustered
# below.
#The script must be run as superuser, ie sudo backupScript
#===============================================================================
HOME_DIR=/Users/mjh
RSYNC_LOCAL=$HOME_DIR/bin/rsync
BACKUP_EXCLUDES=$HOME_DIR/bin/backup_excludes.txt
MAIL_ADDR=mjh@bluecloud.com
LOG_FILE=$HOME_DIR/Library/Logs/backup.log
#Using $$ below uses the process ID in the file name and thus makes it unique.
TMP_FILE=/tmp/myBak.$$.txt
SRC_VOL=/Volumes/Backup400GB
DST_VOL=/Volumes/NewBootDrive
RST_NAME=Restore
SRC_DIR=$SRC_VOL/Backups/iMac/1/
DST_DIR=$DST_VOL/
#===============================================================================
ReportErrorAndExit()
{
#This function reports an error.
# It takes a compulsory argument, $1, a string that describes the error and
# and optional argument, $2.
# If $2 is anything, the error string is only logged to stderr, otherwise
# it is also logged to the log file.
# The error string is also mailed to $MAIL_ADDR.
#Set bash "word" separator to newline only.
# (If we didn't do this, the string argument passed in would not be
# treated as a single $1 argument.)
# Normally you'd want to restore this after you're done, but we're exiting
# at the end of this function so that's not necessary.
IFS=$'\n'
ERROR_STRING="RESTORE $RST_NAME: $1"
if [[ $2 ]]; then
echo $ERROR_STRING >&2
else
echo $ERROR_STRING >&2
echo $ERROR_STRING >> $LOG_FILE
fi
mail -s $ERROR_STRING $MAIL_ADDR </dev/null &> /dev/null
exit 1
}
#===============================================================================
#Test user is root
if [[ `id -u` != 0 ]]; then
ReportErrorAndExit "user is not root" DONT_LOG
fi
#...............................................................................
echo "============================================================" >> $LOG_FILE
echo `date` >> $LOG_FILE
echo "Start restore $RST_NAME" >> $LOG_FILE
#...............................................................................
#Test src volume exists
if [[ ! -d $SRC_DIR ]]; then
ReportErrorAndExit "$SRC_DIR does not exist"
fi
#Test dst directory exists
if [[ ! -d $DST_DIR ]]; then
ReportErrorAndExit "$DST_DIR does not exist"
fi
#...............................................................................
#Force the backup drive to have permissions enabled
#This (helpfully non-documented, no-built in help --- thanks Apple) command will
# force permissions to be enabled for the backup drive.
# http://www.macosxhints.com/article.php?story=20020925051644480
vsdbutil -a $DST_VOL
#Test that the restore drive has permissions enabled we
# have the obvious problem of permissions not stored correctly.
PERMISSIONS_ENABLED_ON_BACKUP=`diskutil info $DST_VOL | grep "Owners" | awk '{print $2}'`
if [[ $PERMISSIONS_ENABLED_ON_BACKUP != "Enabled" ]]; then
ReportErrorAndExit" restore drive does not have permissions enabled"
fi
#-------------------------------------------------------------------------------
#Switch off spotlight indexing until after the restore is done
# (otherwise the disk head jumps around writing out spotlight data then rsync data)
#mdutil -i off $DST_VOL
sync
#-------------------------------------------------------------------------------
#Do the actual restore.
INITIAL_SIZE=`df -k $DST_VOL | grep "^/" | awk '{print $4}'`
INITIAL_SECONDS=`date "+%s"`
$RSYNC_LOCAL -axHEy --delete --delete-after \
--delete-excluded --exclude-from=$BACKUP_EXCLUDES \
--ea-checksum \
--stats --progress \
$SRC_DIR $DST_DIR
RSYNC_ERROR_CODE=$?
if [[ $RSYNC_ERROR_CODE == 0 ]]; then
BOOT=/System/Library/CoreServices/BootX
if [[ -e $SRC_DIR/$BOOT ]]; then
$RSYNC_LOCAL -aEW --delete --delete-after \
--ea-checksum \
--rsync-path=$RSYNC_REMOTE \
$SRC_DIR/$BOOT $DST_DIR/$BOOT \
RSYNC_ERROR_CODE=$?
RSYNC_PHASE=2
fi
fi
if [[ $RSYNC_ERROR_CODE == 0 ]]; then
BOOT=/System/Library/CoreServices/boot.efi
if [[ -e $SRC_DIR/$BOOT ]]; then
$RSYNC_LOCAL -aEW --delete --delete-after \
--ea-checksum \
--rsync-path=$RSYNC_REMOTE \
$SRC_DIR/$BOOT $DST_DIR/$BOOT \
RSYNC_ERROR_CODE=$?
RSYNC_PHASE=3
fi
fi
RSYNC_ERROR_CODE=$?
#...............................................................................
FINAL_SECONDS=`date "+%s"`
let DURATION_SECONDS=$(($FINAL_SECONDS - $INITIAL_SECONDS))
let DURATION_HOURS=$(($DURATION_SECONDS/3600))
let DURATION_SECONDS=$(($DURATION_SECONDS-$DURATION_HOURS*3600))
let DURATION_MINUTES=$(($DURATION_SECONDS/60))
let DURATION_SECONDS=$(($DURATION_SECONDS-$DURATION_MINUTES*60))
echo
echo "Restore Duration hr min s =" $DURATION_HOURS $DURATION_MINUTES $DURATION_SECONDS >> $LOG_FILE
if [[ $RSYNC_ERROR_CODE != 0 ]]; then
ReportErrorAndExit "*** rsync reported error $RSYNC_ERROR_CODE in phase $RSYNC_PHASE"
fi
#-------------------------------------------------------------------------------
#Proactively repair the restore disk
#1 Get the device node for the backup volume.
# We will need this later.
DST_VOLUME_DEV=`diskutil info $DST_VOL | grep "Device Identifier: " | awk '{ print $3 }'`
#2 Loop trying to unmount the backup volume.
# This may take a few tries because Spotlight may be busy indexing the volume.
COUNTER=0
while [[ $COUNTER < 3 ]]; do
diskutil unmount $DST_VOL &> /dev/null
UNMOUNT_ERROR_CODE=$?
if [[ $UNMOUNT_ERROR_CODE == 0 ]]; then
let COUNTER=3;
else
let COUNTER=$COUNTER+1
echo "Could not unmount. Waiting 60 seconds. Attempt $COUNTER of 3."
sleep 60
fi
done
#3 Once we unomunted successfully, remount the drive
# We should now be cleared to run diskutil repairVolume without problems
# when the repair tries to unmount the volume.
diskutil mount $DST_VOLUME_DEV &> /dev/null
#...............................................................................
INITIAL_SIZE=`df -k $DST_VOL | grep "^/" | awk '{print $4}'`
INITIAL_SECONDS=`date "+%s"`
rm $TMP_FILE &> /dev/null
touch $TMP_FILE
tail $TMP_FILE&
echo "dm rv"
diskutil repairVolume $DST_VOL &> $TMP_FILE
REPAIR_ERROR_CODE=$?
kill %1 #Kill the tail command above.
if [[ $REPAIR_ERROR_CODE != 0 ]]; then
cat $TMP_FILE >> $LOG_FILE
rm $TMP_FILE &> /dev/null
ReportErrorAndExit "*** diskutil reported error $REPAIR_ERROR_CODE"
else
rm $TMP_FILE &> /dev/null
fi
sync
#...............................................................................
FINAL_SIZE=`df -k $DST_VOL | grep "^/" | awk '{print $4}'`
FINAL_SECONDS=`date "+%s"`
let CHANGE_IN_SIZE=$(($INITIAL_SIZE - $FINAL_SIZE ))
let DURATION_SECONDS=$(($FINAL_SECONDS - $INITIAL_SECONDS))
let DURATION_HOURS=$(($DURATION_SECONDS/3600))
let DURATION_SECONDS=$(($DURATION_SECONDS-$DURATION_HOURS*3600))
let DURATION_MINUTES=$(($DURATION_SECONDS/60))
let DURATION_SECONDS=$(($DURATION_SECONDS-$DURATION_MINUTES*60))
echo "Repair Duration hr min s =" $DURATION_HOURS $DURATION_MINUTES $DURATION_SECONDS >> $LOG_FILE
echo "Repair Change in size KB MB =" $CHANGE_IN_SIZE \
$(( ($CHANGE_IN_SIZE+512)/1024 )) >> $LOG_FILE
#-------------------------------------------------------------------------------
#Switch spotlight indexing on again.
#mdutil -i on $DST_VOL
echo "============================================================" >> $LOG_FILE
#===============================================================================
This is a pretty trivial modification of the first (local backup) script. obviously, if you are in a hurry and pretty confident on things, you can omit the code at the end that runs diskutil to verify the file system on the new boot volume.
rsync from the sick drive
It is obvious, with a little thought, that the script above is, with only very minor modifications, exactly what we want to use to pull data off the sick drive onto the new boot drive. The only modifications needed are
- change the SRC_VOL to / and
-
(if you wish) omit everything in the script after
#Proactively repair the restore diskie omit all the code to run the file system check on the new drive.
It may help, while the sick drive does its last rsync, to keep it as cool as possible since the drive may be overheating as it dies. As I mentioned before, the best way to do this is putting a frozen medical icepack against it.
If you can't get everything you want off the drive, if feasible put the drive (in a plastic baggie) in the fridge and let it cool overnight. If it's an external drive, this is obviously easy. If it's a portable this is feasible. If it's a desktop machine, maybe just leave it overnight, but try to keep it as cool as possible. The delay may help, as may the fact that it is now quite a bit cooler, and you can try again. Use common sense, of course --- getting your drive wet or covered with ice is not going to be good for it.
additional steps for a boot drive
At this stage, if the drive that failed was a data drive, you're done. You restored the data from the backup. If you were lucky, you were then able to restore whatever changed data existed from the sick drive.
But if this is a boot drive, you have a few more steps left.
First you want to fix up the boot drive.
Next you want to use the boot drive to boot a known good computer (presumably
the one which you are using to do all these rsyncs).
Finally you will try to use this boot drive to reboot the sick computer.
fix dates (to ensure faster future boots)
Type
sudo touch /Volumes/NewBootVolume/System/Library/Extensions
fix permissions on the new boot drive
Either run Disk Utility.app to do this or type
sudo diskutil repairPermissions /Volumes/NewBootVolume
set the good computer to boot off the new boot drive
This is obvious; just use the Startup Disk preference panel.
reboot the good computer
You will probably want to hold down command-v as you reboot. This will cause a verbose boot which will display all sorts of technical information as the boot proceeds. Even if you don't understand this stuff, it is comforting to see that something is happening during the boot. As mentioned, this first boot can take quite a long time. Give it maybe five minutes before panicking that something is wrong.
If anything goes wrong the main thing to try is to power down, power up, and hold down the shift and option keys. As the machine powers up it will look for anything it can boot off --- CDs, hard drives, network and so on, and after a few minutes (this process is fairly slow) it will show you a little list of icons of the various sources available for booting. Choose your new hard drive and continue.
|
Note that I told you to hold both option and shift. These, in fact, are giving two separate commands to the boot process. The option tells the boot process to search for all possible boot devices. The shift tells the boot to proceed in "safe mode" which, as the name suggests, tries to be very careful (and very slow compared to a normal boot) as it goes through the boot process. |
Assuming this works fine, you should use the Startup Disk preference panel to restoring booting to whatever you were using for the good computer, power down, and transfer the new drive to the sick computer.
Note that this hard drive, which you've just booted off, is a perfectly good
replication of your previous computer, but it can run (and provide you with
all your previous data) on any new computer.
(There is one practical issue you have to be aware of in all this, namely that
Intel Macs are considered different machines from PPC macs when it comes to
booting, so you cannot take a boot drive from a PPC Mac and use it on an Intel
Mac or vice versa. With Leopard, this may change, but for now it is something
you need to bear in mind.)
reboot the sick computer
Once again power up holding down shift, option and command-v, and with luck, after some time, you'll be presented with the option of booting off your new disk, the boot will proceed just fine, and life is good.
If anything goes wrong, try booting off the boot CD that came with the computer. You want to make sure that the computer is at least capable of booting.
Depending on exactly what went wrong with your hard drive, it is possible that the broken hard drive itself will prevent the computer from booting. When the mac boots, it tests various components at startup. If an internal drive is so dead that it fails this power on self test, then your machine may simply not continue with the boot. (This would appear to be a bug, since there is nothing actually stopping the computer from booting off FireWire, but that's the way things are.)
If this is the case, you will have to remove the hard drive (which is, of course, more or less easy depending on your computer model). Your mac will also probably not boot if there is no internal hard drive attached even if you are booting off an external drive so you may have to replace the sick drive with something. A possible choice is some ancient hard drive of now laughably small capacity that you've been keeping around. Another possibility is a drive that is sick, so not trustworthy for use, but not so sick that it prevents firmware from booting. (The bottom line is keep old drives around; they may come in handy.)
As an example, when the 30GB drive in my TiBook died, the machine would not boot (from CD or otherwise). I replaced the drive with a 2GB from maybe ten years ago, and that was enough to get firmware past the stage of probing buses and on to booting off Firewire.
In the absolute worst case scenario, where you can boot off CD but not off Firewire, perhaps your best bet is to buy yet another new drive, install a clean MacOS X onto that new drive, install all the system updates and so on over that new MacOS X, and then manually copy over whatever you care about from the previously created drive. A hassle, but better than nothing.
If, after this, things still aren't working, well, you can either continue screwing around on your own, you can call a computer-savvy friend, or you can call a professional. Remember, however, that you only have one copy of your most recent backup, the one you just took a few minutes ago. Whatever you do, you do not want to screw that up.
Your new drive works. Now what?
At this stage what to do next is up to you.
The cheapest choice is to run your existing computer off your new drive from this point on. Is this realistic? Well maybe.
As we've mentioned, if the bad drive prevents the computer from booting, somehow you will have to extract it and replace it with a drive that is less bad. Or you'll have to give up on that computer, throw it away, buy a new computer, and use your new drive to copy over whatever data you care about to the new computer.
On the other hand, if the drive is not that dead but is simply bad in some areas, so that it can no longer handle data, then you can run happily off the external drive. You will probably want to rename the bad drive as something like BAD, run the Erase tab of Disk Utility to wipe this bad disk, and then, every time you reboot, run Disk Utility and Unmount the bad volume. (You want to wipe this disk and unmount it because, unomunted, the OS won't touch it and your life is fine. If the disk is mounted and the OS tries, for some reason to read or write from it, and that read or write fails (as it well might, since this is a bad disk) the rest of the computer will freeze.
The disk may, depending on how it has failed, occasionally make screeching noise or otherwise behave weirdly, but as long as you never use it (ideally but unmounting it if it ever tries to mount) you should be fine. The worst that can happen is that one day it dies so badly the computer no longer boots, but in that case your external boot drive is still fine and can be moved elsewhere --- all that has died is the internal drive which you will now, as discussed above, have to remove and replace.
This situation of running off an external drive may be acceptable or it may not. Obviously it makes a portable a lot less portable, and a FireWire boot drive is going to be slower than the internal drive of an iMac or a tower Mac. You may prefer to just install a new hard drive, which is easy for a tower mac, I am guessing not too difficult for an iMac, and difficult to damn near impossible on a portable. With a portable, you may just want to accept your fate, give the machine to the kids or a friend or whatever, but run it from this point on as a desktop machine. My server is an old portable whose internal drive died, and it runs fine off a FireWire boot drive.
what to do with the sick drive?
You may have an aversion to throwing things away, especially a drive that you
are not sure is bad or not. However there are ways you can still use a dubious
drive until it absolutely dies.
One possibility is to tie it to another drive of the same size and run the two
as a RAID-1 system. (You can use Disk Utility to create a RAID-1 system.)
Another possibility is to use it to perform hourly backups of rapidly changing
material (that is, of course, backed up in the usual fashion less frequently).
Or you can do what some people do and put swap space and /tmp on such a drive.
The main thing is to make sure that a drive you don't trust is being used only in some sort of redundant fashion, so that if it dies at any point, as you expect it to, it's no great tragedy.
Be Prepared!
When something goes wrong is not the time to be trying out all these steps for the first time. I'd recommend that you run through them all right now or as soon as possible, Check each step
- recover from the most recent backup to the new boot drive
- rsync from the (supposedly sick, not really in this) computer to the boot drive
- run the touch command to fix dates
- run diskutil to fix permissions
- use System Preferences to set the new boot drive to boot the machine
- reboot (holding down command-V)
- once booted, check that everything on your new boot drive looks just like
it does on the internal drive (the one that is supposedly, not really, sick)
Now is the time to learn about how long you expect each of these steps to take and if there is something wrong in one of the backup scripts, not a year from now when a drive dies for real.

