Backup Software
From Wiki99
↑ Computers ↑
← prev: Backup Hardware
next: Backup of a Local Drive →
Contents |
Introduction to rsync
Now that we have a hard drive on which to backup, we need backup software. Fortunately the UNIX world provides us with a fine program for this task, named rsync.
rsync can do many thing, but stripped to the essentials what it does is to make one directory, let's call it dst, recursively look exactly like another directory, let's call it src. Obviously if dst is empty, then rsync is simply a recursive copying program, but if dst has some files in it, then rsync will only copy over the files in src that are not in dst, or that are in dst but which differ in some way from their equivalently named counterparts in src. To make this ability even more useful, either src or dst can be on a remote computer rather than locally.
It should be pointed out that when rsync matches one directory with another, it matches all the files exactly. Not only are the contents are matched, but also the permissions, dates associated with the file, extended attributes and resource forks, and any other attributes the file may have. Even, where it makes sense, multiple hard links representing a file are transferred as hard links, not as multiple copies of a file. Thus an rsync duplication of one hard drive to another gives a more faithful copy than any other copying program I know of, certainly much better than a Finder copy.
Introduction to hard links
Given these abilities of rsync, it is clear that this represents a good basis for a backup scheme, but rsync can go even further to provide multiple backups at a low cost in extra disk space. To understand how this works, you need a good understanding of the concept of hard links.
UNIX file systems distinguish between a file and a name for a file.
The file, (more technically the inode), consists of the file data and
metadata, while a file name points to where this file data and metadata can be
found on the disk. Because of this split, there is nothing to prevent a file
having multiple names, all of which point to the same inode, ie to the same
file data.
A hard link is the name for one of these names pointing to the inode.
Note that a hard link is different from either a symbolic link or an alias.
Both a symbolic link and an alias point not to an inode, but to another file
name.
Both hard links and symbolic links have their uses. Symbolic links, for example,
can point to files mounted on other file systems, whereas hard links can only
refer to files on the same file system (ie on the same partition of the same
hard drive).
|
As a technical point, aliases have some aspects of both symbolic links and hard links. In Tiger, an alias has both a part that behaves like a symbolic link, which is used preferentially, and a part that behaves like a hard link, which is used if the symbolic link part fails to resolve. |
What makes hard links especially useful for backups is that, unlike
symbolic links, all hard links are symmetrical.
If you have a file named myFile and you create a symbolic link to it, named
symLinkToMyFile, then delete myFile, the data and metadata for myFile are
gone, destroyed forever. The symbolic link symLinkToMyFile still exists, but
is useless, it just points to nowhere.
However when you create a hard link to myFile, named hardLinkToMyFile, then
the inode data for myFile is modified to indicate that myFile now has two names.
When you "delete myFile", you are not actually deleting the file, all you are
deleting is one of the names for myFile, and so the inode data for myFile will
be modified to indicate that myFile now only has one name, viz hardLinkToMyFile.
It is only when you delete the last name of a file that the files inode, data
and metadata are destroyed.
Using hard links for backup
Consider now the process of backing up. You perform an initial backup and copy some huge number of files to the backup hard drive. A week later you want to perform a second backup. For this second backup, chances are that more than 95% of the files on your hard drive have not changed since the first backup. What you would like, in the directory holding the second backup, is for all files that are changed to be copied over, but for every file that is the same, rather than copying over the same file, simply create a hard link in the current backup directory that points to the inode of the file when it was created in the previous backup directory. This hard link based strategy means that while our first backup may take say 80GB, the second backup, which consists of writing a lot of hard links and a few changed files to the backup hard drive, may take up only say 1GB. This, in turn, means that we can afford to maintain many different backups for successive dates. For some purposes, we will only want to restore the most recent backup. But sometimes we may want to go back to a file that was deleted three months ago, and our multi-backup strategy can ensure that we have a copy.
Deleting old backups
Note, as already mentioned, that a file's data is destroyed only when the last hard link to it is destroyed. This means that if you are running out of space on your hard drive, you can delete the oldest three backups say, and all that will be destroyed are a large number of hard links and any old files that exist only in those last three backups, but not in any of the newer backups. But, the corollary to this is that, when you destroy these last three backups you may not free up as much space as you expected because, in truth, you didn't destroy that many files --- once again all you destroyed are a whole lot of hard links and the few files that existed only in those last three backups.
Miscellaneous
Using hard links for backups is described in more detail here. The details he gives for how to get rsync to do what is needed are very out of date, but the background material is correct.
With this theoretical background in place, we still need to acquire some basic software, and then write a script to handle various messy details of the backup process.
Note: There is one important aspect of backing up that we are ignoring here, and this is the backing up of MySQL databases. After we've covered MySQL we will discuss this and you'll see why they have to be treated differently from most other files.
Obtaining rsync
The first thing we need to do is get hold of a copy of rsync. The problem we face is that HFS+ has various non-standard metadata items that do not have counterparts in other UNIXes. For this reason, the standard open source code for rsync is less than ideal for use on an HFS+ volume. In theory all the various UNIXes are converging on a common concept of extended attributes to describe metadata, so perhaps in a few years this will no longer be an issue, but for now it means that you cannot just go to the standard rsync web site and download some source code.
Apple, to their eternal shame, shipped, with Tiger, a version of rsync that claimed to support the various HFS+ metadata, but which was so buggy as to be completely useless. As of 10.4.7, the rsync that ships with Tiger is now OK but is not ideal and should be used only if you have no choice.
The ideal rsync to use is the one provided by http://www.onthenet.com.au/~q/rsync/ which has most of the most recent rsync features, but works correctly with HFS+. In particular the two useful features it supports that are not supported by the Apple rsync are multiple link-dest directories, and fuzzy matching. (These features will be explained in time.) Unfortunately (at least when I tried it) both the PPC version of the rsync supplied here (running emulated), and the universal version did not work on an Intel mac, so if you have an Intel mac it seems that for now you are stuck with having to use the Apple rsync and having to give up the fancy features of newer rsyncs.
If you have a PPC mac and so want to use this improved rsync, download the binary from the web page and store it somewhere on your file system. One possibility is in /usr/local/bin, another possibility is in ~/bin.
Obtaining powershift
We need one more program which will handle renaming our directories after a backup. This program is called powershift.
Go to http://www.math.ualberta.ca/imaging/rlbackup/, look for a file available for download named something like rlbackup-2.20.tar.gz, and download it. (This web page describes a complicated network backup scheme that is of no interest to us; all we care about is one small program that is part of this scheme.)
Use stuffit or whatever to expand the file you just downloaded. Then, in a terminal window cd to this directory, something like
and type
If you're lucky this will build the program we want without errors. If you're unlucky, the makefile for the program has not yet been fixed to work properly under MacOS X. Open the file Makefile
and search down a few pages for two lines that look like
powershift: powershift.cc $(CPP) $(OPT) -static -o powershift powershift.cc
Remove the -static so that the lines now look like
powershift: powershift.cc $(CPP) $(OPT) -o powershift powershift.cc
save (ctrl-O) and exit (ctrl-X), and try again:
This time the make should complete (very fast) and you will find in this directory an executable named powershift which you should move to the same place where you earlier put your copy of rsync.
How the Backup Script Works
We now have all the pieces in place to write a backup script.
The idea is that on the backup hard drive we will create a directory at the root level called Backups. In that directory we will create a directory for each hard drive we plan to back up. For example there might be one directory called Backups/server and another called Backups/portable. In each of these directories, the backups will be named Backups/server/1, Backups/server/2, Backups/server/3 and so on. The most recent backup will be in Backups/server/0 while the backup is in progress. If anything goes wrong, it will remain there, named as Backups/server/0. You can fix whatever went wrong, run the backup script again, and things will continue just fine. But at any given time, the backup in Backups/server/0 cannot be trusted as being complete; it is only after a backup completes successfully that each backup is renamed, so Backups/server/2 will be renamed to Backups/server/3, Backups/server/1 will be renamed to Backups/server/2, and Backups/server/0 will be renamed to Backups/server/1. This is performed by powershift.
Powershift is actually slightly smarter than just incrementing the number of each backup by 1. What it will do is preserve the most recent n backups, say, 5. For the backups older than the most recent 5, it will preserve them according to an exponentially decaying series. That is, it will preserve backup 5, 5+2, 5+4, 5+8, 5+16, 5+32... The way it numbers the backups as it does this is not always obvious, but that's essentially what it's doing.
One additional item I add to the backup is that after rsync has copied over modified data and made all the hard links, and after powershift has renamed all the backup directories, I run a (command-line) copy of Apple Disk Utility to scan the backup hard drive and repair any file system inconsistencies. It obviously does us no good to be backing up data to a hard drive that has an inconsistent file system.
So the brief summary of a backup script is:
-
Test that we are running as superuser.
(Since the backup is supposed to read everything off the hard drive, it needs to run as superuser.) -
Test, as far as is practical, that the source and destination hard drives are connected and visible to the OS.
(I keep my backup hard drives switched off when I am not backing up to them. This obviously helps protect them from, say, a rogue program or my clumsiness. This test catches the times when I forget to switch the backup hard drive on before doing a backup.) -
Test that the destination hard drive has permissions enabled.
(Apple, by default, does not enable permissions on external hard drives, ie hard drives connected via USB or FireWire. This makes life easier if you are moving hard drives between different computers, but for our backup to work properly, the destination hard drive must have permissions enabled.) -
If this is the first backup to the backup directory, I tell rsync to dump a display of each file as it is being handled to terminal. This is comforting in letting us know that something is happening (especially since the first backup can take a long time, perhaps three or four hours depending on how much data you have on the source hard drive).
-
For subsequent backups I don't dump this data to the terminal since it actually requires enough CPU to display the rapidly streaming list of files to slow down the backup.
-
After rsync is done, we run powershift.
-
And after powershift we run Apple Disk Utility.
Along the way we do various things to test for errors, measure how long the backup took and how much disk space it required on the backup drive, and how long it took for disk utility to run and whether it made any modifications to the backup drive.
You will probably find it interesting to run Activity Monitor (displaying disk IO) while you are running a backup.
With this in mind, you should now be in a position to understand the logic of the backup script in the next section. After I discuss some of the finer details of this script, I'll present a second script, slightly more complicated, showing how you can backup across the network. After this, you should be able to modify these scripts to achieve your particular backup needs.

