The problem
Our shiny new VI3 setup works really well, but the backup chapter still needs work. I P2V-ed all our Linux boxes to VM’s, so the existing rsnapshot file level backups still run. So far so good.
But, in addition to file level backups, I also want full VM backups, each day, both on-site and off-site. As a matter of fact, I also want some sort of versioning system, to have multiple, full VM, off-site backups. I don’t want to install some mega expensive disk array that contains X times the ~900 Gb of space all my raw VM’s suck up.
What I want is a very simple, efficient and elegant setup, without all kinds of fancy stuff and graphical bells and whistles. I’m running UNIX systems for a living so I’m not afraid of console utilities.
After doing some research I was unable to find any existing solutions, and the ones that come close are commercial and expensive, or require too much complicated crap to be installed.
The solution
Our VMware license includes a license for VMware Consolidated Backup (VCB). Being a great company, VMware has plugins and manuals for all the major closed source, expensive, buggy black boxes enterprise backup suites, but documentation about their command line tools is pretty lame and comes down to one louzy console screen of help text.
Luckily, it seems that in order to make full VM backups you actually need just one command (vcbMounter.exe).
Since the open source program rsync has served me really well in the past, I decided to use it again for our VMware backups.
My setup uses two machines (Windows 2003 Server, as VCB runs only on Windows), one machine hooked up to our SAN, running the VCB software, and one off-site machine housing the archive. Both machines are modest 1U Supermicro boxes, with 4 x 1 Tb SATA in RAID5, on Areca controllers. They are connected via our dedicated WAN link at 100Mb/s.
It basically comes down to:
- Full VM backups are created locally with VCB; old backups are first deleted (because VCB refuses to overwrite old backups)
- The new backups are transferred to the remote site efficiently and securely using rsync and OpenSSH
- The off-site server uses Volume Shadow Copy to create a history of full VM backups
Steps 1 and 2 are done using this batchfile (rename to .bat/cmd).
By using the inplace option, we actually update the old backup files on the remote server. This is an important details, because without it the file would be deleted and recreated, thereby killing the efficiency of the VSS part later.
The rsync algorithm will cause only the diffs to go over the line. The backup of all our VM’s together here is about 500 Gb (VCB strips out redundant unused space, saving about 400 Gb already at this stage.
The link to our remote site is 100 Mbit/s, so in the theoretic, most optimistic approach this can transport 36 Gb/hour, which would make the synchronization take at least 13-14 hours. In practice it will be even longer, and thus impratical to use.
Using rsync, only the diffs are sent over the line.
Our situation, with 10 VM’s running website, e-mail, database, fileservers, applications etc, the first results show that the daily diffs are somewhere between 20 and 30 Gb. This would theoretically take less than a hour to transport.
The practical situation is alot different. Although the actual amount of data is reasonably small, running rsync with its sliding checksums on half a terabyte of binary chunks takes also hours.
My real-world example show that the VCB backups themselves take about 1.5 hours to execute, yielding a directory with ~500 Gb of backups. This then gets rsync-ed to the remote site, which takes 5-8 hours (a seen during the last week). This is a workable solution for a daily schedule.
The partition that houses the data on the remote server has Volume Shadow Copy enabled, and creates Shadow Copies daily at the appropriate time (30 minutes before the other site initiates the rsync step).
The following picture shows that we now have 5 full copies available of our 500 Gb directory, but instead of an extra 5 x 500 Gb = 2.5 Tb, it merely takes up an extra 120 Gb:
At this stage we’ve got:
- Daily full backups of our VM’s on-site
- Multiple full backups of our VM’s off-site
Caveats
- To prepare everything, I need a full copy of the 500 Gb tree on both machines. Initially I planned on using rsync and OpenSSH, but it turned out that the OpenSSH daemon on Windows is very slow. With my systems (dual Xeon 3 GHz etc) connected via gigabit, the throughput maxed out at about 6-7 Mb/sec (Linux to Linux: > 30 Mb/sec).
Instead of using rsync/OpenSSH, I simply mounted the disk with CIFS and copied over the whole tree.
Subsequent tranfers are already limited to about 12 Mb/sec because of our uplink speed, but that is not a poblem in the real world scenario. - I have used the cwRsync package to install rsync and OpenSSH on Windows. OpenSSH with public key authentication between two Windows systems is possible and runs perfectly fine, but setting things up can be a bit hairy, especially if you’re used to UNIX systems…
- To secure things, you should restrict access to the OpenSSH daemon; I have used the buildin Windows Firewall to accomplish this, and it works fine.
- This article describes only half of the story. The other half is called restore. The Vcb utility to restore VM’s that comes with VCB (vcbRestore.exe) is pretty buggy and inflexible. It is hard to restore VM’s to a different place or with a different name. As some people have found out, it is possible to use VMware Converter to
convertrestore VCB backups to a different system (VMware ESX, server, Workstation, etc), but even despite VMware now claiming Converter can do it, this step still requires manual fiddling with vmx and vmdk files.
I have recently updated VMware Converter to 3.0.2u1 build 62456, and now everything works like a charm 🙂
It is installed on the same machine as VCB, so Converter has direct access to the backups. The restoration process is very straighforward and easy to understand. The software allows you to change the disksize of the restore VM, the datastore where the VM will be put, and the name. This name is then reflected at low-level, so not only the ‘friendly name’, but also the VMDK files have this new name. I have restored several machines and it worked without glitch. The only downside is that the restoration process takes places over the network, which is a bit slower than the backup process, which is done over fibre channel. But with gigabit ethernet restoring a small VM of 4Gb only took a few minutes.
This way of restoring also allows you to restore a VM onto a totally different system. This might come in handy for Disaster Recovery, where you might be forced to revive a VM onto VMware server or even on VMware Workstation.
Thanks for a nice summary with you real-world experience. The trick with using VSS for storing several generations of full images is .. simply great!
Great script. Thanx for sharing… 🙂
Nice tip.
I’ll favorite it.