Friday, October 7, 2011

A perpetual SVN backup method to the cloud

I came up with this method a long time ago to provide redundancy and archival to the svn servers that I operate. I don't think I've found a similar solution online so I thought I would share because it works pretty well for me.

Basic Redundancy

The first step of course is to make sure that you provide a good amount of local redundancy. This is fairly standard procedure. RAID the drives the svn repo is on. Now some people might stop there and say, a mirror is good enough, like one company I remember that lost all it's data because someone disgruntled just did a DROP table in mySQL. For those of us who are more paranoid... we do not stop here.

The next step in my redundancy strategy for svn is to have a mirror server that uses svnsync to maintain a 2nd copy of the repo. If the main server fails, a simple DNS switch and everybody will be back in business. The mirror server runs on a perpetual loop, of syncing waiting a minute, then syncing again. I know that this is probably a little wasteful because it pings the main server so often, but in practice I've found it's much more stable than using post-commit-hooks to send a message to the mirror server. Some online examples of using svnsync tell you to use a post-commit-hook to call svnsync on the main server. I've found that this presents with 2 problems.

First problem that I encountered with that is, svn seems to be designed to not send a commit done message to the client until the entire post commit hook is done. So on particularly large commits, waiting for a sync to the mirror is wasting time on the client side.

The other problem, is that syncing by sending data from one Apache server to another Apache server, is extremely slow. Which is why I always call the svnsync on the mirror server using a UNC file path and not over https.

Archiving to the Cloud

So now I have a redundant main server, and a redundant mirror, that has at most a delay of a minute before it updates itself to the latest revision. Perfect right? Not yet. Any disaster plan should always have an offsite backup solution. I could use an offsite mirror, but that is cost prohibitive based on the size of my repository. It's VERY large. So instead, I archive to the cloud.

The post commit hook is fairly simple. All I do is write the revision number of the commit as a file on the server.   This is so the post commit hook can exit as quickly as possible, because of the aforementioned problem 1. Later at night, the server loops through the directory where all the days commits have been stored and runs a simple batch file on it.

@echo off
call setPath.bat
svnadmin dump E:\svndb -r %1 --incremental | bzip2 > svnarchive.%1.dump.bz 

:UPLOAD
echo "Upload dump!"
s3cmd put svnarchive.%1.dump.bz s3://svnarchive/svnarchive.%1.dump.bz

for /f "delims=" %%v in ('md5sum svnarchive.%1.dump.bz') do set TOOLOUTPUT=%%v
SET LOCAL_MD5=%TOOLOUTPUT:~0,32%
echo "Local md5 = %LOCAL_MD5%"

s3cmd info s3://svnarchive/svnarchive.%1.dump.bz | grep MD5 > rev%1.txt
set /p TOOLOUTPUT= < rev%1.txt
del rev%1.txt
SET REMOTE_MD5=%TOOLOUTPUT:~14,47%
echo "Remote md5 = %REMOTE_MD5%"
if "%LOCAL_MD5%" == "%REMOTE_MD5%" (
	echo "MATCH!"
	goto END
) else (
	echo "DOES NOT MATCH!"
	goto UPLOAD
)

:END
del svnarchive.%1.dump.bz

Simply put, this batch file takes the rev number as an argument, does an incremental dump and bz2's that output. Then it uploads it to S3, it then compares S3's calculated MD5 with the locally calculated MD5 and if they match, voila! Delete the dump and move on to the next revision. If not it tries it again and again, until it succeeds. Using the S3 calculated MD5 is a simple way to keep bandwidth moving in and out of S3 as low as possible. The advantages of using this method is that it is not always safe to make a copy of an svn repo while it is running, so simple copy backup solution on a repo is insufficient. You could do a svn hotcopy before running regular backup software on it, but to hotcopy a large repo could take a long time. This method of doing a per transaction backup is quick and cheap in comparison.


Additional Steps

 In addition to this, regularly run integrity checks are always a good idea. I can run an integrity check on any revision by redumping the revision and checking the MD5 against the one in S3. I re-verify the MD5 of every revision committed in any given week, at the end of the week. Yearly, I also do a full verification from 1 to N. If there are differences discovered, then it's time to check the mirror, and see which one is right. Because the corruption could have come from the main repo, or S3! In the event of a catastrophic failure that takes out both the main and the mirror server, I also have a recovery script that grabs every single revision from the bucket and does a svnadmin load to a new repo. Something like :

for /L %%V in (%1,1,%2) do (
	s3 get svnarchive/svnarchive.%%V.dump.bz /nogui
	bunzip2 svnarchive.%%V.dump.bz
	svnadmin load svndb < svnarchive.%%V.dump
	del svnarchive.%%V.dump
)


Limitations

Amazon has unlimited files number per amazon s3 bucket so theoretically this archival system could go forever, but S3 has a file size limit of 5GB. Anyone who commits 5GB at a time to SVN needs to be shot.

The cloud is not always safe, there have been a couple of reports of corruption and failures with the s3 service, but for the most part it has been a reliable service for me.

Restoring from S3 could take a while, depending on my downstream, but if BOTH mirror and main server went down. I think I have bigger problems, like the big one finally hit Southern California, or someone burned down the building.

This has worked fairly well for me for quite a while now, and I hope this gives someone else ideas on cost effective backups, redundancy and archival. I do know that people recommend having multiple repositories, but the way we have our projects set up, it's much easier to do them in one repo.