Tuesday 20 August 2013

Using Glacier for long-term backups

Recently I've been trying to find a new home for 3TB of old EBS data to save on cost. S3 was a consideration, although it's nearly as expensive and my past experiences using FUSE haven't always been great. Another consideration was 'bring the data home', but 3TB of old data on expensive HW didn't seem worth it.

Amazon Glacier is something we'd been meaning to look at for a while, so it seemed the perfect time to give it a go.

I found a super guide to using Glacier on blog.tkassembled.com although for our RHEL based Linux instances it required some tweaking....

To install Glacier on a RHEL based Linux instances:

# yum install python-setuptools git
# pip-2.6 install boto --upgrade
# git clone git://github.com/uskudnik/amazon-glacier-cmd-interface.git
# cd amazon-glacier-cmd-interface
# python setup.py install

I also found that splitting the data up into 200MB chunks, as advised by Amazon, took an age using gzip so instead I ended up using pigz:

# BACKUP_TIME="$(date +%Y%m%d%H%M%S)"
# tar cvf - /mnt/s10_1 | pigz -4 | split -a 4 --bytes=200M - "s10_1.$BACKUP_TIME.tar.gz."

The format of the .glacier-cmd files has now changed too - here is mine:

[aws]
access_key=REMOVED
secret_key=REMOVED

[glacier]
region=us-east-1
logfile=~/.glacier-cmd.log
loglevel=INFO
output=print


Generally I'm been impressed with Glacier, and Glacier-Cmd is a great little tool. Others I've heard about but not used yet are MT-AWS-Glacier and Glaciate Archive.

Should note it took me a good week and a half to split, zip and upload 3TB of data to Glacier using a c1.medium instances in EC2. Glacier certainly isn't 'fast archive', and if you are looking to upload/download content fast then this isn't for you - a simple listing of the information in a Glacier vault can take 4+ hours.

No comments:

Post a Comment