Backup and Restore

31st March is apparently World Backup Day.

Curiously a few days ago we faced a disaster in a server we are using for some of our services. (Some static websites and some internal services like source code repositories, wiki, etc...) Both of our drives failed (we are using Raid 1).

Fortunately we have been very careful about backup since day 1, backup saved us multiple times in the past and not doing backup made me loosing many stuff during my early computer ages, so we know how important it is.

Still an hard drive failure is a pain in the neck, especially in the case of centralized and non-replicated services because of the downtime required to restore them.

So what happen when a disaster like this hits your server?

Five stages of grief

Before start the recovering step, you must accept the death of your hard drives.

Phase 1: Denial

SkiceLab main website was non accessible. It was not loading the stylesheets from some reason. I've logged in the server and found out there were some HD errors. I could not believe it. HD errors bring annoyance.

Phase 2: Anger

The very same day of the disaster, Hetzner Robot was not accessible due to some unrelated issues. I couldn't ask for assistance and that made me mad. We called by phone and they told us the service would have been restored in an hour.

Phase 3: Bargaining

We had to wait much more, in the meantime I could analyze our drives looking for some hopes. Both drives were raising errors but I was still believing just one of them was corrupted.

Phase 4: Depression

During this phase I've tried to do a full backup instead of relying to our partial backups, I didn't want to loose time re-installing the system, the boredom of system administration tasks required to make the things work the same way of before made me sad and desperate at the same time.

Actually Hetzner backup server was really slow and it literally took one day before I was able to fully mirror our services. Still some of the drive corruptions didn't let us have a perfect mirror.

Phase 5: Acceptance

Ok, a minimal amount of time had to be done in order to restore all the services. Before doing a full restore I've decided to at least bring back the most essential one, the source repositories, using the partial backup (the one of the can-not-loss stuff) and a free VM on Azure (that is free for Startups subscribed to Microsoft's Bizspark program).

This could have been done from the begin. We decided to move the services that should be always accessible to cloud services to make them easier to restore and replicate in the future. We are already using cloud services for all the costumer services for the same reason, we just realized that we need to care more about ours too.

Restore

Most of backups related articles on the web, talk lots about backups and few about restore. Obviously backup is crucial, still restore is an important argument to talk about.

As I've said we made both partial and full backup, still the second one was corrupted as it was done only after the raid failure. So, how do you restore a service after a disaster?

Install a minimal system. Hetzner have a very useful rescue system to install an OS from scratch, you just need to setup the partitions and wait a couple of minutes.
Copy the whole backup to a local directory (e.g. /root/backup), configure ssh to connect after reboot (copy your pub key to /root/.ssh/authorized_keys and run chmod 0700 /root/.ssh && chmod 0600 /root/.ssh/authorized_keys, reboot.
Restore /etc/. Avoid /etc/fstab, partitions UUID could be different and you might have changed the partitions (in my case I've changed the root filesystem to use XFS instead of EXT4, also I've given more space to the LVM group)
Restore home directories and ssh keys. You might want to restart ssh in order to use the same policy of security you were using before the disaster (i.e. disable password login, disable root login, etc...). Also you can connect with your user instead of the root user. Note that after you've restored /etc/passwd, /etc/shadow, /etc/groups, home directories and sudo configuration, you should be able to use all the previous users of the system, without reboot.
Restore all the data used by your programs: /var/www/, /var/lib/lxc, etc...
Restore custom scripts. I had all of them located in /usr/local/bin.
Re-install the packages. You should have saved the status of your package system. In my case I just have saved the /var/lib/dpkg/status file, this is a suboptimal solution, still I was able to recover everything:
1. restore /var/lib/dpkg/status
2. dpkg -l | grep ii | awk '{print "apt-get --reinstall -y install", $2}' > /tmp/reinstall
3. run twice sh /tmp/reinstall (the first time it will install everything but with some issues)
Please note, that if you have some new packages installed that were not installed in your previous system, then some unwanted garbage will resides in your drive. In my case it was just dnsmasq, I've installed it back and removed it.

As we use Linux Containers for most of our subsystems, I've had to do the following additional steps:

Restore lvm partitions (I had a backup of the lvm partition info and full backups of all containers -- the full backup worked for them, otherwise I should have done some of the steps of before for each container)
Ensure all of them can start and work properly.
Ensure your web server starts correctly and forwards request to the proper lxc.
Reboot. (just to be sure everything work after reboot ;-) )

Easy, isn't it? It looks quick too... indeed it was not! Data restore from Hetzner backup was incredibly slow, probably due to a connection issue between our host and the backup host.

Conclusions

We have changed some stuff in the monitoring system recently and didn't test them, this made the error reporting system not notify us in case of raid failure. I think the drives didn't crash together the same day, but because of the missing notifications we were not able to take the proper recovering actions in time. Always check your monitoring and notification system.

We had bad luck with Hetzner support system that was down the same morning of the disk outages. Shit happens, we need to take it into account.

Hetzner backup system is not that bad but it's not even that good. The incredible slow connection made the disaster even more painful.

VMs on the cloud are more reliable and less disaster-prone than the old hosting solution. We still need hosting because of better hardware for testing with convenient pricing, but we will move all the indispensable services on the cloud (at the end, Azure is free for 3 years, hopefully after that time your Startup will be able to pay for cloud hosting). Even without replica, proper snapshots of the VM images let you recover your internal services in a glance.

Please note, even with VMs, you should still do your data backups.

Angry Bits

Words on bytes and bits