From Cloud to Tape and Back. Google’s Gmail outage.
I’ve been reading through a bunch of articles about the Gmail incident and there were two interesting points that jumped out to me about it that I thought I would write about. The first is not really new to me and relates to the trust that must exist between customer and cloud provider when handing over it’s most important data and the second was Google’s use of tape backups.
Here is some quick background; About 34,000 users woke up on Sunday, February 27th to find that their Gmail accounts had been completely wiped clean. All messages, contacts and chat logs were gone.
“Imagine the sinking feeling of logging in to your Gmail account and finding it empty. That’s what happened to 0.02% of Gmail users yesterday, and we’re very sorry. The good news is that email was never lost and we’ve restored access for many of those affected. Though it may take longer than we originally expected, we’re making good progress and things should be back to normal for everyone soon. “ (via Google’s Gmail Blog)
Ouch. I don’t care how smart you may think you are, if you were a user who got hit with that on Sunday, I would imagine you felt pretty helpless. At that moment, all you could really do is scour the internet news sources, forums and groups hoping to find out that you weren’t alone and Google was actively trying to recover your mailbox. Turns out this time it was a storage software update that introduced an unexpected bug to the systems. This brings me to the tapes and really the thing that I found most interesting for some reason.
Google was forced to begin mail restoration processes from their TAPE backups. Virtual Tape Libraries (VTLs) and Disk to Disk backups are all the rage (especially if you are free from regulatory constraints) but in this particular instance, it was tape to the rescue. I can only guess that the bug wiped the data and that deletion was replicated across to the many different sites and disk backups that Google maintains for it’s Gmail infrastructure. I also assume that since they are restoring from tape, some users will actually experience some sort of data loss (from the time the tape backup was competed to the time of the corruption at least). Tape is also the reason it is taking days rather than hours to restore the mailboxes completely.
For whatever reason, I just didn’t think tape would have been part of the Google infrastructure. It just seems to me like such old tech for a company that has had some very innovative datacenter ideas including leveraging modular trailers, grid computing and decentralized power supply backups.
That said, I’m sure at least 34,000 people are glad tape is still alive and kicking at Google though.