A series of lightning strikes near a Google data center in Belgium has resulted in the permanent loss of some information stored in the affected disks, various media outlets are reporting.
According to CNN and BBC News, a power grid located near the data center was hit four times during an electrical storm, causing a brief power outage. Automatic backup systems quickly restored power, and the majority of the data was recovered, but a miniscule fraction was reportedly permanently affected.
In an online incident report, Google said that the lightning strikes caused errors to occur on “a small proportion of Google Compute Engine persistent disks” in part of western Europe between Thursday, August 13 and Monday, August 17. Affected disks “sporadically returned I/O errors to their attached GCE instances, and also typically returned errors for management operations.”
They also said in “a very small fraction of cases,” or less than 0.000001 percent of the disk space in the region, there was “permanent” data loss. “Google takes availability very seriously, and the durability of storage is our highest priority. We apologize to all our customers who were affected by this exceptional incident,” the company said.
Taking steps to deal with similar issues in the future
Representatives at French startup Azendoo told CNN that the company’s services were down for 12 hours, and that while Google recovered a small portion of their data, they had to recover most of the information themselves from data that had been backed up in another data center.
Google said they had conducted an investigation of the problem and identified several issues with their hardware and software that contributed to the problem, and that they were currently in the process of improving these aspects of their services. The company also told BBC News that it was working to improve their response procedures to make future losses less likely.
“Google has an ongoing program of upgrading to storage hardware that is less susceptible to the power failure mode that triggered this incident,” the company explained in its report. “Since the incident began, Google engineers have conducted a wide-ranging review across all layers of the datacenter technology stack, from electrical distribution systems through computing hardware to the software controlling the GCE persistent disk layer.”
“Several opportunities have been identified to increase physical and procedural resilience,” the firm added. Among them are continued efforts to upgrade hardware to improve cache data retention during a power outage, developing techniques to increase data durability, and finding a way to improve their engineers’ response procedures when dealing with future outages.
(Image credit: Thinkstock)
Comments