How a routing misconfiguration caused the Facebook outage

Par ediallo - 5 October, 2021 - 15:10

Facebook and its affiliated services stopped working for nearly six hours following a misconfiguration of the company's backbone routers. A historic failure by its duration, which comes in a tense context for the company, accused of lying about its moderation policy by a former employee.

The company Facebook and its various products - Facebook, Messenger, Instagram, WhatsApp, Oculus and Workplace - were affected by a total outage of nearly 6 hours.

It affected users around the world with very uneven consequences since in some countries these social networks act as the main channels of communication, with SMS still being very expensive.

The incident began shortly before 4pm French time, according to reports made on the specialized site Downdetector testifying to the inability to access the company's websites.

Error messages were displayed, referring to domain names that could not be found. Mobile applications did not display any message but did not update themselves, and were also unable to access Facebook domains. The failure was finally resolved at midnight French time.

The company's internal systems were also affected by the outage. Some employees reported that they were unable to access the company's headquarters because their badges were no longer working, a New York Times reporter reported on Twitter. Also, they reportedly relied heavily on SMS and Microsoft Outlook messaging to communicate.

A CHANGE IN CONFIGURATION AT FAULT

During the outage, Facebook did not communicate much. It was not until the end of the incident that the Menlo Park company gave more details in a blog post signed by Santosh Janardhan, VP Engineering and Infrastructure.

Putting an end to some speculations, he explained that the failure was caused by "a faulty configuration change", and more precisely "a configuration change on the backbone routers that coordinate the network traffic between our data centers".

To fully understand this incident, we need to go back to the way the Internet works as an infrastructure. As its name implies, the Internet is a set of interconnected networks that allow client terminals and servers to communicate efficiently using a common communication protocol (the Internet Protocol, abbreviated as IP).

Domain names play a fundamental role in this infrastructure because they allow an IP address to be translated into an intelligible and easily remembered name to facilitate access to a resource (for example, a website).

THE BORDER GATEWAY PROTOCOL AT THE HEART OF THE FAILURE

In detail, during the breakdown, the authoritative DNS (Domain Name System) servers (those that store the data) for Facebook.com were not responding, reports Stéphane Bortzmeyer, system and network architect at the French Internet Association in Cooperation (Afnic), on his blog.

This is the reason why the websites of Facebook and its subsidiaries displayed an error message related to the domain name.

This failure is the consequence of a bad maneuver related to the Border Gateway Protocol (BGP). Backbone of the Internet operation, BGP is a routing protocol used to exchange routing and network accessibility information between Autonomous Systems (AS).

An AS is a very large network or group of networks under the control of a single entity with a consistent internal routing policy. Every computer or device that connects to the Internet is connected to such a system.

This protocol allows these large systems to know all possible routes connecting them to other networks. By incorrectly changing the configuration of its routers, Facebook erased the information that allowed the rest of the Internet to access its servers, making them invisible and causing this monster outage.

AN ALREADY DIFFICULT CONTEXT FOR FACEBOOK

This historic incident by its duration has caused the share of Facebook to fall. At the close of Wall Street, it was down 4.89% to 326.23 dollars. The U.S. company also lost about $ 545,000 in advertising revenue per hour during the outage, according to estimates by Bloomberg.

Most importantly, the outage came at a tense time for the U.S. company. Frances Haugen, a whistleblower and former product manager at Facebook, gave an interview to the American channel CBS in which she accuses Facebook of having chosen "profit over security".

The woman who is at the origin of the Facebook Files, a series of damning documents published in the Wall Street Journal, claims that the social network is lying about its policy to fight against hate and disinformation. She is to be heard this Tuesday by the U.S. Congress. It has also sent the documents it has to several European governments, including France.

Facebook is also under strong regulatory pressure, especially from Brussels. As such, the European Commissioner Thierry Breton took advantage of this failure to remind in a tweet that "in the global digital space, everyone could suffer a shutdown".

In the midst of discussions on the DSA/DMA package, he reiterated his ambition to provide "Europeans with a better digital via regulation (...)". Note that the regulation of large platforms will not prevent the occurrence of a bad configuration.

ALICE VITARD