How Do Facebook And Google Manage Software Releases Without Causing Major Problems?

(Source: I used to be one of several engineers designated as "pushers"[1] at Facebook around 2008-2009. The more hazardous portions of this job have since rightly been trusted to people dedicated to the release process.)

Facebook evolved from the beginning with the idea of zero down time. In my 4.5 years there, I can only remember a handful of experiences (one caused by me) where there was a widespread site disruption[2]. Pushing multiple releases daily without causing downtime affected much of what we did within engineering for the web app.

Pushing in multiple phases.

We broke pushing new code into several "phases" - each designed to catch different problems. As of 2011, the phases were:

latest - this wasn't so much a part of deployment as a version of the site running the latest code at all time. Employees would use this site and find any major bugs almost instantly.

p1 - A handful of servers that would be the first to run the new code in production. The goal of this was to catch any obvious fatals/warnings in the logs before a new release would gain wide distribution. Any engineers having code released would gather in IRC and start watching the logs as this went out.
p2 - We'd then push to a larger set of servers on the web tier. The number of servers in p2 increased over time, but I believe hovered around 5% or so. This offered several opportunities, including catching long tail fatals and monitoring CPU/memory/memcache fetches/DB queries/external service use along with key user metrics on the servers for any anomalies.

p3 - the entire web tier. At this point, people were pretty confident with the release. It's rare that anything major happened here that wasn't caught beforehand.

There were several more phases for internal tiers, but that's mostly unrelated to the question.

Multiple versions of code running simultaneously.

Given Facebook's release system, all code had to be written with the idea that it would be executing in an environment with previous releases as well. This meant it was more difficult to do things like change data stores. For example, to move a field from one MySQL table to another would have taken five steps:

Introduce double writing. Both the old and new table would need to be written to on any update.
Push new release.
Run a migration to copy all the old data over to the new format.
Switch reads from the old system to the new system, remove old read/write code.
Push new release.

It also led to the way we developed and released features. Rather than long lived feature branches[3], we used...

Feature gating.[4]

Facebook has an internal system called Gatekeeper. It allows people to enable/disable/throttle features in real time without requiring code changes. To gate code, you'd write something like:

This decoupled feature releasing from code deployment. Features might be released over the course of days or weeks as we watched user metrics, performance, and made sure services were in place for it to scale. It also meant code was written to generally work whether a particular feature was enabled or not.

Versioned static resources across the web tier.

Facebook had a system where static resources were versioned per-release. This can cause trouble with a multi-phase deployment where a user could be served a page that refers to a static resource from a new release, but the CDN queries a server running the old release. A page served for the new release could wind up with resources for the old release.

Facebook created a system where static resources were served from database. This meant, before p1, we could push a new version of static resources to every web server. After that, even servers running the old code were capable of serving the correct static resources for the new release.[5]

[1] Essentially an engineer (and later two) designated to herd cats during the potentially multi-hour process.

[2] This doesn't mean we didn't create bugs. " Move fast and break things" was the company engineering motto.

[3] In the interest of honesty, we did once attempt to use a long lived feature branch for the 2009 site redesign. It wasn't pretty.

[4] FeatureToggle

[5] There's a good writeup of how to handle static resources here: http://www.phabricator.com/docs/...

This question originally appeared on Quora. More questions on Software Engineering:

More From Forbes

How Do Facebook And Google Manage Software Releases Without Causing Major Problems?