The gold standard for systems performance measurement is a load test, which is a deterministic process of putting a demand on a system to establish its capacity. For example, you might load test a web search cluster by playing back actual logged user requests at a controlled rate. Load tests make great benchmarks for performance tuning exactly because they are deterministic and repeatable. Unfortunately, they just don't work for some of us.
At Foursquare, we push new versions of our application code at master/HEAD to production at least daily. We are constantly adding features, tweaking how old features work, doing A/B tests on experimental features, and doing behind-the-scenes work like refactoring and optimization to boot. So any load test we might create would have to be constantly updated to keep up with new features and new code. This hypothetical situation is reminiscent of bad unittests that basically repeat the code being tested — duplicated effort for dubious gain.
To make things even worse, a lot of our features rely on a lot of data. For example, to surface insights after you check in to a location on Foursquare we have to consider all your previous check-ins, your friends' check-ins, popular tips at the venue, nearby venues that are popular right now, etc. etc. Creating an environment in which we might run a meaningful load test would require us to duplicate a lot of data, maybe as much as the whole site. A lot of data means a lot of RAM to serve it from, and RAM is expensive.
So we usually choose not to attempt these “canned" load tests. In lieu of a classic load test, our go-to pre-launch performance test is what we call a “dark test." A dark test involves generating extra work in the system in response to actual requests from users.
For example, in June 2012, we rolled out a major Foursquare redesign in which we switched the main view of the app from a simple list of recent friend check-ins to an activity stream which included other types of content like tips and likes. Behind the scenes, the activity stream implementation was much more complex than the old check-in list. This was in part because we wanted to support advanced behavior like collapsing (your friend just added 50 tips to her to-do list, we should collapse them all into a single stream item).
Before and after the redesign
Perhaps surprisingly, the biggest driver of additional complexity was the requirement for infinite scroll, which meant we needed to be ready to materialize any range of activity for all users. Since the intention was for the activity stream to be the main view a user sees upon opening the Foursquare app, we knew that the activity stream API endpoint would receive many, many requests as soon as users started to download and use the new version of the app. Above all, we did not want to make a big fuss about this great new feature and then give our users a bad experience by serving errors to them when they tried to use it. Dark testing was a key factor in making the launch a success.
The first version of the dark test was very simple: whenever a Foursquare client makes a request for the recent check-ins list, generate an activity stream response in parallel with the recent check-ins response, then throw the activity stream response away. We then hooked this up to a runtime control in our application which permitted it to be invoked on an arbitrary percentage of requests, so we were able to generate this work for one percent, five percent, 20 percent, etc. of all check-in list requests. By the time we were a few weeks out from redesign launch, we were running this test 24/7 for one-hundred percent of requests, which gave us pretty good confidence that we could launch this feature without overloading our systems.
Click here to read the full post.
- Cooper Bethea (@cooperb)