Netflix in Flames


Recently I came across this blog post. The title sounded quite intriguing and when reading it, I found some points of interest that I will summarize and present at the end of my blog post.

Here is the digest of that post.

Netflix moves their next-generation Netflix.com web site to Node.js and hits performance issues with their web application’s response time increasing at a steady rate of 10 ms/hour and an accompanying increase in CPU usage. The monitoring of Node.js’ heap size showed no signs of a memory leak (the V8’s heap size stayed flat).

The troubleshooting steps described in the blog article boil down to using the cool Flame graphs visualization tool to see the most frequent (and, therefore, “hot”) code-paths that are visualized in a metaphoric manner as flames. For those who have background in troubleshooting Java applications, the closest concept is the repeated JVM Thread Dump technique (kill -3 YOUR_JAVA_PID, or using the Java VisualVM tool).

At the end of their CSI: Netflix troubleshooting show, it was revealed that it was, in fact, their own web application that was causing the trouble by programmatically adding static route handlers to the Express.js JavaScript web framework for Node.js at the rate of 10 instances an hour which resulted in the ever growing list of routers (most of them being duplicates). Every such addition bumped the response time of their web app on average by 1 ms.

When developing their application, Netflix Dev had this assumption that the Express.js library would automatically remove duplicate route handlers through some sort of a hash-based map mechanism (this is the assumption most developers would normally make when working with more conventional server-side languages).

That’s not the case with Express.js. Express developers simply stash the routing table in a global array – quite common design decision in the brave JavaScript world as the present ECMAScript 5 spec does not offer the hungry masses of JavaScript developers a Map object.

Of course, JavaScript is an extremely expressive language and you can always hack up your own band-aid JavaScript Map with just a few lines of code:


var Map = {}; // Look Mom, this is my own Map!
. . .
if (!Map[someKey]) {
Map[someKey] = someValue;
}

Or wait (twiddling thumbs for nothing better to do) for JavaScript engines to implement the upcoming ECMAScript 6 spec that finally introduces the native Map object.

The final code fix was about making sure that only unique route handlers were added by the web application to the Express router table. That’s all there was to it, really. Case closed.

The author went down from the eye-catching (and quite controversial) “Node.js in Flames” title, kicked Express.js on the way out (funny that JavaScript itself was not blamed at all), and made it all appear as a “mission accomplished”.

The troubleshooting saga could have been much shorter (and less exciting, though) if the author had done a couple of simple things:

  1. Review symptoms of service degradation:
    • CPU growing (almost) in direct correlation with the response latency is a hallmark of the linearly increasing complexity of a computational task (traversing the list of routers in this case)
    • The magic number 10 popping up on a couple of occasions should have rung a bell (what if we change our procedure to add 5 instances an hour instead?)
    • A simple cause-effect diagram of the code would probably be enough to spot the root cause
  2. Contact the Express.js dev team and tell them that the client application is adding duplicate values; or inspect the source in parts around the web app / Express touch points

One might be tempted to say, “Sure, it is easy to see clearly with 20/20 hindsight and full post-mortem analysis done”, which is not the point here. The basic steps in troubleshooting are about uncluttering the “crime scene” by eliminating the unlikely causes. They did some steps in the right direction: not blaming the EC2 service and making sure there is no memory leaks in the web app, but then ran out of ideas.

The post also served to illustrate several emerging trends in modern web development.

  1. Companies are more comfortable in adopting new technologies, e.g. Node.js (the current version is only at a 0.10 level) with the “ugly duckling” JavaScript becoming an accepted technology on the server-side
    • As Jeff Atwood, a co-founder of the Stack Overflow, put it once, “Any application that can be written in JavaScript, will eventually be written in JavaScript”, so stay tuned …
  2. Development is fueled by a symbiosis of systems with varying degree of production quality
  3. JavaScript can easily bite opinionated developers coming from other languages that have a rich set of collection libraries (and other “goodies”)
  4. Developers have propensity to (very quickly?) build solutions without a clear picture of system interaction and inter dependencies; they can also (very quickly?) start blaming the whole software stack without conducting proper differentials (like in the House, M.D. medical drama)
  5. There is a lot of sensationalism on the web and you need to develop some internal filter to separate the wheat from the chaff, so to speak
  1. No comments yet.
(will not be published)

*