Infrequently Noted

Performance Innumeracy & False Positives

February 3, 2011

tl;dr version: the web is waaaay too slow, and every time you write something off as "just taking a couple of milliseconds", you're part of the problem. Good engineering is about tradeoffs, and all engineering requires environmental assumptions -- even feature testing. In any case, there are good, reliable ways to use UA detection to speed up feature tests in the common case, which I'll show, and to which the generic arguments about UA vs. feature testing simply don't apply. We can and should go faster. Update: Nicholas Zackas explains it all, clearly, in measured form. Huzzah!

Performance Innumeracy

I want to dive into concrete strategies for low-to-no false positive UA matching for use in caching feature detection results, but first I feel I need to walk back to some basics since I've clearly lost some people along the way. Here are some numbers every developer (of any type) should know, borrowed from Peter Norvig's indispensable "Teach Yourself To Program In Ten Years":

Approximate timing for various operations on a typical PC:

execute typical instruction 1/1,000,000,000 sec = 1 nanosec

fetch from L1 cache memory 0.5 nanosec

branch misprediction 5 nanosec

fetch from L2 cache memory 7 nanosec

Mutex lock/unlock 25 nanosec

fetch from main memory 100 nanosec

send 2K bytes over 1Gbps network 20,000 nanosec

read 1MB sequentially from memory 250,000 nanosec

fetch from new disk location (seek) 8,000,000 nanosec

read 1MB sequentially from disk 20,000,000 nanosec

send packet US to Europe and back 150 milliseconds = 150,000,000 nanosec

That data's a bit old -- 8ms is optimistic for a HD seek these days, and SSD changes things -- but the orders of magnitude are relevant. For mobile, we also need to know:

fetch from flash storage	1,300,000 nanosec
60hz time slice	16,000,000 nanosec
send packet outside of a (US) mobile carrier network and back	80-800 milliseconds = 80,000,000 - 800,000,000 nanosec

The 60hz number is particularly important. To build UI that feels not just fast, but instantly responsive, we need to be yielding control back to our primary event loop in less than 16ms, all the time, every time. Otherwise the UI will drop frames and the act of clicking, tapping, and otherwise interacting with the app will seem "laggy" or "janky". Framing this another way, anything your webapp blocks on for more than 16ms is the enemy of solid, responsive UI.

Why am I blithering on and on about this? Because some folks continue to mis-prioritize the impact of latency and performance on user satisfaction. Google (my employer, who does not endorse this blog or my private statements in any way) has shown that seemingly minor increases in latency directly impact user engagement and that major increases in latency (> 500ms) can reduce traffic and revenue significantly. Latency then, along with responsiveness (do you drop below 60hz?), is a key metric for measuring the quality of an web experience. It's no accident that Google employs Steve Souders to help evangelize the cause of improving performance on the web, and has gone so far as to build products like Chrome & V8 who have as a core goal to the web faster. A faster web is a better web. Full stop.

That's why I get so deeply frustrated when we get straw-man based, data-challenged advocacy from the maintainers of important bits of infrastructure:

This stuff is far from easy to understand; even just the basics of feature detection versus browser detection are quite confusing to some people. That’s why we make libraries for this stuff (and, use browser inference instead of UA sniffing). These are the kind of efforts that we need, to help move the web forward as a platform; what we don’t need is more encouragement for UA sniffing as a general technique, only to save a couple of milliseconds. Because I can assure you that the Web never quite suffered, technologically, from taking a fraction of a second longer to load.

What bollocks. Not only did I not encourage UA sniffing "as a general technique", latency does in fact hurt sites and users -- all the time, every day. And we're potentially not talking about "a couple of milliseconds" here. Remember, in the context of mobile devices, the CPUs we're on are single-core and clocked in the 500mhz-1ghz range, which directly impacts the performance of single-threaded tasks like layout and JavaScript execution -- which by the way happen in the same thread. In my last post I said:

...if you’re a library author or maintainer, please please please consider the costs of feature tests, particularly the sort that mangle DOM and or read-back computed layout values

Why? Because many of these tests inadvertently force layout and style re-calculation. See for instance this snippet from has.js:

if(has.isHostType(input, "click")){
  input.type = "checkbox";
  input.style.display = "none";
  input.onclick = function(e){
    // ...
  };
  try{
    de.insertBefore(input, de.firstChild);
    input.click();
    de.removeChild(input);
  }catch(e){}
  // ...
}

Everything looks good. The element is display: none; so it shouldn't be generating render boxes when inserted into the DOM. Should be cheap, right? Well, lets see what happens in WebKit. Debugging into a simple test page with equivalent code shows that part of the call stack looks like:

#0	0x0266267f in WebCore::Document::recalcStyle at Document.cpp:1575
#1	0x02662643 in WebCore::Document::updateStyleIfNeeded at Document.cpp:1652
#2	0x026a89fd in WebCore::MouseRelatedEvent::receivedTarget at MouseRelatedEvent.cpp:152
#3	0x0269df03 in WebCore::Event::setTarget at Event.cpp:282
#4	0x026af889 in WebCore::Node::dispatchEvent at Node.cpp:2604
#5	0x026adbcb in WebCore::Node::dispatchMouseEvent at Node.cpp:2885
#6	0x026ae231 in WebCore::Node::dispatchSimulatedMouseEvent at Node.cpp:2816
#7	0x026ae3f1 in WebCore::Node::dispatchSimulatedClick at Node.cpp:2837
#8	0x02055bb5 in WebCore::HTMLElement::click at HTMLElement.cpp:767
#9	0x022587e6 in WebCore::HTMLInputElementInternal::clickCallback at V8HTMLInputElement.cpp:707
...

Document::recalcStyle() can be very expensive, and unlike painting, it blocks input and other execution. And the cost is at page loading is likely to be much higher than other times as there will be significantly more new styles streamed in from the network to satisfy for each element in the document when this is called. This isn't a full layout, but it's most of the price of one. Now, you can argue that this is a WebKit bug and I'll agree -- synthetic clicks should probably skip this -- but I'm just using this as an illustration to show that what browsers are doing on your behalf isn't always obvious. Once this bug is fixed, this test may indeed be nearly free, but it's not today. Not by a long shot.

Many layouts in very deep and "dirty" DOMs can take ten milliseconds or more, and if you're doing it from script, you're causing the system to do lots of work which it's probably going to need to throw away later when the rest of your markup and styles show up. Your average, dinky test harness page likely under-counts the cost of these tests, so when someone tells me "oh, it's only 30ms", not only do my eyes bug out at the double-your-execution-budget-for-anything number, but also the knowledge that in the real world, it's probably a LOT worse. Just imagine this happening in a deep DOM on a low-end ARM-powered device where memory pressure and a single core are conspiring against you.

False Positives

My last post concerned how you can build a cache to eliminate many of these problems if and only if you build UA tests that don't have false positives. Some commenters can't seem to grasp the subtlety that I'm not advocating for the same sort of lazy substring matching that has deservedly gotten such a bad rap.

So how would we build less naive UA tests that can have feature tests behind them as fallbacks? Lets look at some representative UA strings and see if we can't construct some tests for them that give us sub-version flexibility but won't pass on things that aren't actually the browsers in question:

IE 6.0, Windows:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)

FF 3.6, Windows:

Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.13) Firefox/3.6.13

Chrome 8.0, Linux:

Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Ubuntu/10.10 Chromium/8.0.552.237 Chrome/8.0.552.237 Safari/534.10

Safari 5.0, Windows:

Mozilla/5.0 (Windows; U; Windows NT 6.1; sv-SE) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4

Some features start to jump out at us. The "platform" clauses -- that bit in the parens after the first chunk -- contains a lot of important data and a lot of junk. But the important stuff always comes first. We'll need to allow but ignore the junk. Next, stuff after platform clauses is good, has defined order, and can be used to tightly form a match for browsers like Safari and Chrome. With this in mind, we can create some regexes that don't allow much in the way of variance but do allow sub-minor version to match so we don't have to update these every month or two:

IE60 = /^Mozilla\/4\.0 \(compatible; MSIE 6\.0; Windows NT \d\.\d(.*)\)$/;
FF36 = /^Mozilla\/5\.0 \(Windows; U;(.*)rv\:1\.9\.2.(\d{1,2})\)( Gecko\/(\d{8}))? Firefox\/3\.6(\.\d{1,2})?( \(.+\))?$/;
CR80 = /^Mozilla\/5\.0 \((Windows|Macintosh|X11); U;.+\) AppleWebKit\/534\.10 \(KHTML\, like Gecko\) (.+)Chrome\/8\.0\.(\d{3})\.(\d{1,3}) Safari\/534\.10$/;

These look pretty wordy, and they are, because they're designed NOT to let through things that we don't really understand. This isn't just substring matching on the word "WebKit" or "Chrome", this is a tight fit against the structure of the entire string. If it doesn't fit, we don't match, and our cache doesn't get pre-populated. Instead, we do feature detection. Remember, false positives here are the enemy, so we're using "^" and "$" matches to ensure that the string has the right structure all the way through, not just at some random point in the middle, which UA's that parade around as other browsers tend to do.

Here's some sample code that incorporates the approach:

(function(global){
// The map of available tests
var featureTests = {
    "audio": function() {
        var audio = document.createElement("audio");
        return audio && audio.canPlayType;
    },
    "audio-ogg": function() { /*...*/ }
    // ...
};

// A read-through cache for test results.
var testCache = {};

// An (exported) function to run/cache tests
global.ft = function(name) {
    return testCache[name] = (typeof testCache[name] == "undefined") ?
                                featureTests[name]() :
                                testCache[name];
};

// Tests for 90+% of current browser usage

var ua = (global.navigator) ? global.navigator.userAgent : "";

// IE 6.0/WinXP:
var IE60 = /^Mozilla\/4\.0 \(compatible; MSIE 6\.0; Windows NT \d\.\d(.*)\)$/;
if (ua.search(IE60) == 0) {
    testCache = { "audio": 1, "audio-ogg": 0 /* ... */ };
}

// IE 7.0
// ...
// IE 8.0
// ...

// IE 9.0 (updated with fix from John-David Dalton)
var IE90 = /^Mozilla\/5\.0 \(compatible; MSIE 9\.0; Windows NT \d\.\d(.*)\)$/;
if (ua.search(IE90) == 0) {
    testCache = { "audio": 1, "audio-ogg": 0 /* ... */ };
}

// Firefox 3.6/Windows
var FF36 = /^Mozilla\/5\.0 \(Windows; U;(.*)rv\:1\.9\.2.(\d{1,2})\)( Gecko\/(\d{8}))? Firefox\/3\.6(\.\d{1,2})?( \(.+\))?$/;
if (ua.search(FF36) == 0) {
    testCache = { "audio": 1, "audio-ogg": 1 /* ... */ };
}

// Chrome 8.0
var CR80 = /^Mozilla\/5\.0 \((Windows|Macintosh|X11); U;.+\) AppleWebKit\/534\.10 \(KHTML\, like Gecko\) (.+)Chrome\/8\.0\.(\d{3})\.(\d{1,3}) Safari\/534\.10$/;
if (ua.search(FF36) == 0) {
    testCache = { "audio": 1, "audio-ogg": 1 /* ... */ };
}

// Safari 5.0 (mobile)
var S5MO = /^Mozilla\/5\.0 \(iPhone; U; CPU iPhone OS \w+ like Mac OS X; .+\) AppleWebKit\/(\d{3,})\.(\d+)\.(\d+) \(KHTML\, like Gecko\) Version\/5\.0(\.\d{1,})? Mobile\/(\w+) Safari\/(\d{3,})\.(\d+)\.(\d+)$/;
if (ua.search(FF36) == 0) {
    testCache = { "audio": 1, "audio-ogg": 0 /* ... */ };
}

// ...

})(this);

New versions of browsers won't match these tests, so we won't break libraries in the face of new UAs -- assuming the feature tests also don't break, which is a big if in many cases -- and we can go faster for the majority of users. Win.

Cutting The Interrogation Short

January 30, 2011

I've been having a several-day mail, IRC, and twitter discussion with various folks about performance and the feature detection ~~religion~~ technique, particularly on mobile where CPU ain't free. So what's the debate? I say you shouldn't be running tests in UA's where you can dependably know the answer a-priori.

Wait, what? Why does Alex Russell hate feature testing, kittens, and cute fuzzy ducklings?

I don't. Paul warned me that my approach isn't going to be popular at first glance, but hear me out. My assumptions are as follows:

Working is better than busted
Fast is better than slow
No browser vendor changes the web-facing features in a given version. Evar. Does not happen

If you buy those, then I think we can all get some satisfaction by retracing our steps and asking, seriously, what is the point of feature testing?

Ok, I'll go first: feature testing is motivated by a desire not to be busted, particularly in the face of new versions of UA's which will (hopefully) improve standards support and reduce the need for hacks in the first place. Sensible enough. Why should users wait for a new version of your library just 'cause a new browser was released or because you didn't test on some version of something.

Extra bonus: if you don't mind running them every time, you can write just the feature test and your work is done now and in the future! Awesome! Except some of us do mind. Yes, things are now resilient in the face of new UA's and new versions of old ones, but only on the back of testing for everything you need ever time you load a library on a page. Slowly. Veeerrrrry slooowly.

Paul suggested that some library could use something like Local Storage to cache the results of these tests locally, but this hardly seems like an answer. First, what if the user upgrades their browser? Guess you have to cache and check against the UA string anyway. And what about the cost of going to storage? Paul reported that these tests can be wicked expensive to run at all, on the order of 30ms for the full suite (which you hopefully won't hit...but sheesh). Reported worst-case for has.js is even worse. But apparently going to Local Storage is also expensive. And we're still running all these tests in the fast path the first time anyway. If we think that they're so expensive that we want to cache the results, why don't we think they're so expensive that we don't want to run them in the common case?

Now for a modest proposal: feature tests should only ever be run when you don't know what UA you're running in.

Feature testing libraries should contain pre-built caches -- the kind that come with the library, not the kind that get built on the client -- but they should only be consulted for UA versions that you know you know. If we assume that behavior for UA/version combination never changes, we've got ourselves a get-out-of-jail free card. Libraries can have O(1) behavior in the common case and in the situations where feature testing would keep you from being busted, you're still not busted.

So what's the cost to this? Frankly, given the size of some of the feature tests I've seen, it's going to be pretty minimal vs. the bloat the feature tests add. All performance work is always a tradeoff, but if your library thinks it's important not to break and to be fast, then I don't see many alternatives. New versions of libraries can continue to update the caches and tests as necessary, keeping the majority of users fast, while at the same time keeping things working in hostile or unknown environments.

Anyhow, if you're a library author or maintainer, please please please consider the costs of feature tests, particularly the sort that mangle DOM and or read-back computed layout values. Going slow hurts users, hurts the web, and hurts the culture of performance that's so critical to keeping the platform a viable contender for the next generation of apps. We owe it to users to go faster.

A quick aside: I hesitated writing this for the same reasons that Paul cautioned me about how unpopular this was going to be: there's a whole lot of know-nothing advocacy that's still happening in the JS/webdev/design world these days, and it annoys me to no end. I'm not sure how our community got so religious and fact-disoriented, but it has got to stop. If you read this and your takeaway was "Alex Russell is against feature testing", then you're part of the problem. Think of it like a feature test for bogosity. Did you pass? If so, congrats, and thanks for being part of the bits of the JavaScript universe that I like.

On The Care And Feeding of Spinning Disks

January 24, 2011

or:

Why I spent last summer rebooting Windows

or:

How to compress several months of research into 50 lines of code and unblock a product launch, for fun and profit. Home parlor entertainment edition. Fun for all ages. Warranty void if seal broken. Offer not available in some states.

Now that others have found the same technique that I used to drop Chrome and Chrome Frame cold start times by more than 2/3 last summer, I feel compelled to explain what's going on. Particularly as there seems to be some confusion over in the Hacker News thread.

First, a couple of notes and caveats. SSDs are coming (see: ChromeOS, iOS, etc.), and in an SSD world, none of what follows will matter. Some of it could even hurt you. Other concerns -- drive-internal parallelism, bus bandwidth/latency, and the amounts of crapware (that's you, Sophos/Bit9/etc.) between your driver and your LoadLibrary call -- are going to dominate. Also note that what I'm about to say is app specific. If you want your app to start faster, collect data on real-world systems and go from there. Cargo-cult hacks at your own peril.

Right then, spinning disks.

Good programs start fast, and that includes when you start them just after the OS has started. In this state, many of the caches the OS populates on your behalf are unlikely to be full of goodies that'll help you start faster, so you pay full freight for all of your dependencies. This is called a "cold start". When you start cold you have to go get everything you need from disk. Usually spinning disk. With > 10ms average latencies for random reads. Ouch.

For a sense of scale, consider that warm starts of Chrome (on Windows) take just ~200ms. That's ~200ms for both the browser and renderer processes. 20 random reads from disk double that. Last summer, depending on OS version, disk speed, and amount of Prefetch/SuperFetch training cold starts ranged from 2000-3700ms. Fortunately, browsers are the sorts of things that users tend to keep open, and if they close, re-open later in the same session. Cold start hurts, but it's not most user's primary experience of a browser.

But that doesn't hold for a plugin, particularly one that isn't already ubiquitous. Yes, Flash loads quickly, but for most sites it starts fast because some other site had previously loaded it. Cold start matters even more for Chrome Frame. You've already started your browser, so why is this page taking so long to come in? It's not enough that the network response is being buffered and the page will feel instant when the renderer finally starts...we need pixels on screen stat. Hence began my summer of rebooting. We couldn't ship Chrome Frame to Stable until cold start was sub-second.

Using custom logging via ETW as a starting point, I quickly found that most of the time spent in Chrome startup wasn't going to subprocess or thread creation, cache loading, profile reading, or any of the other obvious big-ticket items. Tremendous discipline has been enforced in the Chromium codebase, ensuring that things which might otherwise block startup tend to happen asynchronously, keeping your experience responsive.

What's left? ETW, Sawbuck and XPerf eventually showed that we were thrashing disk for lots of little 16-64K reads into chrome.dll. Why not 4K, the size of single page? The Windows memory manager uses a read-ahead optimization when a hard fault occurs. That is to say, when your program tries to execute some bit of code from a library that isn't yet in memory, the memory manager will go get the bits that correspond to the code in question, plus a little. Compilers optimize for code locality so you're probably going to need the other pages too. Read-ahead (probably) saves you another expensive seek. If you don't use those pages, no biggie. Windows is smart enough to prioritize caches. Lots of little 16-64K reads (aka: random I/O!) matched up with the hard page fault data...bingo. Even with SuperFetch doing its thing on modern Windows versions, we were seeing lots of slow (>10ms on 5400rpm disk) reads thanks to hard page faults.

Why were we getting these faults? To be entirely honest, we never quite figured this out. SuperFetch/PreFetch do their thing before your program ever hits it's CRT main (the very first program-supplied entry point), meaning frequently used pages should already be available. Part of the reason they weren't might be the different patterns of access between the Chrome Browser and Renderer processes. The same binary (chrome.exe) and main library (chrome.dll) are used for both, but the bits that each use are pretty radically different. There's no access to WebKit code from the Browser process, e.g., but that's most of what the Renderer uses. Other theories included some cap on the number of pages that SuperFetch will remember and log. Whatever the reason, fault location graphs showed a pretty violent thrashing toward the top end of chrome.dll.

Idea time: how can we pull all of the pages into the standby list -- the bit of the Virtual Memory cache with the lowest priority -- with a single seek, i.e. one big sequential read? What can we do to slurp the entire DLL into memory so that when the program needs some bit of the library, it gets it from memory, not from spinning (read "slow") disk?

We tried loading the library and walking the pages sequentially. No joy...well, not much. Certainly not enough. More on that in a second. What about pulling the DLL into the disk cache? Hrmmm. A quick test showed that this worked great on Windows Vista. Just fread the sucker in and boom. ~750ms cold starts!

Except on XP. Balls.

Seemingly the memory manager bypasses the disk cache for faults on XP. Back to the bat-DLL-memory-walk approach, Batman! Yes, this works on XP, albiet with some caveats -- namely that it's slower than the other method. We still get stalls and slow seeks, but things improve to the 1300ms range. Average cold starts cut in half. We'll take it. The final patch isn't pretty, but it does something like what we want. See this file, look for PreReadImage which gets used in Chromium's WinMain which does custom DLL loading here, and only on the first (parent, aka, browser) process that's created. Voila! Really, really fast starts nearly everywhere.

So, is doing this a good idea?

It's not a slam dunk. For users with SSDs, this is more data than they strictly need, possibly slowing starts marginally for them. We'll probably have to revisit this at some point. Also, the Vista+ approach potentially pollutes the disk caches. Not great, but at least the pages in questions won't cause unnecessary thrashing unless the system is already under memory pressure in which case, the impending startup operations (allocating enough heap for V8, renderer process startup, etc.) are going to hurt no matter what. Also, we're not exactly sipping data here. As the size of chrome.dll grows, so will the time required to pull it across the bus, even if the reads are serial. On my most recent dev channel rev, that's a 23MB DLL, so if we get 75-100MB/s across the bus (optimistic for these old drives), we're looking at a couple of hundred milliseconds just spent reading. Beats a tons of random seeks, but it's still not great. Ideally we'd be able to use PGO/WPO to scope down the amount we really need to read by clumping early-use stuff towards the front of the binary, but so far this hasn't panned out thanks to the aforementioned multi-process/single-dll thing. Turns out binary re-writing on Windows isn't trivial. Perhaps the biggest down-side is that the XP experience isn't ideal yet, and users on XP are most likely to have neither SSDs nor particularly speedy spinning disks. They need the help most. We were able to do something for them thanks to some dark magic in the Windows XP Prefetch system, but that's a tale for another day. Suffice to say I now have tremendous respect for the folks who built these systems. Some parts of Windows truly are beautiful.

All in all, this hack has panned out in the real world in a pretty compelling way. Our start time histograms show the improvement in the real world (thanks to all of you who turned on reporting!), and Chrome Frame was able to ship to Stable last fall after we were relatively certain that we weren't degrading overall system performance for users.

If you've read all of this, I hope that for your sake it's with a sense of historical bemusement. What? They didn't all have SSDs/quantum-rent-an-exabytes? Pssh. How did they get by? Well, now you know. One slow reboot of Windows at a time.

Bootnote: I should mention that it wasn't as straight line to suss out what actually happens in a cold start, pin down causality, understand the differences between XP and Vista+, and convince ourselves with real-world data that this was really our best option. It took a long time. Also, at some point, I stopped borrowing Amit's copies of Windows Internals 4 and Windows Internals 5 and bought my own. You'll need both if you still work with XP. Recommended.

Footnote to the Bootnote: Obvious Windows Hacker question: why not just do something like read in the DLL from the GCF BHO when you start up? Y'all get your BHO component loaded at IE launch, after all.

Good thought, Obvious Windows Hacker! Sadly, this is evil...or at least not very sporting. Even with low-priority I/O on Vista+, it's sort of punitive to go do tons of I/O near when the user is probably starting to browse sites that might not make use of our BHO-based goodness, don't you think? Besides, if we can get an honest startup win, why cheat? Honest wins pay off for both Chrome and Chrome Frame, which is doubly awesome.

In which I eat crow.

December 21, 2010

Ok, ok. I've done my bitching and moaning, for years in person and more recently in blog form. I've listened to the interminable discussions at conferences about how everyone knows exactly what's wrong with Twitter and how terribly hard it is to be so popular, and to this day I've escaped such a fate -- to the extent that Pete even made a fake me who, true to form, is nearly as infrequent a poster as I am here. But no longer. Behold! An empty, lame n00b account!

Glen's (Even More) Internet Famous!

December 9, 2010

Older Posts

Newer Posts

execute typical instruction	1/1,000,000,000 sec = 1 nanosec
fetch from L1 cache memory	0.5 nanosec
branch misprediction	5 nanosec
fetch from L2 cache memory	7 nanosec
Mutex lock/unlock	25 nanosec
fetch from main memory	100 nanosec
send 2K bytes over 1Gbps network	20,000 nanosec
read 1MB sequentially from memory	250,000 nanosec
fetch from new disk location (seek)	8,000,000 nanosec
read 1MB sequentially from disk	20,000,000 nanosec
send packet US to Europe and back	150 milliseconds = 150,000,000 nanosec