By Zed A. Shaw

poll, epoll, science, and superpoll

The next thing I wanted to solve for Mongrel2 is implementing an epoll backend for handling I/O. Rather than just implement it, I wanted to reconfirm that epoll was superior in the ways claimed based on recent research. I decided to rerun benchmarks everyone had used, but to change how they were being used based on some papers I read and original research I had done a while ago. Of course I totally can't find any of those papers because they're gone, but I decided to go try science and see what I found.

What I found is actually leading me to try an experiment in Mongrel2 I'm gonna call "superpoll" for lack of a better name. But first, let me get into why the touted benefits of epoll and most of the information out there is mostly bullshit because of some falacies people seem to have about epoll (mostly perpetuated by epoll's proponents).

Fallacy 1: epoll Is O(1)

The general belief among programmers is that epoll is O(1) and poll is O(N). Yes, they actually believe that epoll will process 1 event at the same speed it processes 10k or 100k or 1M events. That's what O(1) means to me, so I'm not sure where they're getting that epoll does anything in constant time.

Where the confusion lies is in what N stands for in the case of poll vs. epoll, which gave me the idea for how the stats are done wrong. They're both O(N), it's just with poll N=total file descriptors (FDs), and with epoll N=active FDs. This is important because much of the information, like this just flat out gets it wrong.

What they do is they setup tests where they have a small number of active FDs, then vary the total number of FDs, and of course epoll wins because it is O(N=active). Their little tests show epoll being basically O(1) because they don't ever increase the active levels very much. However if you set the active equal to the total, epoll clearly shows an O(N) performance graph that matches the active level.

How I tested this is I found the test case everyone seems to use and then ripped out any AIO stuff to remove any confounding. I then started graphing epoll's performance as the ratio of active/total approached 1.0. Sure enough, epoll is O(N=active), not O(1) at all. For it to be O(1) the graph would need to be a straight line no matter how much you throw at it that's active, but that's not the case at all.

I'll have a better graph that summarizes this whole finding in one neat little package for you to ponder.

Fallacy 2: Number Of Active/Total Matters

The next fallacy people seem to have is that epoll wins if there's a high number of total FDs, but poll wins at lower counts. The idea is that if fallacy 1 is true, then when you have 10k FDs epoll will do better, but if you have 10 FDs poll will do better.

Now that I knew fallacy 1 was false, that epoll was O(N=active) while poll was O(N=total), I started testing this. What I did was set active equal to the total in the test, and then measured the performance. The idea is that if epoll's performance is based on active FDs, while poll's is based on total, then setting them equal would remove that as a confounding element and give me an idea of the raw performance of each.

Again, what I found is that size of active or total FDs did not matter. You could throw the same number at both and they both did the same amount of work, and actually, this shocked me, but epoll did worse every time active==total. I kind of didn't believe that, since epoll is supposed to be faster, but not at all. Poll out does it by about 15-20% when the active==total.

Before we get into the implications of that, just understand that if you think poll will win with small numbers of file descriptors and epoll with large numbers, you are completely wrong. What is the determinant, and so far the only determinant of whether poll or epoll is better is....

The Active/Total Ratio

After seeing the result of poll winning when active==total, and epoll winning at active < total, I thought, well that's an easy thing to test and graph. What I wanted then was to test the following hypothesis based on what I knew so far:

The ratio of epoll/poll event performance would drop as the ratio of active/total approached 1.0.

I suspected that, since poll was about 15%-20% faster when the active/total ratio was 1.0, then when the ratio was 0.8 the performance of poll and epoll would be equal.

So I sat down all day and ran tons of runs on my computer comparing epoll/poll performance as the active/total ratio increased. I went with something simple and did 1000 total, then slowly increased by 100. What I got was this graph:

1k graph

What you see in this graph is on the y-axis (the left) is the speed of poll over epoll as a multiplier, with higher numbers meaning epoll is faster. The red line shows you where they break even. The x-axis (numbers on the bottom) is the active/total ratio, so you can see that my hypothesis was correct. As the active/total goes to 1.0, the epoll/poll drops, but how it drops is awesome.

This graph was very surprising. Kind of shocking even, since it's showing that the performance break even for poll vs. epoll is an active/total ratio of 0.6! Yes, in this test poll is just as fast as epoll when you only have 60% of your network connections active. I couldn't believe it. There's no way epoll was that bad compared to poll.

What this graph meant is simply that if your server was even moderately overloaded, and I think 60% active traffic isn't unreasonable on quite a few server types, then epoll is hurting your performance. Even more importantly, it's a potential 20% or more performance hit.

Nah, that can't be right. I could see a break even point of 80%, but 60%? 0.6 active/total ratio? Hello no, that's much much too low. Once you start factoring other performance hits like epoll's requirement of a syscall for every insert/delete of an FD into the pool then you're talking almost a 50/50 split on which one to use.

Never Trust Your Own Lies

This required further testing. Lucky for me doing this with even larger loads and then hanging out playing guitar is actually fun. What I did next was slowly increase the total FDs to find a statistically different result, and settled on 10,000 total FDs. Then I ran the whole test suite, which took forever, and came up with this graph:

10k graph

Holy crap it's the same curve, same result, except for one difference: epoll doesn't win as much. This graph seems to indicate that as you increase the total FDs, epoll starts to lose its magic a bit. I imagine that there's a point, probably in the millions, where poll and epoll just are the same in performance, and really neither is gonna work well.

The break even point was also 60% (0.6), but slightly different, but I still had to confirm it. The chance that these numbers would come out to have the same break point was just too good.

So I ran them again. And again, same result. I did it one more time, this time with randomly selected totals ranging all over, and again same results. After this many graphs looking exactly the same, I can only conclude that my hypothesis was right, and that I can come up with this:

epoll is faster than poll when the active/total FD ratio is < 0.6, but poll is faster than epoll when the active/total ratio is > 0.6.

What Does This Mean?

Another way to put this is if your server or protocol is the type that transmits a ton of data and can't tolerate idle resources, then poll will win. This would be the case for data servers, media servers, streaming data, image caches, anything where your goal is to pump data, lots of it, and get it out fast. In this case that's because your active/total ratio is greater than 0.6 most of the time.

But, if your server is the kind with lots of idle connections just hanging out, then epoll wins not poll. That's because your active/total ratio is < 0.6 most of the time.

The active/total ratio also gives you a good scaling metric, since now you can measure for it and then figure out which one is best to use.

Interestingly enough, Mark McGranaghan came to the same conclusion but about threads vs. events but puts it into a slight different model that's much more complex (and also not backed by evidence, which is sorta lame).

I'd make a further hypothesis that you'll find this same result applies to threads, where if the active/total ratio is over say 0.6 then threads win, but if it's below 0.6 "events" win.

Now What About This Superpoll?

Let's for the rest of this article call "active/total ratio" just ATR. Way easier.

Here's the problem with all of this, everyone is going to put on their binary logic hats and argue about how server X is better with Y. Where X is some protocol, and Y is one of poll or epoll. They obviously won't make the real mental leap here which says you need both. I'd even go so far as to say it'd be damn hard to make one system that handles both load types and keep it efficient.

Normally I'd then study the protocol and traffic patterns then pick one or the other, just like all of you. However Mongrel2 (and dare I say all modern web servers) has to handle both traffic situations thanks to new modern asynchronous communication models. It's gotta handle epoll type traffic with low ATR as in http long poll, async flash sockets, web sockets, etc. It also has to handle poll type traffic with high ATR like in cranking out files and shoving raw data around. In some cases it has to do both in one server, really complicating things.

And with a threshold of just 0.6 for the ATR where one or another wins, it's hard to just pick one and stick with it. We could say just use epoll, but then we're losing a whopping 20% on high ATR. Even as the total increases epoll doesn't do as hot as it did before, so the syscall penalty it has for each socket insert/removal is gonna make it pointless. However you can't go with poll because there's a lot of traffic that just would cause the server to get killed, like with chat.

What I need, and probably everyone needs, is both. I need a poll that can take and move FDs to epoll when the ATR is < 0.6, then move some back when the ATR > 0.6. In fact, I need it to do this depending on where the FD is, in epoll or poll. What I need, is you guessed it, superpoll.

Yeah cheesy name, but here's what I've got in mind, and will try out this week:

  1. poll's job is cranking on high ATR traffic.
  2. epoll's job is babysitting low ATR traffic.
  3. Since poll is mostly active stuff, it has to be in the main processing loop.
  4. Since epoll is mostly idle stuff, it can hang out in a second process or thread.
  5. They need to move batches of FDs between them.
  6. poll needs to move batches of idle FDs to epoll when ATR < 0.6.
  7. epoll needs to move batches of active FDs to poll when ATR > 0.6.
  8. But, this is poll-ATR and epoll-ATR, not a global ATR.

So here's the crazy scheme that just might work, and could be very portable to other event systems that come out later (kqueue, SIGIO, AIO).

  1. The main loop uses poll and tracks the poll-ATR.
  2. A second epoll process is started that also tracks epoll-ATR.
  3. Between them is a domain socket and they use ZeroMQ to talk.
  4. When poll runs and we see an ATR < 0.6, we "compact" the idle sockets and send them to the epoll over 0MQ.
  5. When epoll runs and we see an ATR > 0.6, we also compact and send the active sockets to poll.
  6. Since epoll is generally faster, make it hard to get out of epoll.

With this plan, we can have a second process who's job is to just do nothing but watch insane numbers of idle connections and report them to the main server with ZeroMQ. In the main server we just crank on active connections all day long, but when things go quiet, we hand them off to epoll to babysit. With a bit of fine tuning of the thresholds, we can then make the server adaptive about what loads it can handle.

For example, if I know that one of my servers is going to be mostly idle connections then I set the poll-ATR fairly low so that they get sent over more often, and the epoll-ATR fairly high so they stay there. If I'm running a media server then I'd want the inverse.

Another thing about using 0MQ is that it will work the same whether I make the epoller a process or a thread, or whether it uses epoll or some other low ATR friendly event system.

The Crazy Plan

That my friends is the crazy plan for Mongrel2. The evidence supports that we need both poll and epoll in modern web servers due to the nature of the protocols being used. The plan is to create a super poll where there's two processes that handle the different loads. If that doesn't work then we'll try having poll wait on the epoll FD in a single process.

You should feel free to take the test case I did and try it yourself, just to confirm the numbers. If you want to try kqueue in the same test go ahead, or SIGIO could be interesting too.