Low Latency Optimization: Understanding Pages (Part 1)

  • So many puzzling things here, from the brand new user account created to post this (a portmanteau of Jump Trading and Citadel), to the very minimal information presented (even my own article on it for my software covers about as much), to the people in the comments here conflating virtual memory with hard disk paging in spite of TFA, to red herring comments about RT scheduling, ...

  • Low latency trading (sometimes referred as HFT as well) focuses a lot in data locality. That is to make sure the critical path, that is from market data coming in to new order or cancel order sending out (tick-to-trade), operates in cache as much as possible and avoid memory access as much as possible. Put all those needed data together within few cachelines as possible. To make sure those data are in the cache so the tick-to-trade operates on cache, sometimes warmup of caches between orders are employed too. This is to prevent those caches swapped out, and involves sending fake orders that won't actually go out, but swapped in the needed cachelines before the real orders, so that caches are hot when they are needed.

  • I've been working for HFT firms since I moved to NYC over a decade ago. The article looks like a good summation HugePage benefits (I'm a sysadmin, not a programmer so I understand it on a topical level only).

    What I do find fascinating is that HRT is actively blogging about this stuff. Ten years ago, everyone in the biz was super secretive and never made any public announcement about what we did - even stuff that I would take from HPE and RHEL low latency manuals (which were public knowledge). You never said anything publicly because protecting the "secret ingredients" of the trading system was paramount and any disclosure was one step towards breaking that barrier.

    Now, I'm seeing HFT companies post articles like this and I'm thinking it has to be for recruiting. Why else would they do it?.

    Anyway, as a side note, if you liked this article, you'd also probably like this:

    http://hackingnasdaq.blogspot.com/

    It was one of my favorite reads because it was written by someone going thru the journey of low latency exploration - before everything was taken over by FPGAs.

  • If any of this is interesting to you, but you would like some deeper content to bite into, perhaps start here:

    https://lmax-exchange.github.io/disruptor/disruptor.html

    This technical paper sent me on a multi-year journey regarding one simple question: "If this stuff is fast enough for fintech, why can't we make everything work this way?" Handling millions of requests per second on 1 thread is well beyond the required performance envelope for most public "webscale" products.

  • This article is pretty thin but it's not wrong.

    If you're interested in consistent low latency you do need to avoid TLB misses, and also page faults, cache contention, cache coherency delay (making sure no other cores are accessing your memory) from the CC protocol (MOESI/MESI(F)) and mis-prediction, and that's after you have put all your core's threads into SCHED_FIFO. Using https://lttng.org/ can be really helpful in checking what's happening.

  • Feels a bit blogspammy. Drepper's article is linked to for good reason

  • An OS page size is such a prevalent notion in software it's shocking

    I was oblivious to this a year ago before I got interested in database internals

    Something that I found interesting, there's a recent presentation by Neumann about the Umbra DBMS where he fields a question about hugepages at the end. I recall him saying they don't use it, which I found interesting.

    I know Oracle and MySQL recommended Transparent Hugepages IIRC

  • So useless.

  • If you're doing truly low latency stuff you shouldn't be swapping at all, everything should be 100% resident in memory at all times. So "pages" are totally irrelevant to you. (You should also probably be using something like the PREEMPT_RT patchset, adjust scheduling priorities and try your best to ensure that the CPU core(s) your app is running on aren't burdened by serving interrupts. Plus likely a lot of other stuff that I haven't touched on in this brief comment.)