Here is an article on the Meta engineering blog about BOLT
https://engineering.fb.com/2018/06/19/data-infrastructure/ac...
The gains are dependent on how much time the workload spends in the kernel. The Propeller team showed 2% performance improvement over PGO+LTO in this LLVM discussion post: https://discourse.llvm.org/t/optimizing-the-linux-kernel-wit...
More details about Propeller are available in a recently published paper: https://research.google/pubs/propeller-a-profile-guided-reli...
Is this specific to big programs running on data centers or could it be applied to programs people run locally on their computers or phones?
I'd love to know if the kernel improvements could be improved in this way too for PC / Android users.
I seem to remember that glandium had done something like this on the Firefox binary to optimize the startup, with impressive speedup. Can't find the details, though.
If some of the optimization gain is from bad alignment of potentially fused operations that can't because they straddle a 64 byte cache boundary, it feels like this is something the compiler should be aware of and mitigating.
Aren't profile guided optimizers capable of doing similar optimizations?
If it increases performance by 5% then 5% of execution time was spent missing branches and instruction cache misses... Which sounds utterly implausible. Conventional wisdom has it that instruction caching is not a problem because whatever the size of the binary it is dwarfed by the size of the data. And hot loops are generally no more than a few KBs at most anyway. I'm skeptical.
This looks like iffy blogspam of https://github.com/llvm/llvm-project/blob/main/bolt/docs/Opt...
The claim in it is 'up to 5% improvement' so the title seems overstated as well.
Interestingly Google went ahead and made their own version of Bolt after it was already open source.
It's called Propeller and it had some purported advantages afaik.
Anyone know if such large scale experiments have been conducted with this for the sake for comparison?