This post is literally about saving one byte, at the cost of being slower and not producer-consumer friendly. Not very interesting.
The flag could have been hidden in any other field as a bit or something. Then it could be at least masked with simple AND operation which is usually faster than branching, especially on pipelined CPUs.
Update: Quick implementation: https://gist.github.com/dpc/a194b7784adfa150a450
This fix for concurrency issue is an ugly hack. I'm not sure if it's even correct in this particular scenario, and definitely not proper for anything that would aspire to be good reusable code. I'd advise this code to push atomicity requirement onto caller. Irqs should have been disabled by calling code.
"register" keyword is obsolete. There's no point in using it.
I did something similar back in the DOS era for a serial port library: https://github.com/kstenerud/DOS-Serial-Library/blob/master/...
You still have a buffer or size X that may store X-Y items. I fail to see the "zero memory waste" (not that it's a good tradeoff).
Gosh, all that to save one byte? ;-)
A conceptually simpler way to do this is to assign one (or more) extra bits in the head an tail pointers. For example, for a 512 entry ring, use 16-bit indices. Whenever you index the ring, and the index with 511 before performing the index.