From Kafka to ZeroMQ for real-time log aggregation

  • I used to be on a team responsible for a single small-ish Kafka cluster (between 6-12 nodes) doing non-trivial throughput on bare metal. Without commenting on whether ZeroMQ is the right alternative: I can understand being scared off. Our hand was forced such that we had to go the other way and understand what was going on in Kafka.

    The kicker is that Kafka can be rock solid in terms of handling massive throughput and reliability when the wheels are well greased, but there are a lot of largely undocumented lessons to learn along the way RE: configuration and certain surprising behavior that can arise at scale (such as https://issues.apache.org/jira/browse/KAFKA-2063, which our team ran into maybe a year ago & is only being fixed now).

    Symptoms of these issues can cause additional knock-on effects with respect to things like leader election (we wound up with a "zombie leader" in our cluster that caused all sorts of bizarre problems) and graceful shutdowns.

    Add to that the fact the software is still very much under active development (sporadic partition replica drops after an upgrade from 0.8.1 to 0.8.2; we had to apply some small but crucial patches from Uber's fork) & that it needs a certain level of operational maturity to monitor it all ... it's easy to get nervous about what the next "surprise" will be.

    Having said all that, I'd use Kafka again in a heartbeat for those high volume use cases where reliability matters. Not sure I'd advise others without similar operational experience to do the same for anything mission critical, though -- unless you like stress. That stress is why Confluent is in business. :)

  • To me it sounds like Kafka was not understood in full detail (maybe because missing documentation or the high complexity) and they switched to a system they build themselves. Naturally they know in full detail what is going on and can set up the system as needed.

    I am wondering if working on solving the actual problems with Kafka would have been the better route. I've never used Kafka and i find ZeroMQ great, but reading that their logging solution does drop log messages is a huge no-go for operations. How can you claim to run a serious business and say "babies will die" when you can't be sure to be able to find problems?

    Because, when will you lose logs? Not in normal operation, but when weird things happen. When networking has a hiccup. When Load on the system is too high, so most likely when many people are using your service. Exactly when shit hits the fan. And you just made the decision that it's ok to drop log messages in such cases? That's not good.

    I think you should either dive into Kafka/Zookeeper and fix your problems or switch to another logging solution. You should probably just drop that non-sense "streaming and real-time logs" requirement and live with a log delay of a few seconds and build something really stable instead of building something inherently unstable. Honestly, just collecting syslogs on the core vm and sending them to a central server would have been the better solution. Better then looking into fancy real-time, streaming logs on a sunday night because the system is having a breakdown and you can't even be sure that you are not missing essential logs.

  • I don't understand why people need such ridiculously fast systems when we are using RabbitMQ and crappy Apache flume and we generate more than 5k with spikes of 50k messages/second. Please author of the article tell me your metrics.

    And our log messages are ridiculously big at times (15k to as big as 50k).

    Our pipe never has problems. What fails for us is Elastic Search. In fact at one point in the past we did 100k messages/s when embarrassingly had debug turned on in production and RabbitMQ did not fail but Elastic Search and sadly Flume did as well (I tried to get rid of flume with a custom Rust AMQP to Elastic Search client but at the time had some bugs with the libraries.. Maybe I will recheck out Mozilla Heka someday).

    There is this sort of beating of the developer chest with a lot of tech companies.. that hey listen we are ultra important and we are dealing with ridiculously traffic and we need ultra high performance. Please tell/show me these numbers.... Or maybe stop logging crap you don't need to log.

    Or maybe I'm wrong and we should log absolutely everything and Auth0 made the right choice given their needs (lets assume they have millions of messages a second), I still think I could make a sharded RabbitMQ go pretty far.

    This goes with other technology as well. You don't need to pick hot glamorous NoSQL when Postgresql or MySQL and a tiny bit of engineering will get the job done just fine particularly when mature solutions give you such many things free out of the box (RabbitMQ gives you a ton of stuff like a cool admin UI and routing that you would have to build in ZeroMQ).

  • Did you ever try running 5 ZK's in the ensemble? 3 is the absolute minimum to survive a single machine failure. If you are having trouble with availability, it seems natural to increase your safety factor there.

    I was surprised by the contrasting sense of importance of delivery guarantees in the article. At the start, losing a message was akin to the death of a child. At the end, shrug. Now every single machine failure (or even ømq process restart) failure will lose you log messages stored in memory :(.

    Glad to hear you found a solution that worked for you though! Would love to hear about difficulties you had with the new system, in particular adding brokers.

  • FYI, Kafka doesn't need to fetch from disk every time as it caches the logs pretty aggressively, as long as you have enough memory.

    Running Zk and Kafka on the same nodes is likely not the best thing.

  • FWIW, you can get Kafka packaged as a fully managed and HA service from https://aiven.io on AWS and also Azure, GCE and DigitalOcean.

    But if the Auth0 runs their entire operations on AWS, maybe Kinesis would have been a more natural transition.

  • The author correctly points out that he is comparing apples to oranges.

    Kafka gives you features that certain systems cannot live without, like on disk persistence (saved my life couple of times) and topics. Filtering messages on the client side like ZeroMQ does it not an option in many cases, just think about security. I think Kafka has a long way to go before it can be used as a general message queue (many features are not there yet like visibility timeout for example) but if you can manage Zookeeper and have means to work with it (somebody understands it and knows its quirks) it can provide a reliable platform for distributing a large number of messages with low latency and high throughput, just like it does at LinkedIN.

  • With ZeroMQ I had the worst possible results and experience. Honestly much of what it claims is bogus. It is highly optimized for certain cases and utterly useless for distributed systems. Try and find out in PUB/SUB what the IP addresses of the subscribers are. Not possible. In many cases you will be much better off learning TCP/IP yourself. In the mentioned case you simply iterate over the vector of subscribers - much more powerful and the sane default. It seems at some point people confused internal networking solutions with the Internet.

  • ZMQ's default behavior (and in some cases only behavior) of dropping new messages when buffers are full, made it a no-go for my client. We ended up switching away from ZMQ to a more traditional durable queue and ended up saving a ton of code complexity and got a lot of reliability in the process. Having now researched it I can't think of a reason I'd ever use ZMQ again. I'll either use a durable queue when I care about message delivery, or something much more traditional when I don't.

  • Maybe TANK ( https://github.com/phaistos-networks/TANK ) would have been a good alternative on there. No features parity with Kafka, but setting it up is a matter of running one binary and creating a few topics, and it is faster than Kafka for produce/consume operations. (disclosure: I am involved in its development).

  • Did you consider MQTT? Sound to me a more natural choice.

  • Probably should have been running ZK and Kafka queues separate to CoreOS/container shenanigans.

    If deployed using the Netflix co-processes both are very durable.

  • You can deploy Kafka using DC/IO and it takes care about HA for you. DC/IO is quickly becoming the go-to solution for database deployments. ArangoDB even recommends it as default.

    Now about Kafka vs ZeroMQ: you want Kafka if you cannot tolerate the loss of even a single message. The append-only log with committed reader positions is a perfect fit for that.

  • I'm a total message queue noob. What are the usecases for them?

    I used MQTT but only as a message bus.

  • Did you look at nsq.io or NATS?

  • Why dont all these companies ever just use real enterprise software?

    There are about a dozen message systems out there that will handle much more than Kafka with minimal or no operational overhead while supporting everything they need.

  • I came up with a very different solution for real time access to logs: tail them to slack. It's not an aggregation solution and doesn't work well if you have chatty logs with nothing to filter on, but if you just want to be notified when things are happening in the logs it's pretty nice and doesn't need any infrastructure.

    http://wanderr.com/jay/tail-error-logs-to-slack-for-fun-and-...

  • 2015

  • Anyone use collectd + rrd for this purpose? Still trying to understand at what level it's worth to move to something else.

  • So you used Kafka for something that should have been handled by a MQTT or ZeroMQ in the first place ?

  • But why ZeroMQ and not nanomsg?