This just in, Uber rediscovers what all us database people already knew, structured data is usually way easier to compress and store and index and query than unstructured blobs of text, which is why we kept telling you to stop storing json in your databases.
"Page not found"
Apparently the Uber site noticed I'm not in the USA and automatically redirects to a localized version, which doesn't exist. If their web-development capabilities are any indication I'll skip their development tips.
I'm not trying to flame bait here, but this whole article refutes the "Java is Dead" sentiment that seems to float around regularly among developers.
This is a very complicated and sophisticated architecture that leverages the JVM to the hilt. The "big data" architecture that Java and the JVM ecosystem present is really something to be admired, and it can definitely move big data.
I know that competition to this architecture must exist in other frameworks or platforms. But what exactly would replace the HDFS, Spark, Yarn configuration described by the article? Are there equivalents of this stack in other non-JVM deployments, or to other big data projects, like Storm, Hive, Flink, Cassandra?
And granted, Hadoop is somewhat "old" at this point. But I think it (and Google's original map-reduce paper) significantly moved the needle in terms of architecture. Hadoop's Map-Reduce might be dated, but HDFS is still being used very successfully in big data centers. Has the cloud and/or Kubernetes completely replaced the described style of architecture at this point?
Honest questions above, interested in other thoughts.
I really didn't know whether this was going to be an article about structuring sequential information or about a more efficient way to produce wood. Hacker news!
I clicked, found out, and was dissapointed that this wasn't about wood.
Maybe I should start that woodworking career change already.
Always bugged me that highly repetitive logs take up so much space!
I'm curious, are there any managed services / simple to use setups to take advantage of something like this for massive log storage and search? (Most hosted log aggregators I've looked at charge by the raw text GB processed)
This is basically sysadmin 101, however.
Compressing logs has been a thing since the mid-1990s.
Minimizing writes to disk, or setting up a way to coalesce the writes, has also been around for as long as we have had disk drives. If you don't have enough RAM on your system to buffer the writes so that more of the writes get turned into sequential writes, your disk performance will suffer - this too has been known since the 1990s.
Man, I was expecting constraint linear programming.
Log4j is 350,000 lines of code ... and you still need an add-on to compress logs?
Given the small size of the logs at my employer, they can probably be compressed to a few tens of KB. Compression is always a good thing, but especially when you need to cut cloud storage costs.
Disclaimer: I run Developer Relations for Lightrun.
There is another way to tackle the problem for most normal, back-end applications: Dynamic Logging[0].
Instead of adding a large of amount of logs during development (and then having to deal with compressing and transforming them later) one can instead choose to only add the logs required at runtime.
This is a workflow shift, and as such should be handled with care. But for the majority of logs used for troubleshooting, it's actually a saner approach: Don't make a priori assumptions about what you might need in production, then try and "massage" the right parts out of it when the problem rears its head.
Instead, when facing an issue, add logs where and when you need them to almost "surgically" only get the bits you want. This way, logging cost reduction happens naturally - because you're never writing many of the logs to begin with.
Note: we're not talking about removing logs needed for compliance, forensics or other regulatory reasons here, of course. We're talking about those logs that are used by developers to better understand what's going on inside the application: the "print this variable" or "show this user's state" or "show me which path the execution took" type logs, the ones you look at once and then forget about (while their costs piles on and on).
We call this workflow "Dynamic Logging", and have a fully-featured version of the product available for use at the website with up to 3 live instances.
On a personal - albeit obviously biased - note, I was an SRE before I joined the company, and saw an early demo of the product. I remember uttering a very verbal f-word during the demonstration, and thinking that I want me one of these nice little IDE thingies this company makes. It's a different way to think about logging - I'll give you that - but it makes a world of sense to me.
> After implementing Phase 1, we were surprised to see we had achieved a compression ratio of 169x.
Sounds interesting, now I want to read up on CLP. Not that we have much log texts to worry about.
I'm more surprised that they don't just ship logs to central log gathering directly instead of saving plain files then move them around
Am I the only person who expected an article about logging. Like as in cutting down trees ;=)
Says the person who at work just added structured logging to our new product.
Original CLP Paper: https://www.usenix.org/system/files/osdi21-rodrigues.pdf
Github project for CLP: https://github.com/y-scope/clp
The interesting part about the article isn't that structured data is easier to compress and store, its that there's a relatively new way to efficiently transform unstructured logs to structured data. For those shipping unstructured logs to an observability backend this could be a way to save significant money