Hacker News

Show HN: Root Cause as a Service – Never dig through logs again

by stochastimuson 6/17/2022, 3:55:21 PM with 6 comments

by coward123on 6/18/2022, 1:31:27 AM
I give you credit for working in this space and trying to create a more automated approach... I spent many years in the app performance world both as a consultant and working on products, so again - good on you.
For what it's worth, my immediate reaction is that you might work on different terminology in how you present what your product does. I get that you are trying to create a contrived example in order to demo the product and show value, and that can be a very difficult thing to do. That said, in my line of thinking, an HTTP 500 isn't actually the root cause, it's a symptom of the cause. The password being set incorrectly isn't the root cause either. The real root cause is something in the deployment pipeline, the configuration control, the change management, the architecture, etc, etc. that got us to this point.
I guess I'm struggling here a bit too because I think of how many times I would have been the manual version of this, where I would show information like this to a client's technical team, and I had to absolutely spoon feed them on how to remedy. I remember a team that was supposed to be crack guys from a vendor, an app team, etc who had been working on a problem for months that I fixed in a matter of hours because they just didn't understand what the line in the log meant. So it isn't clear to me how your product is actually creating better visibility + interpretation of the problem toward a solution.
In the ten or so years I did that kind of work, what really stood out to me was that the seemingly obvious tech issues were not obvious because of a lack of education / experience /training on the part of the client personnel, but more often than not the real problems were much much larger architectural issues way beyond just the message in the log. Those are much harder to both identify and correct, but products like yours and the ones you integrate with are almost just a band-aid on the problem.
So, take that for what it's worth - again, good work trying to improve the state of the art in this area.
by kordlessagainon 6/17/2022, 8:10:30 PM
Having worked on a machine learning time series document search solution for the last 2 years, I know exactly why the cost of this is so high. Running logs through a model must be VERY expensive.
I had a good friend at Splunk who passed a few years ago. He was working on something similar, well before we had decent models. His anomaly detection used differences in regular expression patterns to detect "strange things". I guess that's why he carried the title "Chief Mind".
I'm excited where ML and time series data is going. It's going to be interesting!
by treison 6/17/2022, 5:30:05 PM
> Here's a 2 minute demo of what it looks like: https://youtu.be/t83Egs5l8ok.
The problem with this demo is that it uses something that's 100% broken due to something that happened immediately before the failure. That's not hard to debug and I don't really see value there.
The scenarios that could use this sort of tool are things like someone turning on a flag that breaks 1% of a specific end point but only 0.1% of overall requests. So something sub-alert level and with not nearly an immediately obvious cause & effect. If you can detect something like that without generating a ton of noise and give a hint to root cause then that'd be something killer.
It's a cool idea and I can see the value. We've had scenarios like the one I mentioned (and worse) go undetected because of the noise Sentry generates. If you can solve that then you've really got something.
by randombits0on 6/17/2022, 8:57:25 PM
If rnd() > .5 printf(“it’s DNS!”)
by throwaway81523on 6/17/2022, 9:51:22 PM
95.8% of the time it's kind of obvious what happened, at least with reasonable monitoring. Digging through logs is for the other 4.2% of the time. Having done that kind of thing more than once, I don't see ML as being helpful. You often end up writing scripts to search for specific combnations of events, that are only identifiable after the incident has happened.
by Huntseckeron 6/17/2022, 9:59:02 PM
do you not take volume of logs into consideration for pricing then ?