Hacker News

Show HN: Weave - actually measure engineering productivity

by adchurchon 11/20/2024, 5:43:34 PM with 10 comments

by senkoon 11/20/2024, 8:46:47 PM
"Hello Jane, please have a seat. We need to talk about your productivity. Yes, I know you helped the team through a crunch and delivered the new feature, which works flawlessly and is loved by our users. And our balance sheet is much healthier after you found that optimization that saves us $1mm/year. We also appreciate that younger teammates look to you for guidance and learn a lot from you.
But you see, the AI scored your productivity at 47%, barely "meets expectations", while we expect everyone to score at least 72%, "exceeds expectations". How is that calculated? The AI is a state of the art proprietary model, I don't know the details...
Anyways, we've got to design a Personal Improvement Plan for you. Here's what our AI recommends. We'll start with the TPS reports..."
by rkagereron 11/20/2024, 8:33:23 PM
How did you come up with those magic correlation numbers?
Is this generally just sniffing surface quality and quantity of written code, or is consideration given to how architecturally sound the system is built, whether the features introduced and their implementations make sense, how that power is exposed to users and whether the UI is approachable and efficient, user-feedback resulting from the effort, long-term sustainability and technical debt left behind (inadvertently or with deliberation), healthy practices for things like passwords & sensitive data, etc?
I'm glad to see an effort at capturing better metrics, but my own feeling is trying to precisely measure developer productivity is like trying to measure IQ - it's a flawed errand and all you wind up capturing is one corner of a larger picture. Your website shares zero information prior to login, and I'm looking forward to you elaborating a little more on your offering!
EDIT: Would also love to hear feedback from developers at the startups you tested at - did they like it and felt it better reflected their efforts during periods they felt productive vs. not, was there any initial or ongoing resistance & skepticism, did it make managers more aware of factors not traditionally captured by the alternative metrics you mentioned, etc.
by adchurchon 11/20/2024, 6:26:19 PM
Our metric is approximately "hours of work for an expert engineer." Here are some example open source PRs and their output metrics calculated by our algorithm:
https://github.com/PostHog/posthog/pull/25056: 15.266 (Adds backend, frontend, and tests for a new feature)
https://github.com/microsoft/vscode/pull/222315: 8.401 (Refactors code to use a new service and adds new tests)
https://github.com/facebook/react/pull/27977: 5.787 (Small change with extensive, high effort tests; approximately 1 day of work for expert engineer)
https://github.com/microsoft/vscode/pull/213262: 1.06 (Mostly straightforward refactor; well under 1 day of work)
by jaredsohnon 11/20/2024, 8:41:45 PM
If you build something that doesn't solve problems with impact to the business, your real productivity is zero. How does this account for that?
https://blog.pragmaticengineer.com/the-product-minded-engine...
by henningon 11/20/2024, 8:30:43 PM
As soon as people know how the metric is calculated, they will game that metric and it will cease to be useful.
by adambeeceeon 11/20/2024, 6:24:39 PM
Hey HN! I'm one of the co-founders of Weave, and I wanted to jump in here to share a bit more.
Building this has been a wild ride. The challenge of measuring engineering output in a way that’s fair and useful is something we’ve thought deeply about—especially because so many of the existing metrics feel fundamentally broken.
The 0.94 correlation is based on rigorous validation with several teams (happy to dive into the details if anyone’s curious). We’re also really mindful that even the best metrics only tell part of the story—this is why our focus is on building a broader set of signals and actionable insights as the next step.
Would love to hear your thoughts, feedback, or even skepticism—it’s all helpful as we keep refining the product.
by id00on 11/20/2024, 10:02:48 PM
Let me just ignore my natural distain to the whole thing (as a engineer and a manager)
> We’ve developed a custom model that analyzes code and its impact directly...
This is a bold claim all things considering. Don't you need to fine tune this model for every customer as their business metrics likely vastly different? How do you measure the impact of refactoing? What about regressions or design mistakes that surface themselves after months or even years?
by jaredsohnon 11/20/2024, 11:22:46 PM
I'm looking forward to developers setting up LLM prompts to make their code seem more complex and like it required more effort.
by itsdrewmilleron 11/21/2024, 1:37:39 AM
What do you see as the major threats to validity for your approach?
by mg57on 11/20/2024, 9:17:28 PM
Pretty dumb to think you can infer effort from the code itself. You make one "smart invocation" to a remote microservice and replace 1000 lines of code!
The information for effort is not available at the code level - sorry to burst your bubble.