Hacker News

Solving Math Word Problems

by yigitdemiragon 10/29/2021, 6:56:45 PM with 6 comments

by howeycon 10/29/2021, 9:06:22 PM
> Richard, Jerry, and Robert are going to share 60 cherries. If Robert has 30 cherries, and has 10 more than Richard, how many more cherries does Robert have than Jerry?
> answer:
> Robert has 30 + 10 = 40 cherries.
> If there are 60 cherries to be shared, then Richard and Jerry will have 60 - 40 = 20 cherries each.
> Robert has 40 - 20 = 20 more cherries than Jerry.
Um, the answer is "correct" but isn't the actual reasoning wrong?
Robert has 30
Richard has 20
Jerry has 10
Hence they split the 60 this way.
by psadrion 10/30/2021, 4:57:33 AM
This might work better if GPT3 is used to rewrite each statement into an algebraic equation. And then a equation solver is used to solve the system.
by drzoltaron 10/31/2021, 2:43:20 AM
It’s frustrating how myopic these papers can be. It seems like the goal of the paper is to solely work within the GPT framework to test the theory of verifiers. Why not try verifiers out with other models? Perhaps it’s not a fair comparison but I remember a Kaggle competition [0] from six years ago which involved building models to solve grade school science multiple choice questions. A simple word2vec model already could achieve 50% accuracy. Despite multiple choice being (maybe?) easier than free response, I’m just skeptical that the way to solve these problems is to throw billions of weights at them. It’s also not convincing to me that this new dataset doesn’t suffer from a much smaller template space, in that the models still just memorize templates.
[0]: https://www.kaggle.com/c/the-allen-ai-science-challenge/over...
by pred_on 10/30/2021, 11:14:19 AM
For a moment there, the title had me hoping that they were working on the generally undecidable https://en.m.wikipedia.org/wiki/Word_problem_(mathematics)
by poweraon 10/29/2021, 9:05:59 PM
Scoring 55% on a test like this should not be considered a great accomplishment. A sign of progress, yes, but not an accomplishment by itself.
This is still simply a system that is good at guessing. It does not know anything.