That's an interesting space to explore! I'm wondering about the baseline in the benchmarks. Which prompts did you use for those? I'm asking because some of the resulting prompts seem fairly generic, and I'm wondering if you could just blanket add them to each prompt and also see an improvement. Things like "Identify the question (what are you trying to find?)".
In the same vein, wouldn't it be interesting to measure which part of the prompt most contributed to better solving the problem? Surely some parts will be just noise and can be trimmed away.
Also wondering what this does, since the model probably won't (can't?) actually read the problem multiple times:
> Read the problem carefully (multiple times).
This is a really cool idea! I recently came across another project on GitHub: https://github.com/tensorzero/tensorzero that explores a similar direction. You might find it interesting, and perhaps it could offer some inspiration or useful insights for your work as well.
Hey this looks interesting... would love to discuss more... can we connect? pratikkhedikar10@gmail.com
If I jump in and, say, manually 'tweak' one of those JSON strategies because I think I have a better idea, what happens next? Does the LLM just roll with my brilliant human intervention, or could it eventually 'learn' that my tweak was actually counterproductive and refine it back (or away from my edit)?
You should take a look at something called Case-based reasoning. Seems to perfectly fit into the road you are currently walking, as you basically just rediscovered the CBR-cycle.
How do you forsee a system like this efficiently managing and relying on a set of strategies whose size can become unbounded?
I would like to see some interesting input/output pairs. Do you have any?
Thanks for checking this out! A few additional details that didn't fit in the main post:
The system maintains two separate limits: a storage limit (max 10 strategies per problem type in the database) and an inference limit (max 3 strategies applied per query). This keeps the database manageable while ensuring the system prompt doesn't get too long.
One interesting finding was that strategies only get used for inference once they have at least 5 attempts and a 40% success rate. This prevents the system from applying unproven strategies to new problems.
The approach works particularly well with reasoning models like DeepSeek-R1 and QwQ - the learned strategies seem to guide their thinking process effectively.
I'm especially curious about:
1. How this might work with different model families
2. Whether the community sees value in sharing strategy databases between users
3. Ideas for extending beyond text-based reasoning to multimodal problems
The plugin integrates with our broader optillm project which has other inference optimization techniques. You can combine SPL with methods like mixture-of-agents or MCTS using the "&" operator.
Next I'm thinking about meta-learning - having the system learn how to create better strategies more efficiently. Also exploring collaborative strategy sharing.
Would love to hear thoughts on the approach or if anyone has ideas for other problem domains where this might be useful!