> and are looking at other models we can use and fine tune.
This seems to be a common misconception in the industry - fine-tuning a model will almost certainly lower the quality of your response in these situations. You do not want to waste your time fine-tuning on low-quality code that will not exist in-context.
You are going to be stuck with off-the-shelf commercially licensed models for now, which will be effectively useless on codebases that extend beyond their fairly limited context (8k tokens, <1k SLOC). It is very likely that the tool you're searching for simply isn't ready yet.
You're looking for "HumanEval" tests. Not saying this is the best way to test it, but it's the only standard test I know of that code models are compared with and are commonly benchmarked for
The current best models you'd want to try that I'm aware of is WizardCoder(15B), Starcoder(15B), and replit's code model(3B). Replit's instruct model is interesting because of it's competitive performance while only being a 3B model so it's the easiest/fastest to use.
Perhaps interestingly none of these are based on LLama
https://github.com/abacaj/code-eval - This is a large mostly up to date list of benchmarks
https://huggingface.co/WizardLM/WizardCoder-15B-V1.0 - has a chart with a mostly up to date comparison
EDIT: License-wise I think you might be able to commercially use Replit's model and Starcoder, I don't think you're allowed to use WizardCoder outside of academic work.