Ask HN: How do you evaluate LLMs before deploying to production?