Show HN: Tonic Validate Metrics – an open-source RAG evaluation metrics package

  • Related — are there any good end to end benchmark datasets for RAG? End to end meaning not just (context, question, answer) tuples (which ignores retrieval) but (Document , question, answer). I know NQ (Natural Questions) is one such dataset:

    https://ai.google.com/research/NaturalQuestions

    But I do t see this dataset mentioned much in RAG discussions.

  • How does it compare to https://github.com/explodinggradients/ragas

  • Hi all, if anyone has any questions about the open source library, Joe and I will be around today to answer them.

  • This package suggests building a dataset and then using LLM-assisted evaluation via GPT-3.5/4 to evaluate your RAG pipeline on the dataset. It relies heavily on GPT-4 (or an equivalent model) to provide realistic scores. How safe is that approach?

  • This is cool. What are your plans for supporting and building upon this going forward?

  • So disapointed :( I saw Metric and RAG (I thought it would be Red-Amber-Green) and I was hoping for some cool metrics/heatmap thingie...

    I wish you the best though!

  • if you build a dataset of question with responses to test you rag app with this metrics package, how do you know whether the distribution of questions match in any way with the distribution of question you'll get from the app in production? using a hand made dataset of questions and responses could introduce a lot of bias into your rag app.

  • very cool! looking forward to trying it