How are generative AI companies monitoring their systems in production?

  • We struggled with this ourselves while building LLM-based products and then open-sourced our observability/monitoring tool [1]. Many use it to track RAG and agents in production, run custom evals on the production traces (focused on hallucination), and track how metrics are different across releases or customers. Feel free to dm if there is something specific you are looking to solve, happy to help.

    [1] https://github.com/langfuse/langfuse

  • There are quite a few LLM monitoring tools in the market. But for monitoring (or evaluating) RAG systems, I found Ragas to be the most helpful: https://blog.langchain.dev/evaluating-rag-pipelines-with-rag...

  • Langkit is an option. WhyLabs has published a number of blog posts on this subject recently: https://whylabs.ai/blog/posts/monitoring-llm-performance-wit...

  • we built https://klu.ai/ for this

    ======

    outside of us, here's what I see happening

    80% of folks aren't building in prod

    if you pull apart the 20% that are building, I've seen this from largest to smallest population:

    1. most people are not monitoring, followed by 2. home-grown solutions logged into existing observe/analytics platforms, followed by 3. LLMOps tooling like Klu

    the 2 cents on the unfortunate truth: I think that many of the AI bolt-on features are living the classic feature lifecycle in that they are launched, no one is monitoring them for improvement, and the feature retention sucks so there's no top-down push to prioritize. the people measuring and improving are exceptional builders regardless of LLMs/RAG.

  • Also very interested in this question. We are looking at Truera for observability.

  • It's crucial for AI companies to monitor their systems in production continuously. Not only does this ensure the system's performance and reliability, but it also helps in identifying and addressing any issues or biases that may arise.

    Many AI companies use a combination of real-time monitoring, automated alerts, and regular audits to maintain the quality and fairness of their AI systems. It's an ongoing process that plays a vital role in responsible AI development.

    In case you have an AI project in mind, feel free to contact us! https://www.ratherlabs.com

  • We are also looking for a solution for the same.. Currently evaluating LangSmith for the same.

  • I've been looking into this question for a bit. [1]

    Here's my notes on evals --

    Things to consider when comparing options:

    1) “Types of metrics supported (only NLP metrics, model-graded evals, or both), level of customizability; supports component eval (i.e. single prompts) or pipeline evals (i.e. testing the entire pipeline, all the way from retrieval to post-processing)”

    2) “+method of dataset & eval management (config vs UI), tracing to help debug failing evals”

    3) “If you wanted to go deeper on evaluation, I'd probably also add:

    What to evaluate for:

    - Hallucination

    - Safety

    - Usefulness

    - Tone / format (eg conciseness)

    - Specific regressions

    Tips:

    - Model-graded evaluation is taking off

    - Use GPT-4, GPT-3.5 is not good enough [for evals]

    - Most big companies have some human oversight of the model-grading

    - Conversational simulation is an emerging idea building on top of model-graded eval” - AI Startup Founder

    ---

    Here are a few that people are using for evals at production scale:

    * Honeyhive https://honeyhive.ai

    * Gentrace https://gentrace.ai

    * Humanloop https://humanloop.com

    * Gantry https://www.gantry.io

    I've done calls with the founders of three of those four, and I've talked with enterprise customers who've been evaluating a couple of those.

    I see there's a few others mentioned in this thread (langfuse, truera, langkit/whylabs) that I haven't heard about from customers but also look promising. There's also langsmith which I do know is popular amongst enterprises (enterprises hear of langchain, see that they have a big enterprise-oriented offering) but I haven't talked with anyone who uses it.

    Then for evals at prototyping scale there are various small tools and open source tools that I've collected here: https://llm-utils.org/List+of+tools+for+prompt+engineering

    [1]: I'm working on an AI infra handbook. Email me, email in profile, if you can review/add comments to my draft. It's 23 pages long :x