Ask HN: Has Anyone Trained a personal LLM using their personal notes?

  • Maybe not exactly what you’re asking, but I started doing talk therapy last year. It’s done virtually and I record the session with OBS. As soon as the recording finishes, the following happens:

    - The audio is preprocessed (chunked) and sent to Whisper to generate a transcript

    - The transcript is sent to GPT-4 to generate a summary, action items, concepts introduced with additional information

    - The next meeting’s date/time is added to my calendar

    - A chatbot is created that allows me to chat with each session, including playing the role as the therapist and continuing the conversation (with the entire context of what I actually talked about)

    It’s been exceedingly helpful to be able to review all my therapy sessions this way.

  • I've been disciplined (perhaps obsessive at times) with keeping a daily diary for many years and I was interested in being able to query my diary locally via AI. I found a solution that works surprisingly well using GPT4ALL.

    I found GPT4ALL (https://gpt4all.io) to have a nice-enough GUI, and it runs reasonably quickly on my M1 MacBook Air with 8Gb of ram, and it can be setup to be a completely local solution - not sending your data to the Goliaths.

    GPT4ALL has an option to access local documents, via the Sbert text embedding model (RAG).

    My specific results have been as follows; using the Nous Hermes 2 Mistral DPO and Sbert - I indexed 153 days of my daily writing (most days I write between 2 and 3 thousand words).

    Asking a simple question like "what are the challenges faced by the author?" provides remarkable, almost spooky results (which I won't share here) - which in my opinion are spot-on regarding my own challenges over that the period - and Sbert provides references to which documents it used to generate the answer. Options are available to reference an arbitrary number of documents, however the default is 10. Ideally I'd like to have it reference all 153 documents in the query - I'm not sure if it's a ram or a token issue, however increasing the value of documents referenced has resulted in machine lock-ups.

    Anyhow - that's my experience - hope it's helpful to someone.

  • Played around with fine tuning, but ended up just experimenting with RAG.

    One thing I haven’t worked out yet is the agent reliably understanding if it should do a “point retrieval query” or an “aggregation query.”

    Point query: embed and do vector lookup with some max N and distance threshold. For example: “Who prepared my 2023 taxes?”

    Aggregation query: select a larger collection of documents (1k+) that possibly don’t fit in the context window and reason over the collection. “Summarize all of the correspondence I’ve had with tax preparation agencies over the past 10 years”

    The latter may be solved with just a larger max N and larger context window.

    Almost like it’s a search lookup vs. a map reduce.

  • I want this for my photos.

    I'm not a good photographer, but I have taken tens of thousands of photos of my family. I would love to provide a prompt for a specific day and persons and have it create a photo that I never was able to take. I don't mind that it's not "real" because I find photography to be philosophically unreal as it is. I want it to look good, and inspire my mind to recreate the day however it can imagine.

    And I want to do it locally, without giving away my family's data and identity.

  • Going to wait for longer context local models. Fine tuning/training is lossy compression of your notes into the model weights -- there isn't much value in a vaguely-remembered copy of some of my notes. This is why other comments are pointing you towards Retrieval Augmented Generation instead, where the relevant notes are losslessly added to the prompt.

  • PrivateGPT is a nice tool for this. It's not exactly what you're asking for, but it gets part of the way there.

    https://github.com/zylon-ai/private-gpt

  • I’ve tried several different systems, nothing really stands out.

    That being said, I’m trying to document as much as my life in anticipation of such programs existing in the near future. I’m not going overboard, but for example, I wouldn’t really keep a personal diary, but now I try to jot down something every day, write down my thought processes on things, what actions were done and why.

    I’m looking forward to a day where I have an AI assistant (locally hosted and under my control of course) who can help me with decision-making based on my previous actions. Would be neat to compare/contrast how I do things now, compared to the future me.

  • Gianluca Nicoletti, an Italian journalist, writer and radio speaker, is training a LLM with all writings as a support for his autistic child for when he won’t be here anymore. The software will speak with his voice.

    https://www.lospessore.com/13/07/2023/una-chatbot-per-contin...

  • Neither mine nor endorsing, and I haven't played besides the initial installation but Khoj has an open source offering for this. Check it out https://khoj.dev

  • I started fine-tuning GPT-3.5 on a decently large corpus of my text messages and emails, and it pretty much generated schizophrenic output. I don’t think I did a very good job of curating the text that ended up fine-tuning, and I want to try again.

  • It sounds like you want RAG instead of training or even fine tuning a model.

    Have you looked into the OpenAI APIs? They make it relatively easy to do assuming you have some limited programming knowledge.

  • Folks at gitbook are kind enough to give me a LLM over my notes https://til.bhupesh.me

  • Somewhat related, for those of us who don’t take extensive notes: are there nicely packaged plugins for RAG in email, especially for eg Outlook or Apple Mail?

  • Has anyone seen or used something that can train on a complete imessage history?

    Presumably, I have more than enough messages from me along with responses from others to chat with a version of myself that bears an incredible likeness to how I speak and think. In some cases, I'd expect to be able to chat with an LLM of a given contact to see how they'd respond to various questions as well.

  • Not on my notes, but I have used GPT4All to chat with the documents of dapr. I downloaded the md files from the docs GitHub repo and loaded the directory in GPT4All.

    It is not "training" a model but works pretty great.

  • I have a large org-roam note system. I would like to create a pipeline where I can ask natural language questions, and it will build SQLite quries to efficiently crawl through the database and find what I want. I haven't gotten around to it though.

  • I have heard good things about the Notion AI addon although I haven’t tried it myself.

  • I’m literally working on it right now, dm on X if you wanna pair or something

  • [dead]

  • [dead]

  • [flagged]