Hacker News

PdfGptIndexer: Indexing and searching PDF text data using GPT-2 and FAISS

by raghavanklon 7/8/2023, 9:07:55 PM with 17 comments

by chaxoron 7/8/2023, 9:41:08 PM
The most frustrating thing about the many, many clones of this exact type of idea is that pretty much all of them require OpenAI.
Stop doing that.
You will have way more users if you make OpenAI (or anything that requires cloud) the 'technically possible but pretty difficult art of hoops to make it happen' option, instead of the other way around.
The best way to make these apps IMO is to make them work entirely locally, with an easy string that's swappable in a .toml file to any huggingface model. Then if you really want OpenAI crap, you can make it happen with some other docker secret or `pass` chain or something with a key, while changing up the config.
The default should be local first, do as much as possible, and then if the user /really/ wants to, make the collated prompt send a very few set of tokens to openAI.
by hion 7/8/2023, 9:47:52 PM
Keep your data private and don't leak it to third parties. Use something like privateGPT (32k stars). Not your keys, not your data.
"Interact privately with your documents using the power of GPT, 100% privately, no data leaks"[0]
[0] https://github.com/imartinez/privateGPT
by eminent101on 7/8/2023, 10:41:16 PM
Is it going to send my personal data to OpenAI? Isn't that a serious problem? Does not sound like a wise thing to do, not at least without redacting all sensitive personal data from the data. Am I missing something?
by Imnimoon 7/9/2023, 2:51:46 AM
This readme is very confusing. It says we're going to use the GPT-2 tokenizer, and use GPT-2 as an embedding model. But looking at the code, it seems to use the default LangChain OpenAIEmbeddings and OpenAI LLM. Aren't those text-embedding-ada-002 and text-davinci-003, respectively?
I don't understand how GPT-2 enters into this at all.
by AJRFon 7/8/2023, 9:41:58 PM
Is there a company that makes a hosted version of something like this? I quite want a little AI that I can feed all my data to to ask questions to.
by gigel82on 7/8/2023, 10:32:15 PM
I don't get it, GPT-2 is (one of the few) open models from OpenAI, you can just run it locally, why would you use their API for this? https://github.com/openai/gpt-2
by cloudkingon 7/8/2023, 10:42:08 PM
Am I the only one who doesn't need to search across my data? What are the use cases here
by JimmyRuskaon 7/8/2023, 9:52:21 PM
Anyone know how milvus, quickwit, pinecone compares?
I've been thinking about seeing if there's consulting opportunities for local businesses for LLMs, finetuning/vector search, chat bots. Also making tools to make it easier to drag and drop files and get personalized inference. Recently I saw this one pop into my linkedin feed, https://gpt-trainer.com/ . There's been a few others for documents I've found
https://www.explainpaper.com/
https://www.konjer.xyz/
Nope nope, wouldn't want to compete with that on pricing. Local open source LLMs on a 3090 would also be a cool service, but wouldn't have any scalability.
Are there any other finetuning or vector search context startups you've seen?
by csjhon 7/8/2023, 10:06:30 PM
Why have the OpenAI dependency when there's local embeddings models that would be both faster and more accurate?
by N4au5non 7/13/2023, 1:08:51 PM
I’m working for a company that works as a security layer between any sensitive enterprise data and the LLMs. Regardless of the model (HF, ChatGPT, Bard), and regardless of the medium - conversational data, pdf, knowledge bases like Notion etc. It hides the sensitive data, preventing risky use and fact checking at the same time. Happy to make an intro if that’s what you’re looking for! tothepoint.tech
by zikohhon 7/8/2023, 10:09:57 PM
Also what does this do that llamaindex doesn't?
by syntaxingon 7/8/2023, 9:51:52 PM
gpt4all has this truly locally. I recommend those with a decent GPU to give it a go.
by einpoklumon 7/8/2023, 10:16:34 PM
Don't build a personal ChatGPT, and don't let OpenAI, Microsoft and their business partners (and probably the US government) have a bunch of your personal and private information.
by emmenderon 7/9/2023, 4:54:43 PM
Please provide this reference in your readme / blog as it is the original source for your work... and provides the background for the tradeoff between the 2 approaches: 1) fine-tuning vs 2) Search-ask
https://github.com/openai/openai-cookbook/blob/main/examples...
by quickthrower2on 7/8/2023, 9:48:41 PM
The author has a demo of this here: https://www.swamisivananda.ai/
by bella001on 7/9/2023, 10:01:12 AM
[dead]