Hacker News

Ask HN: Why is ChatGPT allowed to scrape other sites via prompts?

by jbryuon 5/21/2024, 8:55:49 PM with 12 comments

by bicxon 5/22/2024, 5:12:51 AM
Google scrapes like a maniac. And for profit. Many others do the same.
A website can put up a TOS prohibiting such use, but my understanding is that is essentially unenforceable if the site is publicly accessible.
The recent Meta v Bright Data case highlights how extreme it can get without being technically illegal. https://techcrunch.com/2024/02/26/meta-drops-lawsuit-against...
If you’re trying to prevent scraping of your data, your best option is to not make it public.
by Nextgridon 5/21/2024, 11:22:18 PM
If you can paste the URL in a browser and copy paste the next, why is it bad that a third-party agent can do the same? It's no different than a remotely-hosted browser you control via natural language, or asking a human assistant to do it and email you the result.
by persedeson 5/21/2024, 11:53:57 PM
I've encountered a couple of robots.txt that specifically block popular llms for certain areas. Example:
https://www.sigmaaldrich.com/robots.txt
by icedchaion 5/21/2024, 11:50:38 PM
My understanding is scraping public sites is legal. It's no different from a search engine crawling your site.
by brianjkingon 5/21/2024, 9:28:29 PM
You can opt out.
https://platform.openai.com/docs/gptbot
by tripplyonson 5/21/2024, 10:57:47 PM
Scraping and violating TOS are not illegal to do, but they can get you blocked.
by xcasperxon 5/22/2024, 6:43:21 PM
I believe this is current precedent around scraping:
https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
by brudgerson 5/22/2024, 5:28:15 AM
Terms of service enforcement is a matter of civil law.
Your legal wherewithal relative to those who abuse them is what gives your terms of service teeth. Or leaves you toothless.
by mensetmanusmanon 5/21/2024, 11:58:27 PM
Preventing scraping also entrenches google for eternity.
by rl3on 5/22/2024, 12:23:59 AM
The web agent's system prompt is simply informed that Scarlett Johansson's voice is at the URL it's about to visit.
by 8noteon 5/22/2024, 1:56:33 AM
Why? It's another user agent. Curl does the same thing, as does chrome and firefox
by aaron695on 5/22/2024, 12:34:16 AM
[dead]