If you can paste the URL in a browser and copy paste the next, why is it bad that a third-party agent can do the same? It's no different than a remotely-hosted browser you control via natural language, or asking a human assistant to do it and email you the result.
I've encountered a couple of robots.txt that specifically block popular llms for certain areas. Example:
My understanding is scraping public sites is legal. It's no different from a search engine crawling your site.
You can opt out.
Scraping and violating TOS are not illegal to do, but they can get you blocked.
I believe this is current precedent around scraping:
Terms of service enforcement is a matter of civil law.
Your legal wherewithal relative to those who abuse them is what gives your terms of service teeth. Or leaves you toothless.
Preventing scraping also entrenches google for eternity.
The web agent's system prompt is simply informed that Scarlett Johansson's voice is at the URL it's about to visit.
Why? It's another user agent. Curl does the same thing, as does chrome and firefox
[dead]
Google scrapes like a maniac. And for profit. Many others do the same.
A website can put up a TOS prohibiting such use, but my understanding is that is essentially unenforceable if the site is publicly accessible.
The recent Meta v Bright Data case highlights how extreme it can get without being technically illegal. https://techcrunch.com/2024/02/26/meta-drops-lawsuit-against...
If you’re trying to prevent scraping of your data, your best option is to not make it public.