Related thread for the official blog post: https://news.ycombinator.com/item?id=40799275
Side note: They seem to serve other robots.txt for different User-Agents & IPs: https://merj.com/blog/investigating-reddits-robots-txt-cloak...
If anyone is curious how deeply destructive, or how deeply approved by the NSA, or how deeply self-sabotaging for society the modern AI training data pipeline is becoming I’d refer them to SB 1047.
OpenAI is openly collaborating with the NSA, Google is manipulating the definition of a web crawl, Anthropic has installed a bunch of humanitarians from Jump Trading as the leading mech interp group that makes strident claims about how all this stuff works based on weights you do not and never will have access to.
They’re telling you: “And you will do nothing, because you can do nothing.”
I invite you to join me in proving that we can in fact do something.
Yup, this is what I thought would happen. It wouldn't surprise me if reddit goes a step further and requires login to view pages.
AI is the death-knell of the web as we've known it for the past three decades. Once freely available information will retreat behind login walls and charge bigco's for access to train their models.
I wonder if some standardised data API will be settled upon, perhaps it already exists?
https://www.reddit.com/robots.txt has additional comments:
# Welcome to Reddit's robots.txt
# Reddit believes in an open internet, but not the misuse of public content.
# See https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy Reddit's Public Content Policy for access and use restrictions to Reddit content.
# See https://www.reddit.com/r/reddit4researchers/ for details on how Reddit continues to support research and non-commercial use.
# policy: https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy
User-agent: *
Disallow: /
Technically this title violates HNs title policy as it should just be "Reddits robot.txt" or something, but "Reddits robot.txt changed" is more useful. I'm curious to see if mods change it.
Is it a good time to start competitor? Given that reddit might take Quora's path to oblivion.
This is not good.
80% of my Google searches for other people's opinions now end with "site:reddit.com", and there is surprisingly quite a few of them. The alternative is Reddit's own search and it tends to produce less relevant results.
Every search engine other than Google has stopped indexing pages from Reddit.
Google has not commented on whether they plan to respect it. Rich Results[0] say they're using a version from June 25. The new version was last modified July 1.
Also it seems that since 2018 it has not actually changed, lol http://web.archive.org/web/20180501000000*/https://old.reddi...
Most robots don't honour robots.txt anyways...
lol what? Just 20-30 minutes ago I saw this at #52, and tried to find it again now and see it #468 https://archive.ph/ReTR5 but I wonder if that is algorithmically natural or whatnot, lol
[dead]
Oh great now we can rely on Reddit's own renowned search functionality /s
What are they thinking?
Interesting. There have been all sorts of "Reddit's content is key to search results, adding 'reddit' to search results makes them good" stories. And there's been a lot of talk about how some of the big ML makers, notably Google, depend on Reddit's content to train their AI. And Google has that recent $60 million deal for content. So clearly Reddit's execs have been talking about how their content is valuable and they shouldn't give it away for free.
But at the same time, blocking search engines from indexing your social media site is a dangerous game. Any search engine that respects this is gonna effectively de-list Reddit. That's no good for views, and views is what makes Reddit money. Presumably they have negotiated private deals with Google and probably Microsoft for this and are trying to sell their data to ML companies, because otherwise this would seem suicidal.
Kind of a shame. The information is still going to get shared around to all the giant corporations, but Reddit will presumably make it harder to access for all the little guys. And the more they tie the content to dollars, the more managers on the inside will start doing stupid things to try and generate more of whatever the most valuable kinds of content are.