Crawlers impact the operations of the Wikimedia projects

  • The dumbest part of this is that all Wikimedia projects already export a dump for bulk downloading: https://dumps.wikimedia.org/

    So it's not like you need to crawl the sites to get content for training your models...

  • This phenomenon is the wilful destruction of valuable global commons at the hands of a very small number of companies. The number of individually accountable decision-makers driving this destruction is probably in the dozens or low hundreds.

  • This has become a concern for the Arch Linux wiki which now makes you pass a proof-of-work challenge to read it. Which my anti-fingerprinting browser fails at every time. Putting a burden on human readers that will be only a minor temporary annoyance for the bots. Think about it, the T in CAPTCHA stands for "Turing". What is the design goal of AI? To create machines that can pass a Turing test.

    I fear the end state of this game is the death of the anonymous internet.

  • A while back I wrote up a way to turn the big Wikipedia XML dump into a database. Not a generic table with articles but thousands of tables, one for each article "type". I'm not sure if this is still the best way to go about it.

    https://feder001.com/exploring-wikipedia-as-a-database-part-...

  • Contrary to what most commenters assume, the high bandwidth usage is not coming from scraping text, but images. They are pretty clear about it:

    > Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models.

  • Maybe this is an insane idea, but ... how about a spider P2P network?

    At least for local AIs it might not be a terrible idea. Basically a distributed cache of the most common sources our bots might pull from. That would mean only a few fetches from each website per day, and then the rest of the bandwidth load can be shared amongst the bots.

    Probably lots of privacy issues to work around with such an implementation though.

  • Previous discussions:

    (91 points, 30 days ago, 101 comments) https://news.ycombinator.com/item?id=43555898

    (49 points, 29 days ago, 45 comments) https://news.ycombinator.com/item?id=43562005

  • Wikimedia's recent post completely misses the mark. What they're experiencing isn't merely bulk data collection – it's the unauthorized transformation of their content infrastructure into a free API service for commercial AI tools.

    It's not crawling for training that is the issue...and it's an over simplification stating that AI companies are "training" on someone's data.

    When systems like Claude and ChatGPT fetch Wikimedia content to answer user queries in real time, they're effectively using Wikimedia as an API – with zero compensation, zero attribution, and zero of the typical API management that would come with such usage. Each time a user asks these AI tools a question, they may trigger fresh calls to Wikimedia servers, creating a persistent, on-demand load rather than a one-time scraping event.

    The distinction is crucial. Traditional search engines like Google crawl content, index it, and then send users back to the original site. These AI systems instead extract the value without routing any traffic back, breaking the implicit value exchange that has sustained the web ecosystem.

    Wikimedia's focus on technical markers of "bot behavior" – like not interpreting JavaScript or accessing uncommon pages – shows they're still diagnosing this as a traditional crawler problem rather than recognizing the fundamental economic imbalance. They're essentially subsidizing commercial AI products with volunteer-created content and donor-funded infrastructure.

    The solution has been available all along. HTTP 402 "Payment Required" was built into the web's foundation for exactly this scenario. Combined with the Lightning Network's micropayment capabilities and the L402 protocol implementation, Wikimedia could:

      - Keep content free for human users
      
      - Charge AI services per request (even fractions of pennies would add up)
      
      - Generate sustainable infrastructure funding from commercial usage
      
      - Maintain their open knowledge mission while ending the effective subsidy
    
    Tools like Aperture make implementation straightforward – a reverse proxy that distinguishes between human and automated access, applying appropriate pricing models to each.

    Instead of leading the way toward a sustainable model for knowledge infrastructure in the AI age, Wikimedia is writing blog posts about traffic patterns. If your content is being used as an API, the solution is to become an API – with all the management, pricing, and terms that entails. Otherwise, they'll continue watching their donor resources drain away to support commercial AI inference costs.

    I suspect several factors contribute to this resistance:

    Ideological attachment to "free" as binary rather than nuanced: Many organizations have built their identity around offering "free" content, creating a false dichotomy where any monetization feels like betrayal of core values. They miss that selective monetization (humans free, automated commercial use paid) could actually strengthen their core mission.

    Technical amnesia: The web's architects built payment functionality into HTTP from the beginning, but without a native digital cash system, it remained dormant. Now that Bitcoin and Lightning provide the missing piece, there's institutional amnesia about this intended functionality.

    Complexity aversion: Implementing new payment systems feels like adding complexity, when in reality it simplifies the entire ecosystem by aligning incentives naturally rather than through increasingly byzantine rate-limiting and bot-detection schemes.

    The comfort of complaint: There's a certain organizational comfort in having identifiable "villains" (bots, crawlers, etc.) rather than embracing solutions that might require internal change. Blog posts lamenting crawler impacts are easier than implementing new systems.

    False democratization concerns: Some worry that payment systems would limit access to those with means, missing that micropayments precisely enable democratization by allowing anyone to pay exactly for what they use without arbitrary gatekeeping.

  • From what I understand, the problem is not really from scrapers that would "pound" the service by being thousands of requesting the same things multiple times, but that they are scraping the whole of Wikipedia including heavy content, like video that is not accessed often.

    If that is the case, I would think that it is a little bit concerning that the model of Wikipedia is based on having most resources not accessed.

    Otherwise, if my understanding is wrong, it would mean that AI company are constantly scraping the same content for change like a search engine would do, but it does little sense to me as I easily guess that models are only trained once every few months at most.

    And also I don't understand how they were not already encountering this problem with the existing constant crawling of search engines...

  • > When an article is requested multiple times, we memorize – or cache – its content in the datacenter closest to the user. If an article hasn’t been requested in a while, its content needs to be served from the core data center.

    Maybe a similar system needs to be set up so that bot requests need to present their latest cache or hash ID of the requested content before a full request can be granted. This way, if the local cache is recent, it doesn't burden the server with requests for content they've already seen, and they can otherwise serve their users information based on the version they have stored locally.

  • The comment below the post makes a lot of sense:

    > I suggest Wikimedia to distribute Wikimedia Commons content using tape drive. The largest tape drive (IBM 3592) can store 50 TB content. The total size of Wikimedia Commons is 610.4 TB. So it needs less than 15 tapes to store the entire site. You can lend the tapes to any company want your content, if they promise to return a in period of time.

  • Dont block them, feed them delurous feverdreamed llamassporn..

  • Nothing speaks quite so clearly to the ideological lean of the Wikimedia Foundation as their choice of social media links: “Share on: Mastodon, Bluesky”

  • Here's what I don't get. Wikimedia claims to be a nonprofit for spreading knowledge. They sit on nearly half a billion of assets.

    Every customer would prefer a firehose content delta over having to scrape for diffs.

    They obviously have the capital to provide this, and still grow their funds for eternity without ever needing a single dollar in external revenue.

    Why don't they?