Hacker News

OpenAI scraping Reddit through redlib instances

by udev4096on 6/8/2025, 2:53:41 PM with 2 comments

by gkbrkon 6/8/2025, 3:48:48 PM
The author used to run a public Arch mirror under mirror.ext4.xyz, so it's not exactly an unknown domain.
Combined with the fact that a lot of their self-hosted stuff, including the Reddit front-ends, are in the Certificate Transparency logs [1], it's not hugely surprising that web crawlers would run into them.
[1]: https://crt.sh/?q=ext4.xyz
by unstablediffusion 6/8/2025, 3:29:35 PM
that's not scraping, that's web search.
their scrapers wouldn't identify themselves