I was focussing mostly on cyber security related subreddits because the vulnerability and exploit discussions were of great value to me.
I built a little scraper in golang that stores the JSON data (instead of the HTML which the archive warrior stores) to save hdd storage. [1]
The problem with reddit's API is that it only shows 1000 entries over 10 pages in every api. Meaning hot/top/new, and search results are limited. If you have more links related to the keyword, you won't discover more.
So you need a very specific keyword list to be able to discover more posts, and search each subreddit for each entry in the keyword list.
Pushshift was the Reddit archive but apparently recent agreements with Reddit may have changed that.
Anyone else creating a Reddit archive will likely get a C&D.
It's next on my list after I finish the MySpace archive.
Seriously, why would anybody do this? Reddit has such a high noise-to-signal ratio that it would be a waste of resources. There may be value in keeping an archive of some individual subreddits, but not the main bulk of Reddit itself.
Yes.
There is the pushift dataset covering posts and comments through 2022 [1].
And the ArchiveTeam has begun crawling reddit as well some time ago [2]
[1] https://old.reddit.com/r/pushshift/comments/10bwxke/updated_...
[2] https://news.ycombinator.com/item?id=36254172