Hacker News

Ask HN: Best way to keep the raw HTML of scraped pages?

by vitorbaptistaaon 11/11/2022, 4:41:23 PM with 9 comments

by mdanielon 11/11/2022, 5:53:39 PM
If you weren't already aware, Scrapy has strong support for this via their HTTPCache middleware; you can choose whether to have it actually behave like a cache, choosing to returned already scraped content if matched or merely to act as a pass-through cache: https://docs.scrapy.org/en/2.7/topics/downloader-middleware....
Their OOtB storage does what the sibling comment says about sha1-ing the request and then sharding the output filename by the first two characters: https://github.com/scrapy/scrapy/blob/2.7.1/scrapy/extension...
by PaulHouleon 11/11/2022, 5:05:32 PM
Content addressable storage. Generate names with SHA-3, split off bits of the names into directories like
```
   name[0:2]/name[0:4]/name[0:6]/name
```
to keep any of the directories from getting too big (even the filesystem can handle huge directories, various tools you use with it might not) Keep a list of where the files came from and other metadata so you can find things in a database.
by placidpandaon 11/11/2022, 6:00:35 PM
When doing this in the past, I settled on an sqlite database with one table that stores the compressed html (gzip or lzma) along with other columns (id/date/url/domain/status/etc.)
Also made it easy to alert on when something broke (query the table for count(*) where status=error) and rerun the parser for failures.
by compressedgason 11/11/2022, 4:57:46 PM
WARC.
by sbrickson 11/11/2022, 4:50:33 PM
i'd just apply intelligent file naming strategy, based on timestamps and urls. keep in mind, that a folder should not contain more than 1000 files or other folders, otherwise it's slow to list.
by nf-xon 11/11/2022, 4:55:03 PM
Did you try using some of the cheap cloud storage, like AWS S3?
by 01113436928on 11/11/2022, 10:27:33 PM
undefined
by 01113436928on 11/11/2022, 10:23:59 PM
undefined
by 01118025631on 11/11/2022, 10:31:05 PM
undefined