Content addressable storage. Generate names with SHA-3, split off bits of the names into directories like
name[0:2]/name[0:4]/name[0:6]/name
to keep any of the directories from getting too big (even the filesystem can handle huge directories, various tools you use with it might not) Keep a list of where the files came from and other metadata so you can find things in a database.When doing this in the past, I settled on an sqlite database with one table that stores the compressed html (gzip or lzma) along with other columns (id/date/url/domain/status/etc.)
Also made it easy to alert on when something broke (query the table for count(*) where status=error) and rerun the parser for failures.
WARC.
i'd just apply intelligent file naming strategy, based on timestamps and urls. keep in mind, that a folder should not contain more than 1000 files or other folders, otherwise it's slow to list.
Did you try using some of the cheap cloud storage, like AWS S3?
undefined
undefined
undefined
If you weren't already aware, Scrapy has strong support for this via their HTTPCache middleware; you can choose whether to have it actually behave like a cache, choosing to returned already scraped content if matched or merely to act as a pass-through cache: https://docs.scrapy.org/en/2.7/topics/downloader-middleware....
Their OOtB storage does what the sibling comment says about sha1-ing the request and then sharding the output filename by the first two characters: https://github.com/scrapy/scrapy/blob/2.7.1/scrapy/extension...