A distributed Posix file system built on top of Redis and S3

  • I advocate for using Route 53 as a database and even I think this is terrifying.

  • We started to build JuiceFS since 2016, released it as a SaaS solution in 2017. After years of improvements, we released the core of JuiceFS recently, hopefully you will find it useful.

    I'm the founder of JuiceFS, would like to answer any questions here.

  • Although the repo looks new, the founder said they've been building it for 4 years.

    https://news.ycombinator.com/item?id=25724925

  • We were doing this at Avere in 2015. The system built a POSIX filesystem out of S3 objects on the backend (including metadata) and then served it over NFS or SMB from a cluster of cache nodes. Keeping metadata and data in separate data stores with different consistency models is a disaster waiting to happen - ask anyone who has run Lustre. Having fast caches with SSDs was the key to getting any kind of decent performance out of S3. The fun part was mounting a filesystem and running df and seeing an exabyte of capacity. They were acquired by Microsoft in 2018 and integrated into Azure.

  • How POSIX-compatible is it exactly? There's a lot of niche features that tend to break on not fully compliant network filesystems. Do unlinked files remain accessible (the dreaded ESTALE on some NFS implementations)? mmap? atomic rename? atomic append? range locks? what's the consistency model?

    Some of those things don't appear to be covered by pjdfstest.

  • Am I the only one who always feels uneasy when using NFS-like filesystems? In my experience way too much software has been built without any kind of fault tolerance regarding file system access, and no matter how good your network filesystem is, it still can cause havoc and every sort of data loss.

    I've seen so many disasters related to software basically assuming a file can't just vanish into thin air, something that can very much happen when your FS is running on top of an arbitrary network connection. Hiding away such a fundamental detail in order to provide a file-like API tends to instill every sort of bad ideas into people (NFS via Wi-Fi? Why not!).

  • This is neat! I am quite a fan of all the go based file systems that are springing up. Question: what are the main compare and contrast points between juice and seaweed fs?

    Here is a compendium for those interested:

    https://github.com/gostor/awesome-go-storage

  • I think the major problem is latency. Try sshfs (fs over ssh) and see what I mean. Don't get me wrong, I use and like sshfs for a quick data transfer, but it's just not good enough to run your application with.

    For a stable POSIX filesystem in production latency is key. Often times in a datacenter 10GE is recommended for network storage solutions, not because of bandwidth (which is also important), but for the 10x reduced latency of a 10GE NIC. Most applications simply expect response times of microseconds or a few milliseconds at most from a POSIX Filesystem, they simply cannot run (without modifying the codebase) on something much slower.

    But if you had to rewrite your application anyways, then you might want to use plain S3 without the FS, that is easier to operate in the long run.

  • this is really cool! using a fast redis for metadata means that suddenly s3 style blob stores become feasible as real networked filesystems.

    nice!

  • Is metadata also replicated in S3?

    Else I don't understand how the metadata can be persistent after reboot as AFAIK redis cannot dump and reload its state.

  • I appreciate the effort to make JuiceFS Posix-compliant first and then compatible with things like Kubernetes down the road.

  • What are the costs / tradeoffs of this (vs the normal application-layer object storage paradigm)?

  • How do you deal with failures? What happens if the redid availability zone disappears for example? An I manually responsible for recovery and backups in this cases, or do you use redis as a cache that can be recovered from s3?

  • I've been using s3ql for a while, but this looks great. After testing it for a few hours, the performance seems solid. I'm glad it has a data cache feature like s3ql.

  • How does this compare with seaweedfs? That also has fuse and can store metadata and small data in redis and provides its own s3 api, so there's one less moving part.

  • Does it pass the generic xfstests?

  • This is pretty cool. How does the storage format work? Would I be able to download storage chunks from s3 and somehow recover my files from chunks?

  • This is greate, I'm considering to use this kind of tech build my personal own cloud based NAS.

  • Would not an NFS solution have this kind of caching and durability built-in? Without doing actual “Jepsen tests” (they are almost a generic term at this point due to their name) how would this improve my life versus buying an NFS vendor solution, or rolling my own?

  • Interesting, I’d guess list operations are significantly faster because of the cache, even a hash map like Redis? That can be a real slowdown in s3 with large quantities of files.

  • FS metadata stored in Redis? What will happen if Redis is down?

  • How much does it cost to use on S3? For example, how many GetObject and other misc non-free API calls does it use? And does it store data using intelligent tiering?

  • I've still got PTSD from NFS so I would want to see some really abusive testing to know that this was more reliable than a vendored NFS SAN.

  • So... a file system on top of a virtualized file system hosted on someone else's computer across the land, with "Outstanding Performance: The latency can be as low as a few milliseconds" ... Milliseconds disk access is outstanding?

    I mean, amazing, and, maybe you know, use a file system.

  • before you want to use it in your project, make sure to have a look at the code - their codebase is pretty much comment free, very little number of tests. Other than the marketing term "posix file system", there is not any proof on such claim.

    I am also not sure how it is a "distributed" file system given its storage is entirely done by S3. Should I call my backup program that backups my data to S3 every night a "distributed system"? When running on top of Redis, it explicitly mentions that Redis cluster is not supported. I haven't used Redis for many years, did I miss something here? A "distributed" file system built on top of a single instance of Redis? Sounds not very "distributed" to me.

  • Would there be any way to mount JuiceFS from AWS Lambda or Fargate?

  • Can I install redis on it?

  • It's AGPL :(

  • probably one of the few cases where agpl is not that bad.