Time, Clocks, and the Ordering of Events in a Distributed System by Leslie Lamport is one of the best papers I've ever read. Not only has it clearly stood the test of time, but it sets the stage for deeper thinking on many of the issues endemic to distributed systems.
My former manager recommended it to me when I first started working in distributed systems and I found that it unlocked a huge variety of topics despite its simplicity. (Thanks Steve!)
https://www.microsoft.com/en-us/research/publication/time-cl...
I thoroughly enjoyed "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" from Martin Kleppmann.
It's a great book that goes into pretty much all of the commonly used strategies to scaling data-intensive applications. It's not incredibly deep on any of them but it will allow you to get a great overview of the entire space. For each component, there's usually references to places where you can read and study more about them.
Joe Armstrong's (co-creator of Erlang) Dissertation is incredible.
Making reliable distributed systems in the presence of software errors
I just followed the course on Distributed Systems by Maarten van Steen at the University of Twente. It was brilliant, and he has written a book on Distributed Systems. It is freely available via https://www.distributed-systems.net/index.php/books/ds3/
I am surprised no one has recommended Leslie Lamports writtings [0].
His paper, "Time, Clocks, and the Ordering of Events in a Distributed System" is still considered a serious read after 40 years.
I can highly recommend Linsey Kuper's 2020 introductory lectures on distsys — 26h of quality content!
https://www.youtube.com/playlist?list=PLNPUF5QyWU8O0Wd8QDh9K...
Dijkstra's "Self-Stabilizing Systems in Spite of Distributed Control" is very "Dijkstra":
- Concise
- Approachable
- Entertaining
- Insightful
- Timeless
http://homepage.divms.uiowa.edu/~ghosh/ssDijkstra.pdf
Enjoy :)
A Distributed Systems Reading List -https://dancres.github.io/Pages/
A small selection of papers that I find useful (also check the Wikipedia articles for a quick overview):
Communicating Sequential Processes "CSP" by Tony Hoare[0] has a strong influence on Go and Clojure. He also published/contributed to other interesting and influential books and papers.
Making reliable distributed systems in the presence of software errors by Joe Armstrong[1] (Erlang, BEAM). An implementation of the actor model and functional programming to optimize for reliability.
Conflict-free Replicated Data Types by Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirsk, "CRDTs" [2]. Enable strong eventual consistency, which is typically useful (and implemented) for databases, p2p (chat) applications and other distributed systems.
[0] https://www.cs.cmu.edu/~crary/819-f09/Hoare78.pdf
[1] https://www.cs.otago.ac.nz/coursework/cosc461/armstrong_thes...
[2] https://hal.inria.fr/hal-00932836/file/CRDTs_SSS-2011.pdf
Books:
Introduction to Reliable and Secure Distributed Programming (https://www.amazon.de/-/en/Christian-Cachin/dp/3642152597).
I took a class with Luis Rodrigues (one of the authors), the book introduces the fundamentals of distributed systems. For example, you would build leader election from first principles.
I've always been a fan of Distributed Systems for Fun and Profit: http://book.mixu.net/distsys/single-page.html
This was the first article that really made it all click for me
Distributed Systems 3rd edition (2017), you can get it free at: https://www.distributed-systems.net/index.php/books/ds3/
This thread has piqued my interest. I am curious if the people interested in distributed systems are closer to hobbyists or professionals using the ideas directly in their work.
Distributed systems fans of HN, why are you reading about distributed systems?
For the "ideas that stood the test of time" I'm keeping this list https://nvartolomei.com/dist-sys-classics/
Joe Armstrong, Erlang, software for a concurrent world. It passes the first criterion: time tested. Not so much the second of having been previously impossible, because Erlang could and does run on yesterday's hardware.
Erlang isn't theoretical. It's practical engineering. It works because message passing is what distributed systems have to do and at scale portions of a distributed system will become unavailable.
There are very specific problems that require more detailed engineering like Lamport Clocks and Raft Consensus Protocol. But not the general case. The general case is "being good enough" as is the nature of engineering.
I'm surprised the raft paper has not been mentioned and I'm happy to share it.
In Search of an Understandable Consensus Algorithm
https://raft.github.io/raft.pdf
I think raft has stood the test of time so far. A very popular implementation of raft is etcd, which is used as Kubernetes' backing store for all cluster data.[0]
[0] https://kubernetes.io/docs/concepts/overview/components/#etc...
Lamport's work linked earlier is great. I'd also point to Maurice Herlihy's (http://cs.brown.edu/~mph/) and Michel Raynal's (https://team.inria.fr/wide/team/michel-raynal/) work. They both have published well written books on the topic and their bibliographies also linked to many relevant references from others.
Go back to the beginning!
Paul Baran's research on Distributed Communications that led to the Internet:
"Paul Baran and the Origins of the Internet" - https://www.rand.org/about/history/baran.html
"On Distributed Communications" - 1964 https://www.rand.org/pubs/research_memoranda/RM3767.html
https://lamport.azurewebsites.net/tla/book.html
imho the most useful book you can read.
One of my favorites:
Not a book nor a paper, but MIT has published their recent Distr. Systems lectures[0] on youtube. Haven't finished it yet but it touches on papers that stood the test of times.
[0] https://www.youtube.com/watch?v=cQP8WApzIQQ&list=PLrw6a1wE39...
"Understanding Distributed Systems" tries to bring together theoretical aspects on the topic (like consensus and consistency models) with practical ones, such as resiliency mechanisms, asynchronous messaging, and observability.
I've learned distributed systems from Andrew S. Tanenbaum books [1]
Distributed Operating Systems Distributed Systems: Principles and Paradigms
Shameless plug for a list I compiled that is specific to Paxos:
https://vadosware.io/post/paxosmon-gotta-concensus-them-all/
I have not checked the links therein for quite some time, but here’s a list of lists on the subject.
In addition to some of the other recommendations here, I really enjoyed "Building on Quicksand" by Pat Helland and Dave Campbell. A lot of eventual consistency made more sense to me after I read it.
Jeff Dean's The Tail at Scale is fantastic: https://research.google/pubs/pub40801/
In addition, learning how various internet protocols (esp routing) are designed and historically evolved would be a good idea. Interconnections by radia perlman interconnections would be a start.
Not papers or books, but I'd recommend you look at the blog articles and such that Basho published about their design of Riak. In particular, they wrote a great article explaining vector clocks as opposed to other synchronizing mechanisms, and they wrote several good articles on CRDTs.
https://riak.com/category/technical/ - Riak blog
https://www.allthingsdistributed.com/files/amazon-dynamo-sos... - Dynamo paper that Riak was in part based on
A previous thread on this topic from 2017:
I recently learned about Distributed Rate limiting. I do not have a definitive source for it but it's something you can explore.
The best and most accessible book on theory is probably Reliable and Secure Distributed systems by Cachin, Guerrouai et al.
Release It! by Michael Nygard is a good book on resilience engineering for distributed systems.
Surprised no one has mentioned the google spanner paper:
https://static.googleusercontent.com/media/research.google.c...
I would be interested in principles driven approach with simple implementations in say python vs kubernetes this containers that.
I have many recommendations of different kinds:
## Blogs:
- http://muratbuffalo.blogspot.com/
- https://bartoszsypytkowski.com/
- https://decentralizedthoughts.github.io/
- https://www.the-paper-trail.org/
- https://blog.acolyer.org/
- https://pathelland.substack.com/
## Other web resources
- https://aws.amazon.com/builders-library/ - set of resources from Amazon about building distributed systems
- https://www.youtube.com/playlist?list=PLeKd45zvjcDFUEv_ohr_H... - lecture series from Cambridge
## Books
- https://www.cl.cam.ac.uk/teaching/1213/PrincComm/mfcn.pdf - A great book on the maths of networking (probability, queuing theory etc...)