In order to meet rising expectations in terms of scalability, robustness, and flexibility, large scale distributed systems increasingly espouse sophisticated distributed architectures that require enforcing complex distributed structural invariants. Unfortunately, maintaining these structural invariants at scale is particularly time consuming and error prone, as developers must take into account asynchronous failures, loosely coordinated sub-systems and network delays. To address this problem, we propose PLEIADES, a new plat- form to construct and enforce large-scale distributed structural invariants under aggressive conditions. PLEIADES combines the resilience of self-organizing overlays, with the expressiveness of an assembly-based design strategy. The result is a highly survivable framework that is able to dynamically maintain arbitrary complex distributed structures under aggressive crash failures. Our evaluation shows in particular that PLEIADES is able to restore the overall structure of a 25,600 node system in 11 asynchronous rounds after half of the nodes have crashed.
This talk in an extended version of the results discussed in the DSN 2018 paper with the same name.