logo资料库

Fault-Tolerant Message-Passing Distributed Systems. An Algorithm....pdf

第1页 / 共465页
第2页 / 共465页
第3页 / 共465页
第4页 / 共465页
第5页 / 共465页
第6页 / 共465页
第7页 / 共465页
第8页 / 共465页
资料共465页,剩余部分请下载后查看
Preface
Contents
Notation
Figures & Algorithms
Tables
1 Definitions & Introductory Examples
1.1 A Few Definitions Related to Distributed Computing
1.2 Example 1: Common Decision Despite Message Losses
1.2.1 The Problem
1.2.2 Trying to Solve the Problem: Attempt 1
1.2.3 Trying to Solve the Problem: Attempt 2
1.2.4 An Impossibility Result
1.2.5 A Coordination Problem
1.3 Example 2: Computing a Global Function Despite a Message Adversary
1.3.1 The Problem
1.3.2 The Notion of a Message Adversary
1.3.3 The TREE-AD Message Adversary
1.3.4 From Message Adversary to Process Mobility
1.4 Main Distributed Computing Models Used in This Book
1.5 Distributed Computing Versus Parallel Computing
1.6 Summary
1.7 Bibliographic Notes
1.8 Exercises and Problems
--- Reliable Broadcast Communication Abstraction
2 In Presence of Process Crash Failures
2.1 Uniform Reliable Broadcast
2.1.1 From Best Effort to Guaranteed Reliability
2.1.2 Uniform Reliable Broadcast (URB-broadcast)
2.1.3 Building the URB-broadcast Abstraction in CAMPn,t[∅]
2.2 Adding Quality of Service
2.2.1 “First In, First Out” (FIFO) Message Delivery
2.2.2 “Causal Order” (CO) Message Delivery
2.2.3 From FIFO-broadcast to CO-broadcast
2.2.4 From URB-broadcast to CO-broadcast: Capturing Causal Past in a Vector
2.2.5 The Total Order Broadcast Abstraction Requires More
2.3 Summary
2.4 Bibliographic Notes
2.5 Exercises and Problems
3 In Presence of Process Crashes & Unreliable Channels
3.1 A System Model with Unreliable Channels
3.1.1 Fairness Notions for Channels
3.1.2 Fair Channel (FC) and Fair Lossy Channel
3.1.3 Reliable Channel in the Presence of Process Crashes
3.1.4 System Model
3.2 URB-broadcast in CAMPn,t[- FC]
3.2.1 URB-broadcast in CAMPn,t[- FC, t < n/2]
3.2.2 An Impossibility Result
3.3 Failure Detectors: an Approach to Circumvent Impossibilities
3.3.1 The Concept of a Failure Detector
3.3.2 Formal Definitions
3.4 URB-broadcast in CAMPn,t[- FC] Enriched with a Failure Detector
3.4.1 Definition of the Failure Detector Class Θ
3.4.2 Solving URB-broadcast in CAMPn,t[- FC, Θ]
3.4.3 Building a Failure Detector Θ in CAMPn,t[- FC, t < n/2]
3.4.4 The Fundamental Added Value Supplied by a Failure Detector
3.5 Quiescent Uniform Reliable Broadcast
3.5.1 The Quiescence Property
3.5.2 Quiescent URB-broadcast Based on a Perfect Failure Detector
3.5.3 The Class HB of Heartbeat Failure Detectors
3.5.4 Quiescent URB-broadcast in CAMPn,t[- FC, Θ,HB]
3.6 Summary
3.7 Bibliographic Notes
3.8 Exercises and Problems
4 In Presence of Byzantine Processes
4.1 Byzantine Processes and Properties of the Model BAMPn,t[t < n/3]
4.2 The No-Duplicity Broadcast Abstraction
4.2.1 Definition
4.2.2 An Impossibility Result
4.2.3 A No-Duplicity Broadcast Algorithm
4.3 The Byzantine Reliable Broadcast Abstraction
4.4 An Optimal Byzantine Reliable Broadcast Algorithm
4.4.1 A Byzantine Reliable Broadcast Algorithm for BAMPn,t[t < n/3]
4.4.2 Correctness Proof
4.4.3 Benefiting from Message Asynchrony
4.5 Time and Message-Efficient Byzantine Reliable Broadcast
4.5.1 A Message-Efficient Byzantine Reliable Broadcast Algorithm
4.5.2 Correctness Proof
4.6 Summary
4.7 Bibliographic Notes
4.8 Exercises and Problems
--- R/W Register Communication Abstraction
5 R/W Register Abstraction
5.1 The Read/Write Register Abstraction
5.1.1 Concurrent Objects and Registers
5.1.2 The Notion of a Regular Register
5.1.3 Registers Defined from a Sequential Specification
5.2 A Formal Approach to Atomicity and Sequential Consistency
5.2.1 Processes, Operations, and Events
5.2.2 Histories
5.2.3 A Formal Definition of Atomicity
5.2.4 A Formal Definition of Sequential Consistency
5.3 Composability of Consistency Conditions
5.3.1 What Is Composability?
5.3.2 Atomicity Is Composable
5.3.3 Sequential Consistency Is Not Composable
5.4 Bounds on the Implementation of Strong Consistency Conditions
5.4.1 Upper Bound on t for Atomicity
5.4.2 Upper Bound on t for Sequential Consistency
5.4.3 Lower Bounds on the Durations of Read andWrite Operations
5.5 Summary
5.6 Bibliographic Notes
5.7 Exercises and Problems
6 R/W Registers despite Asynchrony & t < n/2
6.1 A Structural View
6.2 Building an SWMR Regular Read/Write Register in CAMPn,t[t < n/2]
6.2.1 Problem Specification
6.2.2 Implementing an SWMR Regular Register in CAMPn,t[t < n/2]
6.2.3 Proof of the SWMR Regular Register Construction
6.3 From an SWMR Regular Register to an SWMR Atomic Register
6.3.1 Why the Previous Algorithm Does Not Ensure Atomicity
6.3.2 From Regularity to Atomicity
6.4 From SWMR Atomic Register to MWMR Atomic Register
6.4.1 Replacing Sequence Numbers by Timestamps
6.4.2 Construction of an MWMR Atomic Register
6.4.3 Proof of the MWMR Atomic Register Construction
6.5 Implementing Sequentially Consistent Registers
6.5.1 How to Address the Non-composability of Sequential Consistency
6.5.2 Algorithms Based on a Total Order Broadcast Abstraction
6.5.3 A TO-broadcast-based Algorithm with Local (Fast) Read Operations
6.5.4 A TO-broadcast-based Algorithm with Local (Fast) Write Operations
6.5.5 An Algorithm Based on Logical Time
6.5.6 Proof of the Logical Time-based Algorithm
6.6 Summary
6.7 Bibliographic Notes
6.8 Exercises and Problems
7 Circumventing t < n/2 R/W Register Impossibility - Failure Detector Approach
7.1 The Class Σ of Quorum Failure Detectors
7.1.1 Definition of the Class of Quorum Failure Detectors
7.1.2 Implementing a Failure Detector Σ When t < n/2
7.1.3 A Σ-based Construction of an SWSR Atomic Register
7.2 Σ Is theWeakest Failure Detector to Build an Atomic Register
7.2.1 What Does “Weakest Failure Detector Class” Mean
7.2.2 The Extraction Algorithm
7.2.3 Correctness of the Extraction Algorithm
7.3 Comparing the Failure Detectors Classes Θ and Σ
7.4 Atomic Register Abstraction vs URB-broadcast Abstraction
7.4.1 From Atomic Registers to URB-broadcast
7.4.2 Atomic Registers Are Strictly Stronger than URB-broadcast
7.5 Summary
7.6 Bibliographic Notes
7.7 Exercise and Problem
8 Broadcast Abstraction suited to Family of R/W Implementable Objects
8.1 The SCD-broadcast Communication Abstraction
8.1.1 Definition
8.1.2 Implementing SCD-broadcast in CAMPn,t[t < n/2]
8.1.3 Cost and Proof of the Algorithm
8.1.4 An SCD-broadcast-based Communication Pattern
8.2 From SCD-broadcast to an MWMR Register
8.2.1 Building an MWMR Atomic Register in CAMPn,t[SCD-broadcast]
8.2.2 Cost and Proof of Correctness
8.2.3 From Atomicity to Sequential Consistency
8.2.4 From MWMR Registers to an Atomic Snapshot Object
8.3 From SCD-broadcast to an Atomic Counter
8.3.1 Counter Object
8.3.2 Implementation of an Atomic Counter Object
8.3.3 Implementation of a Sequentially Consistent Counter Object
8.4 From SCD-broadcast to Lattice Agreement
8.4.1 The Lattice Agreement Task
8.4.2 Lattice Agreement from SCD-broadcast
8.5 From SWMR Atomic Registers to SCD-broadcast
8.5.1 From Snapshot to SCD-broadcast
8.5.2 Proof of the Algorithm
8.6 Summary
8.7 Bibliographic Notes
8.8 Exercises and Problems
9 Atomic R/W Registers in Presence of Byzantine Processes
9.1 Atomic Read/Write Registers in the Presence of Byzantine Processes
9.1.1 Why SWMR (and Not MWMR) Atomic Registers?
9.1.2 Reminder on Possible Behaviors of a Byzantine Process
9.1.3 SWMR Atomic Registers Despite Byzantine Processes: Definition
9.2 An Impossibility Result
9.3 Reminder on Byzantine Reliable Broadcast
9.3.1 Specification of Multi-shot Reliable Broadcast
9.3.2 An Algorithm for Multi-shot Byzantine Reliable Broadcast
9.4 Construction of SWMR Atomic Registers in BAMPn,t[t < n/3]
9.4.1 Description of the Algorithm
9.4.2 Comparison with the Crash Failure Model
9.5 Proof of the Algorithm
9.5.1 Preliminary Lemmas
9.5.2 Proof of the Termination Properties
9.5.3 Proof of the Consistency (Atomicity) Properties
9.5.4 Piecing Together the Lemmas
9.6 Building Objects on Top of SWMR Byzantine Registers
9.6.1 One-shot Write-snapshot Object
9.6.2 Correct-only Agreement Object
9.7 Summary
9.8 Bibliographic Notes
9.9 Exercises and Problems
--- Agreement in Sync Systems
10 Consensus & Interactive Consistency in Sync Systems prone to Process Crash Failures
10.1 Consensus in the Crash Failure Model
10.1.1 Definition
10.1.2 A Simple (Unfair) Consensus Algorithm
10.1.3 A Simple (Fair) Consensus Algorithm
10.2 Interactive Consistency (Vector Consensus)
10.2.1 Definition
10.2.2 A Simple Example of Use: Build Atomic Rounds
10.2.3 An Interactive Consistency Algorithm
10.2.4 Proof of the Algorithm
10.2.5 A Convergence Point of View
10.3 Lower Bound on the Number of Rounds
10.3.1 Preliminary Assumptions and Definitions
10.3.2 The (t + 1) Lower Bound
10.3.3 Proof of the Lemmas
10.4 Summary
10.5 Bibliographic Notes
10.6 Exercises and Problems
11 Expediting Decision in Sync Systems prone to Process Crash Failures
11.1 Early Deciding and Stopping Interactive Consistency
11.1.1 Early Deciding vs Early Stopping
11.1.2 An Early Decision Predicate
11.1.3 An Early Deciding and Stopping Algorithm
11.1.4 Correctness Proof
11.1.5 On Early Decision Predicates
11.1.6 Early Deciding and Stopping Consensus
11.2 An Unbeatable Binary Consensus Algorithm
11.2.1 A Knowledge-Based Unbeatable Predicate
11.2.2 PREF0() with Respect to DIFF()
11.2.3 An Algorithm Based on the Predicate PREF0(): CGM
11.2.4 On the Unbeatability of the Predicate PREF0()
11.3 The Synchronous Condition-based Approach
11.3.1 The Condition-based Approach in Synchronous Systems
11.3.2 Legality and Maximality of a Condition
11.3.3 Hierarchy of Legal Conditions
11.3.4 Local View of an Input Vector
11.3.5 A Synchronous Condition-based Consensus Algorithm
11.3.6 Proof of the Algorithm
11.4 Using a Global Clock and a Fast Failure Detector
11.4.1 Fast Perfect Failure Detectors
11.4.2 Enriching the Synchronous Model to Benefit from a Fast Failure Detector
11.4.3 A Simple Consensus Algorithm Based on a Fast Failure Detector
11.4.4 An Early Deciding and Stopping Algorithm
11.5 Summary
11.6 Bibliographic Notes
11.7 Exercises and Problems
12 Consensus Variants - Simultaneous Consensus & k-Set Agreement
12.1 Simultaneous Consensus: Definition and Its Difficulty
12.1.1 Definition of Simultaneous Consensus
12.1.2 Difficulty Early Deciding Before (t + 1) Rounds
12.1.3 Failure Pattern, Failure Discovery, and Waste
12.1.4 A Clean Round and the Horizon of a Round
12.2 An Optimal Simultaneous Consensus Algorithm
12.2.1 An Optimal Algorithm
12.2.2 Proof of the Algorithm
12.3 The k-Set Agreement Abstraction
12.3.1 Definition
12.3.2 A Simple Algorithm
12.4 Early Deciding and Stopping k-Set Agreement
12.4.1 An Early Deciding and Stopping Algorithm
12.4.2 Proof of the Algorithm
12.5 Summary
12.6 Bibliographic Notes
12.7 Exercises and Problems
13 Non-Blocking Atomic Commitment in Presence of Process Crash Failures
13.1 The Non-blocking Atomic Commitment (NBAC) Abstraction
13.1.1 Definition of Non-blocking Atomic Commitment
13.1.2 A Simple Non-blocking Atomic Commitment Algorithm
13.2 Fast Commit and Fast Abort
13.2.1 Looking for Efficient Algorithms
13.2.2 An Impossibility Result
13.3 Weak Fast Commit and Weak Fast Abort
13.4 Fast Commit and Weak Fast Abort Are Compatible
13.4.1 A Fast Commit and Weak Fast Abort Algorithm
13.4.2 Proof of the Algorithm
13.5 Other Non-blocking Atomic Commitment Algorithms
13.5.1 Fast Abort andWeak Fast Commit
13.5.2 The Case t ≤ 2 (System Model CSMPn,t[1 ≤ t < 3 ≤ n])
13.6 Summary
13.7 Bibliographic Notes
13.8 Exercises and Problems
14 Consensus in Sync Systems prone to Byzantine Process Failures
14.1 Agreement Despite Byzantine Processes
14.1.1 On the Agreement and Validity Properties
14.1.2 A Consensus Definition for the Byzantine Failure Model
14.1.3 An Interactive Consistency Definition for the Byzantine Failure Model
14.1.4 The Byzantine General Agreement Abstraction
14.2 Interactive Consistency for Four Processes Despite One Byzantine Process
14.2.1 An Algorithm for n = 4 and t = 1
14.2.2 Proof of the Algorithm
14.3 An Upper Bound on the Number of Byzantine Processes
14.4 A Byzantine Consensus Algorithm for BSMPn,t[t < n/3]
14.4.1 Base Data Structure: a Tree
14.4.2 EIG Algorithm
14.4.3 Example of an Execution
14.4.4 Proof of the EIG Algorithm
14.5 A Simple Consensus Algorithm with Constant Message Size
14.5.1 Features of the Algorithm
14.5.2 Presentation of the Algorithm
14.5.3 Proof and Properties of the Algorithm
14.6 From Binary to Multivalued Byzantine Consensus
14.6.1 Motivation
14.6.2 A Reduction Algorithm
14.6.3 Proof of the Multivalued to Binary Reduction
14.6.4 An Interesting Property of the Construction
14.7 Enriching the Synchronous Model with Message Authentication
14.7.1 Synchronous Model with Signed Messages
14.7.2 The Gain Obtained from Signatures
14.7.3 A Synchronous Signature-Based Consensus Algorithm
14.7.4 Proof of the Algorithm
14.8 Summary
14.9 Bibliographic Notes
14.10 Exercises and Problems
--- Agreement in Async Systems
15 Implementable Agreement Abstractions despite Asynchrony & Minority of Process Crashes
15.1 The Renaming Agreement Abstraction
15.1.1 Definition
15.1.2 A Fundamental Result
15.1.3 The Stacking Approach
15.1.4 A Snapshot-based Implementation of Renaming
15.1.5 Proof of the Algorithm
15.2 The Approximate Agreement Abstraction
15.2.1 Definition
15.2.2 A Read/Write-based Implementation of Approximate Agreement
15.2.3 Proof of the Algorithm
15.3 The Safe Agreement Abstraction
15.3.1 Definition
15.3.2 A Direct Implementation of Safe Agreement in CAMPn,t[t < n/2]
15.3.3 Proof of the Algorithm
15.4 Summary
15.5 Bibliographic Notes
15.6 Exercises and Problems
16 Consensus - Power & Implementability Limit in Crash-prone Async Systems
16.1 The Total Order Broadcast Communication Abstraction
16.1.1 Total Order Broadcast: Definition
16.1.2 A Map of Communication Abstractions
16.2 From Consensus to TO-broadcast
16.2.1 Structure of the Construction
16.2.2 Description of the Algorithm
16.2.3 Proof of the Algorithm
16.3 Consensus and TO-broadcast Are Equivalent
16.4 The State Machine Approach
16.4.1 State Machine Replication
16.4.2 Sequentially-Defined Abstractions (Objects)
16.5 A Simple Consensus-based Universal Construction
16.6 Agreement vs Mutual Exclusion
16.7 Ledger Object
16.7.1 Definition
16.7.2 Implementation of a Ledger in CAMPn,t[TO-broadcast]
16.8 Consensus Impossibility in the Presence of Crashes and Asynchrony
16.8.1 The Intuition That Underlies the Impossibility
16.8.2 Refining the Definition of CAMPn,t[∅]
16.8.3 Notion of Valence of a Global State
16.8.4 Consensus Is Impossible in CAMPn,1[∅]
16.9 The Frontier Between Read/Write Registers and Consensus
16.9.1 The Main Question
16.9.2 The Notion of Consensus Number in Read/Write Systems
16.9.3 An Illustration of Herlihy’s Hierarchy
16.9.4 The Consensus Number of a Ledger
16.10 Summary
16.11 Bibliographic Notes
16.12 Exercises and Problems
17 Implementing Consensus in Enriched Crash-prone Async Systems
17.1 Enriching an Asynchronous System to Implement Consensus
17.2 A Message Scheduling Assumption
17.2.1 Message Scheduling (MS) Assumption
17.2.2 A Binary Consensus Algorithm
17.2.3 Proof of the Algorithm
17.2.4 Additional Properties
17.3 Enriching CAMPn,t[Ø] with a Perpetual Failure Detector
17.3.1 Enriching CAMPn,t[∅] with a Perfect Failure Detector
17.4 Enriching CAMPn,t[t < n/2] with an Eventual Leader
17.4.1 TheWeakest Failure Detector to Implement Consensus
17.4.2 Implementing Consensus in CAMPn,t[t < n/2, Ω]
17.4.3 Proof of the Algorithm
17.4.4 Consensus Versus Eventual Leader Failure Detector
17.4.5 Notions of Indulgence and Zero-degradation
17.4.6 Saving Broadcast Instances
17.5 Enriching CAMPn,t[t < n/2] with Randomization
17.5.1 Asynchronous Randomized Models
17.5.2 Randomized Consensus
17.5.3 Randomized Binary Consensus in CAMPn,t[t < n/2, LC]
17.5.4 Randomized Binary Consensus in CAMPn,t[t < n/2, CC]
17.6 Enriching CAMPn,t[t < n/2] with a Hybrid Approach
17.6.1 The Hybrid Approach: Failure Detector and Randomization
17.6.2 A Hybrid Binary Consensus Algorithm
17.7 A Paxos-inspired Consensus Algorithm
17.7.1 The Alpha Communication Abstraction
17.7.2 Consensus Algorithm
17.7.3 An Implementation of Alpha in CAMPn,t[t < n/2]
17.8 From Binary to Multivalued Consensus
17.8.1 A Reduction Algorithm
17.8.2 Proof of the Reduction Algorithm
17.9 Consensus in One Communication Step
17.9.1 Aim and Model Assumption on t
17.9.2 A One Communication Step Algorithm
17.9.3 Proof of the Early Deciding Algorithm
17.10 Summary
17.11 Bibliographic Notes
17.12 Exercises and Problems
18 Implementing Oracles in Async Systems prone to Process Crash Failures
18.1 The Two Facets of Failure Detectors
18.1.1 The Programming Point of View: Modular Building Block
18.1.2 The Computability Point of View: Abstraction Ranking
18.2 Ω in CAMPn,t[∅]: a Direct Impossibility Proof
18.3 Constructing a Perfect Failure Detector (Class P)
18.3.1 Reminder: Definition of the Class P of Perfect Failure Detectors
18.3.2 Use of an Underlying Synchronous System
18.3.3 Applications Generating a Fair Communication Pattern
18.3.4 The Theta Assumption
18.4 Constructing an Eventually Perfect Failure Detector (Class P)
18.4.1 Reminder: Definition of an Eventually Perfect Failure Detector
18.4.2 From Perpetual to Eventual Properties
18.4.3 Eventually Synchronous Systems
18.5 On the Efficient Monitoring of a Process by Another Process
18.5.1 Motivation and System Model
18.5.2 A Monitoring Algorithm
18.6 An Adaptive Monitoring-based Algorithm Building P
18.6.1 Motivation and Model
18.6.2 A Monitoring-Based Adaptive Algorithm for the Failure Detector Class P
18.6.3 Proof the Algorithm
18.7 From the t-Source Assumption to an Ω Eventual Leader
18.7.1 The t-Source Assumption and the Model CAMPn,t[t-SOURCE]
18.7.2 Electing an Eventual Leader in CAMPn,t[t-SOURCE]
18.7.3 Proof of the Algorithm
18.8 Electing an Eventual Leader in CAMPn,t[ t-MS PAT]
18.8.1 A Query/Response Pattern
18.8.2 Electing an Eventual Leader in CAMPn,t[t-MS PAT]
18.8.3 Proof of the Algorithm
18.9 Building Ω in a Hybrid Model
18.10 Construction of a Biased Common Coin from Local Coins
18.10.1 Definition of a Biased Common Coin
18.10.2 The CORE Communication Abstraction
18.10.3 Construction of a Common Coin with a Constant Bias
18.10.4 On the Use of a Biased Common Coin
18.11 Summary
18.12 Bibliographic notes
18.13 Exercises and Problems
19 Implementing Consensus in Enriched Byzantine Async Systems
19.1 Definition Reminder and Two Observations
19.1.1 Definition of Byzantine Consensus (Reminder)
19.1.2 Why Not to Use an Eventual Leader
19.1.3 On the Weakest Synchrony Assumption for Byzantine Consensus
19.2 Binary Byzantine Consensus from a Message Scheduling Assumption
19.2.1 A Message Scheduling Assumption
19.2.2 A Binary Byzantine Consensus Algorithm
19.2.3 Proof of the Algorithm
19.2.4 Additional Properties
19.3 An Optimal Randomized Binary Byzantine Consensus Algorithm
19.3.1 The Binary-Value Broadcast Abstraction
19.3.2 A Binary Randomized Consensus Algorithm
19.3.3 Proof of the BV-Based Binary Byzantine Consensus Algorithm
19.3.4 From Decision to Decision and Termination
19.4 From Binary to Multivalued Byzantine Consensus
19.4.1 A Reduction Algorithm
19.4.2 Proof of the Reduction Algorithm
19.5 From Binary to No-intrusion Multivalued Byzantine Consensus
19.5.1 The Validated Byzantine Broadcast Abstraction
19.5.2 An Algorithm Implementing VBB-broadcast
19.5.3 Proof of the VBB-broadcast Algorithm
19.5.4 A VBB-Based Multivalued to Binary Byzantine Consensus Reduction
19.5.5 Proof of the VBB-Based Reduction Algorithm
19.6 Summary
19.7 Appendix: Proof-of-Work (PoW) Seen as Eventual Byzantine Agreement
19.8 Bibliographic Notes
19.9 Exercises and Problems
20 Quorum, Signatures & Overlays
20.1 Quorum Systems
20.1.1 Definitions
20.1.2 Examples of Use of a Quorum System
20.1.3 A Few Classical Quorums
20.1.4 Quorum Composition
20.2 Digital Signatures
20.2.1 Cipher, Keys, and Signatures
20.2.2 How to Build a Secret Key: Diffie-Hellman’s Algorithm
20.2.3 How to Build a Public Key: Rivest-Shamir-Adleman’s (RSA) Algorithm
20.2.4 How to Share a Secret: Shamir’s Algorithm
20.3 Overlay Networks
20.3.1 On Regular Graphs
20.3.2 Hypercube
20.3.3 de Bruijn Graphs
20.3.4 Kautz Graphs
20.3.5 Undirected de Bruijn and Kautz Graphs
20.4 Bibliographic Notes
Afterword
Biblio
Index
Michel Raynal Fault-Tolerant Message-Passing Distributed Systems An Algorithmic Approach
Michel Raynal IRISA-ISTIC Université de Rennes 1 Institut Universitaire de France Rennes, France Parts of this work are based on the books “Fault-Tolerant Agreement in Synchronous Message- Passing Systems” and “Communication and Agreement Abstractions for Fault-Tolerant Asynchro- nous Distributed Systems”, author Michel Raynal, © 2010 Morgan & Claypool Publishers (www. morganclaypool.com). Used with permission. ISBN 978-3-319-94140-0 https://doi.org/10.1007/978-3-319-94141-7 Library of Congress Control Number: 2018953101 © Springer Nature Switzerland AG 2018 ISBN 978-3-319-94141-7 (eBook)
Preface La recherche du temps perdu passait par leWeb. [...] La m´emoire ´etait devenue in´epuisable, mais la profondeur du temps [...] avait disparu. On ´etait dans un pr´esent infini. In Les ann´ees (2008), Annie Ernaux (1940) Sed nos immensum spatiis confecimus aequor, Et iam tempus equum fumentia solvere colla.1 In Georgica, Liber II, 541-542, Publius Virgilius (70 BC–19 BC) Je suis arriv´e au jour o`u je ne me souviens plus quand j’ai cess´e d’ˆetre immortel. In Livro de Cr´onicas, Ant´onio Lobo Antunes (1942) C’est une chose ´etrange `a la fin que le monde Un jour je m’en irai sans en avoir tout dit. In Les yeux et la m´emoire (1954), chant II, Louis Aragon (1897–1982) Tout garder, c’est tout d´etruire. Jacques Derrida (1930–2004) 1French: Mais j’ai d´ej`a fourni une vaste carri`ere, il est temps de d´eteler les chevaux tout fumants. English: But now I have traveled a very long way, and the time has come to unyoke my steaming horses.
What is distributed computing? Distributed computing was born in the late 1970s when researchers and practitioners started taking into account the intrinsic characteristic of physically distributed sys- tems. The field then emerged as a specialized research area distinct from networking, operating sys- tems, and parallel computing. Distributed computing arises when one has to solve a problem in terms of distributed entities (usually called processors, nodes, processes, actors, agents, sensors, peers, etc.) such that each entity has only a partial knowledge of the many parameters involved in the problem that has to be solved. While parallel computing and real-time computing can be characterized, respectively, by the terms efficiency and on-time computing, distributed computing can be characterized by the term uncertainty. This uncertainty is created by asynchrony, multiplicity of control flows, absence of shared memory and global time, failure, dynamicity, mobility, etc. Mastering one form or another of uncertainty is pervasive in all distributed computing problems. A main difficulty in designing distributed algorithms comes from the fact that no entity cooperating in the achievement of a common goal can have an instantaneous knowledge of the current state of the other entities, it can only know their past local states. Although distributed algorithms are often made up of a few lines, their behavior can be difficult to understand and their properties hard to state and prove. Hence, distributed computing is not only a fundamental topic but also a challenging topic where simplicity, elegance, and beauty are first-class citizens. Why this book? In the book “Distributed algorithms for message-passing systems” (Springer, 2013), I addressed distributed computing in failure-free message-passing systems, where the computing enti- ties (processes) have to cooperate in the presence of asynchrony. Differently, in my book “Concurrent programming: algorithms, principles and foundations” (Springer, 2013), I addressed distributed com- puting where the computing entities (processes) communicate through a read/write shared memory (e.g., multicore), and the main adversary lies in the net effect of asynchrony and process crashes (unexpected definitive stops). The present book considers synchronous and asynchronous message-passing systems, where pro- cesses can commit crash failures, or Byzantine failures (arbitrary behavior). Its aim is to present in a comprehensive way basic notions, concepts and algorithms in the context of these systems. The main difficulty comes from the uncertainty created by the adversaries managing the environment (mainly asynchrony and failures), which, by its very nature, is not under the control of the system. A quick look at the content of the book The book is composed of four parts, the first two are on communication abstractions, the other two on agreement abstractions. Those are the most important abstractions distributed applications rely on in asynchronous and synchronous message-passing sys- tems where processes may crash, or commit Byzantine failures. The book addresses what can be done and what cannot be done in the presence of such adversaries. It consequently presents both impossi- bility results and distributed algorithms. All impossibility results are proved, and all algorithms are described in a simple algorithmic notation and proved correct. • Parts on communication abstractions. – Part I is on the reliable broadcast abstraction.
– Part II is on the construction of read/write registers. • Parts on agreement. – Part III is on agreement in synchronous systems. – Part IV is on agreement in asynchronous systems. On the presentation style When known, the names of the authors of a theorem, or of an algorithm, are indicated together with the date of the associated publication. Moreover, each chapter has a bib- liographical section, where a short historical perspective and references related to that chapter are given. Each chapter terminates with a few exercises and problems, whose solutions can be found in the article cited at the end of the corresponding exercise/problem. From a vocabulary point of view, the following terms are used: an object implements an abstrac- tion, defined by a set of properties, which allows a problem to be solved. Moreover, each algorithm is first presented intuitively with words, and then proved correct. Understanding an algorithm is a two-step process: • First have a good intuition of its underlying principles, and its possible behaviors. This is nec- • Then prove the algorithm is correct in the model it was designed for. The proof consists in a essary, but remains informal. logical reasoning, based on the properties provided by (i) the underlying model, and (ii) the statements (code) of the algorithm. More precisely, each property defining the abstraction the algorithm is assumed to implement must be satisfied in all its executions. Only when these two steps have been done, can we say that we understand the algorithm. Audience This book has been written primarily for people who are not familiar with the topic and the concepts that are presented. These include mainly: • Senior-level undergraduate students and graduate students in informatics or computing engineer- ing, who are interested in the principles and algorithmic foundations of fault-tolerant distributed computing. • Practitioners and engineers who want to be aware of the state-of-the-art concepts, basic princi- ples, mechanisms, and techniques encountered in fault-tolerant distributed computing. Prerequisites for this book include undergraduate courses on algorithms, basic knowledge on operat- ing systems, and notions on concurrency in failure-free distributed computing. One-semester courses, based on this book, are suggested in the section titled “How to Use This Book” in the Afterword. Origin of the book and acknowledgments This book has two complementary origins: • The first is a set of lectures for undergraduate and graduate courses on distributed computing I gave at the University of Rennes (France), the Hong Kong Polytechnic University, and, as an invited professor, at several universities all over the world. Hence, I want to thank the numerous students for their questions that, in one way or another, contributed to this book. • The second is the two monographs I wrote in 2010, on fault-tolerant distributed computing, titled “Communication and agreement abstractions for fault-tolerant asynchronous distributed
systems”, and “Fault-tolerant agreement in synchronous distributed systems”. Parts of them appear in this book, after having been revised, corrected, and improved. Hence, I want to thank Morgan & Claypool, and more particularly Diane Cerra, for their per- mission to reuse parts of this work. I also want to thank my colleagues (in no particular order) A. Most´efaoui, D. Imbs, S. Rajsbaum, V. Gramoli, C. Delporte, H. Fauconnier, F. Ta¨ıani, M. Perrin, A. Casta˜neda, M. Larrea, and Z. Bouzid, with whom I collaborated in the recent past years. I also thank the Polytechnic University of Hong Kong (PolyU), and more particularly Professor Jiannong Cao, for hosting me while I was writing parts of this book. My thanks also to Ronan Nugent (Springer) for his support and his help in putting it all together. Last but not least (and maybe most importantly), I thank all the researchers whose results are pre- sented in this book. Without their work, this book would not exist. (Finally, since I typeset the entire text myself – LATEX2 for the text and xfig for figures – any typesetting or technical errors that remain are my responsibility.) Professor Michel Raynal Academia Europaea Institut Universitaire de France Professor IRISA-ISTIC, Universit´e de Rennes 1, France Chair Professor, Hong Kong Polytechnic University June–December 2017 Rennes, Saint-Gr´egoire, Douelle, Saint-Philibert, Hong Kong, Vienna (DISC’17), Washington D.C. (PODC’17), Mexico City (UNAM)
Contents I Introductory Chapter 1 . . . . . . . . . . . . . 1 A Few Definitions and Two Introductory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 7 7 9 9 . . . . . . . . . . . . . . . . . . . . . 10 . . . . . . . . . . . . . . . . . . . . . 11 1.1 A Few Definitions Related to Distributed Computing . . . 1.2 Example 1: Common Decision Despite Message Losses . 1.2.1 The Problem . 1.2.2 Trying to Solve the Problem: Attempt 1 . . . . . . . . . . 1.2.3 Trying to Solve the Problem: Attempt 2 . . . . . . . . . . . . . . . . . 1.2.4 An Impossibility Result . 1.2.5 A Coordination Problem . Example 2: Computing a Global Function Despite a Message Adversary . . . . . . . . . . 1.3.1 The Problem . . 1.3.2 The Notion of a Message Adversary . 1.3.3 The TREE-AD Message Adversary . 1.3.4 . . . . . . . . . . . . . . . . . . . . . . From Message Adversary to Process Mobility . 1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 . . . . . . . . . . . . . . . 11 . . . . . . . . . . . . . 12 . . . . . . . . . . . . . 13 . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . . . . 19 1.4 Main Distributed Computing Models Used in This Book . 1.5 Distributed Computing Versus Parallel Computing . . 1.6 . 1.7 Bibliographic Notes . . 1.8 . . Exercises and Problems . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II The Reliable Broadcast Communication Abstraction 21 2 Reliable Broadcast in the Presence of Process Crash Failures 2.1 Uniform Reliable Broadcast . . . . . . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . 23 . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . . . . . . 24 2.1.1 From Best Effort to Guaranteed Reliability . . . 2.1.2 Uniform Reliable Broadcast (URB-broadcast) . . 2.1.3 Building the URB-broadcast Abstraction in CAMP n,t[∅] . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . 2.2 Adding Quality of Service . 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 The Total Order Broadcast Abstraction Requires More . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . 27 “First In, First Out” (FIFO) Message Delivery . . . . . . . . . . . . . . . . . . 27 “Causal Order” (CO) Message Delivery . . . . . . . . . . . . . . . . . . . . . 29 From FIFO-broadcast to CO-broadcast . . . . . . . . . . . . . . . . . . . . . 31 From URB-broadcast to CO-broadcast: Capturing Causal Past in a Vector . . . 34 . . . 38 . . . . . . . . . 39 . . . . . . . . . . . . . . . . 39 . . . . . . . . . . . . . . . . . . . 39 . . Exercises and Problems . . . . . . . . . . . . . . . . . . . . 2.3 2.4 Bibliographic Notes . 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 URB-broadcast in CAMP n,t[- FC] System Model . 3.4 URB-broadcast in CAMP n,t[- FC] Enriched with a Failure Detector Formal Definitions . 3 Reliable Broadcast in the Presence of Process Crashes and Unreliable Channels . . 3.1 A System Model with Unreliable Channels . . . Fairness Notions for Channels . . Fair Channel (FC) and Fair Lossy Channel 3.1.1 3.1.2 3.1.3 Reliable Channel in the Presence of Process Crashes . . . 3.1.4 . . 3.2.1 URB-broadcast in CAMP n,t[- FC, t < n/2] . . 3.2.2 An Impossibility Result . Failure Detectors: an Approach to Circumvent Impossibilities . 3.3.1 The Concept of a Failure Detector . 3.3.2 . 41 . 41 . . . 41 . . . . . . . . . . . . . . . . . . . 42 . 43 . . . . 44 . . . 44 . . 45 . . . . . 46 . 47 . . . . . . . . . . . . . . . . . . . . . . . . 47 . . . 48 . . . . . . . . . 49 3.4.1 Definition of the Failure Detector Class Θ . . . . . . . . . . . . . . . . . . . . 49 . 50 . 3.4.2 3.4.3 Building a Failure Detector Θ in CAMP n,t[- FC, t < n/2] . 50 3.4.4 The Fundamental Added Value Supplied by a Failure Detector . . . . . . . . . 51 . 51 3.5.1 The Quiescence Property . . 51 3.5.2 Quiescent URB-broadcast Based on a Perfect Failure Detector . . . . . . . . . 52 3.5.3 The Class HB of Heartbeat Failure Detectors . . 54 . 3.5.4 Quiescent URB-broadcast in CAMP n,t[- FC, Θ, HB ] . 56 . 58 . . Summary . . 58 . . . . . 59 . . . Exercises and Problems . Solving URB-broadcast in CAMP n,t[- FC, Θ] 3.5 Quiescent Uniform Reliable Broadcast . 3.6 3.7 Bibliographic Notes . 3.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Reliable Broadcast in the Presence of Byzantine Processes . . . . . . . . . . . . 4.1 Byzantine Processes and Properties of the Model BAMP n,t[t < n/3] . . 4.2 . . . . 61 . 61 . . The No-Duplicity Broadcast Abstraction . . 62 . . . . 4.2.1 Definition . . . . 62 . . 4.2.2 An Impossibility Result . . . . . 63 . 4.2.3 A No-Duplicity Broadcast Algorithm . . . . . . . . . . . . . . . . . . . . . . 63 4.3 The Byzantine Reliable Broadcast Abstraction . . . . . . . . . . . . . . . . . . . . . . 65 4.4 An Optimal Byzantine Reliable Broadcast Algorithm . . . . . . . . . . . . . . . . . . 66 . . . . . . 66 . 67 . . . . . . . . . . . . . . . . . . . . . . 68 . . 69 . . . . . . . . 70 . 70 . . . . . 72 . . . 73 . . . Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4.1 A Byzantine Reliable Broadcast Algorithm for BAMP n,t[t < n/3] . 4.4.2 Correctness Proof . . 4.4.3 Benefiting from Message Asynchrony . Time and Message-Efficient Byzantine Reliable Broadcast . . 4.5.1 A Message-Efficient Byzantine Reliable Broadcast Algorithm . . 4.5.2 Correctness Proof . Summary . . . . . 4.6 . 4.7 Bibliographic Notes . 4.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III The Read/Write Register Communication Abstraction 75 5 The Read/Write Register Abstraction 77 The Read/Write Register Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1.1 Concurrent Objects and Registers . . . . . . . . . . . . . . . . . . . . . 77 5.1 . . .
分享到:
收藏