Normal view MARC view ISBD view

Fault-Tolerance Techniques for High-Performance Computing.

By: Herault, Thomas.
Contributor(s): Robert, Yves.
Material type: TextTextSeries: eBooks on Demand.Computer Communications and Networks Ser: Publisher: Cham : Springer, 2015Copyright date: ©2015Description: 1 online resource (325 pages).Content type: text Media type: computer Carrier type: online resourceISBN: 9783319209432.Subject(s): Electronic data processingGenre/Form: Electronic books.Additional physical formats: Print version:: Fault-Tolerance Techniques for High-Performance ComputingDDC classification: 004.24 LOC classification: QA75.5-76.95Online resources: Click here to view this ebook.
Contents:
Intro -- Preface -- Objective -- Thanks -- Contents -- Contributors -- Part I General Overview -- 1 Fault Tolerance Techniques for High-Performance Computing -- 1.1 Introduction -- 1.1.1 Resilience at Scale -- 1.1.2 Faults and Failures -- 1.2 Checkpoint and Rollback Recovery -- 1.2.1 Process Checkpointing -- 1.2.2 Coordinated Checkpointing -- 1.2.3 Uncoordinated Checkpointing -- 1.2.4 Hierarchical Checkpointing -- 1.3 Probabilistic Models for Checkpointing -- 1.3.1 Checkpointing with a Single Resource -- 1.3.2 Coordinated Checkpointing -- 1.3.3 Hierarchical Checkpointing -- 1.3.4 In-Memory Checkpointing -- 1.4 Probabilistic Models for Advanced Methods -- 1.4.1 Fault Prediction -- 1.4.2 Replication -- 1.5 Application-Specific Fault Tolerance Techniques -- 1.5.1 Fault-Tolerant Middleware -- 1.5.2 ABFT for Dense Matrix Factorization -- 1.5.3 Composite Approach: ABFT and Checkpointing -- 1.6 Silent Errors -- 1.6.1 Motivation -- 1.6.2 Other Approaches -- 1.6.3 Optimal Pattern -- 1.7 Conclusion -- References -- Part II Technical Contributions -- 2 Errors and Faults -- 2.1 Introduction -- 2.2 Definitions -- 2.3 Detection -- 2.4 Observations -- 2.4.1 Location Propagation -- 2.4.2 Failure Statistics -- 2.4.3 Additional Information -- 2.4.4 Silent Errors -- 2.5 Modeling -- 2.5.1 Randomness Testing -- 2.5.2 Fitting Distributions -- 2.5.3 Including Prediction -- 2.5.4 Per Component Failure Distribution -- 2.6 Prediction -- 2.6.1 Long-Term Prediction -- 2.6.2 Short-Term Prediction -- 2.6.3 Checkpointing Challenges -- References -- 3 Fault-Tolerant MPI -- 3.1 Introduction -- 3.2 Automatic Uncoordinated Fault Tolerance in MPI -- 3.2.1 Rollback Recovery Execution Model -- 3.2.2 Building a Consistent Recovery Set -- 3.2.3 Short Survey of Related Works -- 3.3 Message Logging and Zero-Copy MPI Communication -- 3.3.1 Understanding Non-blocking MPI Communication.
3.3.2 A Split Model for Matching and Delivery Events -- 3.3.3 A Generic Framework for Message Logging in Open MPI -- 3.3.4 Pessimistic Message Logging Implementation -- 3.3.5 Performance of MPI Message Logging -- 3.3.6 Concluding Remarks -- 3.4 Comparing Event Logging Strategies -- 3.4.1 Active Optimistic Message Logging -- 3.4.2 Optimistic Versus Pessimistic: Experimental Evaluation -- 3.4.3 Concluding Remarks -- 3.5 Optimizing Sender-Based Message Logging -- 3.5.1 Strategies for Sender-Based Copies -- 3.5.2 Backend Storages -- 3.5.3 Copy Methods -- 3.5.4 Sender-Based Copy: Experimental Evaluation -- 3.5.5 Concluding Remarks -- 3.6 Correlated Sets Coordination to Decrease Message Logging -- 3.6.1 Background -- 3.6.2 Correlated Set Coordinated Message Logging -- 3.6.3 Experimental Evaluation -- 3.6.4 Discussion on Process Grouping -- 3.6.5 Concluding Remarks -- 3.7 Supporting User-Level Recovery with Standard MPI -- 3.7.1 The Checkpoint-on-Failure Protocol -- 3.7.2 Implementation Issues -- 3.7.3 Example: The QR Factorization -- 3.7.4 In-Memory CoF Protocol -- 3.7.5 Performance Discussion -- 3.7.6 Concluding Remarks -- 3.8 User-Level Fault Tolerance with Extended MPI -- 3.8.1 Communication Substrate Recovery Background -- 3.8.2 Establishing a Flexible Feature Set -- 3.8.3 The Implementor's Perspective -- 3.8.4 The Users' Perspective -- 3.8.5 The User-Level Failure Mitigation API -- 3.8.6 Performance Assessment -- 3.8.7 Concluding Remarks -- References -- 4 Using Replication for Resilience on Exascale Systems -- 4.1 Introduction -- 4.2 Related Work -- 4.3 Models and Assumptions -- 4.4 Group Replication -- 4.4.1 Exponential Failures -- 4.4.2 General Failures -- 4.4.3 Simulation Methodology -- 4.4.4 Simulation Results -- 4.5 Process Replication -- 4.5.1 Theoretical Results -- 4.5.2 Empirical Evaluation -- 4.6 Conclusion -- References.
5 Energy-Aware Checkpointing Strategies -- 5.1 Introduction -- 5.2 Optimal Checkpointing Period: Time versus Energy -- 5.2.1 Model -- 5.2.2 Optimal Checkpointing Period -- 5.2.3 Experiments -- 5.2.4 Summary -- 5.3 Energy-Aware Fault-Tolerant Protocols for HPC Applications: A Methodology Based on Energy Estimation -- 5.3.1 Identifying Operations in Fault-Tolerant Protocols -- 5.3.2 Energy Calibration Methodology -- 5.3.3 Energy Estimation Methodology -- 5.3.4 Validation of the Estimations -- 5.3.5 Energy-Aware Choice of Checkpointing Protocols -- 5.3.6 Summary -- 5.4 Conclusion -- References -- Index.
Tags from this library: No tags from this library for this title. Log in to add tags.
Item type Current location Call number URL Status Date due Barcode
Electronic Book UT Tyler Online
Online
QA75.5-76.95 (Browse shelf) https://ebookcentral.proquest.com/lib/uttyler/detail.action?docID=3563307 Available EBC3563307

Intro -- Preface -- Objective -- Thanks -- Contents -- Contributors -- Part I General Overview -- 1 Fault Tolerance Techniques for High-Performance Computing -- 1.1 Introduction -- 1.1.1 Resilience at Scale -- 1.1.2 Faults and Failures -- 1.2 Checkpoint and Rollback Recovery -- 1.2.1 Process Checkpointing -- 1.2.2 Coordinated Checkpointing -- 1.2.3 Uncoordinated Checkpointing -- 1.2.4 Hierarchical Checkpointing -- 1.3 Probabilistic Models for Checkpointing -- 1.3.1 Checkpointing with a Single Resource -- 1.3.2 Coordinated Checkpointing -- 1.3.3 Hierarchical Checkpointing -- 1.3.4 In-Memory Checkpointing -- 1.4 Probabilistic Models for Advanced Methods -- 1.4.1 Fault Prediction -- 1.4.2 Replication -- 1.5 Application-Specific Fault Tolerance Techniques -- 1.5.1 Fault-Tolerant Middleware -- 1.5.2 ABFT for Dense Matrix Factorization -- 1.5.3 Composite Approach: ABFT and Checkpointing -- 1.6 Silent Errors -- 1.6.1 Motivation -- 1.6.2 Other Approaches -- 1.6.3 Optimal Pattern -- 1.7 Conclusion -- References -- Part II Technical Contributions -- 2 Errors and Faults -- 2.1 Introduction -- 2.2 Definitions -- 2.3 Detection -- 2.4 Observations -- 2.4.1 Location Propagation -- 2.4.2 Failure Statistics -- 2.4.3 Additional Information -- 2.4.4 Silent Errors -- 2.5 Modeling -- 2.5.1 Randomness Testing -- 2.5.2 Fitting Distributions -- 2.5.3 Including Prediction -- 2.5.4 Per Component Failure Distribution -- 2.6 Prediction -- 2.6.1 Long-Term Prediction -- 2.6.2 Short-Term Prediction -- 2.6.3 Checkpointing Challenges -- References -- 3 Fault-Tolerant MPI -- 3.1 Introduction -- 3.2 Automatic Uncoordinated Fault Tolerance in MPI -- 3.2.1 Rollback Recovery Execution Model -- 3.2.2 Building a Consistent Recovery Set -- 3.2.3 Short Survey of Related Works -- 3.3 Message Logging and Zero-Copy MPI Communication -- 3.3.1 Understanding Non-blocking MPI Communication.

3.3.2 A Split Model for Matching and Delivery Events -- 3.3.3 A Generic Framework for Message Logging in Open MPI -- 3.3.4 Pessimistic Message Logging Implementation -- 3.3.5 Performance of MPI Message Logging -- 3.3.6 Concluding Remarks -- 3.4 Comparing Event Logging Strategies -- 3.4.1 Active Optimistic Message Logging -- 3.4.2 Optimistic Versus Pessimistic: Experimental Evaluation -- 3.4.3 Concluding Remarks -- 3.5 Optimizing Sender-Based Message Logging -- 3.5.1 Strategies for Sender-Based Copies -- 3.5.2 Backend Storages -- 3.5.3 Copy Methods -- 3.5.4 Sender-Based Copy: Experimental Evaluation -- 3.5.5 Concluding Remarks -- 3.6 Correlated Sets Coordination to Decrease Message Logging -- 3.6.1 Background -- 3.6.2 Correlated Set Coordinated Message Logging -- 3.6.3 Experimental Evaluation -- 3.6.4 Discussion on Process Grouping -- 3.6.5 Concluding Remarks -- 3.7 Supporting User-Level Recovery with Standard MPI -- 3.7.1 The Checkpoint-on-Failure Protocol -- 3.7.2 Implementation Issues -- 3.7.3 Example: The QR Factorization -- 3.7.4 In-Memory CoF Protocol -- 3.7.5 Performance Discussion -- 3.7.6 Concluding Remarks -- 3.8 User-Level Fault Tolerance with Extended MPI -- 3.8.1 Communication Substrate Recovery Background -- 3.8.2 Establishing a Flexible Feature Set -- 3.8.3 The Implementor's Perspective -- 3.8.4 The Users' Perspective -- 3.8.5 The User-Level Failure Mitigation API -- 3.8.6 Performance Assessment -- 3.8.7 Concluding Remarks -- References -- 4 Using Replication for Resilience on Exascale Systems -- 4.1 Introduction -- 4.2 Related Work -- 4.3 Models and Assumptions -- 4.4 Group Replication -- 4.4.1 Exponential Failures -- 4.4.2 General Failures -- 4.4.3 Simulation Methodology -- 4.4.4 Simulation Results -- 4.5 Process Replication -- 4.5.1 Theoretical Results -- 4.5.2 Empirical Evaluation -- 4.6 Conclusion -- References.

5 Energy-Aware Checkpointing Strategies -- 5.1 Introduction -- 5.2 Optimal Checkpointing Period: Time versus Energy -- 5.2.1 Model -- 5.2.2 Optimal Checkpointing Period -- 5.2.3 Experiments -- 5.2.4 Summary -- 5.3 Energy-Aware Fault-Tolerant Protocols for HPC Applications: A Methodology Based on Energy Estimation -- 5.3.1 Identifying Operations in Fault-Tolerant Protocols -- 5.3.2 Energy Calibration Methodology -- 5.3.3 Energy Estimation Methodology -- 5.3.4 Validation of the Estimations -- 5.3.5 Energy-Aware Choice of Checkpointing Protocols -- 5.3.6 Summary -- 5.4 Conclusion -- References -- Index.

Description based on publisher supplied metadata and other sources.

There are no comments for this item.

Log in to your account to post a comment.