Fault-Tolerance Techniques for High-Performance Computing Fault-Tolerance Techniques for High-Performance Computing
Computer Communications and Networks

Fault-Tolerance Techniques for High-Performance Computing

    • 87,99 €
    • 87,99 €

Beschreibung des Verlags

This timely text/reference presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC).

The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as algorithm-based fault tolerance. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models.

Topics and features:
Includes self-contained contributions from an international selection of preeminent expertsProvides a survey of resilience methods and performance modelsExamines the various sources for errors and faults in large-scale systems, detailing their characteristics, with a focus on modeling, detection and predictionReviews the spectrum of techniques that can be applied to design a fault-tolerant message passing interfaceInvestigates different approaches to replication, comparing these to the traditional checkpoint-recovery approachDiscusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems, proposing a methodology to estimate such energy consumption
This authoritative volume is essential reading for all researchers and graduate students involved in high-performance computing.

Dr. Thomas Herault is a Research Scientist in the Innovative Computing Laboratory (ICL) at the University of Tennessee Knoxville, TN, USA. Dr. Yves Robert is a Professor in the Laboratory of Parallel Computing at the Ecole Normale Supérieure de Lyon, France, and a Visiting Research Scholar in the ICL.

GENRE
Computer und Internet
ERSCHIENEN
2015
1. Juli
SPRACHE
EN
Englisch
UMFANG
329
Seiten
VERLAG
Springer International Publishing
ANBIETERINFO
Springer Science & Business Media LLC
GRÖSSE
6,4
 MB
Dependable Computer Systems Dependable Computer Systems
2011
Reliable and Autonomous Computational Science Reliable and Autonomous Computational Science
2011
Resilience Assessment and Evaluation of Computing Systems Resilience Assessment and Evaluation of Computing Systems
2012
Distributed Embedded Systems: Design, Middleware and Resources Distributed Embedded Systems: Design, Middleware and Resources
2008
Safety of Computer Architectures Safety of Computer Architectures
2013
Fault-Tolerance Techniques for Spacecraft Control Computers Fault-Tolerance Techniques for Spacecraft Control Computers
2017
Asynchronous Many-Task Systems and Applications Asynchronous Many-Task Systems and Applications
2025
Recent Advances in Parallel Virtual Machine and Message Passing Interface Recent Advances in Parallel Virtual Machine and Message Passing Interface
2007
Guide to Security Assurance for Cloud Computing Guide to Security Assurance for Cloud Computing
2016
The Internet of Things in the Industrial Sector The Internet of Things in the Industrial Sector
2019
Guide to Computing Fundamentals in Cyber-Physical Systems Guide to Computing Fundamentals in Cyber-Physical Systems
2016
Resilient Routing in Communication Networks Resilient Routing in Communication Networks
2024
Big Data Platforms and Applications Big Data Platforms and Applications
2021
6G Mobile Wireless Networks 6G Mobile Wireless Networks
2021