Performance analysis and tuning for general purpose graphics processing units (GPGPU) / Hyesoon Kim ... [et al].

Contributor(s): Kim, HyessonMaterial type: TextTextSeries: Morgan & Claypool PublishersSynthesis digital library of engineering and computer science: ; Synthesis lectures in computer architecture: # 20.Publisher: San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, c2012Description: 1 electronic text (xi, 84 p.) : ill., digital fileISBN: 9781608459551 (electronic bk.)Subject(s): Graphics processing units | Parallel processing (Electronic computers)Additional physical formats: Print version:: Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)DDC classification: 006.6869 LOC classification: T385 | .P472 2012Online resources: Click here to view this ebook. Also available in print.
Contents:
1. GPU design, programming, and trends -- 1.1 A brief history of GPU -- 1.2 A brief overview of a GPU system -- 1.2.1 An overview of GPU architecture -- 1.3 A GPGPU programming model: CUDA -- 1.3.1 Kernels -- 1.3.2 Thread hierarchy in CUDA -- 1.3.3 Memory hierarchy -- 1.3.4 SIMT execution -- 1.3.5 CUDA language extensions -- 1.3.6 Vector addition example -- 1.3.7 PTX -- 1.3.8 Consistency model and special memory operations -- 1.3.9 IEEE floating-point support -- 1.3.10 Execution model of OpenCL -- 1.4 GPU architecture -- 1.4.1 GPU pipeline -- 1.4.2 Handling branch instructions -- 1.4.3 GPU memory systems -- 1.5 Other GPU architectures -- 1.5.1 The Fermi architecture -- 1.5.2 The AMD architecture -- 1.5.3 Many integrated core architecture -- 1.5.4 Combining CPUs and GPUs on the same die --
2. Performance principles -- 2.1 Theory: algorithm design models overview -- 2.2 Characterizing parallelism: the work-depth model -- 2.3 Characterizing I/O behavior: the external memory model -- 2.4 Combined analyses of parallelism and I/O-efficiency -- 2.5 Abstract and concrete measures -- 2.6 Summary --
3. From principles to practice: analysis and tuning -- 3.1 The computational problem: particle interactions -- 3.2 An optimal approximation: the fast multipole method -- 3.3 Designing a parallel and I/O-efficient algorithm -- 3.4 A baseline implementation -- 3.5 Setting an optimization goal -- 3.5.1 Identifying candidate optimizations -- 3.5.2 Exploring the optimization space -- 3.5.3 Summary --
4. Using detailed performance analysis to guide optimization -- 4.1 Instruction-level analysis and tuning -- 4.1.1 Execution time modeling -- 4.1.2 Applying the model to FMM -- 4.1.3 Performance optimization guide -- 4.2 Other performance modeling techniques and tools -- 4.2.1 Limited performance visibility -- 4.2.2 Work flow graphs -- 4.2.3 Stochastic memory hierarchy model -- 4.2.4 Roofline model -- 4.2.5 Profiling and performance analysis of CUDA workloads using Ocelot [33] -- 4.2.6 Other GPGPU performance modeling techniques -- 4.2.7 Performance analysis tools for OpenCL --
Bibliography -- Authors' biographies.
Abstract: General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and programming models.We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a connection to GPGPU platforms.We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also briefly survey the state-of-the-art in GPU performance analysis tools and techniques.
Tags from this library: No tags from this library for this title. Log in to add tags.
Item type Current location Call number URL Status Date due Barcode
Electronic Book UT Tyler Online
Online
T385 .P472 2012 (Browse shelf) https://ezproxy.uttyler.edu/login?url=http://dx.doi.org/10.2200/S00451ED1V01Y201209CAC020 Available 201209CAC020

Mode of access: World Wide Web.

System requirements: Adobe Acrobat Reader.

Part of: Synthesis digital library of engineering and computer science.

Title from PDF t.p. (viewed on December 10, 2012).

Series from website.

Includes bibliographical references (p. 71-81).

1. GPU design, programming, and trends -- 1.1 A brief history of GPU -- 1.2 A brief overview of a GPU system -- 1.2.1 An overview of GPU architecture -- 1.3 A GPGPU programming model: CUDA -- 1.3.1 Kernels -- 1.3.2 Thread hierarchy in CUDA -- 1.3.3 Memory hierarchy -- 1.3.4 SIMT execution -- 1.3.5 CUDA language extensions -- 1.3.6 Vector addition example -- 1.3.7 PTX -- 1.3.8 Consistency model and special memory operations -- 1.3.9 IEEE floating-point support -- 1.3.10 Execution model of OpenCL -- 1.4 GPU architecture -- 1.4.1 GPU pipeline -- 1.4.2 Handling branch instructions -- 1.4.3 GPU memory systems -- 1.5 Other GPU architectures -- 1.5.1 The Fermi architecture -- 1.5.2 The AMD architecture -- 1.5.3 Many integrated core architecture -- 1.5.4 Combining CPUs and GPUs on the same die --

2. Performance principles -- 2.1 Theory: algorithm design models overview -- 2.2 Characterizing parallelism: the work-depth model -- 2.3 Characterizing I/O behavior: the external memory model -- 2.4 Combined analyses of parallelism and I/O-efficiency -- 2.5 Abstract and concrete measures -- 2.6 Summary --

3. From principles to practice: analysis and tuning -- 3.1 The computational problem: particle interactions -- 3.2 An optimal approximation: the fast multipole method -- 3.3 Designing a parallel and I/O-efficient algorithm -- 3.4 A baseline implementation -- 3.5 Setting an optimization goal -- 3.5.1 Identifying candidate optimizations -- 3.5.2 Exploring the optimization space -- 3.5.3 Summary --

4. Using detailed performance analysis to guide optimization -- 4.1 Instruction-level analysis and tuning -- 4.1.1 Execution time modeling -- 4.1.2 Applying the model to FMM -- 4.1.3 Performance optimization guide -- 4.2 Other performance modeling techniques and tools -- 4.2.1 Limited performance visibility -- 4.2.2 Work flow graphs -- 4.2.3 Stochastic memory hierarchy model -- 4.2.4 Roofline model -- 4.2.5 Profiling and performance analysis of CUDA workloads using Ocelot [33] -- 4.2.6 Other GPGPU performance modeling techniques -- 4.2.7 Performance analysis tools for OpenCL --

Bibliography -- Authors' biographies.

Abstract freely available; full-text restricted to subscribers or individual document purchasers.

Compendex

INSPEC

Google scholar

Google book search

General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and programming models.We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a connection to GPGPU platforms.We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also briefly survey the state-of-the-art in GPU performance analysis tools and techniques.

Also available in print.

There are no comments on this title.

to post a comment.