Professional CUDA C Programming.

By: Cheng, JohnContributor(s): Grossman, Max | McKercher, TyMaterial type: TextTextSeries: eBooks on DemandPublisher: Somerset : John Wiley & Sons, Incorporated, 2014Copyright date: ©2014Edition: 1st edDescription: 1 online resource (527 pages)Content type: text Media type: computer Carrier type: online resourceISBN: 9781118739273Subject(s): Computer architecture.;Multiprocessors.;Parallel processing (Electronic computers)Genre/Form: Electronic books.Additional physical formats: Print version:: Professional CUDA C ProgrammingDDC classification: 005.275 LOC classification: QA76.9.A73 -- .C446 2014Online resources: Click here to view this ebook.
Contents:
Cover -- Title Page -- Copyright -- Contents -- Chapter 1 Heterogeneous Parallel Computing with CUDA -- Parallel Computing -- Sequential and Parallel Programming -- Parallelism -- Computer Architecture -- Heterogeneous Computing -- Heterogeneous Architecture -- Paradigm of Heterogeneous Computing -- CUDA: A Platform for Heterogeneous Computing -- Hello World from GPU -- Is CUDA C Programming Difficult? -- Summary -- Chapter 2 CUDA Programming Model -- Introducing the CUDA Programming Model -- CUDA Programming Structure -- Managing Memory -- Organizing Threads -- Launching a CUDA Kernel -- Writing Your Kernel -- Verifying Your Kernel -- Handling Errors -- Compiling and Executing -- Timing Your Kernel -- Timing with CPU Timer -- Timing with nvprof -- Organizing Parallel Threads -- Indexing Matrices with Blocks and Threads -- Summing Matrices with a 2D Grid and 2D Blocks -- Summing Matrices with a 1D Grid and 1D Blocks -- Summing Matrices with a 2D Grid and 1D Blocks -- Managing Devices -- Using the Runtime API to Query GPU Information -- Determining the Best GPU -- Using nvidia-smi to Query GPU Information -- Setting Devices at Runtime -- Summary -- Chapter 3 CUDA Execution Model -- Introducing the CUDA Execution Model -- GPU Architecture Overview -- The Fermi Architecture -- The Kepler Architecture -- Profile-Driven Optimization -- Understanding the Nature of Warp Execution -- Warps and Thread Blocks -- Warp Divergence -- Resource Partitioning -- Latency Hiding -- Occupancy -- Synchronization -- Scalability -- Exposing Parallelism -- Checking Active Warps with nvprof -- Checking Memory Operations with nvprof -- Exposing More Parallelism -- Avoiding Branch Divergence -- The Parallel Reduction Problem -- Divergence in Parallel Reduction -- Improving Divergence in Parallel Reduction -- Reducing with Interleaved Pairs -- Unrolling Loops.
Reducing with Unrolling -- Reducing with Unrolled Warps -- Reducing with Complete Unrolling -- Reducing with Template Functions -- Dynamic Parallelism -- Nested Execution -- Nested Hello World on the GPU -- Nested Reduction -- Summary -- Chapter 4 Global Memory -- Introducing the CUDA Memory Model -- Benefits of a Memory Hierarchy -- CUDA Memory Model -- Memory Management -- Memory Allocation and Deallocation -- Memory Transfer -- Pinned Memory -- Zero-Copy Memory -- Unified Virtual Addressing -- Unified Memory -- Memory Access Patterns -- Aligned and Coalesced Access -- Global Memory Reads -- Global Memory Writes -- Array of Structures versus Structure of Arrays -- Performance Tuning -- What Bandwidth Can a Kernel Achieve? -- Memory Bandwidth -- Matrix Transpose Problem -- Matrix Addition with Unified Memory -- Summary -- Chapter 5 Shared Memory and Constant Memory -- Introducing CUDA Shared Memory -- Shared Memory -- Shared Memory Allocation -- Shared Memory Banks and Access Mode -- Configuring the Amount of Shared Memory -- Synchronization -- Checking the Data Layout of Shared Memory -- Square Shared Memory -- Rectangular Shared Memory -- Reducing Global Memory Access -- Parallel Reduction with Shared Memory -- Parallel Reduction with Unrolling -- Parallel Reduction with Dynamic Shared Memory -- Effective Bandwidth -- Coalescing Global Memory Accesses -- Baseline Transpose Kernel -- Matrix Transpose with Shared Memory -- Matrix Transpose with Padded Shared Memory -- Matrix Transpose with Unrolling -- Exposing More Parallelism -- Constant Memory -- Implementing a 1D Stencil with Constant Memory -- Comparing with the Read-Only Cache -- The Warp Shuffle Instruction -- Variants of the Warp Shuffle Instruction -- Sharing Data within a Warp -- Parallel Reduction Using the Warp Shuffle Instruction -- Summary -- Chapter 6 Streams and Concurrency.
Introducing Streams and Events -- CUDA Streams -- Stream Scheduling -- Stream Priorities -- CUDA Events -- Stream Synchronization -- Concurrent Kernel Execution -- Concurrent Kernels in Non-NULL Streams -- False Dependencies on Fermi GPUs -- Dispatching Operations with OpenMP -- Adjusting Stream Behavior Using Environment Variables -- Concurrency-Limiting GPU Resources -- Blocking Behavior of the Default Stream -- Creating Inter-Stream Dependencies -- Overlapping Kernel Execution and Data Transfer -- Overlap Using Depth-First Scheduling -- Overlap Using Breadth-First Scheduling -- Overlapping GPU and CPU Execution -- Stream Callbacks -- Summary -- Chapter 7 Tuning Instruction-Level Primitives -- Introducing CUDA Instructions -- Floating-Point Instructions -- Intrinsic and Standard Functions -- Atomic Instructions -- Optimizing Instructions for Your Application -- Single-Precision vs. Double-Precision -- Standard vs. Intrinsic Functions -- Understanding Atomic Instructions -- Bringing It All Together -- Summary -- Chapter 8 GPU-Accelerated CUDA Libraries and OpenACC -- Introducing the CUDA Libraries -- Supported Domains for CUDA Libraries -- A Common Library Workflow -- The CUSPARSE Library -- cuSPARSE Data Storage Formats -- Formatting Conversion with cuSPARSE -- Demonstrating cuSPARSE -- Important Topics in cuSPARSE Development -- cuSPARSE Summary -- The cuBLAS Library -- Managing cuBLAS Data -- Demonstrating cuBLAS -- Important Topics in cuBLAS Development -- cuBLAS Summary -- The cuFFT Library -- Using the cuFFT API -- Demonstrating cuFFT -- cuFFT Summary -- The cuRAND Library -- Choosing Pseudo- or Quasi- Random Numbers -- Overview of the cuRAND Library -- Demonstrating cuRAND -- Important Topics in cuRAND Development -- CUDA Library Features Introduced in CUDA 6 -- Drop-In CUDA Libraries -- Multi-GPU Libraries.
A Survey of CUDA Library Performance -- cuSPARSE versus MKL -- cuBLAS versus MKL BLAS -- cuFFT versus FFTW versus MKL -- CUDA Library Performance Summary -- Using OpenACC -- Using OpenACC Compute Directives -- Using OpenACC Data Directives -- The OpenACC Runtime API -- Combining OpenACC and the CUDA Libraries -- Summary of OpenACC -- Summary -- Chapter 9 Multi-GPU Programming -- Moving to Multiple GPUs -- Executing on Multiple GPUs -- Peer-to-Peer Communication -- Synchronizing across Multi-GPUs -- Subdividing Computation across Multiple GPUs -- Allocating Memory on Multiple Devices -- Distributing Work from a Single Host Thread -- Compiling and Executing -- Peer-to-Peer Communication on Multiple GPUs -- Enabling Peer-to-Peer Access -- Peer-to-Peer Memory Copy -- Peer-to-Peer Memory Access with Unified Virtual Addressing -- Finite Difference on Multi-GPU -- Stencil Calculation for 2D Wave Equation -- Typical Patterns for Multi-GPU Programs -- 2D Stencil Computation with Multiple GPUs -- Overlapping Computation and Communication -- Compiling and Executing -- Scaling Applications across GPU Clusters -- CPU-to-CPU Data Transfer -- GPU-to-GPU Data Transfer Using Traditional MPI -- GPU-to-GPU Data Transfer with CUDA-aware MPI -- Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI -- Adjusting Message Chunk Size -- GPU to GPU Data Transfer with GPUDirect RDMA -- Summary -- Chapter 10 Implementation Considerations -- The CUDA C Development Process -- APOD Development Cycle -- Optimization Opportunities -- CUDA Code Compilation -- CUDA Error Handling -- Profile-Driven Optimization -- Finding Optimization Opportunities Using nvprof -- Guiding Optimization Using nvvp -- NVIDIA Tools Extension -- CUDA Debugging -- Kernel Debugging -- Memory Debugging -- Debugging Summary -- A Case Study in Porting C Programs to CUDA C -- Assessing crypt.
Parallelizing crypt -- Optimizing crypt -- Deploying Crypt -- Summary of Porting crypt -- Summary -- Appendix: Suggested Readings -- Index -- Advertisement -- EULA.
Summary: Break into the powerful world of parallel GPU programming with this down-to-earth, practical guide Designed for professionals across multiple industrial sectors, Professional CUDA C Programming  presents CUDA -- a parallel computing platform and programming model designed to ease the development of GPU programming -- fundamentals in an easy-to-follow format, and teaches readers how to think in parallel and implement parallel algorithms on GPUs. Each chapter covers a specific topic, and includes workable examples that demonstrate the development process, allowing readers to explore both the "hard" and "soft" aspects of GPU programming. Computing architectures are experiencing a fundamental shift toward scalable parallel computing motivated by application requirements in industry and science. This book demonstrates the challenges of efficiently utilizing compute resources at peak performance, presents modern techniques for tackling these challenges, while increasing accessibility for professionals who are not necessarily parallel programming experts. The CUDA programming model and tools empower developers to write high-performance applications on a scalable, parallel computing platform: the GPU. However, CUDA itself can be difficult to learn without extensive programming experience. Recognized CUDA authorities John Cheng, Max Grossman, and Ty McKercher guide readers through essential GPU programming skills and best practices in Professional CUDA C Programming, including: CUDA Programming Model GPU Execution Model GPU Memory model Streams, Event and Concurrency Multi-GPU Programming CUDA Domain-Specific Libraries Profiling and Performance Tuning The book makes complex CUDA concepts easy to understand for anyone with knowledge of basic software development with exercises designed to be both readable and high-performance. For the professional seekingSummary: entrance to parallel computing and the high-performance computing community, Professional CUDA C Programming is an invaluable resource, with the most current information available on the market.
Tags from this library: No tags from this library for this title. Log in to add tags.
Item type Current location Call number URL Status Date due Barcode
Electronic Book UT Tyler Online
Online
QA76.9.A73 -- .C446 2014 (Browse shelf) https://ebookcentral.proquest.com/lib/uttyler/detail.action?docID=1776323 Available EBC1776323

Cover -- Title Page -- Copyright -- Contents -- Chapter 1 Heterogeneous Parallel Computing with CUDA -- Parallel Computing -- Sequential and Parallel Programming -- Parallelism -- Computer Architecture -- Heterogeneous Computing -- Heterogeneous Architecture -- Paradigm of Heterogeneous Computing -- CUDA: A Platform for Heterogeneous Computing -- Hello World from GPU -- Is CUDA C Programming Difficult? -- Summary -- Chapter 2 CUDA Programming Model -- Introducing the CUDA Programming Model -- CUDA Programming Structure -- Managing Memory -- Organizing Threads -- Launching a CUDA Kernel -- Writing Your Kernel -- Verifying Your Kernel -- Handling Errors -- Compiling and Executing -- Timing Your Kernel -- Timing with CPU Timer -- Timing with nvprof -- Organizing Parallel Threads -- Indexing Matrices with Blocks and Threads -- Summing Matrices with a 2D Grid and 2D Blocks -- Summing Matrices with a 1D Grid and 1D Blocks -- Summing Matrices with a 2D Grid and 1D Blocks -- Managing Devices -- Using the Runtime API to Query GPU Information -- Determining the Best GPU -- Using nvidia-smi to Query GPU Information -- Setting Devices at Runtime -- Summary -- Chapter 3 CUDA Execution Model -- Introducing the CUDA Execution Model -- GPU Architecture Overview -- The Fermi Architecture -- The Kepler Architecture -- Profile-Driven Optimization -- Understanding the Nature of Warp Execution -- Warps and Thread Blocks -- Warp Divergence -- Resource Partitioning -- Latency Hiding -- Occupancy -- Synchronization -- Scalability -- Exposing Parallelism -- Checking Active Warps with nvprof -- Checking Memory Operations with nvprof -- Exposing More Parallelism -- Avoiding Branch Divergence -- The Parallel Reduction Problem -- Divergence in Parallel Reduction -- Improving Divergence in Parallel Reduction -- Reducing with Interleaved Pairs -- Unrolling Loops.

Reducing with Unrolling -- Reducing with Unrolled Warps -- Reducing with Complete Unrolling -- Reducing with Template Functions -- Dynamic Parallelism -- Nested Execution -- Nested Hello World on the GPU -- Nested Reduction -- Summary -- Chapter 4 Global Memory -- Introducing the CUDA Memory Model -- Benefits of a Memory Hierarchy -- CUDA Memory Model -- Memory Management -- Memory Allocation and Deallocation -- Memory Transfer -- Pinned Memory -- Zero-Copy Memory -- Unified Virtual Addressing -- Unified Memory -- Memory Access Patterns -- Aligned and Coalesced Access -- Global Memory Reads -- Global Memory Writes -- Array of Structures versus Structure of Arrays -- Performance Tuning -- What Bandwidth Can a Kernel Achieve? -- Memory Bandwidth -- Matrix Transpose Problem -- Matrix Addition with Unified Memory -- Summary -- Chapter 5 Shared Memory and Constant Memory -- Introducing CUDA Shared Memory -- Shared Memory -- Shared Memory Allocation -- Shared Memory Banks and Access Mode -- Configuring the Amount of Shared Memory -- Synchronization -- Checking the Data Layout of Shared Memory -- Square Shared Memory -- Rectangular Shared Memory -- Reducing Global Memory Access -- Parallel Reduction with Shared Memory -- Parallel Reduction with Unrolling -- Parallel Reduction with Dynamic Shared Memory -- Effective Bandwidth -- Coalescing Global Memory Accesses -- Baseline Transpose Kernel -- Matrix Transpose with Shared Memory -- Matrix Transpose with Padded Shared Memory -- Matrix Transpose with Unrolling -- Exposing More Parallelism -- Constant Memory -- Implementing a 1D Stencil with Constant Memory -- Comparing with the Read-Only Cache -- The Warp Shuffle Instruction -- Variants of the Warp Shuffle Instruction -- Sharing Data within a Warp -- Parallel Reduction Using the Warp Shuffle Instruction -- Summary -- Chapter 6 Streams and Concurrency.

Introducing Streams and Events -- CUDA Streams -- Stream Scheduling -- Stream Priorities -- CUDA Events -- Stream Synchronization -- Concurrent Kernel Execution -- Concurrent Kernels in Non-NULL Streams -- False Dependencies on Fermi GPUs -- Dispatching Operations with OpenMP -- Adjusting Stream Behavior Using Environment Variables -- Concurrency-Limiting GPU Resources -- Blocking Behavior of the Default Stream -- Creating Inter-Stream Dependencies -- Overlapping Kernel Execution and Data Transfer -- Overlap Using Depth-First Scheduling -- Overlap Using Breadth-First Scheduling -- Overlapping GPU and CPU Execution -- Stream Callbacks -- Summary -- Chapter 7 Tuning Instruction-Level Primitives -- Introducing CUDA Instructions -- Floating-Point Instructions -- Intrinsic and Standard Functions -- Atomic Instructions -- Optimizing Instructions for Your Application -- Single-Precision vs. Double-Precision -- Standard vs. Intrinsic Functions -- Understanding Atomic Instructions -- Bringing It All Together -- Summary -- Chapter 8 GPU-Accelerated CUDA Libraries and OpenACC -- Introducing the CUDA Libraries -- Supported Domains for CUDA Libraries -- A Common Library Workflow -- The CUSPARSE Library -- cuSPARSE Data Storage Formats -- Formatting Conversion with cuSPARSE -- Demonstrating cuSPARSE -- Important Topics in cuSPARSE Development -- cuSPARSE Summary -- The cuBLAS Library -- Managing cuBLAS Data -- Demonstrating cuBLAS -- Important Topics in cuBLAS Development -- cuBLAS Summary -- The cuFFT Library -- Using the cuFFT API -- Demonstrating cuFFT -- cuFFT Summary -- The cuRAND Library -- Choosing Pseudo- or Quasi- Random Numbers -- Overview of the cuRAND Library -- Demonstrating cuRAND -- Important Topics in cuRAND Development -- CUDA Library Features Introduced in CUDA 6 -- Drop-In CUDA Libraries -- Multi-GPU Libraries.

A Survey of CUDA Library Performance -- cuSPARSE versus MKL -- cuBLAS versus MKL BLAS -- cuFFT versus FFTW versus MKL -- CUDA Library Performance Summary -- Using OpenACC -- Using OpenACC Compute Directives -- Using OpenACC Data Directives -- The OpenACC Runtime API -- Combining OpenACC and the CUDA Libraries -- Summary of OpenACC -- Summary -- Chapter 9 Multi-GPU Programming -- Moving to Multiple GPUs -- Executing on Multiple GPUs -- Peer-to-Peer Communication -- Synchronizing across Multi-GPUs -- Subdividing Computation across Multiple GPUs -- Allocating Memory on Multiple Devices -- Distributing Work from a Single Host Thread -- Compiling and Executing -- Peer-to-Peer Communication on Multiple GPUs -- Enabling Peer-to-Peer Access -- Peer-to-Peer Memory Copy -- Peer-to-Peer Memory Access with Unified Virtual Addressing -- Finite Difference on Multi-GPU -- Stencil Calculation for 2D Wave Equation -- Typical Patterns for Multi-GPU Programs -- 2D Stencil Computation with Multiple GPUs -- Overlapping Computation and Communication -- Compiling and Executing -- Scaling Applications across GPU Clusters -- CPU-to-CPU Data Transfer -- GPU-to-GPU Data Transfer Using Traditional MPI -- GPU-to-GPU Data Transfer with CUDA-aware MPI -- Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI -- Adjusting Message Chunk Size -- GPU to GPU Data Transfer with GPUDirect RDMA -- Summary -- Chapter 10 Implementation Considerations -- The CUDA C Development Process -- APOD Development Cycle -- Optimization Opportunities -- CUDA Code Compilation -- CUDA Error Handling -- Profile-Driven Optimization -- Finding Optimization Opportunities Using nvprof -- Guiding Optimization Using nvvp -- NVIDIA Tools Extension -- CUDA Debugging -- Kernel Debugging -- Memory Debugging -- Debugging Summary -- A Case Study in Porting C Programs to CUDA C -- Assessing crypt.

Parallelizing crypt -- Optimizing crypt -- Deploying Crypt -- Summary of Porting crypt -- Summary -- Appendix: Suggested Readings -- Index -- Advertisement -- EULA.

Break into the powerful world of parallel GPU programming with this down-to-earth, practical guide Designed for professionals across multiple industrial sectors, Professional CUDA C Programming  presents CUDA -- a parallel computing platform and programming model designed to ease the development of GPU programming -- fundamentals in an easy-to-follow format, and teaches readers how to think in parallel and implement parallel algorithms on GPUs. Each chapter covers a specific topic, and includes workable examples that demonstrate the development process, allowing readers to explore both the "hard" and "soft" aspects of GPU programming. Computing architectures are experiencing a fundamental shift toward scalable parallel computing motivated by application requirements in industry and science. This book demonstrates the challenges of efficiently utilizing compute resources at peak performance, presents modern techniques for tackling these challenges, while increasing accessibility for professionals who are not necessarily parallel programming experts. The CUDA programming model and tools empower developers to write high-performance applications on a scalable, parallel computing platform: the GPU. However, CUDA itself can be difficult to learn without extensive programming experience. Recognized CUDA authorities John Cheng, Max Grossman, and Ty McKercher guide readers through essential GPU programming skills and best practices in Professional CUDA C Programming, including: CUDA Programming Model GPU Execution Model GPU Memory model Streams, Event and Concurrency Multi-GPU Programming CUDA Domain-Specific Libraries Profiling and Performance Tuning The book makes complex CUDA concepts easy to understand for anyone with knowledge of basic software development with exercises designed to be both readable and high-performance. For the professional seeking

entrance to parallel computing and the high-performance computing community, Professional CUDA C Programming is an invaluable resource, with the most current information available on the market.

Description based on publisher supplied metadata and other sources.

Author notes provided by Syndetics

John Cheng, P H D, is a Research Scientist at BGP International in Houston. He has developed seismic imaging products with GPU technology and many high-performance parallel production applications on heterogeneous computing-platforms.

Max Grossman is an expert in GPU computing with experience applying CUDA to problems in medical imaging, machine learning, geophysics, and more.

Ty McKercher has been helping customers adopt GPU acceleration technologies while he has been employed at NVIDIA since 2008.

There are no comments on this title.

to post a comment.