June 13, 2024
No Comments
Crypto

NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance

admin

Crypto

NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance

The latest release of the NVIDIA cuBLAS library, version 12.5, brings significant updates aimed at enhancing the functionality and performance of deep learning (DL) and high-performance computing (HPC) workloads, according to NVIDIA Technical Blog. Key updates include the introduction of Grouped GEMM APIs, improved matrix multiplication (matmul) performance on NVIDIA Hopper (H100 and H200) and Ada (L40S) GPUs, and enhanced performance tuning options.

Grouped GEMM APIs

The newly introduced Grouped GEMM APIs generalize batched APIs by allowing different matrix sizes, transpositions, and scaling factors to be grouped and executed in one kernel launch. This approach has shown a 1.2x speedup in certain scenarios, such as the generation phase of a mixture-of-experts (MoE) model with batch sizes of 8 and 64 and FP16 inputs and outputs.

Two new sets of APIs support Grouped GEMM:

cublas<t>gemmGroupedBatched for FP32 (including TF32) and FP64 precisions.
cublasGemmGroupedBatchedEx for FP16, BF16, FP32 (including TF32), and FP64 precisions.

These APIs support variable shapes, transpositions, and scaling factors. Examples can be found on the NVIDIA/CUDALibrarySamples GitHub repository.

Latest LLM Matmul Performance on NVIDIA H100, H200, and L40S GPUs

Recent performance snapshots show significant speedups for Llama 2 70B and GPT3 training phases on NVIDIA H100, H200, and L40S GPUs. The H200 GPU, in particular, demonstrates nearly 3x and 5x speedups compared to the A100 for Llama 2 70B and GPT3 training phases, respectively. These improvements are measured without locking GPU clocks and account for the number of times each GEMM is repeated in the workload.

*Figure 1. Speedup of the GEMM-only fraction of e2e workloads*

Library Performance and Benchmarking

Several enhancements have been made to runtime performance heuristics and performance tuning APIs. The cuBLAS library uses a recommender system at runtime to dispatch the fastest available configuration for any user-requested matmuls. This system is trained on actual timing data from a wide range of problems and configurations.

*Figure 2. Sampling of various GEMMs using multiple configurations in different kernel families*

For advanced users, the cublasLtMatmulAlgoGetHeuristic API enables performance tuning to achieve faster implementations. Examples of auto-tuning in cuBLAS can be found on the NVIDIA/CUDALibrarySamples repository.

*Figure 4. An example of auto-tuning in cuBLAS*

Better Functionality and Performance in cuBLASLt

Since cuBLAS 12.0, numerous enhancements have been introduced:

Fused epilogue support parity between BF16 and FP16 precisions on NVIDIA Ampere and Ada.
Additional fused epilogues on NVIDIA Hopper and Ampere.
Support for FP8 on Ada GPUs and performance updates on Ada L4, L40, and L40S.
Removal of M, N, and batch size limitations of cuBLASLt matmul API.
Improved performance of heuristics cache for workloads with high eviction rate.
cuBLAS symbols are available in CUDA Toolkit symbols for Linux repository.

For more information on cuBLAS, see the documentation and samples.

Image source: Shutterstock

. . .

admin

Social Media

Subscribe To Our Weekly Newsletter

No spam, notifications only about new products, updates.

NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance

admin

NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance

Grouped GEMM APIs

Latest LLM Matmul Performance on NVIDIA H100, H200, and L40S GPUs

Library Performance and Benchmarking

Better Functionality and Performance in cuBLASLt

Tags

Share:

admin

Leave a Reply Cancel reply

Most Popular

2nd gubernatorial term for Arif Mohammed Khan, what has endeared Modi to him over the yrs

Stock jumps 6% after receiving approval for iron-ore expansion at its Ramghad mine

How to Repair a Bow in Minecraft?

India vs Australia 4th Test Day 1: Bumrah Shines as Australia Holds the Advantage at MCG

BJP MLC takes on Yogi govt over ‘corrupt’ appointments in industries dept

IT stock hits 5% upper circuit after acquiring 60% stake in semiconductor company

Social Media

Subscribe To Our Weekly Newsletter

Categories

Related Posts

How to See Your Favorites on Roblox?

Singham Again OTT release date announced: Here’s when and where to watch Ajay Devgn starrer : Bollywood News

India exploited Australia’s ‘brittle’ top order, says Shastri

Solar stock jumps 6% after receiving ₹1,988 Cr order for 300-MW renewable energy project

2nd gubernatorial term for Arif Mohammed Khan, what has endeared Modi to him over the yrs

Stock jumps 6% after receiving approval for iron-ore expansion at its Ramghad mine

How to Repair a Bow in Minecraft?

2nd gubernatorial term for Arif Mohammed Khan, what has endeared Modi to him over the yrs

Stock jumps 6% after receiving approval for iron-ore expansion at its Ramghad mine

How to Repair a Bow in Minecraft?