Fast inference engine for Transformer models
-
Updated
Aug 20, 2024 - C++
Fast inference engine for Transformer models
🎉CUDA/C++ 笔记 / 大模型手撕CUDA / 技术博客,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
Tuned OpenCL BLAS
BLISlab: A Sandbox for Optimizing GEMM
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Stretching GPU performance for GEMMs and tensor contractions.
Fast multi-threaded matrix multiplication in C
DBCSR: Distributed Block Compressed Sparse Row matrix library
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
Serial and parallel implementations of matrix multiplication
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
Add a description, image, and links to the gemm topic page so that developers can more easily learn about it.
To associate your repository with the gemm topic, visit your repo's landing page and select "manage topics."