paperarXivTrust 82 · PrimaryPublished 4d agoLive · 3d ago

GPU Parallelization Strategies for Forward and Backward Propagation in Shallow Neural Networks: A CUDA-Based Comparative Study

We present a comparative study of CUDA optimization strategies applied to forward and backward propagation in a shallow neural network. Three stacked optimizations are evaluated: (1) tiled shared memory with bank-conflict elimination via +1-column padding, (2) pre-transposed weight matrices for coalesced global memory access, and (3) a fused MatMul+ReLU kernel that eliminates intermediate global-memory round-trips. Experiments on an NVIDIA Tesla T4 (CUDA 13.0) across three dataset sizes show that the fully optimized implementation achieves a 1.41x speedup over the baseline CUDA version on the

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

newsGoing from single GPU to dual GPU is nice but not in the way I expected newsShow HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch newsUbuntu, CUDA, llama.cpp , nvcc versioning

Covers (incoming)

newsHow NVIDIA’s Inference Software Stack Powers the Lowest Token Cost

Implements (incoming)

repoNexusGPU/tensor-fusion repoNVIDIA/cuvs repopytorch/pytorch repoNVIDIA/physicsnemo repoNVIDIA/TransformerEngine repouccl-project/uccl repoNVIDIA/raft repoBBuf/how-to-optim-algorithm-in-cuda

Related across the graph

repouccl-project/uccl repoNVIDIA/cuvs repoNVIDIA/TransformerEngine newsHow NVIDIA’s Inference Software Stack Powers the Lowest Token Cost repoNVIDIA/physicsnemo repoNVIDIA/raft repoNexusGPU/tensor-fusion newsUbuntu, CUDA, llama.cpp , nvcc versioning repopytorch/pytorch repoBBuf/how-to-optim-algorithm-in-cuda newsShow HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch newsGoing from single GPU to dual GPU is nice but not in the way I expected

Topics

cs.LG