Are Your GPU Atomics Secretly Contending?

  • Autor:

    Peter Maucher, Nick Djerfi, Lennard Kittner, Lukas Werling, Frank Bellosa

  • Quelle:

    PLOS ’25, 13th Workshop on Programming Languages and Operating Systems (October 13–16, 2025, Seoul, Republic of Korea

  • Datum: 13.10.2025
  • Abstract
    GPU applications use atomic operations to coordinate data access in highly parallel code. However, relying on previous experiences and due to limited documentation, programmers resort to guidelines instead of concrete metrics to limit potential performance influences.
    In this paper, we introduce a GPU memory-subsystem microbenchmark suite for analyzing GPU atomic operations.
    Based on the benchmark results, we discuss two particular guidelines, namely: “use only one thread per warp to access an atomic” and “place two atomic variables on different cache lines to avoid contention.” We demonstrate where these guidelines are effective and where actual hardware behavior diverges.

    BibTeX:

        @inproceedings{10.1145/3764860.3768338,

        author = {Maucher, Peter and Djerfi, Nick and Kittner, Lennard and Werling, Lukas and Bellosa, Frank},

        title = {Are Your GPU Atomics Secretly Contending?},

        year = {2025},

        isbn = {979-8-4007-2225-7/25/10},

        publisher = {Association for Computing Machinery},

        address = {New York, NY, USA},

        url = {https://doi.org/10.1145/3764860.3768338},

        doi = {10.1145/3764860.3768338},

        abstract = {GPU applications use atomic operations to coordinate data access in highly parallel code. However, relying on previous experiences and due to limited documentation, programmers resort to guidelines instead of concrete metrics to limit potential performance influences. 
    In this paper, we introduce a GPU memory-subsystem microbenchmark suite for analyzing GPU atomic operations. Based on the benchmark results, we discuss two particular guidelines, namely: “use only one thread per warp to access an atomic” and “place two atomic variables on different cache lines to avoid contention.” We demonstrate where these guidelines are effective and where actual Hardware behavior diverges.},

        booktitle = {PLOS '25: Proceedings of the 13th Workshop on Programming Languages and Operating Systems},

        pages = {84 - 92},

        numpages = {9},

        keywords = {GPU, Atomic Operations, Atomic Contention, Synchronization, Microbenchmarks},

        location = {Seoul, Republic of Korea},

        series = {PLOS ’25}

        }