Many hardware accelerator architectures use DMA units to transfer memory which may be limited by the fixed-width size of the DMA transfer, and automatic loop tilers currently do not take the limitation of these DMA units into account. We present a compiler pass, implemented in MLIR, that uses polyhedral analysis on the memory access patterns in a loop nest and constrain the possible tile sizes based on the DMA chunk width. This allows the compiler to effectively tile loops for these architectures.
Presented at the Languages, Compilers, Tools and Theory of Embedded Systems (LCTES) 2023 conference. LCTES-2023
Direct Memory Access (DMA) is often used within hardware accelerators to transfer course-grain data to and from the host. The use of DMA is often significantly more performant than scalar loads and stores. This work implements an addition to the Affine Loop Tiling pass in MLIR that uses polyhedral analysis to determine a set of tile sizes that ensures that all data copies can be implemented as DMA operations.
This work is currently in the review process to be implemented as a patch to the upstream LLVM project. The merge request can be found here.