FPGA Acceleration of large 3-D stencils

Large 3-D stencil computation over large grids represents a formidable challenge for any computing platform. In this work we managed to achieve maximum possible throughput under the off-chip DDRAM memory throughput constraint.

It was a grueling task, but we managed to implement a fully asynchronous data processing pipeline with 5 near-perfectly balanced pipeline stages; tilling on the host, streaming grid tiles to the FPGA, perform multiple (fused) iterations on the FPGA with maximum possible parallelism under on-chip BRAM constraints, stream results back to the host, and perform untiling & boundary conditions (using perfectly match convolutional layers) on the host. Flawless execution with complete overlap of different pipeline stages.