An MPI-CUDA Implementation for Massively Parallel Incompressible ... 2.4 TeraFLOPS on the
64 nodes of the NCSA Lincoln Tesla cluster using 128 GPUs with a total of 30,720 processing elements. Our
results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics (CFD)
simulations.