Existing parallel programming practices rely on using as many threads as the hardware supports to hide memory access latencies. However this approach is not suitable on the accelerator platform architecture, and in a worst-case scenario can result in resource contention, thereby causing degraded performance. Delivering optimal parallel performance requires a balanced understanding of the platform and orchestrating the workload accordingly.
Due to the complexity of developing platform-specific solutions, the HPC group has developed techniques and frameworks to address common use-cases such as stream computing, and map-reduce on accelerators platforms. Solutions to platform-specific issues such as thread scheduling and memory hierarchies are abstracted within the framework to allow users to benefit from the power of these platforms without having to know the architectural specifics.
Many existing applications are targeted at conventional single threaded architectures, or rely on trivial application of parallel toolkits that may not be suited to the application. Using its experience in developing parallel programming solutions, the HPC group is able to profile an application, isolate bottlenecks and refactor the core logic to enable a more efficient execution on parallel and heterogeneous platforms. For example, stencil operations and sparse matrix-vector multiplication are examples of commonly used kernels in many scientific applications. They are also usually the bottleneck in these applications. Through careful tuning to the underlying architecture and applying various bandwidth optimizations, these kernels can be made to perform faster than the original, thereby improving the turn-around time of the applications.
Deep learning networks have historically required massive amounts of computational resources. This arises from the need to model the large number of interconnections between neurons in the multiple layers of non-trivial neural-networks. Researchers are thus faced with the tradeoff of setting up a large cluster of distributed servers and waiting weeks for the training process to converge, or reducing the size of the training data. By exploiting the inherent parallelism within GPU accelerators, the HPC team aims to build on its existing frameworks to develop easy-to-use GPU-optimized machine learning libraries that bring the power of a large server cluster to within easy reach of more users.