Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
•Design a fast-parallel low-level kernel of the Complex Matrix Multiplication algorithm based on modulo-scheduling, software pipelining and loop unrolling techniques.•Suggest a novel approach of implementing the Complex Matrix Multiplication algorithm based on the fast-parallel kernel and the miss-pipelining technique.•Introduce an ultra-optimized parallel implementation approach based on the fast-parallel kernel and the internal direct memory access data transfer technique.•Accelerate the beamforming and Doppler Filter Bank algorithms to meet tight real-time constraints of radar applications.
The Complex Matrix Multiplication (CMM) algorithm is known to require a high computing performance and presenting exceptional challenges in real-life applications. Recent advances in Very Long Instruction Word (VLIW) based Digital Signal Processors (DSP) demonstrated high computing capabilities with a very low power consumption. In this work, we propose three ultra-fast, parallel and efficient VLIW implementation approaches of the CMM algorithm which could be used to meet tighter real-time constraints of several signal and image processing applications like radars. A novel parallel kernel, task mapping strategy and low-level optimization techniques are suggested, to fit a set of modern VLIW architectures. Additionally, an original memory access management technique was adopted to accelerate the algorithm by avoiding cache misses and bank conflicts. The experimental results showed the effectiveness of the proposed approaches where a peak performance of 15.89 GFLOPS was achieved on one C66x DSP core with a core utilization of 99% and a speedup of about 1.61, 3 and 10 compared to the state-of-the-art, the most optimized vendor and the conventional approaches, respectively.
[Display omitted]