Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
Ergebnis 20 von 29

Details

Autor(en) / Beteiligte
Titel
Ultra-fast and efficient implementation schemes of complex matrix multiplication algorithm for VLIW architectures
Ist Teil von
  • Computers & electrical engineering, 2022-09, Vol.102, p.108294, Article 108294
Ort / Verlag
Elsevier Ltd
Erscheinungsjahr
2022
Quelle
Alma/SFX Local Collection
Beschreibungen/Notizen
  • •Design a fast-parallel low-level kernel of the Complex Matrix Multiplication algorithm based on modulo-scheduling, software pipelining and loop unrolling techniques.•Suggest a novel approach of implementing the Complex Matrix Multiplication algorithm based on the fast-parallel kernel and the miss-pipelining technique.•Introduce an ultra-optimized parallel implementation approach based on the fast-parallel kernel and the internal direct memory access data transfer technique.•Accelerate the beamforming and Doppler Filter Bank algorithms to meet tight real-time constraints of radar applications. The Complex Matrix Multiplication (CMM) algorithm is known to require a high computing performance and presenting exceptional challenges in real-life applications. Recent advances in Very Long Instruction Word (VLIW) based Digital Signal Processors (DSP) demonstrated high computing capabilities with a very low power consumption. In this work, we propose three ultra-fast, parallel and efficient VLIW implementation approaches of the CMM algorithm which could be used to meet tighter real-time constraints of several signal and image processing applications like radars. A novel parallel kernel, task mapping strategy and low-level optimization techniques are suggested, to fit a set of modern VLIW architectures. Additionally, an original memory access management technique was adopted to accelerate the algorithm by avoiding cache misses and bank conflicts. The experimental results showed the effectiveness of the proposed approaches where a peak performance of 15.89 GFLOPS was achieved on one C66x DSP core with a core utilization of 99% and a speedup of about 1.61, 3 and 10 compared to the state-of-the-art, the most optimized vendor and the conventional approaches, respectively. [Display omitted]
Sprache
Englisch
Identifikatoren
ISSN: 0045-7906
eISSN: 1879-0755
DOI: 10.1016/j.compeleceng.2022.108294
Titel-ID: cdi_crossref_primary_10_1016_j_compeleceng_2022_108294

Weiterführende Literatur

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX