Huthmann, Jens Christoph (2017)
An Execution Model and High-Level-Synthesis System for Generating SIMT Multi-Threaded Hardware from C Source Code.
Technische Universität Darmstadt
Dissertation, Erstveröffentlichung
Kurzbeschreibung (Abstract)
The performance improvement of conventional processor has begun to stagnate in recent years. Because of this, researchers are looking for new possibilities to improve the performance of computing systems. Heterogeneous systems turned out to be a powerful possibility. In the context of this thesis, a heterogeneous system consists of a software-programmable processor and a FPGA based configurable hardware accelerator. By using an accelerator specifically tailored to a particular application, heterogeneous system can achieve a higher performance that conventional processors.
Due to their increased complexity, it is more complicated to develop applications for heterogeneous systems than for conventional systems based on a software-programmable processor. For programming the software and hardware parts, different languages have to be used and additional specialised hardware-knowledge is required. Both factors increase the development cost.
This work presents the compiler framework Nymble which allows to program a heterogeneous system with only a single high-level language. In the high-level language the developer only has to select which parts of the application should be executed in hardware. Nymble then generates a program for the software-processor, the configuration of the hardware, and all interfaces between software and hardware.
All heterogeneous systems supported by Nymble have in common that the software and hardware parts of an application have access to a shared memory. As this memory is external RAM with high access latency, it is necessary to insert a cache between the memory and hardware. With this cache, memory accesses can vary between very short or long access latency depending on whether the data is available in the cache.
To hide long latencies, this thesis presents a novel execution model which allows the simultaneous execution of multiple threads in a single accelerator. Additionally, the model enables threads to be dynamically reordered at specific points in the common accelerator pipeline. This capability is used to let other (non-waiting) threads overtake a thread which is waiting for a memory access. Thus, these other threads can execute their calculations independently of the waiting thread to bridge the latency of memory accesses.
Previous works are using execution models which only allow a single thread to be active in the accelerator. In case of a memory access with long latency, the thread is exchanged with another non-waiting thread. This design of the hardware often causes many resources to lie idle for a significant amount of time.
In contrast, the presented novel execution model dynamically spreads multiple threads over the pipeline. This results in a higher utilisation of the resources by using resources more effectively. Furthermore, the simultaneous execution of multiple threads can achieve similar throughput as multiple copies of a single-threaded accelerator running in parallel.
The new execution model makes it possible to combine the improved throughput of multiple copies with the increased efficiency of simultaneous threads in a single accelerator. Thread reordering allows the new model to be effectively used with a cached shared-memory.
In comparison, between four copies of a single-threaded accelerator and a multi-thread accelerator with four thread (both created by Nymble), a resource efficiency of up to factor 2.6x can be achieved. At the same time, four simultaneous threads can be up to 4x as fast as four threads executed consecutively on a single accelerator. Compared to other, more optimised compilers, Nymble can still achieve up to 2x faster runtime with 1.5x resource efficiency.
Typ des Eintrags: | Dissertation | ||||
---|---|---|---|---|---|
Erschienen: | 2017 | ||||
Autor(en): | Huthmann, Jens Christoph | ||||
Art des Eintrags: | Erstveröffentlichung | ||||
Titel: | An Execution Model and High-Level-Synthesis System for Generating SIMT Multi-Threaded Hardware from C Source Code | ||||
Sprache: | Englisch | ||||
Referenten: | Koch, Prof. Dr. Andreas ; Berekovic, Prof. Dr. Mladen | ||||
Publikationsjahr: | 2017 | ||||
Ort: | Darmstadt | ||||
Datum der mündlichen Prüfung: | 21 August 2017 | ||||
URL / URN: | http://tuprints.ulb.tu-darmstadt.de/6776 | ||||
Kurzbeschreibung (Abstract): | The performance improvement of conventional processor has begun to stagnate in recent years. Because of this, researchers are looking for new possibilities to improve the performance of computing systems. Heterogeneous systems turned out to be a powerful possibility. In the context of this thesis, a heterogeneous system consists of a software-programmable processor and a FPGA based configurable hardware accelerator. By using an accelerator specifically tailored to a particular application, heterogeneous system can achieve a higher performance that conventional processors. Due to their increased complexity, it is more complicated to develop applications for heterogeneous systems than for conventional systems based on a software-programmable processor. For programming the software and hardware parts, different languages have to be used and additional specialised hardware-knowledge is required. Both factors increase the development cost. This work presents the compiler framework Nymble which allows to program a heterogeneous system with only a single high-level language. In the high-level language the developer only has to select which parts of the application should be executed in hardware. Nymble then generates a program for the software-processor, the configuration of the hardware, and all interfaces between software and hardware. All heterogeneous systems supported by Nymble have in common that the software and hardware parts of an application have access to a shared memory. As this memory is external RAM with high access latency, it is necessary to insert a cache between the memory and hardware. With this cache, memory accesses can vary between very short or long access latency depending on whether the data is available in the cache. To hide long latencies, this thesis presents a novel execution model which allows the simultaneous execution of multiple threads in a single accelerator. Additionally, the model enables threads to be dynamically reordered at specific points in the common accelerator pipeline. This capability is used to let other (non-waiting) threads overtake a thread which is waiting for a memory access. Thus, these other threads can execute their calculations independently of the waiting thread to bridge the latency of memory accesses. Previous works are using execution models which only allow a single thread to be active in the accelerator. In case of a memory access with long latency, the thread is exchanged with another non-waiting thread. This design of the hardware often causes many resources to lie idle for a significant amount of time. In contrast, the presented novel execution model dynamically spreads multiple threads over the pipeline. This results in a higher utilisation of the resources by using resources more effectively. Furthermore, the simultaneous execution of multiple threads can achieve similar throughput as multiple copies of a single-threaded accelerator running in parallel. The new execution model makes it possible to combine the improved throughput of multiple copies with the increased efficiency of simultaneous threads in a single accelerator. Thread reordering allows the new model to be effectively used with a cached shared-memory. In comparison, between four copies of a single-threaded accelerator and a multi-thread accelerator with four thread (both created by Nymble), a resource efficiency of up to factor 2.6x can be achieved. At the same time, four simultaneous threads can be up to 4x as fast as four threads executed consecutively on a single accelerator. Compared to other, more optimised compilers, Nymble can still achieve up to 2x faster runtime with 1.5x resource efficiency. |
||||
Alternatives oder übersetztes Abstract: |
|
||||
URN: | urn:nbn:de:tuda-tuprints-67767 | ||||
Sachgruppe der Dewey Dezimalklassifikatin (DDC): | 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik 600 Technik, Medizin, angewandte Wissenschaften > 620 Ingenieurwissenschaften und Maschinenbau |
||||
Fachbereich(e)/-gebiet(e): | 20 Fachbereich Informatik 20 Fachbereich Informatik > Eingebettete Systeme und ihre Anwendungen |
||||
Hinterlegungsdatum: | 03 Dez 2017 20:55 | ||||
Letzte Änderung: | 03 Dez 2017 20:55 | ||||
PPN: | |||||
Referenten: | Koch, Prof. Dr. Andreas ; Berekovic, Prof. Dr. Mladen | ||||
Datum der mündlichen Prüfung / Verteidigung / mdl. Prüfung: | 21 August 2017 | ||||
Export: | |||||
Suche nach Titel in: | TUfind oder in Google |
Frage zum Eintrag |
Optionen (nur für Redakteure)
Redaktionelle Details anzeigen |