CASIA OpenIR  > 毕业生  > 博士学位论文
代数处理器加速核编程模型研究和实现
杨磊
学位类型工学博士
导师王东琳
2018-05-24
学位授予单位中国科学院研究生院
学位授予地点北京
关键词异构系统加速核 指令级并行 有限状态机 编程模型 编译器设计
摘要近年来,处理器的异构多核结构成为热门的研究方向,它被广泛应用到科学计算、图形图像处理、通信等各个领域中。与传统单核处理器以及对称多核处理器相比,异构多核系统中的加速核往往针对不同的应用领域做专门的结构设计和优化,因而具有更高的计算性能和/或更低的功耗。然而,加速核在带来高计算性能和低功耗的同时,也给编程方法提出了很大的挑战。一般而言,加速核往往为了某些特定领域的计算应用而设计,因而其架构与通用处理器有着本质的差异。向量结构、指令级并行、数据流结构等常常被用于加速核中,这些结构特点使得传统的编程方法无法有效利用硬件底层的优化特性,从而需要新的编程方法来提升硬件资源利用率;另一方面,加速核的出现导致程序员需要花费很大的努力去适应这些特异性的结构,开发能够充分利用硬件性能的高性能加速核程序具有很大的编程难度。
本论文从代数运算处理器MaPU(Mathematical Processing Unit)加速核的结构特点出发,研究不同层次的编程模型并尝试实现其编译器。论文的主要工作和创新点归纳如下:
1. 提出并完善了一种硬件状态机重构的方法
用状态机描述和控制处理器行为是一种较为常用的方法。对于VLIW(Very Long Instrction Word)式的指令级并行处理器,它的内部包含多个可以由并发指令控制的、彼此互不相关的功能单元,每个功能单元包含有各自的状态机,而处理器整体形成一个复合的状态机。由于复合状态机的结构较为复杂,难以直接设计复合状态机描述电路。如果能够为每个功能单元设计独立的状态机,则设计难度能够降低,从而使得设计状态机控制电路的可行性得到提高。为了使得通过设计独立子状态机来辅助完成处理器整体状态机的设计成为可能,我们提出了一种指令级并行处理器中硬件状态机重构的方法,该方法包括怎样判定子状态机是否可重构、如何进行状态机等价变换、以及如何将独立的子状态机合并并生成复合状态机等几个主要的子方法。状态机重构方法是配置具有VLIW式指令级多发射结构的处理器的硬件状态机的有效方法,同时也为下文中介绍的代数处理器加速核底层语言编程模型提供了理论基础。
2. 提出了一种基于横向宏的代数处理器底层语言编程模型
为了开发能够充分利用硬件底层结构特点的高性能代数算法库,我们为代数处理器设计了基于横向宏拼接的周期级精确编程模型。该编程模型基于状态机重构的原理,用横向宏描述独立功能单元的状态机,然后用横向宏拼接的方式生成处理器的VLIW风格的代码。本工作展示了为定制处理器设计底层编程模型的设计方案,我们提出的宏指令编程模型适用于粗粒度可重构的、具有显式指令级并发特点的加速核结构,对于具有类似结构特点的处理器编程模型设计具有一定的参考性。使用宏指令编程模型开发的代数计算程序能够达到较高的硬件资源利用率,从而达到了尽可能高的计算性能;另一方面,与纯手工汇编方式相比,使用该编程模型能够节约大量的编程工作量,证明其是用于开发高效的底层算法库的有效的解决方案。
3. 提出并实现了应用于代数处理器加速核的领域专用高级语言编程模型
我们在底层编程模型的基础上,研究并提出了应用于代数处理器加速核的领域专用高级语言编程模型。该编程模型围绕着处理器的底层结构特点、针对指令级多发射和硬件流水线进行设计。整个编程模型设计包含3个组成部分:领域专用语言(Domain Specific Language, DSL)、编译器和运行时库。编程语言的设计结合了代数处理器加速核的结构特点,力求在较高的层次提供对底层硬件的抽象;编译器针对编程语言和硬件结构特点进行综合设计,实现了词法分析器、语法分析器、抽象语法树和指令的有向无环图等结构,并能够生成适用于处理器结构的底层硬件流水线代码;运行时库则在高级语言的层次为底层高性能计算库的调用提供了相关的数据类型和内建函数接口,同时添加了针对于特定计算系统的常用辅助函数等。本工作为代数处理器加速核设计了编程模型并提供了相应的实现,为代数处理器加速核的高级语言编程提供了新的尝试和解决方案。
其他摘要

In recent years, the study of heterogeneous structure in processor design becomes more and more popular. It has been widely used in a lot of areas, such as super computing, image processing, communication, etc. Compared with traditional single core processors or symmetric multicore processors, heterogeneous computing system would have higher computing performance and/or lower energy consumption, because of the accelerator cores that have been specially designed and optimized for certain applications. However, while accelerators bring high computing performance and low power consumption, they bring up huge challenge on programming methods. Generally, accelerators are designed for certain areas, and their structure has great difference from general purpose processors. Accelerators would adapt some features such as vector structure, instruction-level parallelism (ILP), and dataflow mechanism. These features cause the failure of traditional programming methods, as the corresponding usage of hardware resource would be less effective. So new programming methods are needed to increase hardware usage ratio. In the meantime, accelerators give programmers big challenge on adapting their specially designed features to targeted programs, making programming accelerators extremely tough.

Based on the structure of the accelerator of an arithmetical processor (MaPU), this thesis studies its programming model on different level and attempts to implement the corresponding compiler. The main contributions are summarized as follows:

1. A hardware finite state machines reconfiguration method

It’s common to use finite state machine (FSM) to describe and control processors. As for the processor that adapt instruction level parallelism with very long instruction word (VLIW) style, there are many individual functional units inside the processor which are controlled by parallel instructions. Each functional unit has its own FSM and the whole processor forms huge compound FSM. It’s very complex to design compound FSM for the entire processor. However, to design individual functional units is a lot easier and increases the programmability of the processor. To make it possible to design individual state machines to get the configuration of the whole processor, we propose a hardware FSM reconfiguration method in ILP processors. The method includes sub-methods of examining the reconfiguration possibility of individual FSMs, transforming state machines equivalently, and merging individual FSMs to form compound state machine. The FSM reconfiguration method is an effective way to configure state machines for ILP processors with VLIW style, and it’s the theoretical basis for the low-level programming model introduced in the following chapters.

2. A low-level ‘horizontal macro programming model’ for the mathematical processor

We propose a cycle-accurate programming model based on horizontal macro design, to take advantage of hardware features to develop high performance arithmetical libraries. The programming model is based on FSM reconfiguration, where individual state machines are represented by horizontal macros, and the compiler reconfigures all the horizontal macros to generate very long instruction word (VLIW) style code. In this thesis we demonstrate the programming model design for the targeted processor. The proposed horizontal macro programming model is suitable for corse-grained reconfigurable processors with ILP features, and it could be referred to for processors with similar structure. With this programming model, programers could take full advantage of hardware resource to develop arithmetical programs, so as to achieving computing performance as high as possible. What’s more, the programming model could save a lot of programming work compared with purely hand assembling, indicating the programming model an effective solution for developing low-level, high performance arithmetical libraries.

3. A high-level programming model for the accelerator of the mathematical processor

Based on low-level programming model, we propose a domain specific high-level programming model for the accelerator of the mathematical processor. The design of this programming model is based on the unique structure of the targeted processor, and it’s designed for ILP and hardware pipeline. The whole programming model incudes 3 parts: a domain specific language (DSL), a compiler and runtime libraries. The domain specific language is designed according to the structure of the accelerator, and it tries to build proper abstraction for the hardware in high level. The compiler is built according to the language and processor features, and in the compiler we implement the lexical analyzer, syntax analyzer, abstract syntax tree and the instructions’ directed acylic graph to compile custom source code into the pipeline code of hardware. The runtime libraries support built-in functions to use high performance arithmetical libraries, and they also integrate some frequently used functions for the specific computing platform. By designing and implementing the programming model, we attempt to give new solution to solve the programming problems for arithmetical processors.

文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/21049
专题毕业生_博士学位论文
作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
杨磊. 代数处理器加速核编程模型研究和实现[D]. 北京. 中国科学院研究生院,2018.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
CASIA Thesis yangl-最(5414KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[杨磊]的文章
百度学术
百度学术中相似的文章
[杨磊]的文章
必应学术
必应学术中相似的文章
[杨磊]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。