CASIA OpenIR  > 毕业生  > 博士学位论文
片内多核共享存储体系结构研究与设计
孟洪宇
Subtype博士
Thesis Advisor王东琳
2019-05
Degree Grantor中国科学院自动化研究所
Place of Conferral中国科学院自动化研究所
Degree Discipline计算机应用技术
Keyword多核共享存储 互联协议 互联结构 任务调度 设计探索
Abstract

随着半导体技术的进步与集成电路的飞速发展,处理器作为集成电路领域的重要分支其性能随之得到了巨大的提升。然而随着信息处理的爆炸式增长,人们对于处理器性能的需求也越来越高、越来越多样化,传统的单核处理器面临着巨大挑战。计算与存储之间的性能差、指令级并行机制设计困难以及功耗高等诸多问题制约了传统处理器的发展与应用。多核结构作为一种有效解决“频率墙”与“功耗墙”的设计方案,被逐渐应用到现代处理器设计中。但随着片内集成的资源越来越多,体系结构的设计规模以及片上多核结构的设计难度也逐渐增加,“带宽墙”问题逐渐暴露出来。作为目前多核处理器设计核心的存储体系结构,其设计方案决定了处理核与芯片系统的整体性能。

目前存储系统设计主要存在以下两个难点:一是如何设计高性能的存储系统,从而可以加速数据供给、提高数据并行传输,使其可以满足处理单元的“供数需求”,保证处理单元运行在高峰值性能利用率下;二是如何设计高效的存储系统,使其可以通过合理的调度使用将存储系统的吞吐性能与缓冲功能充分发挥出来,保证存储系统在最低硬件开销下就可以满足处理单元对数据的需求。针对上述两个存储系统设计中的难点,考虑到片内互联系统对存储单元吞吐带宽的影响,本文从互联系统设计与共享存储系统设计两个角度开展深入研究。提出的互联协议与互联结构可以保证高性能的存储系统设计,提出的存储设计探索方法可以保证高效的存储系统设计。本文的主要研究内容与贡献归纳如下:

1.针对存储系统中传输并行性与连续性的需求,本文提出了一种面向存储单元的互联协议——AEC协议,使得存储系统的吞吐带宽不受传输协议层面的限制。具体来说,本文提出的AEC协议将写数据与写地址合并到一个传输通道,提出地址信号与数据信号一一对应的交互方式,既保证了传输信号的合理利用,又降低了存储单元协议转换的硬件开销。同时借鉴于AXI协议,本文提出的AEC协议在每个传输通道中引入了互相独立的握手信号,以满足多通道与多端口的并行传输;引入了处理未完成操作继续传输与乱序传输的传输方式,以保证传输的连续性。此外,为了使基于AEC协议的互联总线在设计上可以满足多级互联、多级串联以及多种设计结构的扩展,本文提出的AEC协议对各个通道中ID信号的功能与使用方式进行了合理的定义,并引入了点对点的流式传输方式。硬件设计实验结果表明:基于AEC协议的交叉开关互联总线比AXI交叉开关总线具有更低的硬件开销;在28 nm工艺下,基于AEC交叉开关互联总线的16通道共享存储系统可以达到1.6 Tbps的峰值吞吐带宽。

2.针对分布式存储系统中高度并行的传输需求,本文提出了一种面向多通道存储系统的全互连结构——FMN,保证存储系统的并行性可以合理利用。具体来说,本文提出的FMN采用网状互联结构与分布式的节点布局,保证互联系统在物理设计上易于实现;在相邻节点间引入多条传输通道,保证互联结构的最大对分带宽与高度并行传输。根据多通道的定义,对FMN中的节点进行最优化结构设计,保证节点可以达到最低硬件开销。仿真结果表明:FMN互联结构可以获得与交叉开关结构一致的吞吐带宽。硬件设计实验结果表明:在28 nm工艺下,基于64节点FMN的分布式共享存储系统的峰值吞吐带宽可以达到11.2 Tbps。

3.针对高效的存储系统设计,本文提出了一种基于任务调度的存储系统设计探索方法,从而对存储系统设计中存储单元的数量、容量以及带宽给出设计指导。具体来说,针对多核共享存储结构与SPM存储,本文提出了一种同构多核任务调度算法——HoEFT算法;算法中引入了存储系统建模与数据搬运子算法,以模拟多核数据传输;以存储单元的容量为依据进行任务划分与数据划分,保证局部计算效率最大化;以列表调度为基础、以任务权值和最快完成时间为依据进行任务调度;实验结果表明:在DGEMM目标应用下,HoEFT算法比手工调度与Cache存储可以获得更高的处理单元利用率。为了对存储系统设计给出合理的指导,本文提出了基于任务调度的设计探索方法;以DGEMM为目标应用开展多核共享存储系统设计探索,归纳数据传输方式与任务执行方式,推导出存储系统设计参数与处理单元设计参数之间的约束关系,从而给出整个系统的设计指导。

Other Abstract

With the development of semiconductor technology and integrated circuits, the performance of the processor which is an important branch of the integrated circuit field has been greatly improved. However, with the explosive growth of information processing, people's demand for the performance of the processor is becoming higher and more diverse, and traditional single-core processors face enormous challenges. Many problems such as poor performance between processor and memory, difficulty in designing instruction-level parallelism, and high power consumption constrain the development and application of traditional processors. As a design solution to effectively solve the "frequency wall" and "power wall", multi-core architecture is gradually applied to modern processor design. However, with the increasing resources of on-chip integration, the design scale of the processor's architecture and the design difficulty of the multi-core architecture on the chip are gradually increasing, and the problem of “bandwidth wall” is gradually exposed. As the on-chip memory's architecture is the heart of the current multi-core processor design, its design determines the overall performance of the cores and chip system.

At present, there are two main difficulties in memory system design: First, how to design a high-performance memory system, which can accelerate the data supply and increase the parallel transmission of data. So that it can meet the "data-supply requirements" of the cores and ensure that the cores operate under high peak performance utilization. The second is how to design an energy-efficient memory system and how to use it efficiently. So that the chip system can fully utilize the throughput performance and buffering function of the memory system through reasonable scheduling. The memory system can also meet the data requirements of the cores with the minimum hardware overhead. In view of the above two difficulties in the design of the memory system, and considering the impact of on-chip interconnect system on the bandwidth of memory bank, this paper conducts in-depth research from the perspectives of interconnect system design and shared-memory system design. The proposed interconnect protocol and interconnect architecture can guarantee high-performance memory system design and the proposed memory design exploration method can ensure efficient memory system design. The main research contents and contributions of this paper are summarized as follows: 

1.Aiming at the requirement of transmission parallelism and continuity in memory system, this paper proposes an interconnection protocol for memory bank, AEC protocol, so that the throughput bandwidth of the memory system is not limited by the transmission protocol level. To be specific, the AEC protocol proposed in this paper combines the write data and the write address into one transmission channel, and proposes an interaction mode in which the address signal and the data signal correspond one-to-one, which not only ensures the reasonable utilization of the transmission signals, but also reduces the hardware overhead of memory bank's protocol conversion. At the same time, referring to the AXI protocol, the AEC protocol proposed in this paper introduces independent handshake signals in each transmission channel to meet the parallel transmission of multi-channel and multi-port; it introduces the transmission of issuing outstanding transactions to continue transmission and out-of-order transmission to ensure the continuity of the transmission. In addition, in order to make the interconnect based on AEC protocol meet the requirements of multi-level interconnection, multi-level series and multiple design architectures, the AEC protocol proposed in this paper makes reasonable use of the function and usage of ID signals in each channel, and defines and introduces a point-to-point streaming transmission. The hardware design experiment results show that the crossbar-interconnect based on AEC protocol has lower hardware overhead than the AXI-crossbar interconnect; under the 28 nm process, the 16-channel shared-memory system based on AEC-crossbar interconnect can reach the peak throughput of 1.6 Tbps.

2.Aiming at the highly parallel transmission requirements in distributed memory system, this paper proposes a full-interconnect architecture--FMN for multi-channel memory system, which ensures that the parallelism of memory system can be rationally utilized. To be specific, the FMN proposed in this paper adopts mesh architecture and distributed node layout to ensure that the interconnect system is easy to be implemented in physical design; multiple transmission channels are introduced between adjacent nodes to ensure maximum bisection bandwidth and high-parallel transmission of the interconnect architecture. According to the definition of multi-channel, the optimal architecture design of the nodes in the FMN ensures that the node can achieve the minimum hardware overhead. The simulation results show that the FMN interconnect architecture can obtain the throughput consistent with the crossbar architecture. The hardware design experiment results show that the peak throughput of the distributed shared-memory system based on a 64-node FMN can reach 11.2 Tbps under the 28 nm process.

3.For efficient memory system design, this paper proposes a memory system design exploration method based on task schedule, which gives design guidance for the number, capacity and bandwidth of the memory bank in memory system design. To be specific, for multi-core shared-memory architecture and SPM, this paper proposes a homogeneous multi-core task scheduling algorithm-HoEFT algorithm; the memory system modeling and data handling sub-algorithm is introduced in the algorithm to simulate multi-core data transmission; task partition and data partition based on the capacity of the memory to ensure maximum local computing efficiency; based on the list scheduling, task schedule is performed based on the rank of each task and the earliest finish time; the experimental results show that under the DGEMM target application, the HoEFT algorithm can achieve higher utilization cores than manual scheduling and Cache memory. In order to give reasonable guidance to the memory system design, this paper proposes a design exploration method based on task schedule; with DGEMM as the target application, the design of multi-core shared-memory system is explored, the data transmission mode and task execution mode are summarized, and the constraint relationship between memory system design parameters and cores design parameters is derived, thereby the design guidance of the whole system is given.

Pages121
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/23888
Collection毕业生_博士学位论文
Recommended Citation
GB/T 7714
孟洪宇. 片内多核共享存储体系结构研究与设计[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2019.
Files in This Item:
File Name/Size DocType Version Access License
孟洪宇201618014629105.p(6898KB)学位论文 暂不开放CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[孟洪宇]'s Articles
Baidu academic
Similar articles in Baidu academic
[孟洪宇]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[孟洪宇]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.