面向 5G 通信数字信号处理器内核的高能效优化设计

CASIA OpenIR > 毕业生 > 博士学位论文

	面向 5G 通信数字信号处理器内核的高能效优化设计
	郭阳
	2019-06-24
页数	157
学位类型	博士
中文摘要	5G 移动通信系统是面向未来移动通信需求提出的新一代移动通信系统。根据IMT2020白皮书的要求，5G系统要求峰值速率达到10 Gbps，并且要保证用户最小速率达到100 Mbps。如何对海量数据进行快速高效地处理，并且同时具有较低的硬件开销，成为未来5G移动通信系统所面临的首要问题。代数处理器（Mathematic Processing Unit，MaPU）是由中国科学院自动化研究所国家专用集成电路设计工程技术研究中心自主研发的一款通用数字信号处理器（Digital Signal Processor，DSP）。其内核代数处理引擎（Algebraic Processing Engine，APE）具有512位宽的单指令多数据（Single Instruction Multiple Data，SIMD）数据位宽，14个指令槽的超长指令字（Very Long Instruction Word，VLIW）指令架构。上述特点使得 MaPU在处理快速傅里叶变换（Fast Fourier Transform，FFT）、有限脉冲响应（Finite Impulse Response，FIR）等高密集型算子时具有领先的性能功耗比。然而，MaPU作为一款面向超算、通信、多媒体等多个领域的通用处理器，在面向5G通信领域应用时存在各种不适应的问题，可进一步深度优化。在充分利用上述特点和优势的同时，更加适应通信领域信号处理，并且在面对处理5G高实时性海量数据的同时，具有较低的硬件开销和稳定性。本课题针对处理5G通信的应用高实时性海量数据带来的硬件开销问题，基于MaPU内核原有的架构，对面向5G通信的数字信号处理器通用通信处理器（Universal Communication Processor，UCP）的内核做出了高能效的优化设计。从算法实现，功能单元架构设计，存储系统设计三个方面对内核的高性能低功耗方面进行了设计优化。论文的主要工作和创新点归纳如下： 1. 针对5G移动通信标准新加入的控制信道编码方式——Polar码的编码算法的SIMD并行化方案做出了设计。在充分分析算法结构的前提下，对算法的数据依赖关系进行分解，充分抽取了数据操作之间并行性。并且根据处理器SIMD架构的特点设计了并行化实现方案，并且提取了专用加速指令。在硬件模拟器上实现了该算法，并评估其性能。其编码吞吐率约为现有符合5G标准的加速器吞吐的300倍。用硬件描述语言完成了其硬件寄存器传输级（Register Transfer Level，RTL）设计，并用综合工具评估其开销。新指令给硬件带来的开销较小。 2. 基于5G移动通信基带处理的算法需求，设计了内核中功能单元的数目及其架构互联，并对算术逻辑单元和乘累加单元的硬件逻辑进行了设计。考虑到硬件资源的高度复用和数据的高度并行处理，合理的分配硬件资源的配置。因为功能单元是整个处理器内核中组合逻辑最复杂的部分之一，通过对功能单元执行级的延迟约束提高整个处理器内核的系统主频。通过对硬件资源的高度复用降低其硬件面积和功耗的开销。并设计了5G专用加速指令。通过将功能部件植入内核，评估其主频在16-nm的工艺下能达到1.4 GHz。评估5G通信相关算法的处理性能领先去业内其他通信处理器。 3. 针对5G移动通信基带处理存在的海量数据，VLIW架构带来的功能单元互联和指令存储开销问题，对内核的存储系统进行了设计。在其中包括：分布式寄存器文件，指令存储器和数据存储器。并通过5G相关算法评估其性能功耗比。基于动态调度的分布式寄存器文件架构：该寄存器文件通过写口绑定，寄存器重定向将寄存器的物理实体与逻辑含义分离，从而简化了旁路网络。再设计了一套动态调度机制，解决了数据调度冲突的问题。通过实验得出基于动态调度的分布式寄存器文件架构能够降低旁路网络功耗。指令存储器：针对 VLIW机程序纵向和横向两个维度的特点，对VLIW指令分别进行压缩。去除了大量存有空操作指令的空间，并且在指令分发给功能部件时将这些空操作指令还原来保证程序的正常运行。从而降低了指令存储器用来存储大量空操作指令的面积和功耗。数据存储器：设计了间步访问模式和离散访问模式。通过这两种新型的访问模式让多粒度并行存储具有更加灵活的访问方式，降低了访存延迟。提升算法的执行效率。增大其吞吐率。
英文摘要	The 5th Generation (5G) mobile communication system is a new generation communication system proposed for the demand in the future mobile communication. According to the requirements of the IMT2020 white paper, the peak throughput of 5G systems will reach 10 Gbps, and the minimum user throughput will reach 100 Mbps. How to process massive data quickly and efficiently with low hardware overhead become the primary problem in the future 5G mobile communication system. MaPU is a general-purpose digital signal processor independently designed by the National ASIC Design Engineering Technology Research Center of the Institute of Automation, Chinese Academy of Sciences. APE has 512-bit SIMD width and 14-slot VLIW architecture. The above features make the MaPU have a leading performance-to-power ratio when processing high-intensive operators such as FFT and FIR. However, MaPU is a universal processor for supercomputing, communications, multi-media and many other fields. There are various problems when MaPU is applied to the field of 5G communication, which can be further optimized. While making full use of the features and advantages above, our design makes it more suitable for signal processing in the communication field. And it is stable with low hardware overhead. This paper addresses the hardware overhead caused by the application of high-real-time massive data for 5G communication applications. Based on the original architecture of the MaPU core, UCP is implemented with the energy efficient optimized design. From the aspects of algorithm implementation, functional unit architecture design and storage system design of the high performance and low power are carried out. The main work and innovations of the thesis are summarized as follows: 1. Designed for the SIMD parallelization scheme of the Polar coding algorithm for the control channel coding method of the 5G mobile communication standard. Under the premise of fully analyzing the algorithm, the data dependence of the algorithm is decomposed, and the parallelism between operations is fully extracted. And according to the characteristics of the SIMD architecture, the parallelization scheme is designed, and the special acceleration instruction is extracted. The algorithm was implemented on a hardware simulator and its performance was evaluated. The throughput is approximately 300 times that of existing 5G-compliant accelerator throughput. The RTL design was completed in hardware description language and its overhead was evaluated using the synthesis tool. The new instructions bring little overhead to the hardware. 2. Based on the algorithm requirements of 5G mobile communication baseband processing, functional units in the core and the interconnection are designed. The hardware logic of ALU and MAC is designed. Considering the reuse of hardware resources and the highly parallel data processing, the hardware resources is reasonably allocated. Because the functional unit is one of the most complex parts of the combinatorial logic in the processor core, the frequency of the processor core is increased by performing a level of delay constraint on the functional unit. Reduce the overhead of hardware area and power by highly hardware resources. And the 5G dedicated acceleration instructions are designed. By embedding the functional units into the core, it is evaluated that its frequency can reach 1.4 GHz in a 16-nm process. The processing performance of 5G communication-related algorithms is ahead of other communication processors in the industry. 3. For the massive data existing in the 5G mobile communication baseband processing, the functional unit interconnection and instruction storage overhead caused by the VLIW architecture, the storage system is designed including distributed register files, instruction memory and data memory. And the performance-to-power ratio is evaluated by 5G communication algorithms. Distributed register file architecture based on dynamic scheduling: the register file is bound by write port, which separates the physical entity of the register from the logical concept, thus the bypass network is simplified. A dynamic scheduling scheme is designed to solve the problem of data scheduling conflicts. The distributed register file architecture based on dynamic scheduling can reduce the power on the bypass network. Instruction Memory: The VLIW instructions are compressed separately for the two dimensions of the VLIW machine program in both vertical and horizontal dimensions. A large amount of space for storing NOP instructions is removed. These NOP instructions are restored when they are distributed to functional units. It reduces the area and power of the instruction memory used to store a large number of NOP instructions. Data Memory: Interval access mode and discrete access mode are designed. The two new access modes can improve the execution efficiency of the algorithm and increase its throughput.
关键词	5g 通信 Polar 码功能单元寄存器文件存储系统
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/23917
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	郭阳. 面向 5G 通信数字信号处理器内核的高能效优化设计[D]. 中国科学院大学自动化研究所. 中国科学院大学自动化研究所,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
郭阳.pdf（4064KB）	学位论文		限制开放	CC BY-NC-SA