System on Chip 知识点

1. Intro

The success of CMOS* as the most widely used semiconductor* technology is the result of continuously shrinking the key feature size parameters (channel length L_min, transistor width w, and oxide thickness t_ox) of the MOSFET transistors.

作为最广泛使用的半导体技术,CMOS的成功是不断缩小MOSFET晶体管的关键特征尺寸参数(沟道长度L_min、晶体管宽度w和氧化物厚度t_ox)的结果。

Moore’s Law: The number of transistors per chip will continue to double every 18 – 24 months (two years). Note doubling the number of transistors per area implies a shrinking of both L_min and the transistor width w by {\sqrt {2}} .

摩尔定律:每个芯片上的晶体管数量每18-24个月将翻一番。注意,单位面积晶体管数量翻倍意味着沟道长度L_min和晶体管宽度w都要缩小{\sqrt {2}}倍。

CMOS power dissipation density, i.e. power per area, is proportional to the number of transistor devices per area, the switched gate-substrate capacity per device, the device operation frequency, and the square of the supply voltage.

CMOS的功率耗散密度,即单位面积的功率,与单位面积的晶体管器件数量、每个器件的开关门基板容量、器件工作频率和电源电压的平方成正比。

Reducing L_min, w, and t_ox of transistors in the next CMOS generation by a factor of {\sqrt {2}} lowers the gate-substrate capacity C by {\sqrt {2}} too. Thus, implementing the same circuitry (N, f = const.) lowers power dissipation by roughly 30% when the supply voltage remains constant. However, as the number of devices per area increases by a factor of 2, the chip power dissipation density should increase by a factor of {\sqrt {2}}, or 40% per CMOS generation.

将下一代CMOS技术中晶体管的L_min、w和t_ox降低{\sqrt {2}}倍,可使栅极衬底的容量C也降低{\sqrt {2}}倍。因此,在电源电压保持不变的情况下,实现相同的电路(N, f=const.)可以将功率耗散降低大约30%。然而,单位面积的器件数量会增加2倍,所以芯片的功率耗散密度增加{\sqrt {2}}倍,即每一代CMOS功率耗散密度都会增加40%。

There are various reasons why CMOS has become dominant. First, the growth of silicon oxide (SiO2) on the silicon surface is controllable. The fabrication of MOS transistors is highly integratable and easy to design. A second benefit of CMOS circuits results from their electrical behavior. Due to the use of complementary transistors, they offer low power dissipation, high noise immunity, and easy cascadibility of logic gates.

CMOS成为主流的原因有很多。 首先,硅表面上的氧化硅(SiO2)的生长是可控的。MOS晶体管的制造具有高度的可集成性,并且易于设计。 CMOS电路的第二个好处来自其电子行为。 由于使用互补晶体管,它们具有低功率耗散、高抗噪性和逻辑门易于级联的特点。

The source is defined as the source of carriers, so the source of nMOS is on lower potential; the source of pMOS is on higher potential. The gate-source voltage V_gs of nMOS needs to be positive to generate a conductive channel, whereas V_gs needs to be negative for pMOS.

源极被定义为载流子的来源,因此nMOS的源极在较低的电位上;pMOS的源极在较高的电位上。 nMOS的电压Vgs需要为正,以产生导电通道,而对于pMOS,Vgs需要为负。

The propagation delay time of CMOS circuits is . In order to obtain higher speed circuits, capacitive load Cload, oxide thickness tox, channel length Lp, and absolute threshold voltages |Vtp| have to be decreased, whereas carrier mobility μp, channel width Wp, relative oxide dielectric εox, and supply voltage Vdd have to be increased. This is valid for minimizing the delay time of a circuit only. If we want to optimize area and power consumption at the same time, there are conflicts.

为了获得更高速的电路,必须减少容性负载、氧化层厚度、沟道长度和绝对阈值电压,或增加载流子迁移率、沟道宽度、相对氧化介质和电源电压。这些措施只对减少电路的延迟时间有效。如果我们想同时优化面积和功耗,就会有冲突。

在研究低功耗之前,我们必须了解CMOS功耗的来源。

Dynamic power* is mainly related to the functionality of the circuit, it is signal-edge-dependent. It consists of a capacitive part and a short-circuit part. Static power is related to parasitic effects, like sub-threshold current, leakage current, and gate tunneling current. It is signal-level-dependent.

动态功率主要与电路的功能有关,它是电平边缘相关的,包括一个电容部分和一个短路部分。 静态功率与寄生效应有关,如亚阈值电流、泄漏电流、门隧道电流,它是信号电平相关的。

Dynamic Capacitive Power: With the switching activity of α_01, the capacitive power dissipation of CMOS is .

Dynamic Short Circuit Power: Under the following assumption, the short circuit time is .

Static Sub-threshold Currents: An ideal MOS transistor should be completely switched off as long as the gate-source voltage is below the threshold voltage level. But in a real transistor, there will be sub-threshold currents.

亚阈值电流:一个理想的MOS晶体管是,只要栅极-源极电压低于阈值电压水平,晶体管就应完全关闭。但现实中,会有亚阈值电流。

Static Diode Leakage / Gate Current: The gate oxide of a MOS transistor is not a perfect isolator. There is some marginal resistance, some ionic conduction related to trapped ions inside the oxide, and some tunneling through the oxide.

MOS晶体管的栅极氧化物不是完美的隔离物,有一些边际电阻,一些与氧化物内的被困离子有关的离子传导,以及一些通过氧化物的隧道。
If everything is done right, there will never be a conducting path between Vdd and GND.
All logic functions can be expressed using NAND and NOR
  • semiconductor

Semiconductors are materials that have an electrical conductivity that is intermediate between that of a conductor and an insulator. They are generally made from elements such as silicon or germanium, and their conductivity can be controlled by adding impurities (doping) to the material.

  • CMOS

CMOS, or Complementary Metal-Oxide-Semiconductor, is a type of technology used to create semiconductor devices, such as transistors. By applying a voltage to the gate of the transistor, it is possible to control the flow of current between source and drain, which allows the transistor to function as an amplifier, switch, etc.

  • Dynamic power dissipation

Dynamic power dissipation is the power consumed by a CMOS circuit when it is actively switching, or changing states. This power is associated with the movement of charges within the circuit and is proportional to the switching frequency of the circuit.

In a CMOS (complementary metal-oxide-semiconductor) circuit, dynamic capacitive power dissipation is caused by the charging and discharging of the parasitic capacitances present within the circuit. It is proportional to the switching frequency of the circuit and the total capacitance of the circuit. As a result, it can be a significant contributor to the overall power consumption of a high-speed CMOS circuit.

  • Static power dissipation

Static power dissipation is the power consumed by a CMOS circuit when it is not actively switching, or when it is in a static state. This power is associated with the leakage of current through the transistors in the circuit and is independent of the switching frequency.


2. SoC Logic Design Recap

As a consequence of DeMorgan’s rule, all logic functions can be expressed by combinations of either NAND or NOR gates. Furthermore, Boolean equations can be easily converted to static CMOS circuits.

作为DeMorgan规则的结果,所有的逻辑功能都可以由NAND或NOR门的组合来表达。此外,布尔方程可以很容易地转换为静态CMOS电路。

In general, a generic model can be used to convert Boolean equations into static CMOS circuits. Each AND function generates serially connected nMOS transistors on a path from the output to GND, complemented by parallel connected pMOS transistors on a path from the output to VDD. Each OR function generates parallel nMOS, complemented by serial pMOS transistors, respectively. Finally, the output is always inverted, due to the switching properties of MOS transistors.

一般来说,可以用一个通用模型将布尔方程转换成静态CMOS电路。 每个AND函数在从输出到GND的路径上产生串行连接的nMOS晶体管,并在从输出到VDD的路径上产生平行连接的pMOS晶体管作为补充。 每个OR函数分别产生并联的nMOS,辅以串行的pMOS晶体管。最后,由于MOS晶体管的开关特性,输出总是倒置的。

由于存储元素,如寄存器,也是逻辑电路的重要构件,我们现在对寄存器内部进行更详细的研究。

The basic CMOS storage element consists of a loop of two inverters. Connected together, they form a stable circuit, which can stick to either “1” or “0” at the corresponding nodes. The circuit has just outputs, but no inputs, in order to set a specific logic value, we need to open the loop. This can be done by substituting the inverters with NAND gates. A control input x is used to determine the functionality of the NAND gate: For x = 1, the NAND gate operates like an inverter. For x = 0, the output of the NAND gate switches to “1”, thus setting “1” into the loop.

基本的CMOS存储元件是由两个反相器组成的回路。 它们连接在一起,形成一个稳定的电路,在相应的节点上可以保持 “1” 或 “0”。但是该电路只有输出,没有输入,为了设置一个特定的逻辑值,我们需要打开回路。这可以通过用NAND门代替反相器来实现。控制信号x决定NAND门的功能。当x=1时,NAND门就像一个反相器。对于x = 0,NAND门的输出是 “1”,从而将 “1 “输入到循环中。

In contrast to the level-controlled latch, a flip-flop is clock edge-controlled. A flip-flop consists of two serially connected latches, a Master and a Slave.
A clock signal is used to control the enable inputs of both latches.

When the clock signal is “0”, e = 1 for the first latch, the value of input D is set into the Master latch. At the same time, the Slave latch is locked to the previous value of Q.
When the clock signal switches to “1”, e = 1 for the second latch, the Slave latch is set to the current value of the Master latch. Any further change at the input D does not affect the Slave latch, as the Master latch is locked in this state.

Overall, the flip-flop is set to the current value of the input D at the positive clock edge, whereas for all other times, the output Q of the flip-flop is locked.

与电平控制的Latch相反,Flip-Flop是由时钟沿控制的。一个Flip-Flop由两个串行连接的Latch组成,一个为主,一个为辅。 一个时钟信号被用来控制两个Latch的使能输入e。
当时钟信号为 “0 “时,第一个Latch的e=1,输入D的值被设置到主Latch。 同时,辅Latch被锁定,并保持之前的Q值。
当时钟信号切换到 “1 “时,第二个锁存器的e=1,辅Latch被设置为主Latch的当前值。 在输入端D的任何进一步变化都不会影响辅Latch,因为主Latch被锁定。
总的来说,在正时钟边沿,Flip-Flop被设置为输入D的当前值,而对于所有其他时间,触发器的输出Q被锁定。

For each flip-flop, three characteristic parameters are specified: The setup-time t_setup, the hold time t_hold, and the clock-to-output delay t_c2q. The first two parameters t_setup and t_hold impose restrictions on the input signal of the flip- flop. The input signal D must be stable* for the setup-time before clock edge and for the hold-time after clock edge. This is required in order to guarantee correct setting of the flip-flop at the clock edge and to avoid metastability. The third parameter, t_c2q, specifies the delay after the clock edge until the valid data will be visible at the output.

对于每个触发器,都规定了一组三个特征参数。设置时间,保持时间,以及时钟到输出的延迟。前两个参数对触发器的输入信号有限制。 输入信号D在时钟边缘之前的设置时间和时钟边缘之后的保持时间内必须是稳定的。 这是为了保证触发器在时钟边缘的正确设置,并避免偏移性。第三个参数规定了时钟边缘到有效数据在输出端可见的延迟。

有限状态机被广泛用于各种形式的顺序控制器和反应式系统。

Finite State Machines (FSMs) consist of a register bank, input logic, output logic, and a feedback loop. The input logic f(x,u) combines primary inputs x with current-state vector u = [D_1 to D_n] to generate the next-state vector v = [Q_1 to Q_n]. The output logic g(x,u) or g(u) generate the output vector y. The clock signal switches the register bank from the current state to the next state.

有限状态机(FSM)由一个寄存器组、输入逻辑、输出逻辑和反馈回路组成。输入逻辑f(x,u)将主输入x与当前状态向量u相结合,生成下一状态向量v。输出逻辑g(x,u)或g(u)产生输出向量y。时钟信号将寄存器组从当前状态切换到下一个状态。

Communicating FSMs are sequences of combinatorial logic and registers. The maximum clock frequency is limited by the propagation time through the combinatorial logic, the setup time, and the clock-to-output delay of the registers: .

Communicating FSM是一系列组合逻辑和寄存器。最大的时钟频率受限于组合逻辑的传播时间、设置时间和寄存器的时钟到输出延迟。
Tclk : 前一个寄存器输出值之后多久可以在下一个寄存器输出端得到本轮运算的结果,注意前一个寄存器影响后面几个寄存器,后一个寄存器被前面几个寄存器影响。
  • Stable Data Input of FF

The data input is latched (or captured) at the moment the clock edge occurs, which is typically the rising edge of the clock pulse. The flip-flop then holds the captured data in its internal memory until the next clock edge occurs, at which point the process repeats. It is important to ensure that the data input remains stable before and after the clock edge occurs in order to ensure the correct operation of the flip-flop.

3. SoC Paradigm*

Boolean Algebra enables a formal mathematical treatment of logic circuitry expressions and transformation/reduction into simpler expressions – the beginning of logic term minimization. At the beginning of the 90s, the trend resulted in more abstract building blocks in IC design.

布尔代数能够对逻辑电路表达进行正式的数学处理,并将其转化/还原为更简单的表达–这是逻辑术语最小化的开始。在90年代初,这一趋势导致IC设计中出现了更多的抽象构件。

To keep up with the Moore’s law further, the entire circuit can be represented as a SoC platform. SoC is an integrated system design paradigm where large portions of a chip are assembled from already existing function blocks maintained in so called Core or Macro Libraries. Examples of SoC function blocks are: Microprocessor cores (ARM, MIPS, PowerPC), em- bedded SRAM, on-chip busses (AMBA, CoreConnect, OCP), network interfaces (10/100 Ethernet, Gigabit Ethernet, SONET/SDH), system interfaces (PCI, Rapid I/O) and standard peripherals (UART, I2C, GPIO).

为了进一步跟上摩尔定律,整个电路可以被表示成一个SoC平台。 SoC是一种集成的系统设计范式,其中芯片的大部分是由已经存在的功能块组装而成的,这些功能块被维护在所谓的核心或宏库中。 SoC功能块的例子有。微处理器内核(ARM、MIPS、PowerPC)、嵌入式SRAM、片上总线(AMBA、CoreConnect、OCP)、网络接口(10/100以太网、千兆以太网、SONET/SDH)、系统接口(PCI、快速I/O)和标准外围设备(UART、I2C、GPIO)。

为什么不在多个微处理器上运行的软件中实现所有功能?为什么不在硬件中实现所有功能?——Computationa Density 和 Function Diversity 是选择最符合灵活性/性能要求的特定技术来实现SoC功能的主要动机。

The most flexible alternative to implement a certain function is by means of a general purpose processor (CPU). The functionality of the CPU is solely determined by the instruction sequences of the program memory. On the other extreme of the flexibility dimension is custom IC or custom ASIC technology. Once you have designed and manufactured a certain function in custom IC technology, it’s rock solid and can’t be changed without re-design and re-manufacturing.
However, when you compare the two implementations (CPU vs. custom IC) of one and the same function from a chip area and power consumption perspective, custom IC is by up to a factor 10’000 (in area) and 1’000’000 (in power) more efficient than CPU.

实现某种功能的最灵活的选择是通用处理器(CPU)。 CPU的功能完全由程序存储器的指令序列决定。 灵活性维度的另一个极端是定制IC或定制ASIC技术。一旦你在定制IC技术中设计和制造了某种功能,它就坚如磐石,不重新设计和制造就无法改变。 然而,当你从芯片面积和功耗的角度比较一个相同功能的两种实现方式(CPU与定制IC)时,定制IC比CPU的效率高达10’000(面积)和1’000’000(功耗)倍。

Computational density (CD) is defined as computations per unit area and time. Functional Diversity (FD) can either be interpreted as the number of instructions per compute element which is stored local to the compute unit, or as an empiric metric to express the flexibility of a specific implementation technique.

计算密度(CD)被定义为每单位面积和时间的计算量。功能多样性(FD)既可以解释为每个计算元素的指令数量,这些指令存储在计算单元的本地,也可以解释为一个经验指标,表达特定实现技术的灵活性。

软件和硬件实现之间的根本区别是在执行目标功能时的并行程度。软硬件划分是一个很大的设计挑战。

The range of hardware implementation techniques spans from standard software programmable microprocessor/DSP cores, to Application Specific Instruction Processors (ASIP), to Field Programmable Gate arrays (FPGA)*, to Application Specific Integrated Circuits (ASIC) and (full) custom IC.

硬件实现技术的范围从标准的软件可编程微处理器/DSP核心,到特定应用指令处理器(ASIP),到现场可编程门阵列(FPGA),到特定应用集成电路(ASIC)和(全)定制IC。

The difference between Standard Cell and Macro Cell SoC (System on Chip) solutions is the complexity of the individual cell. SoC Macro Cells may represent entire CPU cores.

标准SoC单元和宏观SoC单元解决方案的不同之处在于单个单元的复杂性。 宏观SoC单元可以代表整个CPU核心。

HW实现方法比较:

Gate array architecture consists of an array of prefabricated transistors. These transistors are supplied with VDD and GND connections. Any circuit can be implemented using these transistors by depositing wire interconnects on top of the array.

门阵列架构由预制晶体管阵列组成。 这些晶体管提供 VDD 和 GND 连接。 通过互连阵列顶部的导线,可以使用这些晶体管实现任何电路。

In Standard Cell ASIC there are no prefabricated transistors, but a library of pre-developed logic cells. Any circuit can be implemented using these cells.

在Standard Cell ASIC中,没有预制晶体管,而是预先开发的逻辑单元库。 使用这些单元可以实现任何电路。

ASIC chip

Full Custom Design


与ASIC芯片或全定制IC相比,可编程逻辑(PLDs, Programmable Logic Devices)的功能可以在芯片制造后被修改,即通过保险丝/熔断器技术重新编程。

The basic ingredients of an FPGA(Field Programmable Gate Array) are configurable logic blocks (CLBs), configurable routing resources, and I/O pads. Each CLB contains multiple look-up tables which are configured by the program data.

There are two general ways of implementing Boolean logic. According to the truth table, a Boolean equation could be implemented either by using logic gates (systematic complementary CMOS logic design), or by a Look Up Table (LUT). With a LUT, the input variables x1 to x3 are used as addresses for a memory, whereas the values of the output variable y are stored in the memory cells. For simple combinatorial circuits, the gate delay of a circuit may be lower than the access delay of a LUT, but with the LUT you have the possibility of programming any Boolean combination of the input variables into it. Therefore LUTs are widely used for FPGAs.

Next to LUTs for combinatorial logic functions, FPGAs contain programmable interconnect resources. Each output signal of a configurable logic block can be switched on one of multiple signal lines of a routing channel. Individual lines inside the routing channels can be connected with other lines at cross section points. Conventional MOS transistors are used as switches. The state of each switch is stored in a memory cell close to the switch.

FPGA 的基本组成部分是可配置逻辑块 (CLB)、可配置路由资源和 I/O 焊盘。 每个 CLB 包含多个由程序数据配置的查找表。
有两种实现布尔逻辑的一般方法。根据真值表,布尔方程可以通过使用逻辑门(系统互补CMOS逻辑设计),或通过查找表(LUT)来实现。在LUT中,输入变量x1到x3被用作存储器的地址,而输出变量y的值被存储在存储器单元中。 对于简单的组合电路,电路的门延迟可能低于LUT的存取延迟,但有了LUT,你就有可能将输入变量的任何布尔组合编程到其中。 因此,LUT被广泛用于FPGA。
除了用于组合逻辑功能的LUT之外,FPGA还包含可编程的互连资源。可配置逻辑块的每个输出信号可以在路由通道的多条信号线中的一条上切换。路由通道内的个别线路可以在截面点上与其他线路连接。 传统的MOS晶体管被用作开关。每个开关的状态被存储在靠近开关的存储单元中。

对于经典的系统板设计者来说,处理器是一个封装的组件。对于今天的片上系统设计者来说,微处理器是一个虚拟组件,即一个预先设计好的构件,可以用于芯片的设计。

Virtual components (VCs) are available as soft VC, firm VC, and hard VC. A soft VC consists of a synthesizable code in a hardware description language, e.g. VHDL or Verilog. The architecture of the soft VC can be modified by the SoC designer and it can be easily transferred to the newest technology generation by logic synthesis tools. In contrast, the hard VC is an optimized, technology dependent macro with fixed layout (placement and wiring), which can- not be modified by the SoC designer and which requires significant design effort to be transferred to a newer technology. The benefit of hard VCs is their higher speed/area/power optimization in their target technology, compared to a soft VC.

虚拟组件(VCs)有软VC、硬VC和硬VC之分。软VC由硬件描述语言的可合成代码组成,如VHDL或Verilog。 软VC的结构可以由SoC设计者修改,它可以很容易地通过逻辑综合工具转移到最新一代的技术。相比之下,硬VC是一个操作时间化的、依赖于技术的宏,具有固定的布局(放置和布线),不能由SoC设计者修改,并且需要大量的设计工作来转移到一个较新的技术。与软VC相比,硬VC的好处是在其目标技术中具有更高的速度/面积/功率操作。

多核设计原则是基于平台的SoC设计中的一个新范式。

Case 1: Tapp unchanged. f decreases by a factor of n.
case 2: Tapp decreases by a factor of k. f increase by a factor of k/n.

Case 1: Assuming that the application can be perfectly parallelized and distributed over n cores, i.e. Tapp keeps unchanged. The operating frequency and supply voltage of the cores can be scaled by 1/n without changing the execution time of the application. In this case, the dynamic power consumption of the multicore processor will be reduced by a factor of 1/n2 compared to the single-core processor.

Case 2: If we want to increase the application performance on a multicore processor by factor k, i.e. Tapp keeps unchanged. The dynamic power consumption will be still lower than in a single core by a factor of k3/n2. As long as (k3/n2) is lower than 1, the multicore processor will be more efficient in terms of dynamic power consumption than a single-core processor (but not in terms of static power consumption since we will have more core instances).

Case 1: 假设应用程序可以完美地并行化并分布在n个内核上,内核的工作频率和电源电压可以按1/n的比例调整,而不改变应用程序的执行时间。 在这种情况下,与单核处理器相比,多核处理器的动态功耗将减少1/n2
Case 2: 如果我们想把多核处理器的应用性能提高k倍,那么动态功耗仍将比单核处理器低k3/n2倍。 只要(k3/n2)低于1,多核处理器在动态功耗方面将比单核处理器更有效(但在静态功耗方面不是,因为我们将有更多的核心实例)。
  • SoC Paradigm

The system on chip (SoC) paradigm is a design approach in which all or most of the components of a computer or electronic system are integrated onto a single chip. In an SoC design, the various components of the system, such as the microprocessor, memory, input/output (I/O) interfaces, and other peripherals, are all combined onto a single piece of silicon.

  • FPGA

A field-programmable gate array (FPGA) is a type of programmable logic device (PLD) that can be used to implement digital circuits. It is called a “field-programmable” device because it can be programmed by the user after it has been manufactured, allowing the user to customize the device for a specific application.

An FPGA consists of an array of configurable logic blocks (CLBs) and interconnect resources that can be used to implement a wide variety of digital circuits. The CLBs and interconnect resources can be programmed by the user to perform specified logic functions. Each CLB consists of a number of programmable logic elements (PLEs), which can be configured to perform a specific logic function. The PLEs are typically implemented using programmable function blocks (PFBs) and look-up tables (LUTs).

The number and complexity of the CLBs in an FPGA determine the overall capacity and performance of the device.


4. Processor Architecture

Instruction Set Architecture (ISA) is the interface between a computer’s software and hardware, which defines the set of instructions that a computer’s processor can execute. It specifies the types of instructions that can be used and the format of those instructions, as well as the memory and input/output operations that can be performed by the processor.

处理器的指令集架构(ISA)是计算机的软件和硬件之间的接口,它定义了计算机处理器可以执行的指令集。它规定了可以使用的指令类型和这些指令的格式,以及处理器可以执行的内存和输入/输出操作。

We can differentiate the processors by their instruction complexity (e.g. RISC or CISC), type of instruction-level parallelism (dynamically scheduled superscalars or statically scheduled VLIW) as well as by application-specific areas of their employment. The performance of processors can be significantly improved by exploiting instruction-level parallelism (ILP).

我们可以通过指令的复杂性(如RISC或CISC)、结构级并行的类型(动态调度的超标量或静态调度的VLIW)以及具体的应用领域来区分处理器。处理器的性能可以通过利用指令级并行性(ILP)而得到显著提高。

The high-level code is transformed by a compiler into an ISA-specific machine code (also called binary or object code). Alternatively, the target code can be written in an assembly language by specifying the program functionality using target ISA instructions.

On the hardware side, the actual processor decodes instructions and generates control signals that are necessary for instructions’ execution. The control signal specification is ISA and processor-dependent.

高级代码由编译器转化为特定于ISA的机器码(也称为二进制或目标码)。 另外,目标代码可以通过使用目标ISA指令来指定程序功能,以汇编语言编写。

在硬件方面,实际的处理器对指令进行解码并产生指令执行所需的控制信号。控制信号的指定取决于ISA和处理器。

An Instruction Set Architecture (ISA) typically includes several key parameters that define the capabilities and functionality of a processor. These parameters can include: (1) Data Types: types of data that the processor can handle, such as integers, floating-point numbers, and memory addresses. (2) Instructions: the set of instructions that the processor can execute, such as arithmetic and logic operations, branching, and memory access. (3) Registers: the number and size of registers that the processor has, which are used to temporarily store data and perform calculations. (4) Memory Models: the memory models that the processor supports, such as a flat or segmented memory model, and how memory is accessed and managed by the processor.

For example, the MIPS instruction set consists of different instruction groups. Arithmetic instructions perform arithmetic operations on the registers, e.g. addition and subtraction. Load/store instructions read and write data from the registers to the main memory. Jumps and branch instructions change the sequential execution flow of the target program. They are used to construct loops or when the program execution must be conditioned.

The register file consists of a fixed number of architecture registers. The ISA specifies how the corresponding registers can be used and what kind of information they contain at a certain moment of execution.

The accessible memory region is defined by the address space.

一个指令集架构(ISA)通常包括几个关键参数,定义了一个处理器的能力和功能。这些参数可以包括。(1) 数据类型:处理器可以处理的数据类型,如整数、浮点数和内存地址。(2) 指令:处理器可以执行的指令集,如算术和逻辑运算、分支和内存访问。(3) 寄存器:处理器拥有的寄存器的数量和大小,用于临时存储数据和执行计算。(4) 内存模型:处理器支持的内存模型,如平坦的或分段的内存模型,以及内存如何被处理器访问和管理。

例如,MIPS指令集由不同的指令组组成。算术指令对寄存器进行算术运算,如加法和减法。加载/存储指令从寄存器中读写数据到主存储器中。跳转和分支指令改变目标程序的顺序执行流程。它们被用来构建循环或当程序执行必须有条件时。

寄存器文件由固定数量的架构寄存器组成。ISA规定了在执行的某一时刻如何使用相应的寄存器以及它们包含什么样的信息。

可访问的内存区域是由地址空间定义的。

尽管目前的处理器要复杂得多,但我们可以通过基本处理器的简化框图来研究问题。这种类型的微处理器被称为RISC(精简指令集计算机)架构。所有指令(instruction i/o)只对寄存器(registers)和累加器(accumulator)进行操作。对于内存的访问(data i/o),必须使用特殊的加载/存储指令。

A program is executed in the following sequence:
Instruction fetch (IF): The program counter is incremented and the next instruction is fetched into instruction register (IR). If the instruction was present in the instruction cache, IF takes only 1 clock cycle; otherwise the instruction must be loaded from the main memory, taking more cycles.
Instruction decode (ID): The processor decodes the instruction in IR, and a set of control signals is generated. Then the processor retrieves the operands from registers specified in the instruction.
Execution (EX): In this stage, computational instructions are executed in the ALU and their result is stored in the accumulator. For load/store instructions, the effective memory address is calculated.
Memory (MEM): If the current instruction is load/store, the content of a register in the register block is read from or written to the main memory.
Write back (WB): The result of computational instructions or the data retrieved by load instructions is written back into the register block.

一个程序按以下顺序执行:
– 指令获取(IF)程序计数器被递增,下一条指令被送入指令寄存器(IR)。如果该指令存在于指令缓存中,IF只需要1个时钟周期;否则,该指令必须从主存储器中加载,需要更多周期。
– 指令解码(ID)处理器对IR中的指令进行解码,并产生一组控制信号。然后,处理器从指令中指定的寄存器中取操作数。
– 执行(EX)在这个阶段,算术指令在ALU中被执行,其结果被存储在累加器中。对于加载/存储指令,计算有效的内存地址。
– 内存(MEM)如果当前指令是加载/存储,寄存器块中的一个寄存器的内容被从主存储器中读出或写入。
– 回写(WB)计算指令的结果或加载指令获取的数据被写回寄存器块中。

如果我们假设每个阶段需要一个时钟周期,那么执行每条指令将需要5个周期,这个处理器的总体CPI(每条指令的周期)为5。现实中,处理器并不总是需要等待一条指令完成后再开始下一条。通过利用指令之间可能的并行性(ILP, instruction-level parallelism),可以提高处理器的执行效率。

每个周期都执行IF,取下一条指令。单个指令的执行仍需要5个周期,但总的CPI将逐渐接近1。

To enable pipelining in hardware, the result of each pipeline operation has to be stored in the intermediate registers at each clock cycle. The period of the clock signal (or its maximum frequency) is defined by the longest pipeline stage. The total instruction rate is typically expressed in MIPS (Millions of Instructions Per Second) and can be determined by dividing the clock frequency of a processor by its CPI value (f / CPI).

为了在硬件中实现流水线,每个流水线操作的结果必须在每个时钟周期存储在中间寄存器中。时钟信号的周期(或其最大频率)是由最长的流水线阶段定义的。总指令率通常以MIPS(每秒百万条指令)为单位,等于f / CPI.

有效流水线的先决条件是各个指令阶段的规律性,一组少而精的指令和寻址模式。然而,指令流水线的效率受到结构危险、数据危险和控制危险的限制。

Structural Hazards: Structural hazards occur if a resource conflict exists between instructions in the pipeline. Assume that the processor has only one memory port that is used both for fetching instructions and data load/store operations, then IF and MEM cannot be done in one cycle.

Data Hazards: In a pipeline, a data hazard arises if a result of an operation is required before it has been calculated.

The pipeline has to be stalled until the first instruction writes the right value into register r3.

Control Hazards: If a branch operation causes the program counter to jump to another location, the instructions in the pipeline following the branch have to be flushed and the pipeline has to be filled again starting from the correct instruction.

The overall performance loss due to control hazards is typically even greater than the loss due to data hazards. To cope with this, current processors employ branch prediction in order to predict the right instruction after branches.

Branch History Table(BHT):

In a 1-bit branch predictor, we assume that the next outcome of a branch is likely to be the same as the previous outcome stored in the table. The last outcome of a branch (taken or not taken) is stored in a 1-bit branch history table. The table is indexed by the last x bits of the branch address, and, thus, the branch table contains 2x elements in total. In loops, the 1-bit branch predictor always predicts incorrectly twice: after the first loop iteration (because the branch was not taken before), and at the last loop iteration when we exit the loop (because the branch was previously taken).

Branch accuracy can be further improved using 2-bit branch prediction, where two bits are used to encode the state of each branch in the branch history table. A branch will be predicted as taken only if it has been previously taken two times in a row. The same is valid for the branch to be considered as non-taken.

In general, by observing m last outcomes of former branches, we will get 2m possible branch predictors for each branch in the history table. In general, we can define an (m, n)-predictor in which the last m branches are analyzed to select one of the 2m n-bit predictors for the current branch.

控制危险: 如果一个分支操作导致程序计数器跳转到另一个位置,那么分支之后的流水线上的指令就必须被刷新,而流水线必须从正确的指令开始重新填充。

由于控制危险造成的整体性能损失通常比数据危险造成的损失还要大。为了解决这个问题,目前的处理器采用了分支预测法,以预测分支后的正确指令。

分支历史表(BHT):

在一个1位分支预测器中,我们假设分支的下一个结果可能与存储在表中的上一个结果相同。一个分支的最后结果(采取或不采取)被存储在一个1位分支历史表中。该表以分支地址的最后x位为索引,因此,分支表总共包含2个元素。在循环中,1位分支预测器总是错误地预测两次:在第一次循环迭代之后(因为之前没有采取分支),以及在最后一次循环迭代时退出循环(因为之前已经采取了分支)。

使用2位分支预测可以进一步提高分支的准确性,其中两个比特用于编码分支历史表中每个分支的状态。一个分支只有在之前连续两次被占用的情况下才会被预测为被占用。同样的情况也适用于被认为是未采取的分支。

一般来说,通过观察以前分支的m个最后结果,我们将为历史表中的每个分支得到2m个可能的分支预测器。一般来说,我们可以定义一个(m, n)预测器,对最后的m个分支进行分析,为当前分支选择2m个n位预测器中的一个。

A superscalar processor uses multiple execution units within the processor, each of which can execute a different instruction at the same time. To effectively utilize the multiple execution units in a superscalar processor, the processor must be able to identify independent instructions that can be executed concurrently. This is typically done using a combination of static analysis (analyzing the code at compile time) and dynamic analysis (analyzing the code as it is being executed).

Very Long Instruction Word (VLIW) processors use multiple execution units to improve performance by taking advantage of instruction-level parallelism (ILP) in a program, like superscalar processors.

One key difference between VLIW processors and superscalar processors is that VLIW processors rely on static analysis (analyzing the code at compile time) to identify independent instructions that can be executed concurrently, while superscalar processors also use dynamic analysis (analyzing the code as it is being executed). This means that VLIW processors must be specifically designed and optimized for the types of programs they will be running, while superscalar processors can adapt to a wider range of programs.

一个超标量处理器在处理器内使用多个执行单元,每个执行单元可以同时执行不同的指令。为了有效利用超标量处理器中的多个执行单元,处理器必须能够识别可以同时执行的独立指令。这通常是通过静态分析(在编译时分析代码)和动态分析(在执行时分析代码)的结合来完成的。

超长指令字(VLIW)处理器使用多个执行单元,通过利用程序中的指令级并行性(ILP)来提高性能,就像超标量处理器一样。

VLIW处理器和超标量处理器的一个关键区别是,VLIW处理器依靠静态分析(在编译时分析代码)来确定可以并发执行的独立指令,而超标量处理器还使用动态分析(在执行时分析代码)。这意味着VLIW处理器必须为其将要运行的程序类型进行专门设计和优化,而超标量处理器可以适应更广泛的程序。

There are several ways to measure the performance of a CPU, ultimately we are interested in CPU time of a program, task or function.

Cycles per instruction (CPI) is a measure of the number of clock cycles that a CPU requires to execute a single instruction. It is used to evaluate the performance of a CPU and to compare the performance of different processors.

A lower CPI indicates that a CPU can execute instructions more quickly and efficiently. Factors that can affect CPI include the complexity of the instruction set, the number of clock cycles required to execute each instruction, and the amount of time required to access data from memory.

It is important to note that CPI alone does not reflect the overall performance of a CPU. A high clock speed or a high number of instructions per clock (IPC) can offset a high CPI, and other factors such as cache size and memory bandwidth can also affect overall performance. Thus it’s used in conjunction with other metrics to get a complete picture of a CPU’s performance.

每条指令的周期(CPI)是对CPU执行一条指令所需的时钟周期数的衡量。它被用来评估CPU的性能和比较不同处理器的性能。

较低的CPI表明CPU可以更快、更有效地执行指令。影响CPI的因素包括指令集的复杂性,执行每条指令所需的时钟周期数,以及从内存访问数据所需的时间。

值得注意的是,单凭CPI并不能反映CPU的整体性能。高的时钟速度或高的每时钟指令数(IPC)可以抵消高的CPI,其他因素如高速缓存大小和内存带宽也会影响整体性能。因此,它与其他指标结合使用,以获得一个CPU性能的完整图像。

We can decompose CPIMEM into CPI of instruction accesses and CPI of data accesses. In a system with cache memory, if data/instructions are in the cache, the access time is 1 cycle, otherwise, there is a penalty (i.e. an increased number of cycles).

我们可以将CPIMEM分解为指令访问的CPI和数据访问的CPI。在一个有高速缓存的系统中,如果数据/指令在高速缓存中,访问时间为1个周期,否则就会有惩罚(即周期数增加)。

一个数据块从主存储器中检索并存储在高速缓存中时,会被放在一个特定的位置,称为高速缓存行。缓存行是可以在缓存和主存之间传输的最小的数据单位。

缓存只包含主存的一小部分,且两者之间有不同的映射方式,如直接映射、集合关联和完全关联等。

In a cache, an address is divided into three parts: offset, index, and tag.

Offset: The offset is the least significant bits of the address, and it is used to identify a specific byte within a cache line. The size of the offset depends on the size of the cache line. For example, if the cache line size is 64 bytes, the offset would be 6 bits (2^6=64).

Index: The index is the next most significant bits of the address, and it is used to identify a specific cache line within the cache. The size of the index depends on the number of cache lines in the cache. For example, if the cache has 64 lines, the index would be 6 bits (2^6=64).

Tag: The tag is the most significant bits of the address, and it is used to identify a specific block of memory within main memory. The tag is used to compare the memory address being accessed to the addresses stored in the cache. If the tag matches, then the data is likely to be in the cache, and the index and offset are used to locate it.

在缓存中,一个地址被分为三个部分:偏移量、索引和标签。

偏移量: 地址的最小有效位,用于识别缓存行中的一个特定字节。偏移量的大小取决于缓冲区行的大小。例如,如果缓存行的大小是64字节,偏移量是6位(26=64)

索引: 地址的下一个最重要的位,它被用来识别高速缓存中的特定高速缓存行。索引的大小取决于缓存中的缓存行的数量。例如,如果缓冲区有64行,索引将是6位(26=64)

标签: 地址的最重要的位,它被用来识别主内存中的特定内存块。标签用于比较被访问的内存地址和存储在高速缓存中的地址。如果标签匹配,那么数据就可能在高速缓存中,索引和偏移量被用来定位它

Direct-mapped cache: The cache is divided into a fixed number of lines, and each line has a unique address. Each block of main memory is mapped to a specific location in the cache.

Set-associative cache: The cache is divided into a fixed number of sets, and each set contains a fixed number of lines. Each block of main memory can be mapped to one of several locations in the cache.

Fully associative cache: In this type of cache organization, each block of main memory can be mapped to any location in the cache. The cache is not divided into sets, and each block of memory is compared to all the tags in the cache. To determine if a block of memory is in the cache, the address is divided into two parts: the offset and the tag. The tag is used to identify a specific block of memory, and the offset is used to identify a specific byte within a cache line.

直接映射的高速缓存: 缓存被分为固定数量的行,每行都有一个唯一的地址。主内存的每个块都被映射到高速缓存中的一个特定位置。

集合式高速缓存: 缓存被分为固定数量的组,每组包含固定数量的行。每个主内存块可以被映射到高速缓存中的几个位置之一。

完全关联的高速缓存: 在这种类型的高速缓存组织中,主内存的每个块都可以被映射到高速缓存的任何位置。缓存本身是一组,每个内存块都与缓存中的所有标记进行比较。为了确定一个内存块是否在高速缓存中,地址被分为两部分:偏移量和标签。标签用于识别特定的内存块,而偏移量则用于识别高速缓存行中的特定字节。

In the set-associative cache, when a cache line set is full and a new block of memory needs to be added, one cache line will be replaced. The most commonly used cache replacement policies are:

Least Recently Used (LRU): This policy replaces the cache line that has not been accessed for the longest period of time. It works by maintaining a linked list of the cache lines in order of the time they were last accessed, with the most recently accessed line at the front of the list and the least recently accessed line at the back of the list. When a new block of memory needs to be added to the cache, the least recently accessed line is removed from the list and replaced with the new data.

Least Frequently Used (LFU): This policy replaces the cache line that has been accessed the least number of times. It works by maintaining a counter for each cache line that keeps track of how many times the line has been accessed. When a new block of memory needs to be added to the cache, the line with the lowest access count is removed from the cache and replaced with the new data.

First In First Out (FIFO): This policy replaces the oldest block in the cache. It works by maintaining a queue of the cache lines in the order they were added, with the oldest block at the head of the queue and the newest block at the tail of the queue. When a new block of memory needs to be added to the cache, the oldest block is removed from the queue and replaced with the new data.

Random: This policy replaces a random block in the cache. It works by randomly selecting a cache line to be replaced when a new block of memory needs to be added to the cache.

缓存替换策略决定了在缓存行用完后需要为新数据腾出空间时,缓存如何处理。最常用的缓存替换策略是。

最近使用最少的(LRU): 这个策略替换的是在最长时间内没有被访问过的缓存行。它的工作原理是按照最后访问的时间顺序维护一个缓存行的链接列表,最近访问的行在列表的前面,最近访问的行在列表的后面。当一个新的内存块需要被添加到缓存中时,最近访问次数最少的行将从列表中删除,并被新的数据所取代。

最不经常使用(LFU): 这个策略取代了被访问次数最少的缓存行。它的工作原理是为每个高速缓存行维护一个计数器,记录该行被访问的次数。当一个新的内存块需要被添加到高速缓存中时,访问次数最少的一行将从高速缓存中被移除,并被新的数据所取代。

先入先出(FIFO): 这个策略取代了高速缓存中最老的块。它的工作原理是按照添加的顺序维持一个缓存行的队列,最老的块在队列的头,最新的块在队列的尾。当一个新的内存块需要被添加到缓存中时,最旧的块会从队列中被移除,并被新的数据所取代。

随机: 这个策略在高速缓存中替换一个随机块。它的工作原理是,当有新的内存块需要添加到高速缓存中时,随机选一个高速缓存行来进行替换。

Cache write strategies determine how data is written to the cache when a block of memory is updated or modified. Some common cache write strategies are:

Write-Through: In this strategy, data is written to both the cache and the main memory at the same time. This ensures that the data in the cache is always consistent with the data in the main memory. However, this strategy can result in increased write traffic to the main memory, which can slow down the system.

Write-Back: In this strategy, data is first written to the cache, and then written to the main memory at a later time. This can improve system performance by reducing the number of writes to the main memory. However, it also increases the chances of cache data becoming inconsistent with main memory data in case of system crashes or power failures.

缓存写入策略决定了当一个内存块被更新或修改时,数据如何被写入缓存。一些常见的高速缓存写入策略是。

Write-Through:在这种策略中,数据被同时写入高速缓存和主内存。这确保了高速缓存中的数据与主存中的数据始终是一致的。然而,这种策略可能会导致对主存的写入流量增加,从而降低系统的速度。

Write-Back:在这个策略中,数据首先被写入高速缓存,然后在稍后的时间写入主内存。这可以通过减少对主内存的写入次数来提高系统性能。然而,在系统崩溃或断电的情况下,它也增加了缓存数据与主内存数据不一致的机会。

In a write-back cache, a “dirty bit” is used to keep track of whether a block of memory in the cache has been modified (written to) or not.

When a block of memory is first brought into the cache, the dirty bit is set to “not dirty” (or “clean”). If a write operation is then performed on the block of memory in the cache, the dirty bit is set to “dirty” to indicate that the block of memory has been modified and the copy in the main memory is no longer up-to-date.

When the cache replacement algorithm decides that a dirty block needs to be evicted, the data in that block is written back to main memory first, so as to maintain the consistency between the main memory and the cache.

The use of dirty bits allows the write-back cache to avoid writing back all blocks to main memory, thus reducing the number of writes to main memory and improving performance.

在回写式高速缓存中,一个 “脏位 “用于跟踪高速缓存中的一个内存块是否被修改过(写入)。
当一个内存块第一次被带入高速缓存时,脏位被设置为 “不脏”(或 “干净”)。如果随后对缓存中的内存块进行了写操作,dirty位被设置为 “dirty”,表示该内存块已被修改,主内存中的拷贝不再是最新的。

当缓存替换算法决定一个脏块需要被驱逐时,该块中的数据首先被写回主内存,以保持主内存和缓存之间的一致性。
脏位的使用使回写缓存避免将所有块写回主内存,从而减少对主内存的写入次数,提高性能。

5. Memory

classification of memory

寄存器和RAM都是由晶体管构成的。然而,寄存器通常由少量的晶体管组成,与CPU紧密结合,用于临时存储被CPU高频使用和处理的数据。它们被设计成小而快,因此数据可以被快速存储和检索。寄存器允许对存储信息的连续访问。

另一方面,RAM通常由更多的晶体管组成,通过一个内存控制器与CPU相连,用于存储正在被操作系统和应用程序高频使用的数据。内存控制器作为CPU和RAM之间的中介,管理数据传输并控制对内存的访问。RAM只允许访问存储信息的一小部分。RAM的物理结构可以根据RAM的类型(如DDR、SDRAM等)而有所不同,但它通常比寄存器的结构更大、更复杂。


characteristics of memory
  • SRAM(静态随机存取存储器)& DRAM(动态随机存取存储器)

刷新: DRAM需要定期刷新其内容,而SRAM不需要。

单元设计: DRAM在电容器中以电荷的形式存储数据,而SRAM在触发器电路中以二进制状态存储数据。

访问时间: SRAM的访问时间比DRAM快,因为它不需要刷新周期。

耗电量: 由于SRAM的单元设计和不需要刷新,它比DRAM消耗更多的能量。

成本: 由于DRAM的单元设计比较简单,因此每比特的存储成本比SRAM低。

容量: DRAM的容量比SRAM大得多,这使得它成为大多数计算机系统中主存储器的首选。

总之,SRAM比DRAM更快、更耗电,但它也更昂贵,而且通常容量较小。DRAM的速度较慢,功耗较低,但价格较低,而且容量较大,这使它成为大多数计算机系统中主存储器的首选。

  • DDR SRAM

通过使用双数据速率(DDR)架构来实现高速数据传输,其中数据在时钟信号的上升沿和下降沿上传输。与传统SRAM相比,这允许更高的数据传输率,传统SRAM只在时钟信号的上升沿传输数据。

We first clarify some basic definitions in the context of memory:

Access bandwidth [bits/s]: Amount of data transported into or out of a memory array (or memory interface) per unit of time.

Latency: Delay or time elapsed between the request and actual delivery of data.

Cycle time: Minimum time period between two consecutive read or write accesses to memory.

Asynchronous memory, also known as asynchronous DRAM, does not operate in sync with the clock speed of the computer’s processor. Instead, it operates on its own clock, which can be slower or faster than the processor’s clock. This results in longer access times, but it also means that asynchronous memory can be manufactured using simpler and less expensive technology.

Synchronous memory operates in sync with the clock speed of the computer’s processor. This results in faster access times, as the memory and processor can work in tandem to quickly transfer data. However, synchronous memory is typically more expensive to manufacture, as it requires more advanced technology to synchronize its operation with the processor’s clock.

访问带宽: 每单位时间内传送到内存阵列(或内存接口)的数据量。
延迟: 请求和实际交付数据之间的延迟或时间。
周期时间: 两个连续的读或写访问之间的最小时间段的最小时间间隔。
异步内存: 也被称为异步DRAM,不与计算机处理器的时钟速度同步运行。相反,它在自己的时钟上运行,其速度可能比处理器的时钟慢或快。这导致了更长的访问时间,但它也意味着异步内存可以使用更简单、更便宜的技术来制造。
同步存储器的运行与计算机处理器的时钟速度同步。这导致更快的访问时间,因为内存和处理器可以协同工作,快速传输数据。然而,同步存储器的制造成本通常更高,因为它需要更先进的技术来使其运行与处理器的时钟同步。

所有的存储器都有一个共同点,即它们是以二维阵列结构组织的。存储的信息不是以每一位为基础进行访问的,而是以所谓的字为单位,由M位组成。M是一个可变的数字 ,通常与相应的微处理器架构的数据路径宽度相匹配(8位、16位、32位、64位)。

存储器阵列的内容可以通过一个共享的、双向的(输入/输出)数据总线访问,该总线为M位宽。由于实际原因(为了限制外部需要的控制信号的数量),我们使用了一个地址解码器。用L个地址信号可以从2L个字中选择一个。

在实践中,内存阵列的尺寸是这样的:宽度和高度大致相等。这意味着每一行包含多个字。因此,地址解码器被分成一个列解码器(K位)和一个行解码器(L-K位),前者从2K 个, 后者从2L-K 个字中选择一个。列解码器和存储器阵列之间设置了感应放大器。

sequence of memory access

Burst access modes allow for reading/writing more than a single data word from/to memory. In order to store/retrieve larger chunks of information to/from consecutive memory locations, it is sufficient to increment the column decoder address lines (while keeping the row decoder lines fixed). The maximum burst size (i.e. number of words that can be accessed during one burst command) equals the number of words in one word line (= 2K).

突发访问模式允许从/向存储器读/写超过一个数据字。 为了向/从连续的存储器位置存储/检索更大的信息块,只需增加列解码器的地址线(同时保持行解码器线的固定)。 最大的突发大小(即在一个突发命令中可以访问的字数)等于一个字行的字数(=2K

Hierarchical Memory Architecture
An assembly consisting of a memory cell matrix, a row decoder, and a column decoder is a memory block or page. Memory blocks can be cascaded horizontally, further partitioning the address bits into block address, column address, and row addresses.

分层内存结构:
由一个存储单元矩阵、一个行解码器和一个列解码器组成的组件是一个存储块或页。存储块可以水平级联,进一步将地址位划分为块地址、列地址和行地址。

我们现在研究实际的存储单元:DRAM单元。DRAM在存储密度方面表现出色,因为它需要最少的基本CMOS器件来实现存储单元,即一个CMOS晶体管加一个电容。

SRAM cell

The transistor acts as a switch which is controlled by the word line WL. The bit information is stored in the storage capacitor CS. BL is precharged to VDD/2.

When a logic “1” (VDD) shall be written to a particular memory cell, the corresponding BL and WL lines are driven with VDD. As a consequence, CS is charged to “1”, i.e., VDD-Vt (If there was already a “1” stored on CS, the logic level is refreshed.) Similarly, storing a “0” is realized by driving BL to GND which discharges CS.

When a particular memory cell is read, the corresponding BL and WL lines are driven with VDD. In case a “1” was stored, the voltage of CS is VDD-Vt, which is greater than the voltage of CBL: VDD/2, the charge redistribution between CS and the bit line capacitor CBL raises the voltage on BL. As CS is much smaller than CBL, the total charge Q = QS+QBL = CSVS + CBLVBL keeps unchanged, this voltage swing is small compared to VDD. However, it’s big enough to be sensed by the sense amplifier who drives BL to VDD and recharges CS. Hence, during a DRAM read, the stored “1” of the memory cell is re-written.

晶体管充当一个开关,由字线WL控制。位信息存储在存储电容CS中,BL预充电到VDD/2。
当一个逻辑 “1”(VDD)被写入一个特定的存储单元时,相应的BL和WL线被VDD驱动。 因此,CS被充电到VDD-Vt(如果CS上已经存储了一个 “1”,则逻辑电平被刷新。)同样,存储 “0 “是通过将BL驱动到GND来实现的,GND使CS放电。
当某一存储单元被读取时,相应的BL和WL线被VDD驱动。在存储 “1”的情况下,CS的电压为VDD-Vt,大于CBL的电压VDD/2,CS和位线电容CBL之间的电荷重新分配提高了BL的电压。由于CS比CBL小得多,总电荷Q=QS+QBL=CSVS+CBLVBL,这个电压波动很小(与VDD相比)。 然而,它大到足以被感应放大器所感应到,放大器驱动BL到VDD并为CS充电。 因此,在DRAM读取期间,存储的”1″被重新写入。

CS may lose the stored content due to charge leakage. That’s why the stored content in DynamicRAM memories has to be refreshed periodically when the time between consecutive memory accesses exceeds certain intervals.

由于电荷泄漏,CS可能会丢失存储的内容。这就是为什么当连续访问存储器的时间超过一定间隔时,必须定期刷新DynamicRAM存储器中的存储内容。

Trench DRAM Cell implements the conducting electrodes of the capacitor along the walls of a deep and narrow trench cut into the Si substrate, which increases the storage capacity CS, thus the information stored is more robust.

沟槽式DRAM单元将电容器的导电电极沿着切入硅衬底的深而窄的沟槽壁实现,这增加了存储容量CS,从而使存储的信息更加稳固。

在DRAM阵列中,每个位线BL有一个感应放大器。从逻辑电平的角度来看,VDD的10%到90%是一个禁止的范围,电压必须被放大到VDD或GND, 这是感应放大器在DRAM存储单元的主要功能。此外,一旦发现有朝向VDD或GND的 “趋势”,感应放大器会立即加速信号的变化。

SRAM cell sense amplifier

The sense amplifier works as follows:

Activation of the word line WL connects the storage capacitor CS to BL. If CS was charged to VDD (Reading a “1”), the voltage on BL will slightly increase, which makes T4 conduct and the voltage on T1 decrease. Thus, T1 conducts, BL is connected to VDD and the originally stored logic level on CS is refreshed.

Reading a “0” from the DRAM cell works similarly.

字线WL的激活将存储电容CS连接到BL。如果CS被充电到VDD(读 “1”),BL上的电压将略有增加,这使得T4导通,T1上的电压下降。因此,T1导通,BL被连接到VDD,CS上原来存储的逻辑电平被刷新。
读取一个逻辑0的工作原理类似。

6-晶体管SRAM单元由双反相器组成,双反相器是最简单的静态寄存器元件。

6-transistor DRAM cell

In DRAM, a sense amplifier is not necessary, but we use it for performance reasons.

In contrast to the 1 transistor DRAM cell, the SRAM cell needs no periodic refreshing and keeps the stored bit value unless it is disconnected from the power supply.

在DRAM中,感应放大器不是必须的,但我们出于性能原因使用它。
与1个晶体管的DRAM单元相比,SRAM单元不需要定期刷新,并能保持存储的比特值,除非它与电源断开连接。

只读存储器(ROM)单元由一个p-n二极管和一个位于存储器矩阵的字线WL和位线BL交叉点的微小金属保险丝组成。

ROM

An open fuse represents a logic “0”, whereas a closed fuse represents a logic “1”.

The diode prevents reverse currents to flow from BL to WL and, thus, impacting the logic values on other BL. The resistor between the BL and GND is mandatory to discharge CBL after each access, thus ensuring a proper “0” level.

保险丝打开代表逻辑 “0”,而保险丝关闭代表逻辑 “1”。
二极管防止反向电流从BL流向WL,从而影响其他BL的逻辑值。在BL和GND之间的电阻是必要的,以便在每次访问后对CBL放电,从而确保一个适当的逻辑”0″。

浮动门晶体管单元

When programming the floating gate transistor cell, a high programming voltage (e.g. four times higher than VDD) is applied to both the control gate and the drain (bit line), making electrons able to tunnel through the first oxide layer onto the floating gate.
Removing the programming voltages leaves negative charge trapped on the floating gate. When now applying VDD to the control gate (word line) the effective floating gate to substrate voltage isn’t large enough to establish a conducting channel. The negative voltage on the floating gate results in a higher threshold voltage Vt.

当对浮动栅极晶体管单元进行编程时,一个高的编程电压(例如VDD的四倍)被施加到控制栅极和漏极(位线),使得电子能够通过第一氧化层隧道到浮动栅极上。
移除编程电压后,浮动栅极上仍有负电荷被捕获。当现在对控制门(字线)施加VDD时,有效的浮动门到子门的电压并不足以建立一个导电通道。浮动门上的负电压导致了更高的阈值电压Vt

闪存单元

In contrast to EPROM, EEPROM, and flash memory cells are electrically erasable. Erasing a stored bit value means removing the trapped charges from the floating gate. This can be done by making the source electrode float (disconnect from GND), connecting the drain electrode to a high voltage, and the control gate to GND. Thus, electrons on the floating gate are attracted through the thin oxide layer to drain. This effect is called “Fowler-Nordheim” tunneling.

与EPROM不同的是,EEPROM和闪存单元是可以电擦除的。 擦除一个存储的位值意味着从浮动栅极上清除困住的电荷。这可以通过使源极浮动(与GND断开),将漏极连接到一个高电压和控制门连接到GND来实现。因此,浮动栅极上的电子通过薄的氧化层被吸引到漏极。这种效应被称为 “Fowler-Nordheim “隧道效应。

计算机系统的内存层次结构

CPU caches are closely integrated with the processor core and operate at the same cycle times as the CPU data path pipelines.

The next faster (and bigger) data repository is on-chip (or off-chip) SRAM. However, in order to access SRAM, the CPU request already has to traverse across the CPU bus introducing additional latency.

For larger quantities of data, external SDRAM (or variants) are the next choice. The single transistor DRAM cell achieves larger storage densities than the six transistor SRAM cell, but introduces longer access times due to the more complex access mechanism and the need to interleave data accesses with periodic refresh cycles.

SDRAM memory acts as a “mirror space” for data residing on the hard disk. DMA (direct memory access) controllers shuffle data from external disk drives via system interfaces (e.g. PCI, SCSI) into the SDRAM without requiring CPU attention.

CPU缓存与处理器核心紧密结合,并以与CPU数据路径管道相同的周期时间运行。
下一个更快的(和更大的)数据存储库是片上(或片外)SRAM。然而,为了访问SRAM,CPU的请求已经必须穿越CPU总线,引入额外的延迟。
对于更大数量的数据,外部SDRAM(或变体)是下一个选择。 单晶体管DRAM单元比六晶体管SRAM单元实现了更大的存储密度,但由于更复杂的访问机制以及需要将数据访问与定期刷新周期交错进行,因此引入了更长的访问时间。
SDRAM存储器充当了驻留在硬盘上的数据的 “镜像空间”。DMA(直接内存访问)控制器通过系统接口(如PCI、SCSI)将数据从外部磁盘驱动器洗进SDRAM,而不需要CPU的关注。

我们现在更详细地研究实现一个最小的内存子系统到标准SDRAM内存芯片所需的协议、接口和构建模块。

The memory controller translates the linear addresses for read and write used by the CPU into two-dimensional (row, column) addresses and corresponding control signals for SDRAM internal use. In general, memory controllers hide features and requirements of a specific memory technology.
If necessary, the memory controller can stall the CPU until the SDRAM is again able to accept subsequent read or write accesses.
Three buses are distinguished for data (b), control signals (c), and addresses (a). In some systems, combinations of these signals are multiplexed onto a single bus.

内存控制器将CPU使用的线性读写地址转换为二维(行、列)地址和相应的控制信号,供SDRAM内部使用。一般来说,内存控制器隐藏了特定内存技术的特点和要求。
如果有必要,内存控制器可以使CPU停顿,直到SDRAM再次能够接受后续的读或写访问。
三条总线被区分为数据(b)、控制信号(c)和地址(a)。 在一些系统中,这些信号的组合被复用到一条总线上。
peak data bandwidth

留下评论

您的电子邮箱地址不会被公开。 必填项已用*标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据