超标量技术汇总

Instruction retire and cancel

Modern processors execute much more instructions that the program flow needs. This is called "speculative execution".

Then the instructions that were "proven" as indeed needed by flow are "retired".（也就是分支预测成功的那个 branch）。

在一个四发射机器上如果能有 50% 的 Retiring 率，那么相应的 IPC 就是 2。

超标量 / Superscalar / 流水线 / 多发射

在一颗处理器内核中实行了指令级并发的一类并发运算。

首先要明白单条流水线和多条流水线之间的区别。在没有流水线的时候，我们对指令的执行没有分那么细，比如说一条指令在取指到最后完成的这段时间内，第二条指令只能等着，在第一条执行结束之后才能够开始执行第二条指令。

流水线技术对整个过程细化成了很多个步骤：取指、译码、执行、访存、回填。这样当第一条指令运行到某一步的时候，后面的指令不需要干等着，可以先运行前面的步骤。由此可见，流水线能够极大地提高吞吐量，因为在没有流水线的时候，每一个步骤的硬件单元有五分之四的时间都在等待，现在我们可以让它 24h 不间断工作。

可见，理论上来说，流水线技术带来的性能提升和流水线中 stage 的数量（也就是我们细化的程度）是密切相关的。

流水线会轻微增加延迟（因为会堵着）。

同样的，多条流水线其实就是把多条没有流水线的 dedicated 执行的单元拆分成了流水线。

多条流水线的运行 stage 可以穿插吗？

也就是说，一个指令可不可以在 pipeline 1 上取指，在 pipeline 2 上译码？应该是不行的，虽然没有任何证据，但是大概率是不可以的，因为搜不到相关内容。

Intel Haswell and AMD Piledriver have 4 pipelines per core

超标量（superscalar）是指在 CPU 中有一条以上的流水线，并且每时钟周期内可以完成一条以上的指令，这种设计就叫超标量技术。单个流水线应该是达不成超标量的，因为取指 ->译码 ->地址生成 ->取操作数 ->执行 ->写回，每个阶段都要消耗一个时钟周期，所以做不到每一个时钟周期完成一条以上指令。

超级标量是指 cpu 内一般能有多条流水线，这些流水线能够并行处理。在单流水线结构中，指令虽然能够重叠执行，但仍然是顺序的，每个周期只能发射 (issue) 或退休 (retire) 一条指令。

多发射：多发射表示的就是有多条流水线。

Branch prediction / speculative execution

Branch prediction answers the question "which", then it can fetch the next instruction, that's all. Speculative execution go one step further and get the result on that selected branch.

分支预测后，CPU 会 speculative execute（推测执行）错误分支里的指令。但这些指令的结果：

不写回通用寄存器；
不写回内存；
不更新状态标志（CF/ZF 等）。

它们只存在两个地方：

重排序缓冲区 ROB^；
保留站 / 寄存器重命名表。

只有 “确认正确” 的指令，才会提交（Commit/Retire）。

分支预测正确：批量提交，写入寄存器 / 内存；
分支预测失败：直接清空 ROB，全部扔掉。

Speculative execution can work WITHOUT branch prediction:

More specifically, consider an example where the program’s control flow depends on an uncached value located in external physical memory. As this memory is much slower than the CPU, it often takes several hundred clock cycles before the value becomes known. Rather than wasting these cycles by idling, the CPU attempts to guess the direction of control flow, saves a checkpoint of its register state, and proceeds to speculatively execute the program on the guessed path. When the value eventually arrives from memory, the CPU checks the correctness of its initial guess. If the guess was wrong, the CPU discards the incorrect speculative execution by reverting the register state back to the stored checkpoint, resulting in performance comparable to idling. However, if the guess was correct, the speculative execution results are committed, yielding a significant performance gain as useful work was accomplished during the delay.

From: Spectre Attacks: Exploiting Speculative Execution

cpu architecture - difference between speculation and prediction - Stack Overflow

乱序执行

顺序发射 -> 乱序执行 -> 顺序提交：只有中间的执行阶段是需要乱序的。

乱序执行，是在 CPU 后端（执行阶段）才真正 “乱” 起来的；前端（取指、译码、重命名）仍然是严格按程序顺序处理的。

ROB (Re-Order Buffer, 重排缓冲区)

注意 ROB 不是给分支预测使用的，而是所有指令，不管是普通运算、加载、存储、跳转，全部都会进 ROB。

不管什么指令，一进后端就先占 ROB 一个位置。ROB 里的顺序 = 程序顺序，在 ROB 里面并没有重排。从实现上其实也可以看出来，ROB 实际上就是一个循环队列，符合先进先出 FIFO 的规则。这里的 O 指的就是提交，或者说 Retire。

ROB 的下一步就是 Retire：从 ROB 里按顺序提交出去，就是退休。因为是一个 FIFO 的结构，这隐含了一个条件，那就是所有指令可能都会被最前面的这个指令卡着，这个指令如果提交不出去，那么即使后面的可以提交也没有办法提交。

如何判断 ROB 里面的一个指令能不能提交？通过保留站^。

保留站（Reservation Station）/ 寄存器重命名

指令在进入 ROB 时，同时也会进入保留站 RS。RS 不管指令进来的顺序，只是监视操作数齐了没，哪个指令的操作数先好哪个指令先执行。

寄存器重命名功能也是由保留站提供的。