PCIE基础概念

PCIE 相比于 PCI 的变化:

  • 并改串;可以补充一下原因;
  • The host bridge is replaced by the upstream port of a four-port fan-out switch

不同的 PCIe 设备之间理论上是可以互相通信的。

好大一棵树 - PCIe Tree

x1, x2, x4, x8, x16 等到都是以 lane 为单位的,lane 越多表示带宽越大。

PCIe Lane & PCIe Gen

Gen 决定每个 Lane 的速度,而 Lane 的数量决定总通道数。总带宽是两者的乘积。

当低 PCIe Lane 数量的设备插入高 PCIe Lane 数量的主板时,实际使用的 Lane 数量由设备与主板的协商结果决定。

IDE (Integrity and Data Encryption)

TLP 层的机密性、完整性和防重放性协议。

PCIe 接口芯片驱动

在 linux 下 PCI/PCIe 接口芯片的驱动框架已由 linux kernel 实现。

PCI 设备驱动程序从总线 0 开始查询 PCI 系统并且定位系统中所有的 PCI 设备和 PCI 桥。它建立一个 可以用来描述这个 PCI 系统拓朴层次的数据结构链表。并且对所有的发现的 PCI 桥编号。

PCIe 设备发现流程

DFS 的方式分配总线号,设备号和功能号?BDF。

PCIe 协议到底是谁和谁之间的协议?

PCIe 是一套总线标准。

总线上的设备之间的协议。和 TCP 一样,采用分层协议,下层协议为上层服务,所以 PCIe 不是一个协议,而是一套协议,一个协议栈,只不过里面每一个子协议一般没有替代品,所以合并起来叫做 PCIe 协议:

  • 事务层的主要职责是创建(发送)或者解析(接收)TLP (Transaction Layer packet),流量控制,QoS,事务排序等。
  • 数据链路层的主要职责是创建(发送)或者解析(接收)DLLP(Data Link Layer packet),Ack/Nak 协议(链路层检错和纠错),流控,电源管理等。
  • 物理层的主要职责是处理所有的 Packet 数据物理传输,发送端数据分发到各个 Lane 传输(stripe),接收端把各个 Lane 上的数据汇总起来(De-stripe),每个 Lane 上加扰(Scramble,目的是让 0 和 1 分布均匀,去除信道的电磁干扰 EMI)去扰(De-scramble),以及 8/10 或者 128/130 编码解码,等等。

PCIe 和 MMIO 的关系

VirtIO 的 transport layer 可以是 PCIe,也可以是 MMIO。这两者的关系是什么呢,是并列还是一个包含另一个,是另一个超集的关系?

后者的关系,MMIO 是 PCIe 规范的一部分。

PCIe ACS (Access Control Services)

默认情况下,PCIe 设备可以通过根复合体(Root Complex)直接与其他设备通信(称为 P2P 通信)。这种通信可能会绕过主机的内存管理单元(IOMMU),从而带来安全隐患。

ACS 可以限制或禁止 P2P 访问,确保所有设备之间的通信都经过根复合体和 IOMMU,从而实现更严格的访问控制。

Without ACS, it is possible for a PCIe Endpoint to either accidentally or intentionally (maliciously) write to an invalid/illegal area on a peer endpoint, potentially causing problems.

启用 or 禁用 ACS:

#!/bin/bash

set -x

[[ "x$1" == "xenable" ]] && {
#	for bdf in (lspci | grep LSI | awk '{ print $1 }' | xargs) ; do sudo setpci -s bdf ECAP_ACS+6.w=5d ; done
	acshelper -s 005d
}

[[ "x$1" == "xdisable" ]] && {
#	for bdf in (lspci | grep LSI | awk '{ print $1 }' | xargs) ; do sudo setpci -s bdf ECAP_ACS+6.w=1d ; done
	acshelper -s 001d
}

PCI / PCIE Topology

The Conventional PCI bus (henceforward PCI) is a designed around the bus topology: a shared bus is used to connect all the devices. To create more complex hierarchies some devices can operate as bridge: a bridge connects a PCI bus to another, secondary, bus. The secondary bus can be another PCI bus (the device is called a PCI-to-PCI bridge, henceforward P2P) or a bus of a different type (e.g. PCI-to-ISA bridge).

The PCI Express bus (henceforward PCIe) is designed around a point-to-point topology: a device is connected only to another device.

  • While the basic components of the PCI bus were devices and bridges,
  • the basic components of the PCIe are devices and switches.

From the software perspective, nothing is changed (but for new features added) and the bus is enumerated the same way: with devices and bridges.

The PCIe switch is the basic glue between devices, it has n downstream ports. Internally the switch has a PCI bus segment, for each port a virtual P2P bridge is created in the internal bus segment.

PCIe Endpoint

就是 device。

更好地理解 lspci 的输出

lspci -t
-[0000:00]---00.0-[01-06]----00.0-[02-06]--+-00.0-[03]--+-00.0
                                           |            \-00.1
                                           \-03.0-[04-06]----00.0-[05-06]----00.0-[06]----00.0

输出难以理解的根源在于 PCI bridge 这个可以创建出总线的东西,也是一个 PCI 设备

把一个垂直的线当作是一个 bus,不要把横向的当作 bus。 如下示例,垂直的是一个 bus,上面挂了很多设备。

00.0-[01-06]00.0-[01] 这两种形式其实是同一个意思,都表示这是一个 PCI Bridge 设备,前者表示这个 bridge 的子树所囊括的总线范围,后者表示仅仅一条总线。这种表示方式可能是为了更加方便寻找一个设备在拓扑中的位置。

00.0-[01-06] 表示 00.0 这个 bridge 设备创建出的 PCI 子树共有六条总线(而不是直接创建出来的总线)。

lspci 不会列出 pcie switch,只会列出 bridge,因为 PCIe Switch 的本质是多端口 bridge 的集合,在 PCI 拓扑中,Switch 被拆解为多个虚拟的 PCI-to-PCI 桥接器(P2P Bridge)(参考 PCIe Switch^)。

PCIe switch 在 lspci -t 中一般输出为什么?

-+-[0000:00]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.4
 |           +-13.0-[03-04]----00.0-[04]----00.0
 |           +-14.0
 |           +-14.2
 |           +-14.4
 |           +-15.0
 |           +-16.0
 |           +-16.1
 |           +-16.4
 |           +-1f.0
 |           +-1f.4
 |           \-1f.5
 +-[0000:12]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.4
 |           \-01.0-[13-33]----00.0-[14-26]----00.0-[15-26]----00.0-[16-26]--+-00.0-[17]--
 |                                                                           +-01.0-[18]--
 |                                                                           +-02.0-[19]--
 |                                                                           +-03.0-[1a]--
 |                                                                           +-04.0-[1b]--
 |                                                                           +-05.0-[1c]--
 |                                                                           +-06.0-[1d]--
 |                                                                           +-07.0-[1e]--
 |                                                                           +-08.0-[1f]----00.0
 |                                                                           +-09.0-[20]--
 |                                                                           +-0a.0-[21]--
 |                                                                           +-0b.0-[22]--
 |                                                                           +-0c.0-[23]--
 |                                                                           +-0d.0-[24]--
 |                                                                           +-0e.0-[25]--
 |                                                                           \-0f.0-[26]--
 +-[0000:39]-+-00.0
 |           +-00.1
 |           +-00.2
 |           \-00.4
 +-[0000:40]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.4
 |           +-01.0-[41]----00.0
 |           \-03.0-[42]----00.0
 +-[0000:48]-+-00.0
 |           +-00.1
 |           +-00.2
 |           \-00.4
 +-[0000:4f]-+-00.0
 |           +-00.1
 |           +-00.2
 |           \-00.4
 +-[0000:56]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.4
 |           +-01.0
 |           +-02.0
 |           +-03.0
 |           \-03.1
 +-[0000:57]---00.0
 +-[0000:59]---00.0
 +-[0000:60]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.4
 |           +-01.0
 |           \-02.0
 +-[0000:6a]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.4
 |           +-01.0
 |           \-02.0
 +-[0000:74]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.4
 |           +-01.0
 |           \-02.0
 +-[0000:7e]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.3
 |           +-00.5
 |           +-01.0
 |           +-01.1
 |           +-01.2
 |           +-02.0
 |           +-02.1
 |           +-02.2
 |           +-03.0
 |           +-03.1
 |           +-03.2
 |           +-05.0
 |           +-05.1
 |           +-05.2
 |           +-06.0
 |           +-06.1
 |           +-06.2
 |           +-07.0
 |           +-07.1
 |           +-07.2
 |           +-08.0
 |           +-08.1
 |           +-08.2
 |           +-0c.0
 |           +-0d.0
 |           +-0e.0
 |           +-0f.0
 |           +-1a.0
 |           +-1b.0
 |           +-1c.0
 |           \-1d.0
 +-[0000:7f]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.3
 |           +-00.4
 |           +-00.5
 |           +-00.6
 |           +-00.7
 |           +-01.0
 |           +-01.1
 |           +-01.2
 |           +-01.3
 |           +-01.4
 |           +-01.5
 |           +-01.6
 |           +-01.7
 |           +-02.0
 |           +-02.1
 |           +-02.2
 |           +-02.3
 |           +-02.4
 |           +-02.5
 |           +-02.6
 |           +-02.7
 |           +-03.0
 |           +-03.1
 |           +-03.2
 |           +-03.3
 |           +-03.4
 |           +-03.5
 |           +-03.6
 |           +-03.7
 |           +-04.0
 |           +-04.1
 |           +-04.2
 |           +-04.3
 |           +-04.4
 |           +-04.5
 |           +-04.6
 |           +-04.7
 |           +-05.0
 |           +-05.1
 |           +-05.2
 |           +-05.3
 |           +-05.4
 |           +-05.5
 |           +-05.6
 |           +-05.7
 |           +-06.0
 |           +-06.1
 |           +-06.2
 |           +-06.3
 |           +-0a.0
 |           +-0a.1
 |           +-0a.2
 |           +-0a.3
 |           +-0a.4
 |           +-0a.5
 |           +-0a.6
 |           +-0a.7
 |           +-0b.0
 |           +-0b.1
 |           +-0b.2
 |           +-0b.3
 |           +-0b.4
 |           +-0b.5
 |           +-0b.6
 |           +-0b.7
 |           +-0c.0
 |           +-0c.1
 |           +-0c.2
 |           +-0c.3
 |           +-0c.4
 |           +-0c.5
 |           +-0c.6
 |           +-0c.7
 |           +-0d.0
 |           +-0d.1
 |           +-0d.2
 |           +-0d.3
 |           +-0d.4
 |           +-0d.5
 |           +-0d.6
 |           +-0d.7
 |           +-0e.0
 |           +-0e.1
 |           +-0e.2
 |           +-0e.3
 |           +-0e.4
 |           +-0e.5
 |           +-0e.6
 |           +-0e.7
 |           +-0f.0
 |           +-0f.1
 |           +-0f.2
 |           +-0f.3
 |           +-0f.4
 |           +-0f.5
 |           +-0f.6
 |           +-0f.7
 |           +-10.0
 |           +-10.1
 |           +-10.2
 |           +-10.3
 |           +-1d.0
 |           +-1d.1
 |           +-1e.0
 |           +-1e.1
 |           +-1e.2
 |           +-1e.3
 |           +-1e.4
 |           +-1e.5
 |           +-1e.6
 |           \-1e.7
 +-[0000:80]-+-00.0
 |           +-00.1
 |           +-00.2
 |           \-00.4
 +-[0000:90]-+-00.0
 |           +-00.1
 |           +-00.2
 |           \-00.4
 +-[0000:97]-+-00.0
 |           +-00.1
 |           +-00.2
 |           \-00.4
 +-[0000:9e]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.4
 |           \-01.0-[9f-bf]----00.0-[a0-b2]----00.0-[a1-b2]----00.0-[a2-b2]--+-00.0-[a3]----00.0
 |                                                                           +-01.0-[a4]--
 |                                                                           +-02.0-[a5]--
 |                                                                           +-03.0-[a6]--
 |                                                                           +-04.0-[a7]--
 |                                                                           +-05.0-[a8]--
 |                                                                           +-06.0-[a9]--
 |                                                                           +-07.0-[aa]--
 |                                                                           +-08.0-[ab]--
 |                                                                           +-09.0-[ac]--
 |                                                                           +-0a.0-[ad]--
 |                                                                           +-0b.0-[ae]--
 |                                                                           +-0c.0-[af]--
 |                                                                           +-0d.0-[b0]--
 |                                                                           +-0e.0-[b1]--
 |                                                                           \-0f.0-[b2]--
 +-[0000:c5]-+-00.0
 |           +-00.1
 |           +-00.2
 |           \-00.4
 +-[0000:cc]-+-00.0
 |           +-00.1
 |           +-00.2
 |           \-00.4
 +-[0000:d3]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.4
 |           +-01.0
 |           +-02.0
 |           +-03.0
 |           \-03.1
 +-[0000:d4]---00.0
 +-[0000:d6]---00.0
 +-[0000:dd]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.4
 |           +-01.0
 |           \-02.0
 +-[0000:e7]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.4
 |           +-01.0
 |           \-02.0
 +-[0000:f1]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.4
 |           +-01.0
 |           \-02.0
 +-[0000:fe]-+-00.0
 |           +-00.1
 |           +-00.2
 |           +-00.3
 |           +-00.5
 |           +-01.0
 |           +-01.1
 |           +-01.2
 |           +-02.0
 |           +-02.1
 |           +-02.2
 |           +-03.0
 |           +-03.1
 |           +-03.2
 |           +-05.0
 |           +-05.1
 |           +-05.2
 |           +-06.0
 |           +-06.1
 |           +-06.2
 |           +-07.0
 |           +-07.1
 |           +-07.2
 |           +-08.0
 |           +-08.1
 |           +-08.2
 |           +-0c.0
 |           +-0d.0
 |           +-0e.0
 |           +-0f.0
 |           +-1a.0
 |           +-1b.0
 |           +-1c.0
 |           \-1d.0
 \-[0000:ff]-+-00.0
             +-00.1
             +-00.2
             +-00.3
             +-00.4
             +-00.5
             +-00.6
             +-00.7
             +-01.0
             +-01.1
             +-01.2
             +-01.3
             +-01.4
             +-01.5
             +-01.6
             +-01.7
             +-02.0
             +-02.1
             +-02.2
             +-02.3
             +-02.4
             +-02.5
             +-02.6
             +-02.7
             +-03.0
             +-03.1
             +-03.2
             +-03.3
             +-03.4
             +-03.5
             +-03.6
             +-03.7
             +-04.0
             +-04.1
             +-04.2
             +-04.3
             +-04.4
             +-04.5
             +-04.6
             +-04.7
             +-05.0
             +-05.1
             +-05.2
             +-05.3
             +-05.4
             +-05.5
             +-05.6
             +-05.7
             +-06.0
             +-06.1
             +-06.2
             +-06.3
             +-0a.0
             +-0a.1
             +-0a.2
             +-0a.3
             +-0a.4
             +-0a.5
             +-0a.6
             +-0a.7
             +-0b.0
             +-0b.1
             +-0b.2
             +-0b.3
             +-0b.4
             +-0b.5
             +-0b.6
             +-0b.7
             +-0c.0
             +-0c.1
             +-0c.2
             +-0c.3
             +-0c.4
             +-0c.5
             +-0c.6
             +-0c.7
             +-0d.0
             +-0d.1
             +-0d.2
             +-0d.3
             +-0d.4
             +-0d.5
             +-0d.6
             +-0d.7
             +-0e.0
             +-0e.1
             +-0e.2
             +-0e.3
             +-0e.4
             +-0e.5
             +-0e.6
             +-0e.7
             +-0f.0
             +-0f.1
             +-0f.2
             +-0f.3
             +-0f.4
             +-0f.5
             +-0f.6
             +-0f.7
             +-10.0
             +-10.1
             +-10.2
             +-10.3
             +-1d.0
             +-1d.1
             +-1e.0
             +-1e.1
             +-1e.2
             +-1e.3
             +-1e.4
             +-1e.5
             +-1e.6
             \-1e.7

PCIe Switch

只有 PCIe 有 switch。PCI 没有 switch。

一个标准 Switch 具有一个上游端口和多个下游端口。上游端口与 RC 或者其他 Switch 的下游端口相连。

和 PCIe Domain 的关系

和计算机网络中 switch 和 bridge 的关系一样,switch 可以隔离广播域(分割不同 PCI Domain),而 PCI 的 bridge 不能(还是同一个 PCI Domain)。

In some advanced configurations, PCI switches might be used to isolate traffic between specific groups of devices, essentially creating sub-domains within a larger domain. However, this requires specific switch features and careful configuration, and it's not a standard functionality for all PCI switches.

PCI Bus

Bus 貌似只是 PCI 里的概念,PCIe 里没有 bus 的概念因为都是 switch。

If you have 3 PCI domains you either have 3 Host Bridges or 3 PCIe root ports.

cpu - How is PCI segment(domain) related to multiple Host Bridges(or Root Bridges)? - Stack Overflow

PCI Bus 也是一种 PCI Device,至少在 QEMU 的 code 中是这样的:

static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
                                           PCIBus **aliased_pbus,
                                           PCIBus **piommu_bus,
                                           uint8_t *aliased_pdevfn)
{
    //...
    while (iommu_bus && (!iommu_bus->iommu_ops ||
           !iommu_bus->iommu_ops->get_address_space) &&
           iommu_bus->parent_dev) {
        // 把 iommu_bus->parent_dev 看作是一个 PCI bus
        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
        //...
        if (!pci_bus_is_express(iommu_bus)) {
            // 把 iommu_bus->parent_dev 看作是一个 PCI device
            PCIDevice *parent = iommu_bus->parent_dev;
            //...
    }
    //...
}

PCI Segment / PCI Domain

Each PCI domain/PCI segment supports up to 256 buses. An OS can assign the bus numbers of each PCI domain as it please. Unfortunately, the word PCI domain has also a meaning in the Linux kernel, it is used to number each Host Bridge.

The two sides (upstream and downstream) of a PCI bridge can be considered in the same PCI domain, assuming it's configured that way. 也就是说网桥的两端其实都是同一个 PCI Domain。Bridge 的作用其实就是 Used to extend the reach of a single PCI domain。

我们可以用多个 RC 或者 PCIe switch^ 来隔离 PCI Domain。

Multiple root complexes are the key to achieving multiple PCI domains in a system.

PCI Domain 和 PCI Segment 是同一个东西。 PCI domain and PCI segment are basically two terms for the same concept - a collection of PCI buses that share a common configuration space.

This identifies a group of devices that can communicate directly with each other on the PCIe bus. It's essentially a traffic isolation zone. In most systems, there's only one PCI segment (often referred to as segment 0) and it might not even be explicitly shown.

A PCIe link is regarded as a PCI bus segment; this checks with the fact that the switch has a P2P bridge for each downstream port (in total there are 1 + n PCI bus segment for a switch). A switch has one more port: the upstream port. So a switch takes 1 + N + 1 PCI segment bus:

In PCIe, a single connection between devices (like a switch and a device) can be considered a segment. This is because each connection acts like a separate "island" for communication.

The PCR is basically a switch with an important twist: each one of its ports establishes a new PCI domain.

Long story short: same word, different meanings.

PCI Segment Group

应该只是 PCIe 里的概念。A PCI Segment Group can have up to 256 PCI bus segments. In most systems there is only one PCI Segment Group (PCI Segment Group number 0).

This enables a single computer to have a much larger number of devices if needed (theoretically up to 16 million). However, having multiple segment groups is uncommon in most desktop and server systems.

PCI Host Bridge (PHB) / PCI Root Bridge

The PCI host bridge provides an interconnect between the processor and peripheral components.

PCI bridge 只能有两个端(upstream, downstream),但是 PCI host bridge 可以有多个。The host bridge serves as a central hub with multiple connections,而且 PCI host bridge 其实也可以隔离处理器系统的存储器域与 PCI 总线域,从这点上来看,和 PCIe switch 还是挺像的。

一个 CPU 可以有多个 PHB 吗?

一个 CPU 可以拥有多个 PHB(PCI Host Bridge),因为一个 CPU 有一个 RC,一个 RC 可以有多个 PHB。

PHB 和其他 PCI switch 的关系是什么?为什么要区分开来?

i440FX 就是这样的一个 host bridge。

Is PCI Host Bridge also a PCI device?

它不是一个 PCI device。It is a bridge (conceptually a Host-to-PCI bridge) that lets the CPU performs PCI transactions. A CPU can have more than a Host Bridge, nothing prevents it, though it's very rare. More likely, a system can have more than one CPU and with a Host Bridge integrated into each of them, a system can have more than one Host Bridge.

For PCI, each Host Bridge establishes a PCI domain. The main characteristic of a PCI domain is that it is isolated from other PCI domains: a transaction is not required to be routable between domains.

对于 Root Bridge:

The term Root Bridge doesn't exist in the specification.

cpu - How is PCI segment(domain) related to multiple Host Bridges(or Root Bridges)? - Stack Overflow

PCI Root Bus / Primary/Secondary/Subordinate Bus

A PCI root bus, also known as a PCI host bridge (PCI) or Root Complex (PCIe). The PCI root bus remains the central connection point for the CPU, memory, and the first level of PCI buses. It's still directly connected to the CPU.

  • The bus connected to the upstream side of a bridge is referred to as its primary bus, it's essentially the "incoming" bus from which the bridge receives data packets.
  • while the bus connected to its downstream side is referred to as its secondary bus. It's essentially the "outgoing" bus where the bridge sends data packets after processing them.
  • The highest numbered PCI bus that can be reached by traversing downstream from a specific PCI bridge. It essentially defines the range of buses "below" the bridge in the hierarchy. Imagine a branching network of PCI buses. The subordinate bus defines the farthest point (highest bus number) you could reach by following these branches downstream from a particular bridge.

对于 Root Bus:

i440fx_pcihost_realize
    PCIHostState *phb = PCI_HOST_BRIDGE(dev);
    b = pci_root_bus_new
        pci_root_bus_internal_init
    phb->bus = b;

static inline bool pci_bus_is_root(PCIBus *bus)
{
    return !!(bus->flags & PCI_BUS_IS_ROOT);
}

/* This bus is the root of a PCI domain */
PCI_BUS_IS_ROOT                                         = 0x0001,

cpu - How is PCI segment(domain) related to multiple Host Bridges(or Root Bridges)? - Stack Overflow

struct PCIHostState QEMU

是不是每一个 machine type(比如 i440FX)都有这样的一个 host state 呀。

有一个全局变量叫做 pci_host_bridges,是一个关于 PCIHostState 的 list。会在:

pci_root_bus_internal_init
pci_host_bus_register
struct PCIHostState {
    SysBusDevice busdev;

    MemoryRegion conf_mem;
    MemoryRegion data_mem;
    MemoryRegion mmcfg;
    uint32_t config_reg;
    bool mig_enabled;
    // root bus
    PCIBus *bus;
    bool bypass_iommu;

    QLIST_ENTRY(PCIHostState) next;
};

PXB (PCI Expander Bridge)

PXB is a "light-weight" host bridge in the same PCI domain
as the main host bridge whose purpose is to enable
the main host bridge to support multiple PCI root buses.
It is implemented only for i440fx and can be placed only
on bus 0 (pci.0).

As opposed to PCI-2-PCI bridge's secondary bus, PXB's bus
is a primary bus and can be associated with a NUMA node
(different from the main host bridge) allowing the guest OS
to recognize the proximity of a pass-through device to
other resources as RAM and CPUs.

qemu/docs/pci_expander_bridge.txt at master · qemu/qemu

PCI-PCI Bridge

就像网桥一样,Connect 2 PCI buses.

两个 port 的地位并不是对等的,Upstream port 和 Downstream port 之间是有区分的。

PCIe Root Port / PCIe Switch Port

PCI Express Port 是一种 PCI-PCI 桥接结构。

There are two types of PCI Express Port: the Root Port and the Switch Port. 其实很简单,Root Port 就是 Root Complex 上的 port,而 Switch Port 就是 switch 上的 port。

  • The Root Port originates a PCI Express link from a PCI Express Root Complex and
  • The Switch Port connects PCI Express links to internal logical PCI buses.

The PCI Express Root Port is a port on the root complex. 每个 root port 都有一个唯一的编号,用于区分不同的 PCIe 总线。

The Switch Port, which has its secondary bus representing the switch's internal routing logic, is called the switch's Upstream Port. The switch's Downstream Port is bridging from switch's internal routing bus to a bus representing the downstream PCI Express link from the PCI Express Switch. 一图胜千言:

https://www.kernel.org/doc/Documentation/PCI/PCIEBUS-HOWTO.txt

What's the difference between using multiple lanes with the original PCI's parallel interface?

If this were a bus like the original PCI, you would need to be careful about aligning all of the signals properly, so there was essentially no skew between then. Good luck getting that to work at GHz speeds.

PCIe allows a little more independence among the lanes. This lets it shoot for high bit rates (e.g. 8 GT/sec for Gen3) without having to try to align all of the signals on a wide bus to within a fraction of a clock cycle. When you send data toward a PCIe device, it gets chopped up into symbols, and those symbols are transmitted separately across the multiple lanes.

The lanes aren't tightly synchronized like a bus. Rather, they're somewhat loosely synchronized. PCIe relies on logic outside the serial links to realign the streams of symbols at the receiving end, so they can be reassembled into the original request stream.

www.quora.com

PCIe in northbridge or southbridge?

The southbridge is capable of supporting PCIe in some cases, but in maximum cases it is supported by the northbridge.

CPU 直连的总线有两个,一个是内存的总线,一个是 PCIE 总线,通常是 PCIE x16,一个显卡就占了。

m2 的固态硬盘也是 PCIe 的总线,那么会不会和显卡抢占带宽呢?答案是不会,因为固态硬盘用的就是南桥的 PCIe。

AMD 提供了 20 条直连 PCIe 通道,16 给显卡,剩下的 4 分给了一个 m2。所以的 AMD 5600X + B550M 为什么会有两个性能不一样的 m2 插槽呢?原因就是这个。

【硬件科普】电脑主板右下角的散热片下面究竟隐藏着什么?详解主板南桥芯片组的功能和作用_哔哩哔哩_bilibili

BDF

一共 2 bytes,8 bit for bus, 5 for devices and 3 for functions. So there could be 256 buses, each with up to 32 devices, each supporting 8 functions in a device PCIe tree.

BDF 是一个软件(比如说 kernel)配置的号还是硬件拓扑就决定了,和 OS 无关?Windows 和 Linux 下都有 BDF 的概念。是一个 PCIe 硬件上的概念,因为 PCIe switch 也需要通过 BDF 来路由到对应的设备。

BDF (or B/D/F) stands for Bus, Device, Function. It it a 16 bit number that locates a device's function.

  • PCI Bus number in hexadecimal, often padded using a leading zeros to two or four digits (8 bits)
  • A colon (:)
  • PCI Device number in hexadecimal, often padded using a leading zero to two digits. (5 bits)
  • A point (.)
  • PCI Function number in hexadecimal (3 bits)

For example 00:1f.2.

和 PCI 总线一样,PCIe 总线中的每一个功能(Function)都有一个唯一的标识符与之对应。这个标识符就是 BDF(Bus,Device,Function),PCIe 的配置软件(即 Root 的应用层,一般是 PC)应当有能力识别整个 PCIe 总线系统的拓扑逻辑,以及其中的每一条总线(Bus),每一个设备(Device)和每一项功能(Function)。

所以 PCIe 中也有功能的概念,并且是最小的粒度。

每一个 PCIe 设备可以只有一个功能(Function),即 Fun0。也可以拥有最多 8 个功能,即多功能设备(Multi-Fun),因为功能标识用了 3 位。

每个设备必须要有功能 0(Fun0),其他的 7 个功能(Fun1~Fun7)都是可选的。

PCI-bridge 也有自己的 BDF。

PASID (Process Address Space Identifier)

Remapping hardware treats inbound memory requests from root-complex integrated devices and PCI Express* attached discrete devices into two categories:

  • Requests without address-space-identifier: These are the normal memory requests from endpoint devices. These requests typically specify the type of access (read/write/atomics), targeted DMA address/size, and source-id of the device originating the request (e.g., Bus/Dev/Function).
  • Requests with address-space-identifier: These are memory requests with additional information identifying the targeted address space from endpoint devices. Beyond attributes in normal requests, these requests specify the targeted Process Address Space Identifier (PASID), and Privileged-mode-Requested (PR) flag (to distinguish user versus supervisor access). For details, refer to the PASID Extended Capability Structure in the PCI Express specification.

Bus Master Enable (BME)

BME means "Bus Master Enable" and it is the Bit 2 in Command Register(offset 0x4) in PCI Config space.

This bit specifies if a function is capable of issuing Memory and IO Read/Write requests.(这和 DMA 有什么区别,其实就是允许 DMA 了)。

A device needs to have Bus Master Enable (BME) set in its configuration to initiate DMA transfers.

BIOS sets this bit blindly:

  • If the device supports bus master access, the bit becomes 1,
  • otherwise, the write to this bit has no effect and the bit remains 0.

深入PCI与PCIe之一:硬件篇 - 知乎

深入PCI与PCIe之二:软件篇 - 知乎

MMIO / PMIO

Each I/O device monitors the CPU's address bus and responds to any CPU access of an address assigned to that device, connecting the data bus to the desired device's hardware register.

Port-mapped I/O (PMIO) often uses a special class of CPU instructions designed for performing I/O, such as the in and out. I/O devices have a separate address space from general memory.

Memory-mapped I/O (MMIO) uses the same physical address space to address both memory and I/O devices. So a memory address may refer to either RAM or I/O device. Thus, the CPU instructions used to access the memory can also be used for accessing devices.

Memory-mapped I/O and port-mapped I/O - Wikipedia

Advantage:

  • By discarding the extra complexity that port I/O brings, a CPU requires less internal logic and is thus cheaper, faster, easier to build, consumes less power and can be physically smaller;
  • instructions that perform an ALU operation directly on a memory operand (loading an operand from a memory location, storing the result to a memory location, or both) can be used with I/O device registers as well.

As you look at this, the crucial thing to keep in mind is that the CPU doesn’t really know anything about what it’s connected to. It talks to the outside world through its pins but it doesn’t care what that outside world is.

When the northbridge receives a physical memory request it decides where to route it: should it go to RAM? Video card maybe? This routing is decided via the memory address map. For each region, the memory map knows the device that owns that region. (MMIO 的这段映射程序是存在于 Northbridge 上面)

This memory layout is permanent, but user programs do not see it directly - instead, they run into their own virtual address space to which the kernel can decide to map, wherever it wants, physical memory and IO ranges.

MMIO vs DMA

这两者都是 PCIe 里的两种传输方式(所以把它们放到了 PCIe 的这篇文章里)。

PCIe Part 2 - All About Memory: MMIO, DMA, TLPs, and more! – Reversing Engineering for the Soul 这篇文章讲 MMIO 和 DMA 的关系挺好的。

MMIO allows the CPU to control hardware by reading and writing specific memory addresses. Usually, this would be used for low-bandwidth operations such as changing control bits.

DMA allows hardware to directly read and write memory without involving the CPU. Usually, this would be used for high-bandwidth operations such as disk I/O or camera video input.

MMIO 是从 driver 角度访问设备内存,可以是写(从内存到设备内存)也可以是读(从设备内存到内存),MMIO 其实本质上还是 CPU 在指导数据的拷贝,因为还是普通操作内存的指令 mov 等等,Hardware optimizations allow this method to achieve a throughput that is quite a bit faster than config space accesses. However, its speed still pales in comparison to the bulk transfer speed of DMA. 教科书上为了说明 DMA 好处用来拉踩 CPU 直接搬运效率低的,应该就是 PIO & MMIO。因为 CPU 需要执行 MOV 指令,这样会占用到 CPU 自己的寄存器空间,同时也没有办法一次性移动很多内容(受限于寄存器大小),需要很多 MOV 指令才可以

DMA 是从设备的角度访问内存,可以是从内存写到设备内存,也可以是从设备内存写到内存,DMA 没有特别的指令,只是 program 一下 dma controller 里的相关控制位就行了。But unlike MMIO, DMA is initiated by the device itself, not the host CPU. The host CPU will tell the device over MMIO where the DMA should go and the device itself is responsible for starting and finishing the DMA transfer.

没有 PCIe 协议之前,有 MMIO 和 DMA 吗?

  • ISA 就已经有 MMIO 了。
  • ISA 也有 DMA。

PCIe 的 MMIO/DMA 相比于 ISA 的 MMIO/DMA 优势是什么?

最重要的,我觉得是带宽,毕竟都是 DMA,底层传输物理层实现是不一样的,速度肯定有区别。

linux - What is the difference between DMA and memory-mapped IO? - Stack Overflow

Whose job to execute MMIO map?

Address decoder - Wikipedia

An address decoder has two or more inputs for address bits and one or more outputs for device selection signals.

When the address for a particular device appears on the address inputs, the decoder asserts the selection output for that device.

  • A dedicated, single-output address decoder may be incorporated into each device on an address bus,
  • or a single address decoder may serve multiple devices.

For PCI:

  • Memory BARs: Bit 31-4 is the base address.
  • I/O BARs: Bit 31-2 is the base address.

Whose job to establish MMIO map? BIOS or OS?

In sum, the BIOS:

  • Discovers the device's BARs.
  • For each BAR, asks the BAR how much address space it needs.
  • Assigns the BAR a block of address space.
  • Programs each BAR with the base addresses (this also configure the address decoder).

How is memory mapped to certain hardware? How is MMIO accomplished exactly? - Super User

Can MMIO mapping be updated in runtime?

MMIO process

In sum, the BIOS:

  • Discovers the device's BARs (base address register). Each device can have up to six BARs.
  • For each BAR, asks the BAR how much address space it needs.
  • Assigns the BAR a block of address space.
  • Programs the BAR with the base address of the chosen block of memory address space.

You can, of course, find the process detailed on Wikipedia's PCI configuration space - Wikipedia page.

How is memory mapped to certain hardware? How is MMIO accomplished exactly? - Super User

Motherboard Chipsets and the Memory Map | Many But Finite

mmap (memory-mapped file I/O), not to be confused with memory-mapped I/O

Root Complex

只有在 PCIe 当中存在。通常位于 CPU 芯片上,负责连接 CPU 与其他设备,如 GPU、网卡、存储控制器等。所以说一个 NUMA Node 只有一个 Root Complex,因为一个 NUMA Node 有一个 CPU Package。lscpi -s 18:00.0 -vv 可以看到 PCI 设备 18:00.0 属于哪一个 NUMA node。

There is only one root complex per socket, not one per core.

CPU is not connect to Memory directly! They are connected through Root Complex.

Though the following topology, we cannot regard memory as endpoint and Root Complex also doesn't connect memory using PCIe protocol, the mechanism that the root complex uses to send the data to memory is highly implementation specific.

linux device driver - How PCIE Root complex moves DMA transaction from PCIe endpoint to Host memory - Stack Overflow

pci e - Is it possible to connect a pciE slot to one of many Root Complexes - Stack Overflow

How does Root Complex handle the MMIO request from device side (not from CPU side)?

Configuration space

For PCI: PCI LOCAL BUS SPECIFICATION, REV. 3.0

  • 6 Configuration Space

For PCIe: PCIE 6.0 SPEC:

  • Figure 7-3 PCI Express Configuration Space Layout
  • Figure 7-4 Common Configuration Space Header

PCI devices have a set of registers referred to as configuration space.

  • PCI configuration space length: 256 bytes (The first 64 bytes of configuration space are standardized; the remainder (192 bytes) are available for vendor-defined purposes. In order to allow more parts of configuration space to be standardized without conflicting with existing uses, there can be a list of capabilities defined within the remaining 192 bytes of PCI configuration space. Each capability has one byte that describes which capability it is, and one byte to point to the next capability.)
  • PCIe configuration space length: 4096 bytes (with the first 256 bytes for PCI and the rest for PCIe extended capabilities.)

BARs are a set of those registers, though not all. (像 Device ID,Vendor ID,Class Code 和 Revision ID,是只读寄存器,不是 BAR)。

不是所有的 device 的 configuration space 的 layout 是一样的,下面是 configuration space 的 layout:

  • PCI Header (first 64 bytes):
    • The first 16 Bytes (four words) are defined the same for all types of PCI devices: Device ID, Vendor ID, Status, Command, Class Code, Revision ID, BIST, Header Type, Latency Timer, Cache Line Size.
    • The remaining words can have different layouts depending on the base function that the device supports. The Header Type Register values determine the different layouts of remaining 48 bytes (64-16) of the header, depending on the function of the device. (BARs 就是在这里面,可能不同的设备类型 BAR 是不一样的)
      • Type 1 headers for Root Complex, switches, and bridges.
      • Type 0 for endpoints.
  • List of capabilities (192 bytes).(上面说的 PCI Header 里有一个寄存器叫做 Capabilities Pointer,which points (an offset into this function's configuration space,不是指向内存) to a linked list of new capabilities implemented by the device. Used if bit 4 of the status register is set to 1.) Capabilities Pointer 存放 Capabilities 结构链表的头指针。在一个 PCIe 设备中,可能含有多个 Capability 结构,这些寄存器组成一个链表。 (为什么不直接弄成连续的呢?)
  • PCIe extended capabilities (256 - 4096 bytes). The only standardized is the first four bytes at 0x100 (256 bytes) which are the start of an extended capability list. Extended capabilities are very much like normal capabilities except that they can refer to any byte in the extended configuration space (by using 12 bits instead of eight), have a four-bit version number and a 16-bit capability ID. Extended capability IDs overlap with normal capability IDs, but there is no chance of confusion as they are in separate lists.

PC 机想操作 PCIE 设备的内存空间,只能通过 BAR 空间来进行操作。PC 机只能看见 BAR 空间。

CPU 如果想访问某个设备的空间,RC 会代办。比如:

  • 如果 CPU 想读外设的数据,RC 通过 TLP 把数据从 PCIe 外设读到内存,然后 CPU 从内存读数据;
  • 如果 CPU 要往外设写数据,则先把数据在内存中准备好,然后 RC 通过 TLP 写入到 PCIe 设备。

具体实现就是上电的时候,系统把 PCIe 设备开放的空间映射到内存空间,CPU 要访问该设备空间,只需访问对应的内存空间。RC 检查该内存地址,如果发现该内存空间地址是某个设备空间的映射,就会触发其产生 TLP,去访问对应的 PCIe 设备。

The first 64 bytes of configuration space are standardized:

undefined

Configuration Space (4096 bytes):

  • First 64 bytes:
    • BARs
  • 192 bytes:
    • Capabilities Pointer (Offset 34h, point to a linked list of capabilities) (MSI, MSI-X, etc.)
  • Rest (PCIe extended capabilities)

PCI configuration space - Wikipedia

老男孩读PCIe之六:配置和地址空间

PCI Vendor ID & PCI Device ID

#define PCI_VENDOR_ID_REDHAT_QUMRANET 0x1af4

VirtIO 用的 vendor id 好像和 REDHAT 是同一个?

The 16-bit device ID is then assigned by the vendor. 表示这个厂商的某一个 device 的型号。

PCI status (16bit)

属于 PCI configuration space header 的一部分。

// QEMU
#define PCI_STATUS		0x06	/* 16 bits */
#define  PCI_STATUS_IMM_READY	0x01	/* Immediate Readiness */
#define  PCI_STATUS_INTERRUPT	0x08	/* Interrupt status */
#define  PCI_STATUS_CAP_LIST	0x10	/* Support Capability List */
#define  PCI_STATUS_66MHZ	0x20	/* Support 66 MHz PCI 2.1 bus */
#define  PCI_STATUS_UDF		0x40	/* Support User Definable Features [obsolete] */
#define  PCI_STATUS_FAST_BACK	0x80	/* Accept fast-back to back */
#define  PCI_STATUS_PARITY	0x100	/* Detected parity error */
#define  PCI_STATUS_DEVSEL_MASK	0x600	/* DEVSEL timing */
#define  PCI_STATUS_DEVSEL_FAST		0x000
#define  PCI_STATUS_DEVSEL_MEDIUM	0x200
#define  PCI_STATUS_DEVSEL_SLOW		0x400
#define  PCI_STATUS_SIG_TARGET_ABORT	0x800 /* Set on target abort */
#define  PCI_STATUS_REC_TARGET_ABORT	0x1000 /* Master ack of " */
#define  PCI_STATUS_REC_MASTER_ABORT	0x2000 /* Set on master abort */
#define  PCI_STATUS_SIG_SYSTEM_ERROR	0x4000 /* Set when we drive SERR */
#define  PCI_STATUS_DETECTED_PARITY	0x8000 /* Set on parity error */

PCI_STATUS_INTERRUPT

This read-only bit reflects the state of the interrupt in the device/function. Only when the Interrupt Disable bit in the command register is a 0 and this Interrupt Status bit is a 1, will the device’s/function’s INTx# signal be asserted. Setting the Interrupt Disable bit to a 1 has no effect on the state of this bit.

BAR (Base Address Registers) / Type 0 Configuration Space Header / Type 1 Configuration Space Header

// PCI Spec
7. Software Initialization and Configuration
    7.5 PCI and PCIe Capabilities Required by the Base Spec for all Ports
        7.5.1 PCI-Compatible Configuration Registers
            7.5.1.2 Type 0 Configuration Space Header
            7.5.1.3 Type 1 Configuration Space Header

Type 0 是给 endpoints 用的,Type 1 是给 root port 用的。

Command register in BAR

Command register 并不是 BAR 的一部分,这意味着这是一个 general 的 register,Type 0 和 Type 1 都有这个 register。

The Command register contains a bitmask of features that can be individually enabled and disabled.

The PCI Express (PCIe) command register is not read-only. The CPU, through the operating system or device drivers, can set and clear specific bits within the register to control the functionality of a PCIe device.

PCI capability list

PCI Local Bus Specification Revision 3.0

  • 6.7. Capabilities List

尽管 PCI capability list 里的 capability 以及每一个 capability 的大小都是可以自定义的。但是仍旧需要遵循一个标准。

因为 256 bytes 的 configuration space 的前 64 bytes 是定义好的,后 192 bytes 可以是 device 自己实现的结构,也可能是一个 Capabilities List (PCI Status Register Capabilities List bit (bit 4) is set)。

Each capability in the list consists of

  • an 8-bit ID field assigned by the PCI SIG,
  • an 8 bit pointer in configuration space to the next capability,
  • and some number of additional registers immediately following the pointer to implement that capability.

Each capability must be DWORD aligned.(这并不意味着每一个 capability 的大小必须是相等的,只需要起始的地方是 align 的就行),比如从下面的图我们可以看出来:

  • Capabilities entries 不一定是按顺序放的,也不一定是挨着放的;
  • Capabilities entries 的大小不一定要相等;
  • Capabilities entries 的大小不一定是 WORD 的整数倍。
//                                                 ┌───────────────┐
//                                                 │               │
//                          Capabilities Pointer   │      A4h      │  34h
//                                                 │               │
//                                                 └─────────────┬─┘
//                                                               │
//                                                               │
//                                                               │
// ┌───────────────────────────────┬───────────────┬─────────────┼─┐
// │                               │               │             │ │
// │         Capability Y          │      E0h      │ ID for Y    │ │  5Ch
// │                               │               │             │ │
// └───────────────────────────────┴────┬──────────┴────────▲────┼─┘
//                                      │                   │    │
//                                      │    ┌──────────────┘    │
//                                      │    │                   │
// ┌───────────────────────────────┬────┼────┴─────┬─────────────▼─┐
// │                               │    │          │               │
// │         Capability X          │    │ 5Ch      │ ID for X      │  A4h
// │                               │    │          │               │
// │                               └────┼──────────┴───────────────┤
// │                                    │                          │
// │                                    │                          │
// │                                    │                          │
// │                                    │                          │
// └────────────────────────────────────┼──────────────────────────┘
//                                      │
//                                      └──────────────────┐
//                                                         │
// ┌───────────────────────────────┬───────────────┬───────▼───────┐
// │                               │               │               │
// │          Capability Z         │      00h      │   ID for Z    │  E0h
// │                               │               │               │
// │                               └───────────────┴───────────────┤
// │                                                               │
// │                                                               │
// │                                                               │
// │                                                               │
// └───────────────────────────────────────────────┐               │
//                                                 │               │
//                                                 │               │
//                                                 │               │
//                                                 └───────────────┘

PCIe extended capabilities

Extended capabilities are optional features that go beyond the core PCIe specification.

Each extended capability has its own capability header specifying its ID, version, and next capability pointer. They reside within the device's 4KB configuration space, accessed through PCIe configuration transactions.

PCIe extended capabilities (256 - 4096 bytes). The only standardized is the first four bytes at 0x100 (256 bytes) which are the start of an extended capability list.

Extended capabilities are very much like normal capabilities except that they can refer to any byte in the extended configuration space (by using 12 (4K bytes) bits instead of 8 (256 bytes)), have a four-bit version number and a 16-bit capability ID. 这么看来引入 Extended 的好处就是能够扩展 capabilities list 的空间?

Extended capability IDs overlap with normal capability IDs, but there is no chance of confusion as they are in separate lists.(尽管取值空间有重合,但是因为是不同的 list 所以对于一个数字不会有歧义。)

TLP

TLP Type

  • MR: Memory Read Request
  • MW: Memory Write Request
  • IOR: I / O Read Request
  • IOW: I / O Write Request
  • Msg: Message Request

PCIe QEMU Code

struct PCIDevice QEMU

struct PCIDevice {
    AddressSpace bus_master_as;
    MemoryRegion bus_master_container_region;
    // 这是一个 alias MemoryRegion, see function pci_init_bus_master()
    // 和 Command register 里的 BME bit 有关系,当 BME bit 是置上的时候才会 enable
    // See memory_region_set_enabled(&s->bus_master_enable_region, pci_get_word(s->config + PCI_COMMAND) & PCI_COMMAND_MASTER);
    // 表示这个 Device issue memory I/O request 的时候,backend memory 应该是这一片 memory
    // 也就是 DMA 的 memory region,可以这么理解。
    MemoryRegion bus_master_enable_region;
    //...
};

struct PCIBus QEMU

struct PCIBus {
    BusState qbus;
    enum PCIBusFlags flags;
    const PCIIOMMUOps *iommu_ops;
    void *iommu_opaque;
    uint8_t devfn_min;
    uint32_t slot_reserved_mask;
    pci_set_irq_fn set_irq;
    pci_map_irq_fn map_irq;
    pci_route_irq_fn route_intx_to_irq;
    void *irq_opaque;
    // 一个 bus 上的 devices
    // pci_qdev_realize
    //     do_pci_register_device 
    //         bus->devices[devfn] = pci_dev;
    PCIDevice *devices[PCI_SLOT_MAX * PCI_FUNC_MAX];
    PCIDevice *parent_dev;
    MemoryRegion *address_space_mem;
    MemoryRegion *address_space_io;

    QLIST_HEAD(, PCIBus) child; /* this will be replaced by qdev later */
    QLIST_ENTRY(PCIBus) sibling;/* this will be replaced by qdev later */

    /* The bus IRQ state is the logical OR of the connected devices.
       Keep a count of the number of devices with raised IRQs.  */
    int nirq;
    int *irq_count;

    Notifier machine_done;
};