I/O devices doesn't use virtual address, neither physical address, they use a third kind of address: a "bus address".

Data transfer can be triggered in two ways:

  • Either the software asks for data (via a function such as read())
  • or the hardware asynchronously pushes data to the system.

一个比较具体的有 DMA (third-party DMA) 流程是这样的:

  • Userspace 有了 read() 系统调用;
  • Kernel driver setup DMAC channel 相关的寄存器(Address, Length, DMA Direction^)。
  • DMAC 和磁盘进行协作,将磁盘数据写入内存区域,CPU 全程不参与此过程(此时 CPU 是不是可以调度到其他线程做其他事情?毕竟这个 block 住了,答案是可以的);
  • DMAC 向 CPU 发出数据传输完成的信号,阻塞的进程返回,由 CPU 负责将数据从内核缓冲区拷贝到用户缓冲区
  • 用户进程由内核态切换回用户态,解除阻塞状态。

DMA programming in kernel

不同的 driver 使用这个公共的 DMA 框架。

kernel.org/doc/Documentation/DMA-API.txt

A simple PCI DMA example

The actual form of DMA operations on the PCI bus is very dependent on the device being driven. Thus, this example does not apply to any real device; instead, it is part of a hypothetical driver called dad (DMA Acquisition Device). A driver for this device might define a transfer function like this,下面是 driver 里的 code:

int dad_transfer(struct dad_dev *dev, bool write, void *buffer,  size_t count)
{
    dma_addr_t bus_addr;
    unsigned long flags;

    // Map the buffer for DMA
    bus_addr = pci_map_single(dev->pci_dev, buffer, count, dev->dma_dir);
    //...

    // Set up the device
    // 设置 device 的 command 以及 device 上一些 register 的值比如 addr 和 len
    writeb(dev->registers.command, DAD_CMD_DISABLEDMA);
    writeb(dev->registers.command, write ? DAD_CMD_WR : DAD_CMD_RD);
    writel(dev->registers.addr, cpu_to_le32(bus_addr));
    writel(dev->registers.len, cpu_to_le32(count));

    // 通过写 device 的 register 来告诉 device 开始 DMA
    writeb(dev->registers.command, DAD_CMD_ENABLEDMA);
    return 0;
}

This function maps the buffer to be transferred and starts the device operation. The other half of the job must be done in the interrupt service routine, which would look something like this。这是在 interrupt handler 里的 code:

void dad_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    struct dad_dev *dev = (struct dad_dev *) dev_id;

    /* Make sure it's really our device interrupting */

    /* Unmap the DMA buffer */
    pci_unmap_single(dev->pci_dev, dev->dma_addr, dev->dma_size, dev->dma_dir);

    /* Only now is it safe to access the buffer, copy to user, etc. */
    ...
}

Direct Memory Access and Bus Mastering - Linux Device Drivers, Second Edition [Book]

request DMA channel in driver:

request_dma

The channel argument is a number between 0 and 7 or, more precisely, a positive number less than MAX_DMA_CHANNELS. On the PC, MAX_DMA_CHANNELS is defined as 8, to match the hardware. The name argument is a string identifying the device. The specified name appears in the file /proc/dma, which can be read by user programs.

Your sound card and your analog I/O interface can share the DMA channel as long as they are not used at the same time.

Another example driver code using DMA to perform MEM-TO-MEM transfer

#include <linux/module.h>
#include <linux/init.h>
#include <linux/completion.h>
#include <linux/slab.h>
#include <linux/dmaengine.h>
#include <linux/dma-mapping.h>

/* Meta Information */
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Johannes 4 GNU/Linux");
MODULE_DESCRIPTION("A simple DMA example for copying data from RAM to RAM");

void my_dma_transfer_completed(void *param) 
{
	struct completion *cmp = (struct completion *) param;
	complete(cmp);
}

/**
 * @brief This function is called, when the module is loaded into the kernel
 */
static int __init my_init(void) {
	dma_cap_mask_t mask;
	struct dma_chan *chan;
	struct dma_async_tx_descriptor *chan_desc;
	dma_cookie_t cookie;
	dma_addr_t src_addr, dst_addr;
	u8 *src_buf, *dst_buf;
	struct completion cmp;
	int status;

	printk("my_dma_memcpy - Init\n");

	dma_cap_zero(mask);
	dma_cap_set(DMA_SLAVE | DMA_PRIVATE, mask);
	chan = dma_request_channel(mask, NULL, NULL);
	if(!chan) {
		printk("my_dma_memcpy - Error requesting dma channel\n");
		return -ENODEV;
	}

	src_buf = dma_alloc_coherent(chan->device->dev, 1024, &src_addr, GFP_KERNEL);
	dst_buf = dma_alloc_coherent(chan->device->dev, 1024, &dst_addr, GFP_KERNEL);

	memset(src_buf, 0x12, 1024);
	memset(dst_buf, 0x0, 1024);

	printk("my_dma_memcpy - Before DMA Transfer: src_buf[0] = %x\n", src_buf[0]);
	printk("my_dma_memcpy - Before DMA Transfer: dst_buf[0] = %x\n", dst_buf[0]);

	chan_desc = dmaengine_prep_dma_memcpy(chan, dst_addr, src_addr, 1024, DMA_MEM_TO_MEM);
	if(!chan_desc) {
		printk("my_dma_memcpy - Error requesting dma channel\n");
		status = -1;
		goto free;
	}

	init_completion(&cmp);

	chan_desc->callback = my_dma_transfer_completed;
	chan_desc->callback_param = &cmp;

	cookie = dmaengine_submit(chan_desc);

	/* Fire the DMA transfer */
	dma_async_issue_pending(chan);

	if(wait_for_completion_timeout(&cmp, msecs_to_jiffies(3000)) <= 0) {
		printk("my_dma_memcpy - Timeout!\n");
		status = -1;
	}

	status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);
	if(status == DMA_COMPLETE) {
		printk("my_dma_memcpy - DMA transfer has completed!\n");
		status = 0;
		printk("my_dma_memcpy - After DMA Transfer: src_buf[0] = %x\n", src_buf[0]);
		printk("my_dma_memcpy - After DMA Transfer: dst_buf[0] = %x\n", dst_buf[0]);
	} else {
		printk("my_dma_memcpy - Error on DMA transfer\n");
	}

	dmaengine_terminate_all(chan);
free:
	dma_free_coherent(chan->device->dev, 1024, src_buf, src_addr);
	dma_free_coherent(chan->device->dev, 1024, dst_buf, dst_addr);

	dma_release_channel(chan);
	return 0;
}

/**
 * @brief This function is called, when the module is removed from the kernel
 */
static void __exit my_exit(void) {
	printk("Goodbye, Kernel\n");
}

module_init(my_init);
module_exit(my_exit);

Linux_Driver_Tutorial/30_dma_memcpy at main · Johannes4Linux/Linux_Driver_Tutorial

Let's code a Linux Driver - 30 DMA (Direct Memory Access) Memcopy - YouTube

Asynchronously DMA

This happens, for example, with data acquisition devices that go on pushing data even if nobody is reading them. In this case, the driver should maintain a buffer so that a subsequent read call will return all the accumulated data to user space. The steps involved in this kind of transfer are slightly different:

  • The hardware raises an interrupt to announce that new data has arrived.
  • The interrupt handler allocates a buffer and tells the hardware where to transfer its data.
  • The peripheral device writes the data to the buffer and raises another interrupt when it’s done.
  • The handler dispatches the new data, wakes any relevant process, and takes care of housekeeping.

A variant of the asynchronous approach is often seen with network cards. These cards often expect to see a circular buffer (often called a DMA ring buffer) established in memory shared with the processor; each incoming packet is placed in the next available buffer in the ring, and an interrupt is signaled. The driver then passes the network packets to the rest of the kernel, and places a new DMA buffer in the ring.

Asynchronously DMA 是不是就是 first-party DMA 的 use case

Direct Memory Access and Bus Mastering - Linux Device Drivers, Second Edition [Book]

在等待 DMA 完成的过程中,CPU 可以运行吗?

答案是可以运行的,这里有关于这个问题的讨论:microprocessor - Does a CPU completely freeze when using a DMA? - Electrical Engineering Stack Exchange

  • Operations like this do not delay the processor, but can be rescheduled to handle other tasks. DMA(Direct Memory Access) Wiki - FPGAkey,reschedule 的原因是 read 一般来说是同步的,用户希望 read 完之后再执行下一行代码,所以当前进程没法运行了,需要调度其他进程来运行。

注意在数据传输过程中,CPU 是可以进行其他操作的:在DMA控制传输的同时cpu真的还可以运行其他程序吗? - 其他 - 恩智浦技术社区 我觉得是因为 DMA 是通过周期窃取的方式来获得总线的控制权的。所以在 DMA 的过程中 CPU 也会有机会 master the bus 然后做自己的事。

这里也提到了,会把 process 置为 sleep:Direct Memory Access and Bus Mastering - Linux Device Drivers, Second Edition [Book]

DMA 是 CPU 发起的,还是 Device 发起的?/ first-party DMA / third-party DMA

首先有两种方式的 DMA:

  • Third-party DMA: 古老,只有这种方式有 DMAC,在该方式中,为了执行 DMA 操作,CPU 需要先设置 DMAC 的寄存器
  • First-party DMA: 在 PCI/PCIe 总线下,每个设备都可以作为总线的 master,所以不再像 third-party 那样需要一个单独的 DMAC 来管理 DMA。当设备需要进行内存访问时,只需要获取内存地址对应的总线地址,请求总线控制权,并向目标地址发出读写请求即可。在进行 DMA 时,设备驱动首先分配一部分内存,并获得这部分内存对应的总线地址,之后将包含该总线地址的 DMA 描述信息发送给设备(通过 MMIO 或者 PMIO),之后该 PCIe 设备就可以创建 PCIe Read/Write Transaction,向对应的总线地址通过 PCIe 进行通信。

看下来,都需要 CPU(设备驱动)去发起,只不过一个是编程 DMAC,一个是直接 MMIO 编程 PCIe 设备。

有两种说法,一种说 DMA 一般都是 Device 发起的:DMA(直接内存访问)实现数据传输的流程详解-百度开发者中心

但是维基百科说是 CPU 先发起的:With DMA, the CPU first initiates the transfer, then it does other operations while the transfer is in progress。

DMA device 可以不借助于 driver 预先编程,直接打断 CPU 当前执行流来做 DMA 吗?

The CPU initializes the DMA controller with:

  • A count of the number of words to transfer, and
  • The memory address to use.

The CPU then commands the peripheral device to initiate a data transfer. The DMA controller then provides addresses and read/write control lines to the system memory. Each time a byte of data is ready to be transferred between the peripheral device and memory, the DMAC increments its internal address register until the full block of data is transferred. 这样 DMAC 就知道可以通知 CPU 了。

所以从上面来看,其实是 CPU 先发起的,只不过是 CPU(driver)让 device 开始写了。什么叫 device 开始写呢?这个数据的传输动作是 device 主动触发的还是 DMA 主动触发的?

Driver/DMA 相关的书可以看这个:,ch15.13676

Third-party 是最初的 DMA 方式,需要一个单独的 DMA 控制器来管理 DMA。Third-party 在古老的 ISA 总线和 IBM PC 中使用,而现在普遍使用的都是 PCIe 总线,在 PCIe 中不再需要单独的 DMA 控制器,每个设备都可以是总线的主设备(master),都可以主动发出 DMA 读写请求。

DMA 是与主机使用的总线协议密切相关的,不同的总线协议有不同的 DMA 实现方式。

但是要注意的是,上面应该对应的是 third-party DMA 的情况,device 自己也可以主动发起 DMA 的,这叫做 first-party DMA^,也叫做 bus mastering,PCI DMA 就是这种情况的设计。

DMA 介绍 | 简 悦 这篇文章讲的不错。

DMA 完成后如何通知 CPU 完成?

How does full virtualization emulate DMA access?

比如说 guest 里原封不动的驱动程序,通过 MMIO 的方式编程一个外部的虚拟 PCIe 设备(guest 不知道是虚拟的)以触发一个 DMA 请求,这部分如何进行虚拟化?

MMIO 的时候 trap 出来,然后在 device 端直接拷贝到 MMIO 时指定的内存空间去,拷贝完成后像以前那样通知 vCPU(比如通过注入中断的方式?)

另外,当客户机通过 DMA 访问大块 I/O 时,QEMU 模拟程序不会把操作结果放到 I/O 共享页中,而是通过内存映射的方式将结果直接写到客户机的内存中去,然后通过 KVM 模块告诉客户机 DMA 操作已经完成。

QEMU 里对应的代码,基本思路:

  • 对于 DMA 控制器的区间的 MMIO 写,截获;
  • 识别 guest 的请求,直接通过 memmove 的方式来模拟 DMA;
  • 返回 guest。
kvm_cpu_exec
    case KVM_EXIT_MMIO:
        address_space_rw
            address_space_rw

static const MemoryRegionOps fw_cfg_dma_mem_ops = {
    .read = fw_cfg_dma_mem_read,
    .write = fw_cfg_dma_mem_write,
    .endianness = DEVICE_BIG_ENDIAN,
    .valid.accepts = fw_cfg_dma_mem_valid,
    .valid.max_access_size = 8,
    .impl.max_access_size = 8,
};

// Port I/O 的情况,使用了 fw_cfg_dma_mem_ops
fw_cfg_io_realize
    if (FW_CFG(s)->dma_enabled)
        memory_region_init_io(../&fw_cfg_dma_mem_ops, FW_CFG(s)../);

// MMIO 的情况,也使用了 fw_cfg_dma_mem_ops
fw_cfg_mem_realize
    if (FW_CFG(s)->dma_enabled)
        memory_region_init_io(../&fw_cfg_dma_mem_ops, FW_CFG(s)../);

// 对 DMA region 的读很简单,返回对应的内容就行了
static uint64_t fw_cfg_dma_mem_read(void *opaque, hwaddr addr, unsigned size)
{
    /* Return a signature value (and handle various read sizes) */
    return extract64(FW_CFG_DMA_SIGNATURE, (8 - addr - size) * 8, size * 8);
}

// 以对 DMA 的写为例(因为读只是查询),这个表示写这块 memory region,触发一个 DMA 请求
fw_cfg_dma_mem_write
    fw_cfg_dma_transfer
        // 表示方向是从 memory 往 device 写,也就是从 device 的角度是读
        dma_memory_read
        // 表示方向是从 device 往 memory 写,也就是从 device 的角度是写
        dma_memory_write
            dma_memory_rw
                dma_memory_rw_relaxed
                    // 写的是 s->dma_as
                    address_space_rw
                        address_space_write
                            flatview_write
                                // 终于找到你了,直接把设备内存和 dma 内存拷贝了
                                memmove

这么样看,在有 DMA 时,全虚拟化场景下的 IO 性能应该也不差吧,毕竟 MMIO 搬运的方式需要多次 VMExit,但是 DMA 看起来只需要一次就好了,已经实现了 bulk 内存操作了,也不比 VirtIO 要差?并不是,因为 VirtIO 时零拷贝的,而这种 full virtualization DMA 还是需要一次拷贝,见下面:

即使 QEMU 采用 DMA 的方式把数据帧直接写入到客户机内存,然后通知客户机,同样免不了数据复制带来的开销。因为这个 DMA 是通过 memcpy 模拟出来的,还是需要拷贝,需要先从真正的设备拷贝到 QEMU 里的 virtio 设备,再从 virtio 设备 memcpy 到 guest 共享出来的内存。

可能因为有的 driver 就是只用了 MMIO,没有充分用到 DMA,或者说有的 driver 没有办法用到 DMA。

DMA 安全吗,如果 Malicious Device 往不该写的内存写怎么办?/ DMA Attack

This is a security concern known as a DMA attack.

IOMMU 可以 mitigate 这种 attack。

Bus Mastering

Bus mastering 是 DMA 的一种,bus mastering 其实是 first-party DMA^。

PCI 的 DMA 方式其实就是 bus mastering,也就是 first-party DMA:

  • Bus Mastering on PCI: The PCI bus architecture itself supports bus mastering capabilities. This means devices connected to the PCI bus can be granted permission (Bus Master Enable or BME) to directly access the system bus and memory for data transfers.
  • Device Initiated Transfers: In PCI DMA, the device itself initiates the DMA transfer process. It sends a request to the DMA controller, specifying details like source and destination addresses, transfer size, and utilizes the PCI bus for data movement.
  • DMA Controller Involvement: While the device takes initiative, the DMA controller is still involved. The controller manages arbitration on the bus (ensuring no conflicts), performs error checking, and handles some aspects of the transfer process.

Most modern bus architectures, such as PCI, allow multiple devices to bus master because it significantly improves performance for general-purpose operating systems.

While bus mastering theoretically allows one peripheral device to directly communicate with another, in practice almost all peripherals master the bus exclusively to perform DMA to main memory.

也就是说之前和 CPU 抢总线这件事是 DMAC 来做的,现在我们不需要 DMAC 了,Device 自己就可以抢了总线然后往内存里写数据了。为什么之前 DMAC 抢总线呢?他抢了总线怎么保证 Device 在抢到总线的这段时间正好在收发数据呢? 答案是 DMAC 其实和 Device 有通信的接口协议的。US6701405B1 - DMA handshake protocol - Google Patents 他会和 Device 进行通信来让 Device 发送或者接收数据。有了 bus mastering 之后,Device 可以自己把总线占了然后自己往内存里写数据,不需要 DMAC 代替其和 CPU 抢总线了。

我觉得这个和异步 DMA 是正交的概念。异步 DMA 不是通过 CPU 来发起的(比如 read()),是先数据到了才通知 ISR 然后设置 DMA region 来传输的。但是后面 DMA 传输的过程应该可以是 first-party 也可以是 third-party 形式的。

Some real-time operating systems prohibit peripherals from becoming bus masters, because the scheduler can no longer arbitrate for the bus and hence cannot provide deterministic latency.

Direct Memory Access and Bus Mastering - Linux Device Drivers, Second Edition [Book]

First-party DMA & Third-party DMA & PCI DMA

Third-party DMA 才是标准 DMA。使用 DMA Controller。

First-party DMA 也就是 bus mastering 也就是 PCI DMA 所采取的方式。

Standard DMA, also called third-party DMA, uses a DMA controller. (first-party 可能也有 controller,但是只负责仲裁)

Bus mastering is the capability of devices on the PCI bus (other than the system chipset, of course) to take control of the bus and perform transfers directly. PCI supports full device bus mastering, and provides bus arbitration facilities through the system chipset.

First-party DMA, the device drives its own DMA bus cycles using a channel from the system's DMA engine. The ddi_dmae_1stparty(9F) function is used to configure this channel in a cascade mode so that the DMA engine will not interfere with the transfer. Modern IDE/ATA hard disks use first-party DMA transfers. The term "first party" means that the peripheral device itself does the work of transferring data to and from memory, with no external DMAC involved. This is also called bus mastering, because when such transfers are occurring the device becomes the "master of the bus".

Direct Memory Access (DMA) Modes and Bus Mastering DMA

Third-party DMA utilizes a system DMA engine resident on the main system board, which has several DMA channels available for use by devices. The device relies on the system's DMA engine to perform the data transfers between the device and memory. The driver uses DMA engine routines (see ddi_dmae(9F)) to initialize and program the DMA engine. For each DMA data transfer, the driver programs the DMAC and then gives the device a command to initiate the transfer in cooperation with that DMAC.

更深的理解请看 Bus mastering^。

What is DMA Engine

好像并没有 DMA Engine 这个称呼。我觉得应该看语境:

  • 如果指的是 hardware,那么应该就是 Hardware DMA Controller;
  • 如果指的是 software,那么应该就是 Software DMA Engine in Kernel,kernel 里有一个头文件就是 drivers/dma/dmaengine.h

linux - What is the difference between DMA-Engine and DMA-Controller? - Stack Overflow

What is a DMA channel

DMA channel 是 DMA controller 里面的一部分。

DMA channels are virtual pathways within a DMA controller that manage Direct Memory Access (DMA) transfers between devices and system memory.

A single system typically has multiple devices capable of performing DMA transfers. DMA channels provide a way to manage these concurrent requests and prevent conflicts.

Each DMA channel on a DMAC typically has its own set of registers for configuration and control.

Bus address

For example, if a PCI device has a BAR, the kernel reads the bus address (A) from the BAR and converts it to a CPU physical address (B). The address B is stored in a struct resource and usually exposed via /proc/iomem. When a driver claims a device, it typically uses ioremap() to map physical address B at a virtual address (C). It can then use, e.g., ioread32(C), to access the device registers at bus address A.

As a matter of fact, the situation is slightly more complicated than that. DMA-based hardware uses bus, rather than physical, addresses. Although ISA and PCI addresses are simply physical addresses on the PC, this is not true for every platform.

Aka. DMA address.

Most of the 64bit platforms have special hardware that translates bus addresses (DMA addresses) …

Kernel source: Documentation/DMA-mapping.txt

DMA Mapping

DMA mapping is a conversion from virtual addressed memory to a memory which is DMA-able on physical addresses (actually bus addresses).

Where is the DMA hardware?

What's the difference between "driver" and "the device"

For an example, GPU has its own computing unit and it may also want to access

https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt

片内 DMA

DMA Controller

也是一种外设。

可以将其视为一种能够通过一组专用总线将内部和外部存储器与每个具有 DMA 能力的外设连接起来的控制器。

什么是仲裁器

什么是 DMA 通道

每个通道对应不同的外设的 DMA 请求。

虽然每个通道可以接收多个外设的请求,但是同一时间只能有一个有效,不能同时接收多个。

DMA Direction

/**
 * enum dma_transfer_direction - dma transfer mode and direction indicator
 * @DMA_MEM_TO_MEM: Async/Memcpy mode
 * @DMA_MEM_TO_DEV: Slave mode & From Memory to Device
 * @DMA_DEV_TO_MEM: Slave mode & From Device to Memory
 * @DMA_DEV_TO_DEV: Slave mode & From Device to Device
 */
enum dma_transfer_direction {
	DMA_MEM_TO_MEM,
	DMA_MEM_TO_DEV,
	DMA_DEV_TO_MEM,
	DMA_DEV_TO_DEV,
	DMA_TRANS_NONE,
};