DMA « Leinux

               CPU                  CPU                  Bus
             Virtual              Physical             Address
             Address              Address               Space
              Space                Space

            +-------+             +------+             +------+
            |       |             |MMIO  |   Offset    |      |
            |       |  Virtual    |Space |   applied   |      |
          C +-------+ --------> B +------+ ----------> +------+ A
            |       |  mapping    |      |   by host   |      |
  +-----+   |       |             |      |   bridge    |      |   +--------+
  |     |   |       |             +------+             |      |   |        |
  | CPU |   |       |             | RAM  |             |      |   | Device |
  |     |   |       |             |      |             |      |   |        |
  +-----+   +-------+             +------+             +------+   +--------+
            |       |  Virtual    |Buffer|   Mapping   |      |
          X +-------+ --------> Y +------+ <---------- +------+ Z
            |       |  mapping    | RAM  |   by IOMMU
            |       |             |      |
            |       |             |      |
            +-------+             +------+

设备不使用虚拟地址，也不使用物理地址，而是使用总线地址（bus address，有的系统上和物理地址是一样的，但是大多系统上不是）。如果设备：

寄存器通过 MMIO 映射在了系统物理内存地址空间；
通过 DMA 去读写系统内存。

那么就会用到 bus address， IOMMUs and host bridges can produce arbitrary mappings between physical and bus addresses.

有 bus address，那么就有 bus address space。问题：在虚拟化环境下，bus address 就是 GPA 吗？设备对 GPA DMA，然后 IOMMU 翻译成 HPA 这样？

During the enumeration process, the kernel reads the bus address (A) from the BAR and converts it to a CPU physical address (B). The address B is stored in a struct resource and usually exposed via /proc/iomem. When a driver claims a device, it typically uses ioremap() to map physical address B at a virtual address (C). It can then use, e.g., ioread32(C), to access the device registers at bus address A.

设备访问系统内存需要用到 IOMMU 来帮助做地址转换。需要把 DMA Address 转换成为物理地址。一个流程是这样的：

driver 调用 dma_map_single(VA) API，这个 API 会返回一个 bus address 回来（类似一个 handle，类型是 dma_addr_t），同时更新 IOMMU 页表（bus address -> PA ）的映射。
driver 告诉设备往这个 bus address DMA。
更新后的 IOMMU 页表做这个地址转换。
DMA 传输完后，driver unmap 这个地址。

物理地址转 bus address，使用 host bridge；bus address 转物理地址：使用 IOMMU。

我们要始终认知的一种 DMA 模型是：设备内部由寄存器保存着当前其在设备内存中的偏移，比如说对于一个 8G 的 GPU 显卡，如果当前寄存器指向 2G，那么 DMA 时 bus address space 其实就是从 2G 开始算 offset 的，bus address 0 就是 2G，bus address 0.1G 就是 2.1G 这样。这个信息 driver 应该可以感知到，这样结合内核 DMA API，就可以控制 DMA 到设备的哪片内存/从设备哪片内存读取信息。

DMA 在内核当中主要的子系统和代码目录（可以看到 swiotlb 也在里面）：

DMA MAPPING HELPERS
M:	Marek Szyprowski <m.szyprowski@samsung.com>
R:	Robin Murphy <robin.murphy@arm.com>
L:	iommu@lists.linux.dev
S:	Supported
W:	http://git.infradead.org/users/hch/dma-mapping.git
T:	git git://git.infradead.org/users/hch/dma-mapping.git
F:	include/asm-generic/dma-mapping.h
F:	include/linux/dma-direct.h
F:	include/linux/dma-map-ops.h
F:	include/linux/dma-mapping.h
F:	include/linux/swiotlb.h
F:	kernel/dma/

当然 kernel/dma/ 下面还有很多其他 driver。

Data transfer can be triggered in two ways:

Either the software asks for data (via a function such as read())
or the hardware asynchronously pushes data to the system.

一个比较具体的有 DMA (third-party DMA) 流程是这样的：

Userspace 有了 read() 系统调用；
Kernel driver setup DMAC channel 相关的寄存器（Address, Length, DMA Direction^）。
DMAC 和磁盘进行协作，将磁盘数据写入内存区域，CPU 全程不参与此过程（此时 CPU 是不是可以调度到其他线程做其他事情？毕竟这个 block 住了，答案是可以的）；
DMAC 向 CPU 发出数据传输完成的信号，阻塞的进程返回，由 CPU 负责将数据从内核缓冲区拷贝到用户缓冲区；
用户进程由内核态切换回用户态，解除阻塞状态。

DMA programming in kernel

不同的 driver 使用这个公共的 DMA 框架。DMA API 内核文档分为两部分：

标准 DMA API 文档（未开始读）：kernel.org/doc/Documentation/DMA-API.txt；
一个有示例的更加适合阅读的 Howto 文档（已读完）：kernel.org/doc/Documentation/core-api/dma-api-howto.rst
定义在这个头文件中：include/linux/dma-mapping.h，

DMA API 提供的是 general 的 API，和总线实现无关，屏蔽掉了总线实现，用户可以放心调用。

DMA 在内核中的两种映射方式

使用 IOMMU 和不使用 IOMMU（使用 swiotlb）进行 DMA。当然还有其他，就是使用 ops->map_resource() 来决定。

将 IOMMU 给彻底给 bypass 掉，Linux 提供了 iommu.passthrough command line 的选项，这个选项配置上后，DMA 默认不会走 IOMMU，而是走传统的 swiotlb 方式的 DMA；缺点是效率较低。

static inline bool use_dma_iommu(struct device *dev)
{
	return dev->dma_iommu;
}

A simple PCI DMA example

The actual form of DMA operations on the PCI bus is very dependent on the device being driven. Thus, this example does not apply to any real device; instead, it is part of a hypothetical driver called dad (DMA Acquisition Device). A driver for this device might define a transfer function like this，下面是 driver 里的 code：

int dad_transfer(struct dad_dev *dev, bool write, void *buffer,  size_t count)
{
    dma_addr_t bus_addr;
    unsigned long flags;

    // Map the buffer for DMA
    bus_addr = pci_map_single(dev->pci_dev, buffer, count, dev->dma_dir);
    //...

    // Set up the device
    // 设置 device 的 command 以及 device 上一些 register 的值比如 addr 和 len
    writeb(dev->registers.command, DAD_CMD_DISABLEDMA);
    writeb(dev->registers.command, write ? DAD_CMD_WR : DAD_CMD_RD);
    writel(dev->registers.addr, cpu_to_le32(bus_addr));
    writel(dev->registers.len, cpu_to_le32(count));

    // 通过写 device 的 register 来告诉 device 开始 DMA
    writeb(dev->registers.command, DAD_CMD_ENABLEDMA);
    return 0;
}

This function maps the buffer to be transferred and starts the device operation. The other half of the job must be done in the interrupt service routine, which would look something like this。这是在 interrupt handler 里的 code：

void dad_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    struct dad_dev *dev = (struct dad_dev *) dev_id;

    /* Make sure it's really our device interrupting */

    /* Unmap the DMA buffer */
    pci_unmap_single(dev->pci_dev, dev->dma_addr, dev->dma_size, dev->dma_dir);

    /* Only now is it safe to access the buffer, copy to user, etc. */
    ...
}

Direct Memory Access and Bus Mastering - Linux Device Drivers, Second Edition [Book]

request DMA channel in driver:

request_dma

The channel argument is a number between 0 and 7 or, more precisely, a positive number less than MAX_DMA_CHANNELS. On the PC, MAX_DMA_CHANNELS is defined as 8, to match the hardware. The name argument is a string identifying the device. The specified name appears in the file /proc/dma, which can be read by user programs.

Your sound card and your analog I/O interface can share the DMA channel as long as they are not used at the same time.

Another example driver code using DMA to perform MEM-TO-MEM transfer

注意是内存到内存的传输（为了教学示例毕竟是一个虚拟设备）。一般都是 d2h 或者 h2d 的传输。

#include <linux/module.h>
#include <linux/init.h>
#include <linux/completion.h>
#include <linux/slab.h>
#include <linux/dmaengine.h>
// 可见这里需要引入内核的这个 API 头文件
#include <linux/dma-mapping.h>

/* Meta Information */
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Johannes 4 GNU/Linux");
MODULE_DESCRIPTION("A simple DMA example for copying data from RAM to RAM");

void my_dma_transfer_completed(void *param) 
{
	struct completion *cmp = (struct completion *) param;
	complete(cmp);
}

/**
 * @brief This function is called, when the module is loaded into the kernel
 */
static int __init my_init(void) {
	dma_cap_mask_t mask;
	struct dma_chan *chan;
	struct dma_async_tx_descriptor *chan_desc;
	dma_cookie_t cookie;
	dma_addr_t src_addr, dst_addr;
	u8 *src_buf, *dst_buf;
	struct completion cmp;
	int status;
    //...
	dma_cap_zero(mask);
	dma_cap_set(DMA_SLAVE | DMA_PRIVATE, mask);
	chan = dma_request_channel(mask, NULL, NULL);
    //...

	src_buf = dma_alloc_coherent(chan->device->dev, 1024, &src_addr, GFP_KERNEL);
	dst_buf = dma_alloc_coherent(chan->device->dev, 1024, &dst_addr, GFP_KERNEL);

	memset(src_buf, 0x12, 1024);
	memset(dst_buf, 0x0, 1024);

	printk("my_dma_memcpy - Before DMA Transfer: src_buf[0] = %x\n", src_buf[0]);
	printk("my_dma_memcpy - Before DMA Transfer: dst_buf[0] = %x\n", dst_buf[0]);

	chan_desc = dmaengine_prep_dma_memcpy(chan, dst_addr, src_addr, 1024, DMA_MEM_TO_MEM);
    //...

	init_completion(&cmp);

	chan_desc->callback = my_dma_transfer_completed;
	chan_desc->callback_param = &cmp;

	cookie = dmaengine_submit(chan_desc);

	/* Fire the DMA transfer */
	dma_async_issue_pending(chan);

	if(wait_for_completion_timeout(&cmp, msecs_to_jiffies(3000)) <= 0) {
		printk("my_dma_memcpy - Timeout!\n");
		status = -1;
	}

	status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);
	if(status == DMA_COMPLETE) {
		printk("my_dma_memcpy - DMA transfer has completed!\n");
		status = 0;
		printk("my_dma_memcpy - After DMA Transfer: src_buf[0] = %x\n", src_buf[0]);
		printk("my_dma_memcpy - After DMA Transfer: dst_buf[0] = %x\n", dst_buf[0]);
	} else {
		printk("my_dma_memcpy - Error on DMA transfer\n");
	}

	dmaengine_terminate_all(chan);
free:
	dma_free_coherent(chan->device->dev, 1024, src_buf, src_addr);
	dma_free_coherent(chan->device->dev, 1024, dst_buf, dst_addr);

	dma_release_channel(chan);
	return 0;
}

/**
 * @brief This function is called, when the module is removed from the kernel
 */
static void __exit my_exit(void) {
	printk("Goodbye, Kernel\n");
}

module_init(my_init);
module_exit(my_exit);

Linux_Driver_Tutorial/30_dma_memcpy at main · Johannes4Linux/Linux_Driver_Tutorial

Let's code a Linux Driver - 30 DMA (Direct Memory Access) Memcopy - YouTube

Asynchronously DMA

This happens, for example, with data acquisition devices that go on pushing data even if nobody is reading them. In this case, the driver should maintain a buffer so that a subsequent read call will return all the accumulated data to user space. The steps involved in this kind of transfer are slightly different:

The hardware raises an interrupt to announce that new data has arrived.
The interrupt handler allocates a buffer and tells the hardware where to transfer its data.
The peripheral device writes the data to the buffer and raises another interrupt when it’s done.
The handler dispatches the new data, wakes any relevant process, and takes care of housekeeping.

A variant of the asynchronous approach is often seen with network cards. These cards often expect to see a circular buffer (often called a DMA ring buffer) established in memory shared with the processor; each incoming packet is placed in the next available buffer in the ring, and an interrupt is signaled. The driver then passes the network packets to the rest of the kernel, and places a new DMA buffer in the ring.

Asynchronously DMA 是不是就是 first-party DMA 的 use case

Direct Memory Access and Bus Mastering - Linux Device Drivers, Second Edition [Book]

在等待 DMA 完成的过程中，CPU 可以运行吗？

答案是可以运行的，这里有关于这个问题的讨论：microprocessor - Does a CPU completely freeze when using a DMA? - Electrical Engineering Stack Exchange

Operations like this do not delay the processor, but can be rescheduled to handle other tasks. DMA(Direct Memory Access) Wiki - FPGAkey，reschedule 的原因是 read 一般来说是同步的，用户希望 read 完之后再执行下一行代码，所以当前进程没法运行了，需要调度其他进程来运行。

注意在数据传输过程中，CPU 是可以进行其他操作的：在DMA控制传输的同时cpu真的还可以运行其他程序吗？ - 其他 - 恩智浦技术社区我觉得是因为 DMA 是通过周期窃取的方式来获得总线的控制权的。所以在 DMA 的过程中 CPU 也会有机会 master the bus 然后做自己的事。

这里也提到了，会把 process 置为 sleep：Direct Memory Access and Bus Mastering - Linux Device Drivers, Second Edition [Book]

DMA 是 CPU 发起的，还是 Device 发起的？/ first-party DMA / third-party DMA

首先有两种方式的 DMA：

Third-party DMA: 古老，只有这种方式有 DMAC，在该方式中，为了执行 DMA 操作，CPU 需要先设置 DMAC 的寄存器
First-party DMA: 在 PCI/PCIe 总线下，每个设备都可以作为总线的 master，所以不再像 third-party 那样需要一个单独的 DMAC 来管理 DMA。当设备需要进行内存访问时，只需要获取内存地址对应的总线地址，请求总线控制权，并向目标地址发出读写请求即可。在进行 DMA 时，设备驱动首先分配一部分内存，并获得这部分内存对应的总线地址，之后将包含该总线地址的 DMA 描述信息发送给设备（通过 MMIO 或者 PMIO），之后该 PCIe 设备就可以创建 PCIe Read/Write Transaction，向对应的总线地址通过 PCIe 进行通信。

看下来，都需要 CPU（设备驱动）去发起，只不过一个是编程 DMAC，一个是直接 MMIO 编程 PCIe 设备。

有两种说法，一种说 DMA 一般都是 Device 发起的：DMA（直接内存访问）实现数据传输的流程详解-百度开发者中心

但是维基百科说是 CPU 先发起的：With DMA, the CPU first initiates the transfer, then it does other operations while the transfer is in progress。

DMA device 可以不借助于 driver 预先编程，直接打断 CPU 当前执行流来做 DMA 吗？

The CPU initializes the DMA controller with:

A count of the number of words to transfer, and
The memory address to use.

The CPU then commands the peripheral device to initiate a data transfer. The DMA controller then provides addresses and read/write control lines to the system memory. Each time a byte of data is ready to be transferred between the peripheral device and memory, the DMAC increments its internal address register until the full block of data is transferred. 这样 DMAC 就知道可以通知 CPU 了。

所以从上面来看，其实是 CPU 先发起的，只不过是 CPU（driver）让 device 开始写了。什么叫 device 开始写呢？这个数据的传输动作是 device 主动触发的还是 DMA 主动触发的？

Driver/DMA 相关的书可以看这个：,ch15.13676

Third-party 是最初的 DMA 方式，需要一个单独的 DMA 控制器来管理 DMA。Third-party 在古老的 ISA 总线和 IBM PC 中使用，而现在普遍使用的都是 PCIe 总线，在 PCIe 中不再需要单独的 DMA 控制器，每个设备都可以是总线的主设备（master），都可以主动发出 DMA 读写请求。

DMA 是与主机使用的总线协议密切相关的，不同的总线协议有不同的 DMA 实现方式。

但是要注意的是，上面应该对应的是 third-party DMA 的情况，device 自己也可以主动发起 DMA 的，这叫做 first-party DMA^，也叫做 bus mastering，PCI DMA 就是这种情况的设计。

DMA 介绍 | 简悦这篇文章讲的不错。

DMA 完成后如何通知 CPU 完成？

How does full virtualization emulate DMA access?

比如说 guest 里原封不动的驱动程序，通过 MMIO 的方式编程一个外部的虚拟 PCIe 设备（guest 不知道是虚拟的）以触发一个 DMA 请求，这部分如何进行虚拟化？

MMIO 的时候 trap 出来，然后在 device 端直接拷贝到 MMIO 时指定的内存空间去，拷贝完成后像以前那样通知 vCPU（比如通过注入中断的方式？）

另外，当客户机通过 DMA 访问大块 I/O 时，QEMU 模拟程序不会把操作结果放到 I/O 共享页中，而是通过内存映射的方式将结果直接写到客户机的内存中去，然后通过 KVM 模块告诉客户机 DMA 操作已经完成。

QEMU 里对应的代码，基本思路：

对于 DMA 控制器的区间的 MMIO 写，截获；
识别 guest 的请求，直接通过 memmove 的方式来模拟 DMA；
返回 guest。

kvm_cpu_exec
    case KVM_EXIT_MMIO:
        address_space_rw
            address_space_rw

static const MemoryRegionOps fw_cfg_dma_mem_ops = {
    .read = fw_cfg_dma_mem_read,
    .write = fw_cfg_dma_mem_write,
    .endianness = DEVICE_BIG_ENDIAN,
    .valid.accepts = fw_cfg_dma_mem_valid,
    .valid.max_access_size = 8,
    .impl.max_access_size = 8,
};

// Port I/O 的情况，使用了 fw_cfg_dma_mem_ops
fw_cfg_io_realize
    if (FW_CFG(s)->dma_enabled)
        memory_region_init_io(../&fw_cfg_dma_mem_ops, FW_CFG(s)../);

// MMIO 的情况，也使用了 fw_cfg_dma_mem_ops
fw_cfg_mem_realize
    if (FW_CFG(s)->dma_enabled)
        memory_region_init_io(../&fw_cfg_dma_mem_ops, FW_CFG(s)../);

// 对 DMA region 的读很简单，返回对应的内容就行了
static uint64_t fw_cfg_dma_mem_read(void *opaque, hwaddr addr, unsigned size)
{
    /* Return a signature value (and handle various read sizes) */
    return extract64(FW_CFG_DMA_SIGNATURE, (8 - addr - size) * 8, size * 8);
}

// 以对 DMA 的写为例（因为读只是查询），这个表示写这块 memory region，触发一个 DMA 请求
fw_cfg_dma_mem_write
    fw_cfg_dma_transfer
        // 表示方向是从 memory 往 device 写，也就是从 device 的角度是读
        dma_memory_read
        // 表示方向是从 device 往 memory 写，也就是从 device 的角度是写
        dma_memory_write
            dma_memory_rw
                dma_memory_rw_relaxed
                    // 写的是 s->dma_as
                    address_space_rw
                        address_space_write
                            flatview_write
                                // 终于找到你了，直接把设备内存和 dma 内存拷贝了
                                memmove

这么样看，在有 DMA 时，全虚拟化场景下的 IO 性能应该也不差吧，毕竟 MMIO 搬运的方式需要多次 VMExit，但是 DMA 看起来只需要一次就好了，已经实现了 bulk 内存操作了，也不比 VirtIO 要差？并不是，因为 VirtIO 时零拷贝的，而这种 full virtualization DMA 还是需要一次拷贝，见下面：

即使 QEMU 采用 DMA 的方式把数据帧直接写入到客户机内存，然后通知客户机，同样免不了数据复制带来的开销。因为这个 DMA 是通过 memcpy 模拟出来的，还是需要拷贝，需要先从真正的设备拷贝到 QEMU 里的 virtio 设备，再从 virtio 设备 memcpy 到 guest 共享出来的内存。

可能因为有的 driver 就是只用了 MMIO，没有充分用到 DMA，或者说有的 driver 没有办法用到 DMA。

DMA 安全吗，如果 Malicious Device 往不该写的内存写怎么办？/ DMA Attack

This is a security concern known as a DMA attack.

IOMMU 可以 mitigate 这种 attack。

Bus Mastering

Bus mastering 是 DMA 的一种，bus mastering 其实是 first-party DMA^。

PCI 的 DMA 方式其实就是 bus mastering，也就是 first-party DMA：

Bus Mastering on PCI: The PCI bus architecture itself supports bus mastering capabilities. This means devices connected to the PCI bus can be granted permission (Bus Master Enable or BME) to directly access the system bus and memory for data transfers.
Device Initiated Transfers: In PCI DMA, the device itself initiates the DMA transfer process. It sends a request to the DMA controller, specifying details like source and destination addresses, transfer size, and utilizes the PCI bus for data movement.
DMA Controller Involvement: While the device takes initiative, the DMA controller is still involved. The controller manages arbitration on the bus (ensuring no conflicts), performs error checking, and handles some aspects of the transfer process.

Most modern bus architectures, such as PCI, allow multiple devices to bus master because it significantly improves performance for general-purpose operating systems.

While bus mastering theoretically allows one peripheral device to directly communicate with another, in practice almost all peripherals master the bus exclusively to perform DMA to main memory.

也就是说之前和 CPU 抢总线这件事是 DMAC 来做的，现在我们不需要 DMAC 了，Device 自己就可以抢了总线然后往内存里写数据了。为什么之前 DMAC 抢总线呢？他抢了总线怎么保证 Device 在抢到总线的这段时间正好在收发数据呢？ 答案是 DMAC 其实和 Device 有通信的接口协议的。US6701405B1 - DMA handshake protocol - Google Patents 他会和 Device 进行通信来让 Device 发送或者接收数据。有了 bus mastering 之后，Device 可以自己把总线占了然后自己往内存里写数据，不需要 DMAC 代替其和 CPU 抢总线了。

我觉得这个和异步 DMA 是正交的概念。异步 DMA 不是通过 CPU 来发起的（比如 read()），是先数据到了才通知 ISR 然后设置 DMA region 来传输的。但是后面 DMA 传输的过程应该可以是 first-party 也可以是 third-party 形式的。

Some real-time operating systems prohibit peripherals from becoming bus masters, because the scheduler can no longer arbitrate for the bus and hence cannot provide deterministic latency.

Direct Memory Access and Bus Mastering - Linux Device Drivers, Second Edition [Book]

First-party DMA & Third-party DMA & PCI DMA

Third-party DMA 才是标准 DMA。使用 DMA Controller。

First-party DMA 也就是 bus mastering 也就是 PCI DMA 所采取的方式。

Standard DMA, also called third-party DMA, uses a DMA controller. (first-party 可能也有 controller，但是只负责仲裁)

Bus mastering is the capability of devices on the PCI bus (other than the system chipset, of course) to take control of the bus and perform transfers directly. PCI supports full device bus mastering, and provides bus arbitration facilities through the system chipset.

First-party DMA, the device drives its own DMA bus cycles using a channel from the system's DMA engine. The ddi_dmae_1stparty(9F) function is used to configure this channel in a cascade mode so that the DMA engine will not interfere with the transfer. Modern IDE/ATA hard disks use first-party DMA transfers. The term "first party" means that the peripheral device itself does the work of transferring data to and from memory, with no external DMAC involved. This is also called bus mastering, because when such transfers are occurring the device becomes the "master of the bus".

Direct Memory Access (DMA) Modes and Bus Mastering DMA

Third-party DMA utilizes a system DMA engine resident on the main system board, which has several DMA channels available for use by devices. The device relies on the system's DMA engine to perform the data transfers between the device and memory. The driver uses DMA engine routines (see ddi_dmae(9F)) to initialize and program the DMA engine. For each DMA data transfer, the driver programs the DMAC and then gives the device a command to initiate the transfer in cooperation with that DMAC.

更深的理解请看 Bus mastering^。

What is DMA Engine

好像并没有 DMA Engine 这个称呼。我觉得应该看语境：

如果指的是 hardware，那么应该就是 Hardware DMA Controller；
如果指的是 software，那么应该就是 Software DMA Engine in Kernel，kernel 里有一个头文件就是 drivers/dma/dmaengine.h。

linux - What is the difference between DMA-Engine and DMA-Controller? - Stack Overflow

What is a DMA channel

DMA channel 是 DMA controller 里面的一部分。

DMA channels are virtual pathways within a DMA controller that manage Direct Memory Access (DMA) transfers between devices and system memory.

A single system typically has multiple devices capable of performing DMA transfers. DMA channels provide a way to manage these concurrent requests and prevent conflicts.

Each DMA channel on a DMAC typically has its own set of registers for configuration and control.

Bus address / DMA Address

For example, if a PCI device has a BAR, the kernel reads the bus address (A) from the BAR and converts it to a CPU physical address (B). The address B is stored in a struct resource and usually exposed via /proc/iomem. When a driver claims a device, it typically uses ioremap() to map physical address B at a virtual address (C). It can then use, e.g., ioread32(C), to access the device registers at bus address A.

As a matter of fact, the situation is slightly more complicated than that. DMA-based hardware uses bus, rather than physical, addresses. Although ISA and PCI addresses are simply physical addresses on the PC, this is not true for every platform.

Aka. DMA address. 虚拟化场景下，也叫做 IOVA。

Most of the 64bit platforms have special hardware that translates bus addresses (DMA addresses) …

Kernel source: Documentation/DMA-mapping.txt

Where is the DMA hardware?

What's the difference between "driver" and "the device"

For an example, GPU has its own computing unit and it may also want to access

https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt

片内 DMA

DMA Controller

也是一种外设。

可以将其视为一种能够通过一组专用总线将内部和外部存储器与每个具有 DMA 能力的外设连接起来的控制器。

什么是仲裁器

什么是 DMA 通道

每个通道对应不同的外设的 DMA 请求。

虽然每个通道可以接收多个外设的请求，但是同一时间只能有一个有效，不能同时接收多个。

DMA Direction

/**
 * enum dma_transfer_direction - dma transfer mode and direction indicator
 * @DMA_MEM_TO_MEM: Async/Memcpy mode
 * @DMA_MEM_TO_DEV: Slave mode & From Memory to Device
 * @DMA_DEV_TO_MEM: Slave mode & From Device to Memory
 * @DMA_DEV_TO_DEV: Slave mode & From Device to Device
 */
enum dma_transfer_direction {
	DMA_MEM_TO_MEM,
	DMA_MEM_TO_DEV,
	DMA_DEV_TO_MEM,
	DMA_DEV_TO_DEV,
	DMA_TRANS_NONE,
};

DMA Mapping 类型

DMA mapping is a conversion from virtual addressed memory to a memory which is DMA-able on physical addresses (actually bus addresses).

有两种 DMA Mapping 类型：

Consistent DMA mappings;
Streaming DMA mappings.

Consistent DMA mappings

或者叫 "synchronous" or "coherent"。

Usually：

Mapped at driver initialization;
Unmapped at the end and;
从硬件机制上保证，设备端和主机端可以同时访问数据，不会出现一致性问题。

如何使用？不详细描述了，毕竟这个类型的 DMA 用的不多，大头还是在 streaming，感兴趣可以看：kernel.org/doc/Documentation/core-api/dma-api-howto.rst

Streaming DMA mappings

或者叫 "asynchronous" or "outside the coherency domain"。

Usually:

Mapped for one DMA transfer,
Unmapped right after it (unless you use dma_sync_* below)
and for which hardware can optimize for sequential accesses.

Good examples of what to use streaming mappings for are:

Networking buffers transmitted/received by a device.
Filesystem buffers written/read by a SCSI device.

如何使用？

struct device *dev = &my_dev->dev;
dma_addr_t dma_handle;
void *addr = buffer->ptr;
size_t size = buffer->len;

// 在 IOMMU 中建立从 VA 到 dma address 的映射（其实是 dma address 到 PA）。
// direction 表示 d2h 还是 h2d
// 如果是 d2h 的情况：这种机制能保证设备 DMA 到 buffer->ptr 指向的内存；
// 如果是 h2d 的情况：怎么保证 DMA 到了设备的哪一块内存上呢？比如如果是显存，那么怎么 DMA 呢？这个应该是由设备内状态控制的，驱动应该能够感知到
// 设备内的状态，比如说设备内可能有一个寄存器存了当前指向显存的哪块 offset，从哪块开始 DMA。这些都是设备 specific 的。
dma_handle = dma_map_single(dev, addr, size, direction);
// 把 dma_handle 通过 MMIO 写设备寄存器的方式，告诉设备往这个地址来 DMA，这不是一个同步的过程，因为设备随时会开始 DMA
// 也随时会结束，如果想要在代码里手动同步，需要 wait 比如 r852_dma_wait 直到 DMA 结束。
// 在传输完成后，销毁这个 dma_handle，也就是从 IOMMU 中移除 dma_handle 到 PA 的映射。
dma_unmap_single(dev, dma_handle, size, direction);