Actual memory used (not reserved) for a process

pmap -x <process pid>

user_mode() Kernel

cs 是代码段寄存器,regs->cs & 3 如果为 true,说明这个 cs 的 CPL 可能是 1, 2, 3,反正就是不是 0,也就是说不是内核的代码段。

 // user_mode(regs) determines whether a register set came from user
 // mode...
static __always_inline int user_mode(struct pt_regs *regs)
{
    //...
	return !!(regs->cs & 3);
}

do_trap() Kernel

就是保存硬件错误码和异常号到当前进程描述符中,然后发送相应的信号给进程。

当前进程接收到信号。如果进程是在用户态,则信号交给进程自身的信号处理程序(如果存在的话);如果是在内核态,则内核通常会杀死进程。

static void do_trap(int trapnr, int signr, char *str, struct pt_regs *regs,
	long error_code, int sicode, void __user *addr)
{
	struct task_struct *tsk = current;

    // 不需要发送信号
	if (!do_trap_no_signal(tsk, trapnr, str, regs, error_code))
		return;

	show_signal(tsk, signr, "trap ", str, regs, error_code);

	if (!sicode)
		force_sig(signr);
	else
		force_sig_fault(signr, sicode, addr);
}

qemu_memfd_create() QEMU

看起来就是对 memfd_create() 这个 syscall 的一个简单的封装。

int qemu_memfd_create(const char *name, size_t size, bool hugetlb,
                      uint64_t hugetlbsize, unsigned int seals, Error **errp)
{
    int htsize = hugetlbsize ? ctz64(hugetlbsize) : 0;
    //...
    htsize = htsize << MFD_HUGE_SHIFT;

    //...
    int mfd = -1;
    unsigned int flags = MFD_CLOEXEC;

    if (seals) {
        flags |= MFD_ALLOW_SEALING;
    }
    if (hugetlb) {
        flags |= MFD_HUGETLB;
        flags |= htsize;
    }
    // This is a syscall
    mfd = memfd_create(name, flags);
    // error handling...

    if (ftruncate(mfd, size) == -1) {
        error_setg_errno(errp, errno, "failed to resize memfd to %zu", size);
        goto err;
    }

    if (seals && fcntl(mfd, F_ADD_SEALS, seals) == -1) {
        error_setg_errno(errp, errno, "failed to add seals 0x%x", seals);
        goto err;
    }

    return mfd;
    //...
}

QEMU guest memory layout

info mtree

object_get_canonical_path_component() QEMU

这个函数的作用就是给 obj 起名字。

从父对象的所有 property 里找,如果 property 的 opaque 是自己,那么就返回这个 property 的名字。没太看懂。

const char *object_get_canonical_path_component(const Object *obj)
{
    ObjectProperty *prop = NULL;
    GHashTableIter iter;

    if (obj->parent == NULL) {
        return NULL;
    }

    g_hash_table_iter_init(&iter, obj->parent->properties);
    while (g_hash_table_iter_next(&iter, NULL, (gpointer *)&prop)) {
        if (!object_property_is_child(prop)) {
            continue;
        }

        if (prop->opaque == obj) {
            return prop->name;
        }
    }
    //没找到的情况...
}

Smram

The firmware preserves the state of the CPU in a region of RAM designated as "SMRAM".

Short for System Management RAM, SMRAM is a portion of the systems memory used by the processor to store code used with SMM.

mmap_reserve() / mmap_activate() QEMU

首先是 mmap_reserve()。对于 x86,fd 其实是被忽略的,mmap 最经典的用法:申请内存。

所以这个函数的作用很简单:分配 size 大小的空间,并返回指针。

//...
static void *mmap_reserve(size_t size, int fd)
{
    // 表示我们不和其他进程 share 这个内容
    int flags = MAP_PRIVATE;

    //...ppc related
    fd = -1;
    // 
    flags |= MAP_ANONYMOUS;
    return mmap(0, size, PROT_NONE, flags, fd, 0);
}

首先要明白 activate 是什么意思。这个函数只被调用了一次,在调用这个函数之前,mmap_reserve() 被调用了一次,mmap_reserve() 并没有考虑到 alignment,它申请了超过 size 的内存,也就是 size + align 的内存。这样不管申请到的地址的起始地址在哪里,我能保证我们所需要的最右端的空间已经被 reserve 了。

|----|----|----|----|----|----|   // 这是内存空间,alignment 是 4
   --|----|----|----|-            // 假如我们要申请 11 个大小,那么我们传给 mmap_reserve 的是 11 + 4 = 15
     |----|----|---               // 而 mmap_activate 只会申请 11 个,同时还保持了起始地址的 alignment。
static void *mmap_activate(void *ptr, size_t size, int fd,
                           uint32_t qemu_map_flags, off_t map_offset)
{
    // 其实这部分就是根据传进来的 qemu_map_flags
    // 来决定要传给 mmap 的真正的 flags
    const bool noreserve = qemu_map_flags & QEMU_MAP_NORESERVE;
    const bool readonly = qemu_map_flags & QEMU_MAP_READONLY;
    const bool shared = qemu_map_flags & QEMU_MAP_SHARED;
    const bool sync = qemu_map_flags & QEMU_MAP_SYNC;
    const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
    int map_sync_flags = 0;
    int flags = MAP_FIXED;
    void *activated_ptr;

    //...
    // 如果没有 fd,那就 MAP_ANONYMOUS
    flags |= fd == -1 ? MAP_ANONYMOUS : 0;
    flags |= shared ? MAP_SHARED : MAP_PRIVATE;
    flags |= noreserve ? MAP_NORESERVE : 0;
    if (shared && sync) {
        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
    }

    // mmap 本质只是建立映射的关系,mmap 到一个 fd,相当于把这段空间映射到了 fd 背后的文件,从 map_offset
    // 开始的空间。相当于在 进程虚拟空间 到 文件内容空间 建立了双向的一对一映射。
    // 当然,如果 fd 是空的,那就是从物理地址空间来映射了。
    activated_ptr = mmap(ptr, size, prot, flags | map_sync_flags, fd, map_offset);
    // error handling...
    return activated_ptr;
}

mmap() / brk() / sbrk() / Syscall

这两个 syscall 的作用看起来好像是可以互相替代的,都可以用来分配内存,但是:

  • mmap() 是一次分配一个页的,调用它会更新这个进程的页表,是一个比较 heavy 的动作;
void *mmap(void addr[.length], size_t length, int prot, int flags, int fd, off_t offset);

传进去的 flags 包含两部分:第一部分的三个 flags 是必须要有且值互斥的:

  • MAP_SHARED: 一个进程改了内容,其他进程也能看到;同时如果 mmap 的是一个文件的话,更改也会到文件中去。
  • MAP_SHARED_VALIDATE: 和 MAP_SHARED 一样,只不过多了一些 error checking。
  • MAP_PRIVATE: Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file.

第二部分的 flags 是可以 0 个或多个加在上面的:

  • MAP_ANONYMOUS: The mapping is not backed by any file; its contents are initialized to zero. The fd argument is ignored. Anonymous mappings are simply large, zero-filled blocks of memory ready for use.
  • MAP_FIXED: 传入的 addr 必须要 suitably aligned,因为我们会 exact 从这个地方开始 map。If the memory region specified by addr and length overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded.

Two mmapped regions of the same file will access the same underlying page cache data. However, each mapping needs to independently map each of the virtual pages to the physical pages. Basically, each mmap() creates a new range in virtual memory. Every page of that range corresponds to a page of physical memory. mmap() 就是直接映射到 fd 对应的 page cache 的。mmap 传入的地址参数就是要映射的虚拟地址,直接映射到了 page cache 上,而不是另外申请的物理内存,不然的话很麻烦,改一个 mmap 地址里的内容会先更改自己的物理内存,然后再将数据写入到 page cache/文件中去。

我们新分配的空间 (addr, addr + length) 是根据 (offset, offset + length) 区间内的内容来初始化的。

int brk(void *addr);

brk() sets the end of the data segment to the value specified by addr, when that value is reasonable, the system has enough memory, and the process does not exceed its maximum data size.

void *sbrk(intptr_t increment);

sbrk() increments the program's data space by increment bytes. Calling sbrk() with an increment of 0 can be used to find the current location of the program break.

brk sets the upper limit of the data segment, sbrk increments it. In ancient Unixes malloc/free used sbrk. On modern ones things could be very different, for example, OSX does not use brk/sbrk to manage heap allocations but mmap, brk/sbrk exist but are just emulation in a small segment of memory. This is almost the same on Linux (source code of mention the history of transition from brk/sbrk to mmap).

c - Difference between brk and sbrk - Stack Overflow

和 page fault 的关系

brk only pushes the VMA^. When the APP hits the newly allocated region;, a page fault occurs. The page fault routine will search for the VMA responsible for the address and then call the page table handler.

mmap 其实也是一样的,只是做了简单的映射,并没有更新页表,都是 on-demand 的,当 page fault 发生的时候我们才会去更新页表。

memory management - What happens in the kernel when the process accesses an address just allocated with brk/sbrk? - Stack Overflow

brkmmap 的区别

  • brk 只能直接在指针后面增加程序的 break,也就是 VMA 是受限的。
  • mmap 可以在任意的虚拟内存地址进行映射。

这是 mmap 的最大的优势。

Chris's Wiki :: blog/unix/SbrkVersusMmap 重点看:Then, in SunOS 4, …

Top half / Bottom half

When Interrupt triggers, Interrupt Handler should be executed very quickly and it should not run for more time (it should not perform time-consuming tasks). If we have the interrupt handler who is doing more tasks then we need to divide it into two halves.

The top half is nothing but our interrupt handler. If we want to do less work, then the top half is more than enough. No need for the bottom half in that situation.

[Mastering Workqueue in Linux Kernel Programming Full Tutorial](https://embetronicx.com/tutorials/linux/device-drivers/workqueue-in-linux-kernel/)

vmap() / kmap() Kernel

// start from "pages", "count" pages
// flag VM_FLUSH_RESET_PERMS: reset direct map and flush TLB on unmap, can't be freed in atomic context
void * vmap(struct page ** pages, unsigned int count, unsigned long flags, pgprot_t prot);

vmap() : used to map multiple physical pages into a contiguous kernel virtual address space for a long duration.

kmap() : used to map a single physical page for a short duration.

me_pagecache_clean() Kernel

清空 page cache 里的一个 page,注意不是 folio。

/*
 * Clean (or cleaned) page cache page.
 */
static int me_pagecache_clean(struct page_state *ps, struct page *p)
{
	int ret;
	struct address_space *mapping;
	bool extra_pins;

	delete_from_lru_cache(p);

	/*
	 * For anonymous pages we're done the only reference left
	 * should be the one m_f() holds.
	 */
	if (PageAnon(p)) {
		ret = MF_RECOVERED;
		goto out;
	}

	/*
	 * Now truncate the page in the page cache. This is really
	 * more like a "temporary hole punch"
	 * Don't do this for block devices when someone else
	 * has a reference, because it could be file system metadata
	 * and that's not safe to truncate.
	 */
	mapping = page_mapping(p);
	if (!mapping) {
		/*
		 * Page has been teared down in the meanwhile
		 */
		ret = MF_FAILED;
		goto out;
	}

	/*
	 * The shmem page is kept in page cache instead of truncating
	 * so is expected to have an extra refcount after error-handling.
	 */
	extra_pins = shmem_mapping(mapping);

	/*
	 * Truncation is a bit tricky. Enable it per file system for now.
	 *
	 * Open: to take i_rwsem or not for this? Right now we don't.
	 */
	ret = truncate_error_page(p, page_to_pfn(p), mapping);
	if (has_extra_refcount(ps, p, extra_pins))
		ret = MF_FAILED;

out:
	unlock_page(p);

	return ret;
}

kunmap_local() / kmap_local_page() / kmap_local_folio() / Kernel

Unmap a page mapped via kmap_local_page().

Contrary to kmap() mappings, the local mappings are only valid in the context of the caller and cannot be handed to other contexts. This implies that users must be absolutely sure to keep the use of the return address local to the thread which mapped it.

High Memory Handling — The Linux Kernel documentation

Higher order pages / folio order

Higher order pages are groups of 2^n physically contiguous pages where n is the page order.

clear_page() Kernel

clear_page() can zero pages.

filemap_grab_folio() / __filemap_get_folio() Kernel

一个新分配出来的小页 refcount 应该是 3(请看下面的分析)。但是过一段时间会从 LRU 里 remove 变成 2。folio 的话和这个 folio 里面的 page 数量有关系。

Looks up the page cache entry at page cache @mapping & @index. If no folio is found, a new folio is created.

page cache^ 缓存了硬盘某文件的内容,这个函数的作用是拿到对应位置的缓存页的描述(struct folio)。

  • 如果 page cache 里有,那就把 folio 的 ref_count + 1 并返回;
  • 如果 page cache 里没有,alloc 一个 folio(把 ref_count + 1);再加入到 page cache(把 ref_count 再加 nr_pages)。再加入到 LRU 里面(ref_count + 1)。
static inline struct folio *filemap_grab_folio(struct address_space *mapping, pgoff_t index)
{
	return __filemap_get_folio(mapping, index, FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mapping_gfp_mask(mapping));
}

// __filemap_get_folio - Find and get a reference to a folio.
// Looks up the page cache entry at @mapping & @index.
// If this function returns a folio, it is returned with an increased refcount.
// ...
struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp)
{
	struct folio *folio;
repeat:
    // page cache 也是 xarray 实现的,从 page cache 中找 index 对应的页
    // 如果能够找到,ref count + 1。
    // folio_try_get_rcu
    //     folio_ref_add(folio, 1);
	folio = filemap_get_entry(mapping, index);
    // 是一个值还是一个指针,如果是值那说明我们的 page cache 里还没有这个玩意
	if (xa_is_value(folio))
		folio = NULL;
	if (!folio)
		goto no_page;
// -------------------- 走到这里表示 page cache 里有页 --------------------
    // 一部分处理 lock (FGP_LOCK) 的代码...
    // ...
	if (fgp_flags & FGP_ACCESSED)
		folio_mark_accessed(folio);
	else if (fgp_flags & FGP_WRITE) {
		/* Clear idle flag for buffer write */
		if (folio_test_idle(folio))
			folio_clear_idle(folio);
	}

	if (fgp_flags & FGP_STABLE)
		folio_wait_stable(folio);
// -------------------- 这一部分表示 page cache 里没有页,创建一个 --------------------
no_page:
    // 这个条件表示 page cache 里没有 folio,需要现在添加
	if (!folio && (fgp_flags & FGP_CREAT)) {
		unsigned order = FGF_GET_ORDER(fgp_flags);
		int err;

		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
			gfp |= __GFP_WRITE;
		if (fgp_flags & FGP_NOFS)
			gfp &= ~__GFP_FS;
		if (fgp_flags & FGP_NOWAIT) {
			gfp &= ~GFP_KERNEL;
			gfp |= GFP_NOWAIT | __GFP_NOWARN;
		}
		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
			fgp_flags |= FGP_LOCK;

		if (!mapping_large_folio_support(mapping))
			order = 0;
		if (order > MAX_PAGECACHE_ORDER)
			order = MAX_PAGECACHE_ORDER;
		/* If we're not aligned, allocate a smaller folio */
		if (index & ((1UL << order) - 1))
			order = __ffs(index);

		do {
			gfp_t alloc_gfp = gfp;

			if (order == 1)
				order = 0;
			if (order > 0)
				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
            // 这里会一直调用到 set_page_count,所以会加 page 的 refcount + 1
            // folio_alloc
            //   folio = alloc_pages()
            //      __alloc_pages
            //        get_page_from_freelist    
            //          prep_new_page
            //            post_alloc_hook
            //              set_page_refcounted
            //                set_page_count(page, 1);
			folio = filemap_alloc_folio(alloc_gfp, order);
			if (!folio)
				continue;

			/* Init accessed so avoid atomic mark_page_accessed later */
			if (fgp_flags & FGP_ACCESSED)
				__folio_set_referenced(folio);

            // 把我们刚刚拿到的 folio 加入到 page cache 里面。
            // 这个 folio 里有几个 page,就把这个 folio 的 refcount 加几次。
            // 然后还会再给这个 folio +1,因为会放到 lru 里面。
            // filemap_add_folio
            //   __filemap_add_folio
            //     	folio_ref_add(folio, nr);
            //   folio_add_lru()
			err = filemap_add_folio(mapping, folio, index, gfp);
            // 注意!!!这里很关键,没出错就退出,这样我们
            // 不会执行到 folio_put(),所以不会减的
			if (!err)
				break;
            // 把 folio->page 的 ref_count - 1
            // 如果成了 0 会把整个 folio 都释放掉
			folio_put(folio);
			folio = NULL;
		} while (order-- > 0);

		if (err == -EEXIST)
			goto repeat;
		if (err)
			return ERR_PTR(err);
		/*
		 * filemap_add_folio locks the page, and for mmap
		 * we expect an unlocked page.
		 */
		if (folio && (fgp_flags & FGP_FOR_MMAP))
			folio_unlock(folio);
	}

    //...
	return folio;
}

fget(), fput() Kernel

fget():根据 fd 得到其对应的 struct file,并且给 struct filef_count 引用数 +1;参数是 fd。

fput(): 与 fget() 对应,对 f_count 进行 -1,如果发现 f_count 为 0 了,那么将其对应的 struct file 结构删除。参数是 struct file

  1. 减少文件的使用计数变量。这个计数变量是共享一个打开文件描述符的进程(共享文件描述符的父子进程)数。两个不相干的进程分别调用 open() 打开同一个文件,并不会共用这个变量,他们有自己的各自的文件描述对象。
  2. 如果 f_count 减一后达到了 0,那么说明这个文件描述符已经没有人需要使用了,执行相关的释放/回写工作。如果不为 0,则不做任何回写释放动作,直接返回。
  3. 如果不在中断中且不是内核线程的话(为什么做这个判断,待考察),将执行 ___fput 函数的一个任务添加到当前进程的任务列表。当前进程的任务列表中的函数会在返回用户态时 (前) 得到执行。所以有理由相信,在 ____fput 中进行了相关的释放或者回写耗时操作,使得 close 返回时间较长。

可以看到,和 close 一样,fput 也是调用了 release 函数的。

fput
    ____fput
        __fput
        	file->f_op->release(inode, file);

为什么读锁和写锁要互斥?

That ensures that a reader will never see a partial update (a state where the writer has updated some parts of the data but not all of them. This state is usually inconsistent). 比如一个结构体的属性只有一半被写了,剩下一半还没写。

那如果只是一个简单的变量,比如 int 类型,不存在写时产生中间状态的,还需要读写锁互斥吗?

看起来这种情况都不需要写锁和写锁互斥了,但是事实是,一个 long 型参数的写操作都并不是原子性的,如果允许同时读和写,那读到的数很可能是就是写操作的中间状态,比如刚写完前 32 位的中间状态,此时既不是旧值,也不是新值,long 型数都如此,而实际上一般读的都是复杂的对象,那中间状态的情况就更多了。所以读锁还是有存在的必要的,并不是什么时候想读就可以直接读。

当然,对于真正的原子性的变量比如 bool(byte),Boolean variables cannot cause assignment tearing like long for example, hence locking is not necessary.

c# - Locking a single bool variable when multithreading? - Stack Overflow

那么如果是这样,那么 C++ 里为什么还会有 atomic<bool> 呢?

Remember about memory barriers. Although it may be impossible to change bool partially, it is possible that multiprocessor system has this variable in multiple copies and one thread can see old value even after another thread has changed it to new. Atomic introduces memory barrier, so it becomes impossible. 所以这不是同一个问题。

c++ - When do I really need to use atomic<bool> instead of bool? - Stack Overflow

Memory banks

The term RAM Bank is not really a standard term, but people often use it to refer to memory modules. You know, those strips you can buy that come in various sizes of memory, are rectangular shaped and has many golden connection points and a small hole cut in between to make sure it only fits the correct memory slot.

The terminology for Bank, Module, Bar are things people use. Module is the official name, but both bank and bar are commonly used due to not knowing the proper terminology.

So a RAM Bank (or RAM Module) is a circuit board, containing gold connection points and RAM chips and is used in computers to temporarily remember any data on the computer.

memory - What is a RAM Bank? How is it defined? - Super User

How to see banks:

lshw -class memory

linux - How do I determine the number of RAM slots in use? - Unix & Linux Stack Exchange

smp_wmb() / smp_rmb() / smp_mb() / wmb() / rmb() / mb() / barrier() Kernel

Read/write memory barrier.

编译器屏障 barrier() 不涉及任何硬件指令。是最弱的一种屏障,只对编译器有效。是否需要使用硬件屏障指令(lfence 等)的原则:你写的这段代码(包含内存操作)在真正意义上存在并行执行,也就是可能存在不止一个 CPU 观察者。如果你能保证不可能同时存在多个 CPU 观察者(例如执行这段代码的进程都是绑定到一个 CPU),实际上你就不需要硬件指令级别屏障。此时你需要考虑的问题只有:是否需要使用编译屏障。

在 x86 64 位情况下,Linux 的 smp_rmb()smp_wmb() 也用了 lfencesfence 指令,但是如果是 32 位的,它其实只是 barrier(),只是禁止编译器重排而已。

在 rmb() 跟 wmb() 里用了 lfence 跟 sfence 指令,可能跟设备交互时会用到。

如果内存操作的顺序的观察者全是 CPU,那么请使用 smp。如果内存操作顺序涉及 CPU 和硬件设备(例如网络设备),请使用 mb()。如果我们不写驱动的话,其实很少和设备打交道。因此一般都是选择 smp。

  • mb(): A full system memory barrier. All memory operations before the mb() in the instruction stream will be committed before any operations after the mb() are committed. This ordering will be visible to all bus masters in the system. It will also ensure the order in which accesses from a single processor reaches slave devices.
  • rmb(): Like mb(), but only guarantees ordering between read accesses. That is, all read operations before an rmb() will be committed before any read operations after the rmb().
  • wmb(): Like mb(), but only guarantees ordering between write accesses. That is, all write operations before a wmb() will be committed before any write operations after the wmb().
  • smp_mb(): Similar to mb(), but only guarantees ordering between cores/processors within an SMP system. All memory accesses before the smp_mb() will be visible to all cores within the SMP system before any accesses after the smp_mb().
  • smp_rmb(): Like smp_mb(), but only guarantees ordering between read accesses.
  • smp_wmb(): Like smp_mb(), but only guarantees ordering between write accesses.

这篇文章讲的也很清楚:Memory access ordering: Barriers and the Linux kernel - Architectures and Processors blog - Arm Community blogs - Arm Community

如何使用屏障指令 - 知乎

Kernel dmesg call trace

what is the number +0x71/0x420 meaning in call trace?

The first number is the offset inside the function, the second is the size of the function.

kernel - what is the number +0x71/0x420 meaning in call trace? - Unix & Linux Stack Exchange

tdp_mmu_iter_cond_resched() KVM

TDP MMU iteration 是一个耗时的操作,因此这个函数可以用来适当地放弃占用。

/*
 * Yield if the MMU lock is contended (rwlock_needbreak(&kvm->mmu_lock)) or 
 * this thread needs to return control to the scheduler (need_resched()).
 * (Yield 的意思就是放弃占用线程,把 thread 让给别人)
 *
 * If this function yields, iter->yielded is set and the caller must skip to
 * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
 * over the paging structures to allow the iterator to continue its traversal
 * from the paging structure root.
 *
 * Returns true if this function yielded.
 */
static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
							  struct tdp_iter *iter,
							  bool flush, bool shared)
{
	WARN_ON_ONCE(iter->yielded);

	/* Ensure forward progress has been made before yielding. */
        
	if (iter->next_last_level_gfn == iter->yielded_gfn)
		return false;

    // Returns non-zero if there is another task waiting on the rwlock.
	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
		if (flush)
			kvm_flush_remote_tlbs(kvm);

		rcu_read_unlock();

		if (shared)
			cond_resched_rwlock_read(&kvm->mmu_lock);
		else
			cond_resched_rwlock_write(&kvm->mmu_lock);

		rcu_read_lock();

		WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn);

		iter->yielded = true;
	}

	return iter->yielded;
}

TD Preserving

/lib/firmware/intel-seam/libtdx.bin
/lib/firmware/intel-seam/libtdx.bin.sigstruct

echo update > /sys/devices/system/cpu/tdx/reload

Kvm-nx-lpage-recovery

It is a kernel thread.

kvm_create_vm
    kvm_arch_post_init_vm
        kvm_mmu_post_init_vm
            if (nx_hugepage_mitigation_hard_disabled)
        		return 0;
            kvm_nx_huge_page_recovery_worker

这个 thread 是为了 NX huge page^ 用的,可以在 load KVM 的时候:

sudo modprobe kvm nx_huge_pages=never

进行关闭。可以通过:

cat /sys/module/kvm/parameters/nx_huge_pages

查看。

Rmmod: ERROR: Module kvm_intel is in use

目前还没有找到方法,但是下面是一个思路,如果看到 fd 是被 kvm-nx-lpage-recovery^ 这个 kernel thread 占用的。那么可以查看对应页面,有对应的不打开的方法。

sudo lsof | grep kvm # 有时候是个 kernel thread: kvm-nx-lpage-recovery
ps -ef | grep  <pid>

BDRV / QEMU

bdrv.

KVM_BUG_ON() KVM

会导致 WARN_ON_ONCE,但是只会导致一次,后面出的错就不会再 WARN 了,相当于只 WARN 第一次出错的地方。

#define KVM_BUG_ON(cond, kvm)					\
({								\
	bool __ret = !!(cond);					\
								\
	if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged))		\
		kvm_vm_bugged(kvm);				\
	unlikely(__ret);					\
})

static inline void kvm_vm_bugged(struct kvm *kvm)
{
	kvm->vm_bugged = true;
	kvm_vm_dead(kvm);
}

cpu_synchronize_all_states() / cpu_synchronize_state() / kvm_cpu_synchronize_state() / do_kvm_cpu_synchronize_state() / do_kvm_cpu_synchronize_state() / kvm_arch_get_registers() QEMU

Synchronize all vcpu states from KVM.

void cpu_synchronize_all_states(void)
{
    //...
    CPU_FOREACH(cpu) {
        cpu_synchronize_state(cpu);
    }
}

void cpu_synchronize_state(CPUState *cpu)
{
    if (cpus_accel->synchronize_state) {
        cpus_accel->synchronize_state(cpu);
    }
}

void kvm_cpu_synchronize_state(CPUState *cpu)
{
    if (!cpu->vcpu_dirty) {
        run_on_cpu(cpu, do_kvm_cpu_synchronize_state, RUN_ON_CPU_NULL);
    }
}

static void do_kvm_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
{
    if (!cpu->vcpu_dirty) {
        kvm_arch_get_registers(cpu);
        cpu->vcpu_dirty = true;
    }
}

int kvm_arch_get_registers(CPUState *cs)
{
    X86CPU *cpu = X86_CPU(cs);

    // sync 时需要保证 cpu 处于停止状态
    assert(cpu_is_stopped(cs) || qemu_cpu_is_self(cs));

    // 这些数据都是通过向 KVM 调用 ioctl 拿到的。
    ret = kvm_get_vcpu_events(cpu);
    ret = kvm_get_mp_state(cpu);
    ret = kvm_getput_regs(cpu, 0);
    ret = kvm_get_xsave(cpu);
    ret = kvm_get_xcrs(cpu);
    ret = has_sregs2 ? kvm_get_sregs2(cpu) : kvm_get_sregs(cpu);
    ret = kvm_get_msrs(cpu);
    ret = kvm_get_apic(cpu);
    ret = kvm_get_debugregs(cpu);
    ret = kvm_get_nested_state(cpu);
    //...
    cpu_sync_bndcs_hflags(&cpu->env);
    return ret;
}

Change another ESP's grub information

lsblk see all the hard drives.

sudo mkdir -p /mnt/efi
sudo lvmdiskscan
# mount the /boot partition and ESP partition
# sudo mount /dev/sdc2 /mnt/boot (optional)
sudo mount /dev/sdc1 /mnt/efi
vim /mnt/efi/EFI/centos/grub.cfg

Find the lines around terminal_output console, change the timeout=0 to timeout=5.

How to mount a LVM?

# 根据 size 找到你想要 mount 的名字(VG)以及 partition(LV)
lvs 
# 激活这个名字
vgchange -ay <VG>
# mount
mkdir -p /another
mount /dev/<VG>/<LV> /another # mount /dev/cs/home /another

EAGAIN / EWOULDBLOCK

For most systems, EAGAIN and EWOULDBLOCK will be the same. There are only a few systems in which they are different.

sockets - Difference between 'EAGAIN' or 'EWOULDBLOCK' - Stack Overflow

shadow_*_mask / kvm_mmu_reset_all_pte_masks() KVM

为什么要重新赋值一遍呢,直接用宏定义不好吗?

因为引入了 EPT 之后,这些 global 变量们可能根据是不是启用 EPT 取不同的值,可以看函数 kvm_mmu_set_ept_masks()

In file arch/x86/kvm/mmu/spte.c:

u64 __read_mostly shadow_host_writable_mask;
u64 __read_mostly shadow_mmu_writable_mask;
u64 __read_mostly shadow_nx_mask;
u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
u64 __read_mostly shadow_user_mask;
u64 __read_mostly shadow_accessed_mask;
u64 __read_mostly shadow_dirty_mask;
u64 __read_mostly shadow_mmio_value;
u64 __read_mostly shadow_mmio_mask;
u64 __read_mostly shadow_mmio_access_mask;
u64 __read_mostly shadow_present_mask;
u64 __read_mostly shadow_memtype_mask;
u64 __read_mostly shadow_me_value;
u64 __read_mostly shadow_me_mask;
u64 __read_mostly shadow_acc_track_mask;

u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;

u8 __read_mostly shadow_phys_bits;

kvm_mmu_reset_all_pte_masks:

vmx_init
    kvm_x86_vendor_init
        __kvm_x86_vendor_init
            kvm_mmu_vendor_module_init
                kvm_mmu_reset_all_pte_masks

void kvm_mmu_reset_all_pte_masks(void)
{
    //...
	shadow_nonpresent_or_rsvd_lower_gfn_mask =
		GENMASK_ULL(low_phys_bits - 1, PAGE_SHIFT);
	shadow_user_mask	= PT_USER_MASK;
	shadow_accessed_mask	= PT_ACCESSED_MASK;
	shadow_dirty_mask	= PT_DIRTY_MASK;
	shadow_nx_mask		= PT64_NX_MASK;
	shadow_x_mask		= 0;
	shadow_present_mask	= PT_PRESENT_MASK;

	/*
	 * For shadow paging and NPT, KVM uses PAT entry '0' to encode WB
	 * memtype in the SPTEs, i.e. relies on host MTRRs to provide the
	 * correct memtype (WB is the "weakest" memtype).
	 */
	shadow_memtype_mask	= 0;
	shadow_acc_track_mask	= 0;
	shadow_me_mask		= 0;
	shadow_me_value		= 0;

	shadow_host_writable_mask = DEFAULT_SPTE_HOST_WRITABLE;
	shadow_mmu_writable_mask  = DEFAULT_SPTE_MMU_WRITABLE;
    //...
}

mmu_page_hash / KVM

struct kvm_arch {
    //...
    // 4K 个 slot 的 hash table
	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
    //...
}

这个 hash table 的更新在以下两个函数:

kvm_mmu_alloc_shadow_page
	hlist_add_head(&sp->hash_link, sp_list);

gfn_round_for_level() / KVM

参数 level 可以是:

enum pg_level {
	PG_LEVEL_NONE,
	PG_LEVEL_4K,
	PG_LEVEL_2M,
	PG_LEVEL_1G,
	PG_LEVEL_512G,
	PG_LEVEL_NUM
};
static inline gfn_t gfn_round_for_level(gfn_t gfn, int level)
{
    // KVM_PAGES_PER_HPAGE 表示一个 level 的 page table 可以容纳多少个 page
    // 比如对于 PG_LEVEL_4K,KVM_PAGES_PER_HPAGE(level) 就是 1,负数表示为 ffffffff
    // 对于 PG_LEVEL_2M,KVM_PAGES_PER_HPAGE(level) 就是 512,负数表示为 fffffe00
    // 以此类推
	return gfn & -KVM_PAGES_PER_HPAGE(level);
}

kvm_cpu_role / KVM

包含了 base 和 ext 两个 field,base 就是 kvm_mmu_page_role,ext 是 kvm_mmu_extended_role,其中包含了一些 CR, LMA 等等的信息。

kvm_cpu_role 是从 kvm_mmu_role 重命名过来的(7a7ae8292391c4d53c4340e606bf48776c3449e7)。由此可见,kvm_cpu_role 表示的是整个 MMU 的 role,而 kvm_mmu_page_role 仅仅表示 SPT 当中某一个 page 所表示的 role。

union kvm_cpu_role {
	u64 as_u64;
	struct {
		union kvm_mmu_page_role base;
		union kvm_mmu_extended_role ext;
	};
};

struct kvm_mmu {
    //...
	union kvm_cpu_role cpu_role;
    //...
}

Guest-paging verification

是一个 VM-execution control bit。

Specifically, guest-paging verification may apply to an access using a GPA to the guest PSE that maps a page for the linear address.

也就是说,当访问的这个页是一个 guest 的页表页时。

It applies if bit 57 (verify guest paging) is set in the EPT PSE that maps for that GPA.

Intel VT-rp - Part 2. paging-write and guest-paging verification | Satoshi’s notes

Exit qualification

This field contains additional information about the cause of VM exits due to the following: debug exceptions; page-fault exceptions; start-up IPIs (SIPIs); task switches; INVEPT; INVLPG;INVVPID; LGDT; LIDT; LLDT; LTR; SGDT; SIDT; SLDT; STR; VMCLEAR; VMPTRLD; VMPTRST; VMREAD; VMWRITE; VMXON; XRSTORS; XSAVES; controlregister accesses; MOV DR; I/O instructions; and MWAIT. The format of the field depends on the cause of the VM exit. See Section 27.2.1 for details.

For a page-fault exception, the exit qualification contains the linear address that caused the page fault. (GVA)

struct vfsmount Kernel

A struct vfsmount represents a subtree in the big file hierarchy - basically a pair (device, mountpoint).

struct vfsmount {
	struct dentry *mnt_root;	// root of the mounted tree
	struct super_block *mnt_sb;	// super_block 代表了整个文件系统
	int mnt_flags;
	struct mnt_idmap *mnt_idmap;
} __randomize_layout;

vfs_kern_mount() Kernel

kern_mount() Kernel

FNAME

#define PTTYPE_EPT 18 /* arbitrary */
#define PTTYPE PTTYPE_EPT
#include "paging_tmpl.h"
#undef PTTYPE

#define PTTYPE 64
#include "paging_tmpl.h"
#undef PTTYPE

#define PTTYPE 32
#include "paging_tmpl.h"
#undef PTTYPE

这么设计的目的是控制 PTTYPE 来多包含几次 paging_tmpl.h 这个头文件。

FNAME(page_fault) 其实和 ept_page_fault 一样。

#if PTTYPE == 64
	#define FNAME(name) paging##64_##name
#elif PTTYPE == 32
	#define FNAME(name) paging##32_##name
#elif PTTYPE == PTTYPE_EPT
	#define FNAME(name) ept_##name

最后成了:

paging64_page_fault
paging32_page_fault
ept_page_fault

struct inode_operations Kernel

// i_op 成员定义对目录相关的操作方法列表,譬如
//  - mkdir()系统调用会触发 inode->i_op->mkdir() 方法,而
//  - link() 系统调用会触发 inode->i_op->link() 方法。
struct inode_operations {
	int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t);
    //...
} ____cacheline_aligned;

struct inode / Kernel

struct inode {
	umode_t			i_mode;
	unsigned short		i_opflags;
	kuid_t			i_uid;
	kgid_t			i_gid;
	unsigned int		i_flags;

#ifdef CONFIG_FS_POSIX_ACL
	struct posix_acl	*i_acl;
	struct posix_acl	*i_default_acl;
#endif

    // i_op 成员定义对目录相关的操作方法列表,譬如
    //  - mkdir()系统调用会触发 inode->i_op->mkdir() 方法,而
    //  - link() 系统调用会触发 inode->i_op->link() 方法。
	const struct inode_operations	*i_op;
	struct super_block	*i_sb;
	struct address_space	*i_mapping;

#ifdef CONFIG_SECURITY
	void			*i_security;
#endif

	/* Stat data, not accessed from path walking */
	unsigned long		i_ino;
	/*
	 * Filesystems may only read i_nlink directly.  They shall use the
	 * following functions for modification:
	 *
	 *    (set|clear|inc|drop)_nlink
	 *    inode_(inc|dec)_link_count
	 */
	union {
		const unsigned int i_nlink;
		unsigned int __i_nlink;
	};
	dev_t			i_rdev;
	loff_t			i_size;
	struct timespec64	i_atime;
	struct timespec64	i_mtime;
	struct timespec64	__i_ctime; /* use inode_*_ctime accessors! */
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
	unsigned short          i_bytes;
	u8			i_blkbits;
	u8			i_write_hint;
	blkcnt_t		i_blocks;

#ifdef __NEED_I_SIZE_ORDERED
	seqcount_t		i_size_seqcount;
#endif

	/* Misc */
	unsigned long		i_state;
	struct rw_semaphore	i_rwsem;

	unsigned long		dirtied_when;	/* jiffies of first dirtying */
	unsigned long		dirtied_time_when;

	struct hlist_node	i_hash;
	struct list_head	i_io_list;	/* backing dev IO list */
#ifdef CONFIG_CGROUP_WRITEBACK
	struct bdi_writeback	*i_wb;		/* the associated cgroup wb */

	/* foreign inode detection, see wbc_detach_inode() */
	int			i_wb_frn_winner;
	u16			i_wb_frn_avg_time;
	u16			i_wb_frn_history;
#endif
	struct list_head	i_lru;		/* inode LRU list */
	struct list_head	i_sb_list;
	struct list_head	i_wb_list;	/* backing dev writeback list */
	union {
		struct hlist_head	i_dentry;
		struct rcu_head		i_rcu;
	};
	atomic64_t		i_version;
	atomic64_t		i_sequence; /* see futex */
	atomic_t		i_count;
	atomic_t		i_dio_count;
	atomic_t		i_writecount;
#if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
	atomic_t		i_readcount; /* struct files open RO */
#endif
	union {
		const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
		void (*free_inode)(struct inode *);
	};
	struct file_lock_context	*i_flctx;
	struct address_space	i_data;
	struct list_head	i_devices;
	union {
		struct pipe_inode_info	*i_pipe;
		struct cdev		*i_cdev;
		char			*i_link;
		unsigned		i_dir_seq;
	};

	__u32			i_generation;

#ifdef CONFIG_FSNOTIFY
	__u32			i_fsnotify_mask; /* all events this inode cares about */
	struct fsnotify_mark_connector __rcu	*i_fsnotify_marks;
#endif

#ifdef CONFIG_FS_ENCRYPTION
	struct fscrypt_info	*i_crypt_info;
#endif

#ifdef CONFIG_FS_VERITY
	struct fsverity_info	*i_verity_info;
#endif

    // 私有的数据,可以放任何类型的数据,比如 KVM 的 debugfs 就放了个 struct kvm 进去
    // 而 gmem_fs 则是把 flag 放了进去
	void			*i_private; /* fs or device private pointer */
} __randomize_layout;

Mprotect

mprotect 是一个 syscall,libc 也提供了对应的 helper 函数。

mprotect() changes the access protections for the calling process's memory pages containing any part of the address range in the interval [addr, addr+len-1].

If the calling process tries to access memory in a manner that violates the protections, then the kernel generates a SIGSEGV signal for the process.

所以,进程为什么要自己给自己设置 protection 呢,usecase 是什么?

Security, debug… for details please see: www.quora.com

可以看这里面的 example:mprotect(2) - Linux manual page

mmu_invalidate_retry_hva() KVM

在 page fault handling 中被调用,用来 check 是否这个 page 是一个 stale 的 page fault。

static inline int mmu_invalidate_retry_hva(struct kvm *kvm, unsigned long mmu_seq, unsigned long hva)
{
    //...
    // 发生 page fault 的时候,恰好我们正处于 invalidate 的过程中
    // 并且 page fault 发生的 hva 也是我们 invalidate 的区间
    // 那么返回 true 表示我们放弃此次 page fault handling。
	if (unlikely(kvm->mmu_invalidate_in_progress) &&
	    hva >= kvm->mmu_invalidate_range_start &&
	    hva < kvm->mmu_invalidate_range_end)
		return 1;
	if (kvm->mmu_invalidate_seq != mmu_seq)
		return 1;
	return 0;
}

Buffer cache

buffer cache 是块设备驱动的一部分,page cache 按照文件的逻辑页进行缓冲,buffer cache 按照文件的物理块进行缓冲。page cache 与 buffer cache 并不是非此即彼,而是相互融合,可以同时存在的,同一文件的页即可存在于 page cache 中,又可存在于 buffer cache 中,它们在物理内存中只有一份拷贝。文件系统接口就处于 page cache 和 buffer cache 之间,它完成 page cache 的逻辑页与 buffer cache 的物理块之间的相互转换,再交给统一的块设备 IO 进行调度处理,文件的逻辑块与物理块的关系就表现为 page cache 与 buffer cache 的关系。

文件的逻辑层需要映射到实际的物理磁盘,这种映射关系由文件系统来完成。

linux内核中的address_space 结构解析_linux kernel address_space_operations readahead-CSDN博客

Swap cache / Linux

Swap 是用来放物理内存上放不下的内容的,那么 Swap cache 是做什么的?

当将页面交换到硬盘中时,Linux 总是避免页面写,除非必须这样做。当页面已经被 swap out 但是当有进程再次访问时又要将它重新调入内存。只要页面在内存中没有被写过,则硬盘中的拷贝是有效的。Linux 使用 swap cache 来跟踪这些没有写过的页面。

swap cache 是一个链表,每一个节点是一个 PTE,每个对应于系统中的物理页面,如果非零,表示在交换 file 中这一页没有更改。如果此页被修改,那么将其从 swap cache 中删除。

Swap cache 的好处是能够避免不必要的 swap out。当 Kernel 要将一个 page 交换到硬盘时,先去 swap cache 里看看,如果不为 0,那么说明这个 page 没有改过,硬盘上的是有效备份(可能因为刚刚 swap 进来读了但是没改,现在又要 swap 出去),那就没必要再写回了,可以减少不必要的 IO。

linux内核中的address_space 结构解析_linux kernel address_space_operations readahead-CSDN博客

匿名 fd / 匿名句柄 / anonymous inode / anonymous fd

尽管叫 anonymous fd,其实匿的是 inode。匿名的意思说的就是没有路径( 在内核里面说的就是没有有效的 dentry )。

At least in some contexts, an anonymous inode is an inode without an attached directory entry.

在 Linux 的文件体系中,一个文件句柄,对应一个 file 结构体,关联一个 inode 。 file/dentry/inode 这三驾马车是一定要配齐的,就算是匿名的(无 path,无效 dentry),对于 file 结构体来说,一定要绑定 inode 和 dentry ,哪怕是伪造的、不完整的 inode。anon_inodefs 就应运而生了,内核就帮你搞出来一个公共的 inode ,这就节省了所有有这样需求的内核模块,避免了内存的浪费,省了冗余重复的 inode 初始化代码。

匿名 fd 背后的是一个叫做 anon_inodefs 的内核文件系统( 位于 fs/anon_inodes.c ),这个文件系统极其简单,整个文件系统只有一个 inode ,这个 inode 是文件系统初始化的时候创建好的。之后,所有需要一个匿名 inode 的句柄都直接跟这个 inode 关联即可。

问题一,为什么 file 一定要和 inode 绑定,直接置空不行吗?

我猜是为了保持 file 接口的统一性,保持 file 是一个 general 的概念。

Linux fd 系列| “匿名句柄” 是一切皆文件背后功臣 - 知乎

alloc_anon_inode() Kernel

该函数传入一个超级块作为参数用于创建一个匿名 inode 。这个函数创建一个新的内存 inode 实例,这个 inode 不具备完备的功能,也是用来做匿名之用。这种匿名 inode 就不是 anon_inodefs 的那个了,而是具体文件系统实例上的匿名 inode 。

HOST PAGE SIZE / TARGET PAGE SIZE / qemu_host_page_size / QEMU

qemu_host_page_size is always >= TARGET_PAGE_SIZE.

block->page_size, qemu_host_page_size, TARGET_PAGE_SIZE 这三者是不一样的。

从下面可以看出来,qemu_host_page_size 表示的就是在物理机上支持的最小 page size 和 TARGET_PAGE_SIZE 之前取最小值。

qemu_create_machine
    page_size_init
void page_size_init(void)
{
    // After this we can ensure that qemu_host_page_size >= TARGET_PAGE_SIZE
    if (qemu_host_page_size == 0) {
        // getpagesize(), a glibc function, returns the number of bytes in a
        // memory page.
        qemu_host_page_size = qemu_real_host_page_size();
    }
    if (qemu_host_page_size < TARGET_PAGE_SIZE) {
        qemu_host_page_size = TARGET_PAGE_SIZE;
    }
    qemu_host_page_mask = -(intptr_t)qemu_host_page_size;
}

Why HPS >= TPS?

QEMU debug thread

Use gdb to debug QEMU segfault issue

// configure QEMU like this
./configure --enable-debug --disable-pie
gdb <program>

// After into gdb, load the debug info
>> file /home/lei/p/virtualization.hypervisors.server.vmm.qemu-next/build/qemu-system-x86_64
>> start
>> continue

No symbol table loaded. Use the "file" command.

Memfd / Sealed files

Sealed files [LWN.net]

原来的共享内存的 IPC 方式有一些问题,就是进程之间必须互相信任,否则可能会有 TOCTTOU^ 问题。In the real world, where this kind of trust is often not present, careful users of shared memory must copy data before using it and be prepared for signals; that kind of programming is cumbersome, error-prone, and slow.

Sealed files 是用来解决这一问题的:The sealing concept allows one party in a shared-memory transaction to work with a memory segment in the knowledge that it cannot be changed by another process at an inopportune time.

使用方式:

  1. 当使用 shmfs 来 IPC 时,一般是 Creating a file on an shmfs filesystem and mapping it into memory.(这种方式也会有一个 file descriptor) The process can seal it with a call to fcntl() using the new SHMEM_SET_SEALS command.
  2. 不使用 shmfs 时,我们引入了一个新的 syscall:memfd_create

memfd_create() Syscall

This call will create a file (not visible in the filesystem) that is suitable for sealing; it will be associated with the given name and be size bytes in length. it will return a fd associated with the newly created file.

The returned file descriptor can be passed to mmap(), of course.

让我们看看官方文档如何描述:

However, unlike a regular file, it lives in RAM and has a volatile backing storage. Once all references to the file are dropped, it is automatically released.

一些例子:

// 名字随便起,想起什么都行
int memfd_create(const char *name, unsigned int flags);

kvm_queue_exception() And variants

kvm_queue_exception() KVM

kvm_queue_exception_p() KVM