struct page / struct mem_section Kernel

每个物理页面都需要一个 struct page 数据结构来描述,因此为了降低成本,该结构中大量使用了 C 语言的联合体 Union 来优化其大小。

内存被划分为很多个 mem_section,每个 mem_section 的大小为 128MB,每一个 mem_section 有一个数组保存了所有的 struct page

struct mem_section {
	/*
	 * This is, logically, a pointer to an array of struct
	 * pages...
	 */
	unsigned long section_mem_map;
    //...
};

page_to_pfn() / pfn_to_page() Kernel

#define vmemmap ((struct page *)VMEMMAP_START)
#define __pfn_to_page(pfn)	(vmemmap + (pfn))
#define __page_to_pfn(page)	(unsigned long)((page) - vmemmap)

每一个物理页,在 kernel 中有一个 struct page 进行描述。这些 struct page 是按它对应的物理页面的地址顺序,顺序存放在 vmemmap 数组中。所以,某一个页对应的 struct pagevmemmap 数组中的偏移,由这个 page 是第几个物理页面决定。

NX huge pages

Non-executable huge pages.

iTLB multihit is an erratum where some processors may incur a machine check error, possibly resulting in an unrecoverable CPU lockup, when an instruction fetch hits multiple entries in the instruction TLB. This can occur when the page size is changed along with either the physical address or cache type. A malicious guest running on a virtualized system can exploit this erratum to perform a denial of service attack.

也就是说,如果一个 huge page 是可执行的,那么在 instruction fetch 的时候,有可能 hit 到多个 entries,从而给了 guest exploit 的空间。

In order to mitigate the vulnerability, KVM initially marks all huge pages as non-executable. If the guest attempts to execute in one of those pages, the page is broken down into 4K pages, which are then marked executable.

iTLB multihit — The Linux Kernel documentation

Page pinning / Page locking

A page that has been locked into memory with a call like mlock() is required to always be physically present in the system's RAM

对于 pin page,virtual address 不仅要一直在 RAM 中,且 physical address 也不会变,因此不会有任何 page fault。

Locking and pinning [LWN.net]

struct kvm_mmu_page_role KVM

union kvm_mmu_page_role {
	u32 word;
	struct {
        // 这个页表项在哪一级?
		unsigned level:4;
        //...
        // 表示这个 MMU page 是 invalid 的
        // See kvm_tdp_mmu_invalidate_all_roots()
        unsigned invalid:1;
        //...
	};
};

Page fault handling process in kernel

Declare the handling function for page fault:

#define X86_TRAP_PF		14	/* Page Fault */

DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_PF,	exc_page_fault);

Define the handling function for page fault:

DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
    //...
	handle_page_fault(regs, error_code, address);
    //...
}

handle_page_fault
    // for page fault on kernel address
    do_kern_addr_fault
        
    do_user_addr_fault
        handle_mm_fault
            __handle_mm_fault
                handle_pte_fault
                    do_fault

struct mm_struct / Memory Descriptor (kernel)

aka “Memory Descriptor”.

Represents the process’ address space. Holds the data Linux needs about the memory address space of the process.

struct mm_struct {
        //...
        // 代码段,堆栈段起始地址之类的。
		unsigned long start_code, end_code, start_data, end_data;
		unsigned long start_brk, brk, start_stack;
}

struct vm_area_struct (kernel)

虚拟内存区块 (VMA),表示进程的虚拟地址空间中的一块区域。

A VM area is any part of the process virtual memory space that has a special rule for the page-fault handlers (ie a shared library, the executable area etc).

struct vm_area_struct {
    //...
	// Function pointers to deal with this struct.
	// such as the page fault handler.
	const struct vm_operations_struct *vm_ops;
    //...
}

struct vm_operations_struct (kernel)

vm_area_struct 中的 vm_operations_struct 结构描述操作虚拟内存区块的函数集。

vm_area_struct 一一对应,毕竟是为了描述一个 vm_area_struct

/*
 * These are the virtual MM functions - opening of an area, closing and
 * unmapping it (needed to keep files on disk up-to-date etc), pointer
 * to the functions called when a no-page or a wp-page exception occurs.
 */
struct vm_operations_struct {
    //...
    // 当发生 page fault 时的 handler
	vm_fault_t (*fault)(struct vm_fault *vmf);
    //...
}

COCO

Confidential Computing (coco) hardware such as AMD SEV and Intel TDX.

PDPTR, PML4, PML4AE, PDPT, PDPTE, PD, PDE, PT, PTE

In 64-bit:

  • CR3 points to PML4 table, each entry is PML4E
  • PML4E points to PDPT (page-directory-pointer table), each entry is PDPTE
  • PDPTE points to PD (page directory), each entry is PDE
  • PDE points to PT (page table), each entry is PTE

Regarding to PDPTR (Page Directory Pointer Table Register),

The register CR3, which now points to the PDP, also got the alternate name PDPTR: “page directory pointer table register”

Internal, non-architectural PDPTE registers.

The PDPT comprises four (4) 64-bit entries called PDPTEs. Each PDPTE controls access to a 1-GByte region of the linear-address space. Corresponding to the PDPTEs, the logical processor maintains a set of four (4) internal, non-architectural PDPTE registers, called PDPTE0, PDPTE1, PDPTE2, and PDPTE3. The logical processor loads these registers from the PDPTEs in memory as part of certain operations

With PAE paging, a logical processor maintains a set of four (4) PDPTE registers, which are loaded from an address in CR3. Linear address are translated using 4 hierarchies of in-memory paging structures, each located using one of the PDPTE registers. (This is different from the other paging modes, in which there is one hierarchy referenced by CR3.)

The funny page table terminology on AMD64 – pagetable.com

#UD

一般来说代表执行的指令不存在。

Invalid Opcode Exception.

#GP

General Protection.

Indicates that the processor detected one of a class of protection violations called “general-protection violations.”

The conditions that cause this exception to be generated comprise all the protection violations that do not cause

other exceptions to be generated (such as, invalid-TSS, segment-not-present, stack-fault, or page-fault excep-

tions).

Cache line

On most architectures, the size of a cache line is 64 bytes (NOT 4K, remember), meaning that all memory is divided in blocks of 64 bytes, and whenever you request (read or write) a single byte, you are also fetching all its 63 cache line neighbors whether your want them or not.

In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back.

VMCALL / VMFUNC / Hypercall

VMCALLVMFUNC 都是 x86 里的指令,这两者具体有什么区别呢?

VMFUNC is similar with a syscall, while a VMCALL takes much longer time. VMFUNC does not cause a VM exit, so it has a significant performance advantage over VMCALL.

VMFUNC 调用的是硬件 processor 提供的 function 能力,不是 hypervisor 软件编程的 function,这个一定不能搞混,比如 EPTP switching 就是 Intel hardware 提供的一种能力,VMFUNC 能够在不做 root/non-root 切换的情况下就调用这个硬件 function。当然这个能力需要 hypervisor 事先 enable。

没找到说 VMFUNC 是一个特权指令的地方,暂且默认其可以在 guest userspace 执行吧。

VMCALL is not a privileged instruction, which means it can be called from guest's userspace.

深入理解intel vmfunc指令 - L

Intel TSX (TRANSACTIONAL SYNCHRONIZATION EXTENSIONS)

SDM Chapter 16 PROGRAMMING WITH INTEL® TRANSACTIONAL SYNCHRONIZATION EXTENSIONS

HLE: Legacy

RTM: New

Peer certificate cannot be authenticated with given CA certificates for

Add sslverify = 0 to the corresponding .repo file.

Atomic context / Process context / preempt_disable()

Kernel code generally runs in one of two fundamental contexts:

  • Process context reigns when the kernel is running directly on behalf of a (usually) user-space process; the code which implements system calls is one example. When the kernel is running in process context, it is allowed to go to sleep if necessary.
  • But when the kernel is running in atomic context, things like sleeping are not allowed. Code which handles hardware and software interrupts is one obvious example of atomic context.

Atomic context can be entered by:

  • Code which handles hardware and software interrupts is one obvious example of atomic context.
  • Any kernel function moves into atomic context the moment it acquires a spinlock. Given the way spinlocks are implemented, going to sleep while holding one would be a fatal error; if some other kernel function tried to acquire the same lock, the system would almost certainly deadlock forever.
  • preempt_disable() (This function doesn't disable IRQ).

sleep() is not allowed after preempt_disable()(这个函数的主要作用是关闭内核抢占。这个函数无法关闭中断过来的抢占)。

I think the reason lies in why you are about to use preemption disabling in the first place. When you use preemption disabling, you managed to define a critical region within which your data structure is protected from being corrupted by another process. But if you put a sleep in that critical region, I think you actually split your original critical region into two regions in purpose so you should encompass each region by one pair of preempt_disable/enable respectively.

Why sleeping not allowed after preempt_disable

Effective Detection of Sleep-in-atomic-context Bugs in the Linux Kernel | ACM Transactions on Computer Systems

railway.app Alternatives

Kolab (Currently used): https://app.koyeb.com/

Render: https://dashboard.render.com/

Download a patchset from lore

message_id 是整个 thread 的 id。会自动下载最新版本的 patchset,比较智能。

b4 shazam <message_id>

kvm_arch

struct kvm_arch {
    // if use master clock or not
    bool use_master_clock;

    // Host's CLOCK_BOOTTIME
    bool master_kernel_ns;
    // Used to calculate guest's CLOCK_BOOTTIME
    // guest's CLOCK_BOOTTIME = master_kernel_ns + kvmclock_offset
	s64 kvmclock_offset;

    // Host's TSC value doing the update,
    // used to calculate elapsed time
	u64 master_cycle_now;

    // 
	struct kvm_apic_map __rcu *apic_map;

	/*
	 * List of struct kvm_mmu_pages being used as roots.
	 * All struct kvm_mmu_pages in the list should have
	 * tdp_mmu_page set.
	 *
	 * Roots will remain in the list until their tdp_mmu_root_count
	 * drops to zero, at which point the thread that decremented the
	 * count to zero should removed the root from the list and clean
	 * it up, freeing the root after an RCU grace period.
	 */
    // 在非虚拟化的环境中,每一个进程需要有一个自己的页表
    // 但是对于 TDP, 由于我们只需要将 GPA 转化为 HPA,那么应该
    // 整个 VM 只需要一个 shadow page table 就可以了,为什么这里要有多个呢?
    // 据测试,不管 VCPU 的数量是多少,这里每次都会创造 2 个 root,这是为什么?
    //
	struct list_head tdp_mmu_roots;

    // 4096 个 pages 大小的 hash cache,key 是 gfn,value 是 shadow page list
    // 详情请见 mmu_page_hash^
	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
}

kvmclock_offset

Guest CLOCK_BOOTTIME = Host CLOCK_BOOTTIME + kvmclock:

vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;

易得,初始值为 VM 创建时 Host CLOCK_BOOTTIME 的相反数。也就是:

kvm_arch_init_vm()
    //...
	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();

还有一个地方会改 kvmclock_offset 的值,

kvm_vm_ioctl_set_clock
	if (ka->use_master_clock)
		now_raw_ns = ka->master_kernel_ns;
	else
		now_raw_ns = get_kvmclock_base_ns();
    // now_raw_ns: Host's CLOCK_BOOTTIME
    // data.clock: Guest's CLOCK_BOOTTIME (we want to set)
    // because gBT = hBT + offset
    // so      offset = gBT - hBT
    ka->kvmclock_offset = data.clock - now_raw_ns;

use_master_clock / kvm_guest_has_master_clock

use_master_clock 在这个地方被赋值:

pvclock_update_vm_gtod_copy() {
    //...
    // 当 host 在用 tsc,同时所有的 vcpus 都 matched 了
    ka->use_master_clock = host_tsc_clocksource && vcpus_matched
            && !ka->backwards_tsc_observed
            && !ka->boot_vcpu_runs_old_kvmclock;
    //...
}

master_cycle_now

只在:

pvclock_update_vm_gtod_copy
    kvm_get_time_and_clockread
        do_monotonic_raw(master_kernel_ns, master_cycle_now)
            
            

User return MSR (uret MSR)

User return MSRs are always emulated when enabled in the guest, but only loaded into hardware when necessary, e.g. SYSCALL #UDs outside of 64-bit mode or if EFER.SCE=1, thus the SYSCALL MSRs don't need to be loaded into hardware if those conditions aren't met.


// A global variable
// only stores the MSR's indexes, values are in other place
static u32 __read_mostly kvm_uret_msrs_list[KVM_MAX_NR_USER_RETURN_MSRS];

// Each CPU has a struct of this
// which contains all the MSRs in this CPU
struct kvm_user_return_msrs {
	struct user_return_notifier urn;
    // if the above user_return_notifier is
    // registered or not
	bool registered;

    // stores values, indexed with the same index
    // used to index kvm_uret_msrs_list
	struct kvm_user_return_msr_values {
		u64 host; // 
		u64 curr; // MSR value currently on the hardware
	} values[KVM_MAX_NR_USER_RETURN_MSRS];
};





hardware_setup
    vmx_setup_user_return_msrs
        kvm_add_user_return_msr
            

struct vmx_uret_msr

struct vmx_uret_msr {
    // When vm-entry, if this msr should be loaded into hardware
    // or not.
	bool load_into_hardware;
	u64 data;
	u64 mask;
};


struct vcpu {
    //...
    // guest's value on this MSR
	struct vmx_uret_msr   guest_uret_msrs[MAX_NR_USER_RETURN_MSRS];
    //...
}

vmx_set_guest_uret_msr

This function is to set guest's value.

static int vmx_set_guest_uret_msr(struct vcpu_vmx *vmx,
				  struct vmx_uret_msr *msr, u64 data)
{
	unsigned int slot = msr - vmx->guest_uret_msrs;
	int ret = 0;

	if (msr->load_into_hardware) {
		preempt_disable();
		ret = kvm_set_user_return_msr(slot, data, msr->mask);
		preempt_enable();
	}
    //...
    // will update msr data in vcpu struct
	msr->data = data;
    //...
}

vmx_setup_uret_msr

Just set the MSR's load_into_hardware to the specified value.

static void vmx_setup_uret_msr(struct vcpu_vmx *vmx, unsigned int msr,
			       bool load_into_hardware)
{
	uret_msr = vmx_find_uret_msr(vmx, msr);
    //...
	uret_msr->load_into_hardware = load_into_hardware;
}

kvm_set_user_return_msr

This function will be called just before VM-entry to set the guest's value back.

Write the "value" to MSR "slot", while considering host amd "mask"

int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
{
	unsigned int cpu = smp_processor_id();
	struct kvm_user_return_msrs *msrs = per_cpu_ptr(user_return_msrs, cpu);
	int err;

    // use masked part of "value" and unmasked part from "host"
	value = (value & mask) | (msrs->values[slot].host & ~mask);
	if (value == msrs->values[slot].curr)
		return 0;
	err = wrmsrl_safe(kvm_uret_msrs_list[slot], value);
	if (err)
		return 1;

	msrs->values[slot].curr = value;
	if (!msrs->registered) {
		msrs->urn.on_user_return = kvm_on_user_return;
		user_return_notifier_register(&msrs->urn);
		msrs->registered = true;
	}
	return 0;
}

kvm_on_user_return()

urn->on_user_return() will be called when syscall return to user mode.

kvm_on_user_return() will be hooked on this callback when by registering.

// Let's take ssycall as an example, another example is irq
do_syscall_64
    // syscall is done and we need to return to usermode
    syscall_exit_to_user_mode
        syscall_exit_to_user_mode_work
            __syscall_exit_to_user_mode_work // irqentry_exit_to_user_mode
                exit_to_user_mode_prepare
                    arch_exit_to_user_mode_prepare
                        fire_user_return_notifiers
                            hlist_for_each_entry_safe(urn, tmp2, head, link)
                        		urn->on_user_return(urn); // finally call to here

static void kvm_on_user_return(struct user_return_notifier *urn)
{
	unsigned slot;
	struct kvm_user_return_msrs *msrs
		= container_of(urn, struct kvm_user_return_msrs, urn);
	struct kvm_user_return_msr_values *values;
	unsigned long flags;

    //...
    // unregister, i.e., this function should only be triggered once
	if (msrs->registered) {
		msrs->registered = false;
		user_return_notifier_unregister(urn);
	}
    //...
    // write host's value back, which means
    // when userspace is running, the MSR's value should 
    // start with "host"
	for (slot = 0; slot < kvm_nr_uret_msrs; ++slot) {
		values = &msrs->values[slot];
		if (values->host != values->curr) {
			wrmsrl(kvm_uret_msrs_list[slot], values->host);
			values->curr = values->host;
		}
	}
}

Destroy VM / Ctrl-C in QEMU / Teardown / Shutdown

memory_region_finalize // memory_region_info.instance_finalize
    memory_region_destructor_ram
        qemu_ram_free
            reclaim_ramblock
// 主线程收到 SIGINT,发送 SIG_IPI 给各个 vcpu 线程
main
    qemu_init
        qemu_init_displays
            os_setup_signal_handling
                termsig_handler // sigaction(SIGINT,  &act, NULL);
                    qemu_system_killed
                        shutdown_signal = signal;
                        shutdown_pid = pid;
                        shutdown_action = SHUTDOWN_ACTION_POWEROFF;
                        shutdown_requested = SHUTDOWN_CAUSE_HOST_SIGNAL;
    qemu_main
        qemu_default_main
            qemu_main_loop
                // will return true
                main_loop_should_exit
        qemu_cleanup
            vm_shutdown
                do_vm_stop
                    pause_all_vcpus
                        //...
                        cpu->stop = true;
                        // 对每一个线程发送 SIG_IPI
                        pthread_kill(cpu->thread->thread, SIG_IPI)

// vcpu 线程
kvm_vcpu_thread_fn
    kvm_init_cpu_signals
        kvm_ipi_signal
            kvm_cpu_kick
                kvm_run->immediate_exit = 1
    cpu_can_run == false

When a process terminates, all of its open files are closed automatically by the kernel. Many programs take advantage of this fact and don't explicitly close open files. c - is it a good practice to close file descriptors on exit - Stack Overflow

因为进程结束时 kernel 会自动关闭其打开的 fd,所以下面两个函数会被调用:

  • kvm_vm_release
  • kvm_vcpu_release
static const struct file_operations kvm_vm_fops = {
	.release        = kvm_vm_release,
	.unlocked_ioctl = kvm_vm_ioctl,
	.llseek		= noop_llseek,
	KVM_COMPAT(kvm_vm_compat_ioctl),
};

static const struct file_operations kvm_vcpu_fops = {
	.release        = kvm_vcpu_release,
	.unlocked_ioctl = kvm_vcpu_ioctl,
	.mmap           = kvm_vcpu_mmap,
	.llseek		= noop_llseek,
	KVM_COMPAT(kvm_vcpu_compat_ioctl),
};

kvm_vm_release() KVM

kvm_vm_release // kvm_vcpu_release is the same
    kvm_put_kvm
        kvm_destroy_vm
            kvm_arch_destroy_vm
                // This is for VMX, for TDX, do nothing
                static_call_cond(kvm_x86_vm_destroy)(kvm);
                    vmx_vm_destroy
                    //...
                kvm_destroy_vcpus
                    kvm_vcpu_destroy
                        kvm_arch_vcpu_destroy
                        	static_call(kvm_x86_vcpu_free)(vcpu);
                                tdx_vcpu_free // vmx_vcpu_free
                // This is for TDX, for VMX, do nothing
                static_call_cond(kvm_x86_vm_free)(kvm);
                    tdx_vm_free