2023-06 Monthly Archive
struct page
/ struct mem_section
Kernel
每个物理页面都需要一个 struct page 数据结构来描述,因此为了降低成本,该结构中大量使用了 C 语言的联合体 Union 来优化其大小。
内存被划分为很多个 mem_section
,每个 mem_section
的大小为 128MB,每一个 mem_section
有一个数组保存了所有的 struct page
。
struct mem_section {
/*
* This is, logically, a pointer to an array of struct
* pages...
*/
unsigned long section_mem_map;
//...
};
page_to_pfn()
/ pfn_to_page()
Kernel
#define vmemmap ((struct page *)VMEMMAP_START)
#define __pfn_to_page(pfn) (vmemmap + (pfn))
#define __page_to_pfn(page) (unsigned long)((page) - vmemmap)
每一个物理页,在 kernel 中有一个 struct page
进行描述。这些 struct page
是按它对应的物理页面的地址顺序,顺序存放在 vmemmap
数组中。所以,某一个页对应的 struct page
在 vmemmap
数组中的偏移,由这个 page 是第几个物理页面决定。
NX huge pages
Non-executable huge pages.
iTLB multihit is an erratum where some processors may incur a machine check error, possibly resulting in an unrecoverable CPU lockup, when an instruction fetch hits multiple entries in the instruction TLB. This can occur when the page size is changed along with either the physical address or cache type. A malicious guest running on a virtualized system can exploit this erratum to perform a denial of service attack.
也就是说,如果一个 huge page 是可执行的,那么在 instruction fetch 的时候,有可能 hit 到多个 entries,从而给了 guest exploit 的空间。
In order to mitigate the vulnerability, KVM initially marks all huge pages as non-executable. If the guest attempts to execute in one of those pages, the page is broken down into 4K pages, which are then marked executable.
iTLB multihit — The Linux Kernel documentation
Page pinning / Page locking
A page that has been locked into memory with a call like mlock() is required to always be physically present in the system's RAM
对于 pin page,virtual address 不仅要一直在 RAM 中,且 physical address 也不会变,因此不会有任何 page fault。
struct kvm_mmu_page_role
KVM
union kvm_mmu_page_role {
u32 word;
struct {
// 这个页表项在哪一级?
unsigned level:4;
//...
// 表示这个 MMU page 是 invalid 的
// See kvm_tdp_mmu_invalidate_all_roots()
unsigned invalid:1;
//...
};
};
Page fault handling process in kernel
Declare the handling function for page fault:
#define X86_TRAP_PF 14 /* Page Fault */
DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_PF, exc_page_fault);
Define the handling function for page fault:
DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
//...
handle_page_fault(regs, error_code, address);
//...
}
handle_page_fault
// for page fault on kernel address
do_kern_addr_fault
do_user_addr_fault
handle_mm_fault
__handle_mm_fault
handle_pte_fault
do_fault
struct mm_struct
/ Memory Descriptor (kernel)
aka “Memory Descriptor”.
Represents the process’ address space. Holds the data Linux needs about the memory address space of the process.
struct mm_struct {
//...
// 代码段,堆栈段起始地址之类的。
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
}
struct vm_area_struct
(kernel)
虚拟内存区块 (VMA),表示进程的虚拟地址空间中的一块区域。
A VM area is any part of the process virtual memory space that has a special rule for the page-fault handlers (ie a shared library, the executable area etc).
struct vm_area_struct {
//...
// Function pointers to deal with this struct.
// such as the page fault handler.
const struct vm_operations_struct *vm_ops;
//...
}
struct vm_operations_struct
(kernel)
vm_area_struct
中的 vm_operations_struct
结构描述操作虚拟内存区块的函数集。
和 vm_area_struct
一一对应,毕竟是为了描述一个 vm_area_struct
。
/*
* These are the virtual MM functions - opening of an area, closing and
* unmapping it (needed to keep files on disk up-to-date etc), pointer
* to the functions called when a no-page or a wp-page exception occurs.
*/
struct vm_operations_struct {
//...
// 当发生 page fault 时的 handler
vm_fault_t (*fault)(struct vm_fault *vmf);
//...
}
COCO
Confidential Computing (coco) hardware such as AMD SEV and Intel TDX.
PDPTR, PML4, PML4AE, PDPT, PDPTE, PD, PDE, PT, PTE
In 64-bit:
- CR3 points to PML4 table, each entry is PML4E
- PML4E points to PDPT (page-directory-pointer table), each entry is PDPTE
- PDPTE points to PD (page directory), each entry is PDE
- PDE points to PT (page table), each entry is PTE
Regarding to PDPTR (Page Directory Pointer Table Register),
The register CR3, which now points to the PDP, also got the alternate name PDPTR: “page directory pointer table register”
Internal, non-architectural PDPTE registers.
The PDPT comprises four (4) 64-bit entries called PDPTEs. Each PDPTE controls access to a 1-GByte region of the linear-address space. Corresponding to the PDPTEs, the logical processor maintains a set of four (4) internal, non-architectural PDPTE registers, called PDPTE0, PDPTE1, PDPTE2, and PDPTE3. The logical processor loads these registers from the PDPTEs in memory as part of certain operations
With PAE paging, a logical processor maintains a set of four (4) PDPTE registers, which are loaded from an address in CR3. Linear address are translated using 4 hierarchies of in-memory paging structures, each located using one of the PDPTE registers. (This is different from the other paging modes, in which there is one hierarchy referenced by CR3.)
The funny page table terminology on AMD64 – pagetable.com
#UD
一般来说代表执行的指令不存在。
Invalid Opcode Exception.
#GP
General Protection.
Indicates that the processor detected one of a class of protection violations called “general-protection violations.”
The conditions that cause this exception to be generated comprise all the protection violations that do not cause
other exceptions to be generated (such as, invalid-TSS, segment-not-present, stack-fault, or page-fault excep-
tions).
Cache line
On most architectures, the size of a cache line is 64 bytes (NOT 4K, remember), meaning that all memory is divided in blocks of 64 bytes, and whenever you request (read or write) a single byte, you are also fetching all its 63 cache line neighbors whether your want them or not.
In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back.
VMCALL
This instruction allows guest software can make a call for service into an underlying VM monitor. The details of the programming interface for such calls are VMM-specific; this instruction does nothing more than cause a VM exit.
VMCALL is not a privileged instruction, which means it can be called from guest's userspace.
Intel TSX (TRANSACTIONAL SYNCHRONIZATION EXTENSIONS)
SDM Chapter 16 PROGRAMMING WITH INTEL® TRANSACTIONAL SYNCHRONIZATION EXTENSIONS
HLE: Legacy
RTM: New
Peer certificate cannot be authenticated with given CA certificates for
Add sslverify = 0
to the corresponding .repo
file.
Atomic context / Process context / preempt_disable()
Kernel code generally runs in one of two fundamental contexts:
- Process context reigns when the kernel is running directly on behalf of a (usually) user-space process; the code which implements system calls is one example. When the kernel is running in process context, it is allowed to go to sleep if necessary.
- But when the kernel is running in atomic context, things like sleeping are not allowed. Code which handles hardware and software interrupts is one obvious example of atomic context.
Atomic context can be entered by:
- Code which handles hardware and software interrupts is one obvious example of atomic context.
- Any kernel function moves into atomic context the moment it acquires a spinlock. Given the way spinlocks are implemented, going to sleep while holding one would be a fatal error; if some other kernel function tried to acquire the same lock, the system would almost certainly deadlock forever.
-
preempt_disable()
(This function doesn't disable IRQ).
sleep()
is not allowed after preempt_disable()
(这个函数的主要作用是关闭内核抢占。这个函数无法关闭中断过来的抢占)。
I think the reason lies in why you are about to use preemption disabling in the first place. When you use preemption disabling, you managed to define a critical region within which your data structure is protected from being corrupted by another process. But if you put a sleep in that critical region, I think you actually split your original critical region into two regions in purpose so you should encompass each region by one pair of preempt_disable/enable respectively.
Why sleeping not allowed after preempt_disable
railway.app
Alternatives
Kolab (Currently used): https://app.koyeb.com/
Render: https://dashboard.render.com/
Download a patchset from lore
message_id 是整个 thread 的 id。会自动下载最新版本的 patchset,比较智能。
b4 shazam <message_id>
kvm_arch
struct kvm_arch {
// if use master clock or not
bool use_master_clock;
// Host's CLOCK_BOOTTIME
bool master_kernel_ns;
// Used to calculate guest's CLOCK_BOOTTIME
// guest's CLOCK_BOOTTIME = master_kernel_ns + kvmclock_offset
s64 kvmclock_offset;
// Host's TSC value doing the update,
// used to calculate elapsed time
u64 master_cycle_now;
//
struct kvm_apic_map __rcu *apic_map;
/*
* List of struct kvm_mmu_pages being used as roots.
* All struct kvm_mmu_pages in the list should have
* tdp_mmu_page set.
*
* Roots will remain in the list until their tdp_mmu_root_count
* drops to zero, at which point the thread that decremented the
* count to zero should removed the root from the list and clean
* it up, freeing the root after an RCU grace period.
*/
// 在非虚拟化的环境中,每一个进程需要有一个自己的页表
// 但是对于 TDP, 由于我们只需要将 GPA 转化为 HPA,那么应该
// 整个 VM 只需要一个 shadow page table 就可以了,为什么这里要有多个呢?
// 据测试,不管 VCPU 的数量是多少,这里每次都会创造 2 个 root,这是为什么?
//
struct list_head tdp_mmu_roots;
// 4096 个 pages 大小的 hash cache,key 是 gfn,value 是 shadow page list
// 详情请见 mmu_page_hash^
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
}
kvmclock_offset
Guest CLOCK_BOOTTIME
= Host CLOCK_BOOTTIME
+ kvmclock
:
vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
易得,初始值为 VM 创建时 Host CLOCK_BOOTTIME
的相反数。也就是:
kvm_arch_init_vm()
//...
kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
还有一个地方会改 kvmclock_offset
的值,
kvm_vm_ioctl_set_clock
if (ka->use_master_clock)
now_raw_ns = ka->master_kernel_ns;
else
now_raw_ns = get_kvmclock_base_ns();
// now_raw_ns: Host's CLOCK_BOOTTIME
// data.clock: Guest's CLOCK_BOOTTIME (we want to set)
// because gBT = hBT + offset
// so offset = gBT - hBT
ka->kvmclock_offset = data.clock - now_raw_ns;
use_master_clock
/ kvm_guest_has_master_clock
use_master_clock
在这个地方被赋值:
pvclock_update_vm_gtod_copy() {
//...
// 当 host 在用 tsc,同时所有的 vcpus 都 matched 了
ka->use_master_clock = host_tsc_clocksource && vcpus_matched
&& !ka->backwards_tsc_observed
&& !ka->boot_vcpu_runs_old_kvmclock;
//...
}
master_cycle_now
只在:
pvclock_update_vm_gtod_copy
kvm_get_time_and_clockread
do_monotonic_raw(master_kernel_ns, master_cycle_now)
User return MSR (uret MSR)
User return MSRs are always emulated when enabled in the guest, but only loaded into hardware when necessary, e.g. SYSCALL #UDs outside of 64-bit mode or if EFER.SCE=1, thus the SYSCALL MSRs don't need to be loaded into hardware if those conditions aren't met.
// A global variable
// only stores the MSR's indexes, values are in other place
static u32 __read_mostly kvm_uret_msrs_list[KVM_MAX_NR_USER_RETURN_MSRS];
// Each CPU has a struct of this
// which contains all the MSRs in this CPU
struct kvm_user_return_msrs {
struct user_return_notifier urn;
// if the above user_return_notifier is
// registered or not
bool registered;
// stores values, indexed with the same index
// used to index kvm_uret_msrs_list
struct kvm_user_return_msr_values {
u64 host; //
u64 curr; // MSR value currently on the hardware
} values[KVM_MAX_NR_USER_RETURN_MSRS];
};
hardware_setup
vmx_setup_user_return_msrs
kvm_add_user_return_msr
struct vmx_uret_msr
struct vmx_uret_msr {
// When vm-entry, if this msr should be loaded into hardware
// or not.
bool load_into_hardware;
u64 data;
u64 mask;
};
struct vcpu {
//...
// guest's value on this MSR
struct vmx_uret_msr guest_uret_msrs[MAX_NR_USER_RETURN_MSRS];
//...
}
vmx_set_guest_uret_msr
This function is to set guest's value.
static int vmx_set_guest_uret_msr(struct vcpu_vmx *vmx,
struct vmx_uret_msr *msr, u64 data)
{
unsigned int slot = msr - vmx->guest_uret_msrs;
int ret = 0;
if (msr->load_into_hardware) {
preempt_disable();
ret = kvm_set_user_return_msr(slot, data, msr->mask);
preempt_enable();
}
//...
// will update msr data in vcpu struct
msr->data = data;
//...
}
vmx_setup_uret_msr
Just set the MSR's load_into_hardware
to the specified value.
static void vmx_setup_uret_msr(struct vcpu_vmx *vmx, unsigned int msr,
bool load_into_hardware)
{
uret_msr = vmx_find_uret_msr(vmx, msr);
//...
uret_msr->load_into_hardware = load_into_hardware;
}
kvm_set_user_return_msr
This function will be called just before VM-entry to set the guest's value back.
Write the "value" to MSR "slot", while considering host amd "mask"
int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
{
unsigned int cpu = smp_processor_id();
struct kvm_user_return_msrs *msrs = per_cpu_ptr(user_return_msrs, cpu);
int err;
// use masked part of "value" and unmasked part from "host"
value = (value & mask) | (msrs->values[slot].host & ~mask);
if (value == msrs->values[slot].curr)
return 0;
err = wrmsrl_safe(kvm_uret_msrs_list[slot], value);
if (err)
return 1;
msrs->values[slot].curr = value;
if (!msrs->registered) {
msrs->urn.on_user_return = kvm_on_user_return;
user_return_notifier_register(&msrs->urn);
msrs->registered = true;
}
return 0;
}
kvm_on_user_return()
urn->on_user_return()
will be called when syscall return to user mode.
kvm_on_user_return()
will be hooked on this callback when by registering.
// Let's take ssycall as an example, another example is irq
do_syscall_64
// syscall is done and we need to return to usermode
syscall_exit_to_user_mode
syscall_exit_to_user_mode_work
__syscall_exit_to_user_mode_work // irqentry_exit_to_user_mode
exit_to_user_mode_prepare
arch_exit_to_user_mode_prepare
fire_user_return_notifiers
hlist_for_each_entry_safe(urn, tmp2, head, link)
urn->on_user_return(urn); // finally call to here
static void kvm_on_user_return(struct user_return_notifier *urn)
{
unsigned slot;
struct kvm_user_return_msrs *msrs
= container_of(urn, struct kvm_user_return_msrs, urn);
struct kvm_user_return_msr_values *values;
unsigned long flags;
//...
// unregister, i.e., this function should only be triggered once
if (msrs->registered) {
msrs->registered = false;
user_return_notifier_unregister(urn);
}
//...
// write host's value back, which means
// when userspace is running, the MSR's value should
// start with "host"
for (slot = 0; slot < kvm_nr_uret_msrs; ++slot) {
values = &msrs->values[slot];
if (values->host != values->curr) {
wrmsrl(kvm_uret_msrs_list[slot], values->host);
values->curr = values->host;
}
}
}
Destroy VM / Ctrl-C in QEMU / Teardown / Shutdown
memory_region_finalize // memory_region_info.instance_finalize
memory_region_destructor_ram
qemu_ram_free
reclaim_ramblock
// 主线程收到 SIGINT,发送 SIG_IPI 给各个 vcpu 线程
main
qemu_init
qemu_init_displays
os_setup_signal_handling
termsig_handler // sigaction(SIGINT, &act, NULL);
qemu_system_killed
shutdown_signal = signal;
shutdown_pid = pid;
shutdown_action = SHUTDOWN_ACTION_POWEROFF;
shutdown_requested = SHUTDOWN_CAUSE_HOST_SIGNAL;
qemu_main
qemu_default_main
qemu_main_loop
// will return true
main_loop_should_exit
qemu_cleanup
vm_shutdown
do_vm_stop
pause_all_vcpus
//...
cpu->stop = true;
// 对每一个线程发送 SIG_IPI
pthread_kill(cpu->thread->thread, SIG_IPI)
// vcpu 线程
kvm_vcpu_thread_fn
kvm_init_cpu_signals
kvm_ipi_signal
kvm_cpu_kick
kvm_run->immediate_exit = 1
cpu_can_run == false
When a process terminates, all of its open files are closed automatically by the kernel. Many programs take advantage of this fact and don't explicitly close open files. c - is it a good practice to close file descriptors on exit - Stack Overflow
因为进程结束时 kernel 会自动关闭其打开的 fd,所以下面两个函数会被调用:
kvm_vm_release
kvm_vcpu_release
static const struct file_operations kvm_vm_fops = {
.release = kvm_vm_release,
.unlocked_ioctl = kvm_vm_ioctl,
.llseek = noop_llseek,
KVM_COMPAT(kvm_vm_compat_ioctl),
};
static const struct file_operations kvm_vcpu_fops = {
.release = kvm_vcpu_release,
.unlocked_ioctl = kvm_vcpu_ioctl,
.mmap = kvm_vcpu_mmap,
.llseek = noop_llseek,
KVM_COMPAT(kvm_vcpu_compat_ioctl),
};
kvm_vm_release()
KVM
kvm_vm_release // kvm_vcpu_release is the same
kvm_put_kvm
kvm_destroy_vm
kvm_arch_destroy_vm
// This is for VMX, for TDX, do nothing
static_call_cond(kvm_x86_vm_destroy)(kvm);
vmx_vm_destroy
//...
kvm_destroy_vcpus
kvm_vcpu_destroy
kvm_arch_vcpu_destroy
static_call(kvm_x86_vcpu_free)(vcpu);
tdx_vcpu_free // vmx_vcpu_free
// This is for TDX, for VMX, do nothing
static_call_cond(kvm_x86_vm_free)(kvm);
tdx_vm_free