2023-08 Monthly Archive

See previous session's dmesg when kernel crash/shutdown

Change the line in /etc/systemd/journald.conf from the compiled-in default #Storage=auto to Storage=persistent.

systemctl restart systemd-journald

Then shutdown/crash…

journalctl can easily get the kernel messages (dmesg log) from prior shutdown/crash (in a dmesg -T format) through the following.

# all boot cycles
journalctl -o short-precise -k -b all
# Current boot
journalctl -o short-precise -k
# Last boot
journalctl -o short-precise -k -b -1
# Two boots prior
journalctl -o short-precise -k -b -2

How to read dmesg from previous session? (dmesg.0) - Unix & Linux Stack Exchange

Build kernel for guest to use

Kernel config disable CONFIG_MODULES to enable built-in modules: linux - qemu - where are modules pulled from if using -kernel? - Unix & Linux Stack Exchange

make then you will see there is a kernel image in arch/x86/boot/bzImage.

-kernel "/home/lei/p/pvrar/arch/x86/boot/bzImage" \
-initrd "/boot/initramfs-6.0.0-rc7-tdx-pvrar-lei-g8883a6e1d762.img" \

`migratepages` / Page migration

Migrate the physical location a processes pages.

migratepages moves the physical location of a processes pages without any changes of the virtual address space of the process.

Why migrate pages?

Moving the pages allows one to change the distances of a process to its memory. Performance may be optimized by moving a processes pages to the node where it is executing.

In virtualization, migratepages can be used to cause a lot of EPT violations because the original GPA->HPA mapping is invalid.

An example to use migratepages:

migratepages <pid> <from nodes> <to nodes>

source code: numactl/migratepages.c at master · numactl/numactl

migratepages // binary
    migrate_pages // syscall
        kernel_migrate_pages() // kernel
            do_migrate_pages
                migrate_to_node
                    migrate_pages
                        unmap_and_move
                            __unmap_and_move
                                try_to_migrate
                                    try_to_migrate_one
                                        mmu_notifier_invalidate_range_start
                                            mn_hlist_invalidate_range_start
                                                ops->invalidate_range_start()
                                        mmu_notifier_invalidate_range
                                            subscription->ops->invalidate_range()
                                        mmu_notifier_invalidate_range_end
                                            __mmu_notifier_invalidate_range_end
                                                mn_hlist_invalidate_end
                                                    subscription->ops->invalidate_range()

QEMU pin vCPU

taskset -c 0-1 <qemu command>

QEMU bind VM's memory to a NUMA node

-object memory-backend-ram,id=mem0,size=8G,prealloc=yes,policy=bind,host-nodes=0 \
-machine q35,memory-backend=mem0 \

ASCII plot

lewish/asciiflow: ASCIIFlow

ASCIIFlow

How to know NUMA information?

numactl --hardware

Kernel IPI sending path (Kernel)

There is a global variable smp_ops:

struct smp_ops {
    //...
    // Multi-cast IPI
	void (*send_call_func_ipi)(const struct cpumask *mask);
    // Uni-cast IPI
	void (*send_call_func_single_ipi)(int cpu);
    // Multi-cast IPI using RAR
	void (*send_rar_ipi)(const struct cpumask *mask);
    // Uni-cast IPI using RAR
	void (*send_rar_single_ipi)(int cpu);
    //...
};

Can a CPU send an IPI to itself?

Yes, it can.

I think the big reason is consistency: If you are writing software for a multi core processor, and you want to send an interrupt out to all cores in the system, it would suck to have to do an IPI to every other core, then execute INT to interrupt the current core, and of course you will also have to setup the handlers for both interrupt sources etc… It is just much easier to send IPI to everyone.

x86 - Purpose of self-IPI on IA-32 - Stack Overflow

How to enable X11-forwarding / xserver

yum install xorg-x11-xauth

edit /etc/ssh/sshd_config:

AddressFamily inet
AllowTcpForwarding yes
X11Forwarding yes
X11DisplayOffset 10
X11UseLocalhost yes

linux - How does mobaXterm know whether x11 forwarding is working on remote server? - Stack Overflow

Memory ballooning

Allows VMM to artificially enlarge its pool of memory by taking advantage or reclaiming unused memory previously allocated to various VMs. It enables the VMM to reclaim underutilized memory from a lightly loaded VM and re-allocate it to overloaded VMs (need reboot if exceed VM's memory capacity). 当然也可以把回收的内存给 host 上的其他进程去用。

This is achieved through a balloon driver in guest OS which the VMM communicates with when it needs to reclaim memory through ballooning.

注意，这个和 hotplug 是有区别的。hotplug 的情况，Guest 是知道 Memory Capacity 是变化了的。

Ballooning is effective only when the scope of memory resizing do not exceed the VM memory cap. Otherwise, the VM needs to reboot for memory re-configuration. That means ballooning is constrained to the memory cap that originates from the VM’s creation and persists in the VM’s whole lifecycle. 当然，这个问题可以被 memory hotplug 解决。hotplugging can dynamically expand a VM’s physical address space beyond the memory cap specified at boot, and thus can arbitrarily increase a VM’s memory allocation without rebooting the VM. Due to these advantages, hotplugging is complementary to ballooning.

The genius of ballooning is that it allows the guest OS to intelligently make the hard decision about which pages to be paged out without the hypervisor’s involvement. so this can be used to generate a lot of EPT invalidations.

For more, see:

Kexec

A mechanism of the Linux kernel that allows booting of a new kernel from the currently running one. Essentially, kexec skips the bootloader stage and hardware initialization phase performed by the system firmware (BIOS or UEFI), and directly loads the new kernel into main memory and starts executing it immediately. This avoids the long times associated with a full reboot, and can help systems to meet high-availability requirements by minimizing downtime.

kexec -l /boot/vmlinuz-6.2.0-staging-lei --initrd=/boot/initramfs-6.2.0-staging-lei.img --reuse-cmdline
kexec -e

TOCTOU

Is a class of software bugs.

简单来说，就是 check 完之后没问题，可以用，但是在用之前被改了，导致用的时候出现了问题。

Time-of-check to time-of-use - Wikipedia

Page fault 可能发生的几种原因

*SDM 4.7 PAGE-FAULT EXCEPTIONS

Chapter 4 Paging*

An access to a linear address may cause a page-fault exception for either of two reasons:

There is no translation for the linear address. PRESENT bit is 0, 注意，在任意一级，如果 PRESENT bit 是 0，都看作这种情况，所以它包含了
- PTE 里的 PRESETN bit 是 0，可能表示没有映射，也可能表示被 swapped out 出去了。怎么知道是哪一种情况呢？这是 OS 的工作，如何是 swapped out 出去的，操作系统会保存它的映射，这不是 x86 需要提供的功能，x86 只知道这个映射是不存在的。
There is a translation for the linear address, but its access rights do not permit the access. 这个就很容易理解了，比如写一个只读的页。

Whatever pages end up evicted to their backing store will have their “PRESENT” bits cleared.

memory - How does the page table of a process knows which page was swapped? - Unix & Linux Stack Exchange

Attestation

PCCS

PCCS is a cache for PCS hold locally. It retrieves PCK Certificates and other collaterals on-demand using the internet at runtime, and then caches them in local database.

The PCCS exposes similar HTTPS interfaces as PCS.

Memory folios

一个 folio 并不一定非要有 2 的次方个 page。

folio: n. 对开的纸,(原稿的) 一页,页码或张数 adj. 对折的,对开的 vt. 编页码。

Linux 5.16，folio 基础部分开始合入。截止到 Linux 6.5，folio 已经有很大进展。

struct folio 是一个新的抽象概念，用来取代古老的 struct page。

Clarifying memory management with page folios [LWN.net]

内核所管理的内存中，每一个 base page 都是用 system memory map 中的一个 page structure 来代表的。如果利用一组 base page 创建出了一个 compound page^，那么就会对这组 pages 中的第一个 page（"head page"）的 page structure 打上一个特殊标记，从而明确指出这是一个 compound page。这个 head page 的 page structure 中的其他信息都是对整个 compound page 生效的。所有其他 page（"tail pages"）也都被标记出来，使用一个指针指向相关的 head page。关于 compound pages 的组织方式，请参见 An introduction to compound pages。

这种机制可以很容易从 tail page 的 page structure 找到整个 compound page 的 head page。内核中的许多接口都利用了这一特性，但它带来一个歧义问题：如果某个函数收到参数是一个指向了 tail page 的 page structure，那么这个函数是应该针对这个 tail page 上执行，还是在整个 compound page 上执行？

Wilcox 提出了 "page folio" 的概念，它实际上仍然是一个 page structure，只是保证了它一定不是 tail page。任何接受 folio page 参数的函数都会是对整个 compound page 进行操作（如果传入的确实是一个 compound page 的话），这样就不会有任何歧义。从而可以使内核里的内存管理子系统更加清晰；也就是说，如果某个函数被改为只接受 folio page 作为参数的话，很明确，它们不适用于对 tail page 的操作。

LWN：利用page folio来明确内存操作！_LinuxNews搬运工的博客-CSDN博客

Compound page

A compound page is simply a grouping of two or more physically contiguous pages into a unit that can, in many ways, be treated as a single, larger page. They are most commonly used to create huge pages.

`struct folio` Kernel

A folio is a physically, virtually and logically contiguous set of bytes.

It is a power-of-two in size, and
it is aligned to that same power-of-two.

It is at least as large as PAGE_SIZE. It may be mapped into userspace at an address which is at an arbitrary page offset, but its kernel virtual address is aligned to its size.

struct folio {
    //...
            // 这个 folio 里所包含的页的数量
			unsigned int _folio_nr_pages;
    //...
};

`folio_mapcount()` Kernel

refcount 是页面拥有的引用总数，而 mapcount 是引用页面的页表数。

mapcount 表示这个页面被进程映射的个数，即已经映射了多少个用户 PTE 页表。

/**
 * folio_mapcount() - Calculate the number of mappings of this folio.
 * @folio: The folio.
 *
 * A large folio tracks both how many times the entire folio is mapped,
 * and how many times each individual page in the folio is mapped.
 * This function calculates the total number of times the folio is
 * mapped.
 *
 * Return: The number of times this folio is mapped.
 */
static inline int folio_mapcount(struct folio *folio)
{
	if (likely(!folio_test_large(folio)))
		return atomic_read(&folio->_mapcount) + 1;
	return folio_total_mapcount(folio);
}

`can_split_folio()` Kernel

看一个 folio 能不能被 split 成多个小 page。

/* Racy check whether the huge page can be split */
bool can_split_folio(struct folio *folio, int *pextra_pins)
{
    //...
    // 0 == 513 - 512 - 1
	return folio_mapcount(folio) == folio_ref_count(folio) - folio_nr_pages(folio) - 1;
}

`folio_page()` Kernel

Return nth page from a folio.

#define folio_page(folio, n)	nth_page(&(folio)->page, n)

`folio_put()` Kernel

Decrement the reference count on a folio. If the folio's reference count reaches zero, the memory will be released back to the page allocator and may be used by another allocation immediately. Do not access the memory or the struct folio after calling folio_put() unless you can be sure that it wasn't the last reference.

static inline void folio_put(struct folio *folio)
{
	if (folio_put_testzero(folio))
		__folio_put(folio);
}

`folio_nr_pages()` Kernel

假如为了减少 TLB 的 miss 我们用了大页，比如一个页是 2M 的页，我们的 folio_nr_pages() 会是 512 而不是 1，也就是每一个 page 固定成了 4K 大小，即使我们用了 2M THP。

这是因为 struct page is associated to a physical 4k page，对于 x86，struct page 一定固定对应一个 4K 页而不是其他的大小。

c - Where is the struct page in Linux kernel? - Stack Overflow