2023-02 Monthly Archive

Misc ideas

Some instructions are conditionally exiting instructions, which means they will exit only some conditions are met. See the "Conditionally Exiting Instructions" section in Day 5: VM-exit handler, Event Injection, and CPUID Emulation

What does "red rum" (红色朗姆酒) mean? murder.

PREFETCHh instruction: prefetch data into caches.

Samba: n. 桑巴舞（巴西交谊舞）。

The function to handle qmp commands, e.g., query-cpu-definitions is the function with the same name qmp_query_cpu_definitions。

定期爬取博客内容

crontab -e

Add:

0 3 * * * "fish ~/.config/utils/scraper.fish"

Then:

sudo service cron restart

Running a cron job at 2:30 AM everyday - Stack Overflow

Why in C bool is 1 byte?

Because the CPU can't address anything smaller than a byte.

c++ - Why is a boolean 1 byte and not 1 bit of size? - Stack Overflow

Some useful tools for binary format

对于在命令行上提取和过滤数据，fq 提供了一种可访问的方式。
要编辑二进制数据，GNU poke 是一个强大的工具。
要解析并可能以支持的编程语言之一序列化二进制数据，Kaitai Struct 提供了一种灵活的方法。

Weixin Official Accounts Platform

NTFS directory junction vs. directory symbolic link

mklink /J linkName target
mklink /D linkName target

NTFS Hard Links, Junctions and Symbolic Links

Soft link vs. hard link

hard link: same inode with different file name, bi-directional.

soft link: a filename pointing another file.

How to ioctl in Python?

import fcntl

fcntl.ioctl()

.efi File

It contains system-level data that executes between the operating system and the firmware. EFI files are used for staging firmware updates, booting operating systems, and running pre-boot programs.

Typically, EFI files are not meant to be opened. Hardware developers and other advanced PC users can open EFI files using the EFI Developer Kit (cross-platform).

EFI File Extension - What is an .efi file and how do I open it?

这个回答可以看看，对于理解 .efi 存在的意义有所帮助。

bootloader存在的意义是什么？ - 知乎

Export thunderbird message filters

Export/Import section in Message Filters - MozillaZine Knowledge Base

You can search the file directly in "everything". Because the folder is hard to find.

SIGINT, SIGTERM, SIGQUIT, SIGKILL

SIGINT is the signal sent when we press Ctrl+C. The default action is to terminate the process. However, some programs override this action and handle it differently. One common example is the bash interpreter. When we press Ctrl+C it doesn’t quit, instead, it prints a new and empty prompt line.

If we want to use a signal to terminate it, we can’t use SIGINT with this script. We should use SIGTERM, SIGQUIT, or SIGKILL instead.

SIGTERM is the default signal when we use the kill command. The default action of both signals is to terminate the process. However, SIGQUIT also generates a core dump before exiting. When we send SIGTERM, the process sometimes executes a clean-up routine before exiting. We can also handle SIGTERM to ask for confirmation before exiting.

When a process receives SIGKILL it is terminated. This is a special signal as it can’t be ignored and we can’t change its behavior.

The default action for SIGINT, SIGTERM, SIGQUIT, and SIGKILL is to terminate the process. However

SIGTERM, SIGQUIT, and SIGKILL are defined as signals to terminate the process,
but SIGINT is defined as an interruption requested by the user. So, we shouldn’t depend solely on SIGINT to finish a process.

SIGINT And Other Termination Signals in Linux | Baeldung on Linux

硬实时和软实时

硬实时与软实时之间最关键的差别在于，软实时只能提供统计意义上的实时。例如，有的应用要求系统在 95% 的情况下都会确保在规定的时间内完成某个动作，而不一定要求 100%。

软实时到硬实时 - 知乎

Diff output to raw text

# no need to create a git repo
git diff --word-diff=plain --word-diff-regex=. ~/icelake.raw.txt ~/spr.raw.txt

command line - Using 'diff' to get character-level diff between text files - Stack Overflow

QMP reference

QEMU QMP Reference Manual — QEMU documentation

KVM_SET_CPUID2

KVM forbids KVM_SET_CPUID2 after KVM_RUN was performed on a vCPU unless the supplied CPUID data is equal to what was previously set.

[PATCH v4 5/5] KVM: selftests: Test KVM_SET_CPUID2 after KVM_RUN — Linux KVM

What will happen when we set an unsupported CPUID to KVM?

以 CPUID leaf 7 为例：现在 leaf 7 支持 subleaf 0, 1, 2，就不能只暴露给 guest subleaf 0?

better have a test, it seems it won't warn.

Using Samba.

QEMU redirect guest output to host console

-chardev stdio,id=virtiocon0 \ # Backend, Connects the chardev virtiocon0 with the qemu process' stdin/out.
-device virtio-serial \ # channel
-device virtconsole,chardev=virtiocon0 # front-end

virtio-serial simply creates a communication channel between host and guest. This is necessary for the next driver.

The last one, virtconsole creates a console device on the guest, attached to the chardev created before, which was attached to qemu's stdio/out. The guest can then use this console device like any other tty.

The device created on the guest will depend on the kernel and how it was compiled, in linux it's usually /dev/hvc0.

kvm - How do these qemu parameters for stdout redirection work? - Unix & Linux Stack Exchange

Instruction fetches and data access

Who perform the instruction fetch?

It is performed automatically, An instruction fetch is user-mode if CPL = 3 and is supervisor mode if CPL < 3. Instruction fetches are always performed with linear addresses (RIP is linear address).

I think an user mode instruction fetch indicates that we are running user space programs, a supervisor mode instruction fetch indicates that we may running in the kernel space.

ISE in commit message

The LASS details have been published in Chapter 11 in https://cdrdv2.intel.com/v1/dl/getContent/671368

Intel is terrible about maintaining web URLs like this. Better to give this reference as:

The December 2022 edition of the Intel Architecture Instruction Set Extensions and Future Features Programming Reference manual.

Maybe also still give the URL in a reply like this, but avoid using It in a commit message that will be preserved forever.

-Tony

Git rebase --onto

git rebase --onto allows you to rebase starting from a specific commit rather than the branch.

How to git rebase a branch with the onto command? - Stack Overflow

X11 Forwarding

However, X11 is an insecure plaintext protocol by default, so it's not recommended to expose an X Server directly. Instead, most users today use X11 Forwarding to take advantage of the security of SSH when running X11 programs remotely.

…

What You Need to Know About X11 Forwarding

KVM_GET_SUPPORTED_CPUID

struct kvm_cpuid2 {
      __u32 nent;
      __u32 padding;
      struct kvm_cpuid_entry2 entries[0];
};

#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX               BIT(0)
#define KVM_CPUID_FLAG_STATEFUL_FUNC          BIT(1) /* deprecated */
#define KVM_CPUID_FLAG_STATE_READ_NEXT                BIT(2) /* deprecated */

struct kvm_cpuid_entry2 {
      __u32 function;
      __u32 index;
      __u32 flags;
      __u32 eax;
      __u32 ebx;
      __u32 ecx;
      __u32 edx;
      __u32 padding[3];
};

What does each bit in eax (ebx, ecx, edx) mean? There are 2 meanings:

The first explanation:

If the bit returned is 1, it means this bit could be set as 1;
If the bit returned is 0, it means this bit must be set as 0;

The second explanation:

If the bit returned is 1, it means this bit must be set as 1;
If the bit returned is 0, it means this bit must be set as 0;

According to the code:

// kvm_arch_dev_ioctl
// kvm_dev_ioctl_get_cpuid
// get_cpuid_func
// do_cpuid_func
// __do_cpuid_func
// do_host_cpuid
// cpuid_count
void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d)
{
	asm volatile("cpuid"
		     : "=a" (*a), "=b" (*b), "=c" (*c), "=d" (*d)
		     : "0" (id), "2" (count)
	);
}

So it just the output from the host. So I prefer to the first explanation.

64-bit Process virtual address space

The 64-bit x86 virtual memory map splits the address space into two:

the lower section (with the top bit set to 0) is user-space,
the upper section (with the top bit set to 1) is kernel-space.

How a 64-bit process virtual address space is divided in Linux? - Unix & Linux Stack Exchange

PFEC (page fault error code)

It's 32 bit.

// arch/x86/include/asm/trap_pf.h
/*
 * Page fault error code bits:
 *
 *   bit 0 ==	 0: no page found	1: protection fault
 *   bit 1 ==	 0: read access		1: write access
 *   bit 2 ==	 0: kernel-mode access	1: user-mode access
 *   bit 3 ==				1: use of reserved bit detected
 *   bit 4 ==				1: fault was an instruction fetch
 *   bit 5 ==				1: protection keys block access
 *   bit 15 ==				1: SGX MMU page-fault
 */
enum x86_pf_error_code {
	X86_PF_PROT	=		1 << 0,
	X86_PF_WRITE	=		1 << 1,
	X86_PF_USER	=		1 << 2,
	X86_PF_RSVD	=		1 << 3,
	X86_PF_INSTR	=		1 << 4,
	X86_PF_PK	=		1 << 5,
	X86_PF_SGX	=		1 << 15,
};

Search "Page Fault Error Code" in SDM outline.

stac/clac

Set AC Flag, Clear AC Flag.

Enable or disable SMAP.

static __always_inline void stac(void)
{
	/* Note: a barrier is implicit in alternative() */
	alternative("", __ASM_STAC, X86_FEATURE_SMAP);
}

static __always_inline void clac(void)
{
	/* Note: a barrier is implicit in alternative() */
	alternative("", __ASM_CLAC, X86_FEATURE_SMAP);
}

Kernel patching / text poking

Kernel deliberately initialized a temporary mm area within the lower half of the address range for text poking, see commit 4fc19708b165 ("x86/alternatives: Initialize temporary mm for patching").

Why?

ESP (EFI System Partition)

The EFI system partition (also called ESP) is an OS independent partition that acts as the storage place for the EFI bootloaders, applications and drivers to be launched by the UEFI firmware. It is mandatory for UEFI boot.

Typical mount points:

mount ESP to /boot. This is the most straightforward method.
mount ESP to /efi.
…

Tip:

/efi is a replacement[5] for the historical and now discouraged ESP mountpoint /boot/efi.

EFI system partition - ArchWiki

How to see current mount point?

Typically, the EFI partition is mounted as /boot/efi.

Size of the ESP

It is not fixed.

As per the Arch Linux wiki, to avoid potential problems with some EFIs, ESP size should be at least 512 MiB. 550 MiB is recommended to avoid MiB/MB confusion and accidentally creating FAT16. So, most common size guideline for EFI System Partition is between 100 MB to 550 MB.

boot - How to know the proper amount of needed disk space for EFI partition - Ask Ubuntu

Instructions will cause vm-exit

These are the instructions that always cause an exit:

INVEPT
INVVPID
VMCALL
VMCLEAR
VMLAUNCH
VMPTRLD
VMPTRST
VMRESUME
VMXOFF
VMXON

These instruction could case an exit:

CLTS
ENCLS
ENCLV
HLT
IN, INS/INSB/INSW/INSD, OUT, OUTS/OUTSB/OUTSW/OUTSD
INVLPG
INVPCID
LGDT, LIDT, LLDT, LTR, SGDT, SIDT, SLDT, STR
LMSW
MONITOR
MOV from CR3, MOV from CR8, MOV to CR0, MOV to CR3, MOV to CR4, MOV to CR8, MOV DR
MWAIT
PAUSE
RDMSR, RDPMC, RDRAND, RDSEED, RDTSC, RDTSCP
RSM
TPAUSE
UMWAIT
VMREAD
VMWRITE
WBINVD
WRMSR (MSR bitmap in VMCS enable pass-though)
XRSTORS
XSAVES

assembly - VM Instructions If an exit is possible? - Stack Overflow

MTF VM-exit

Monitor Trap flag VM-exit.

当设置该标志位时，回到 guest 时第一条指令会再次触发 VM-Exit，且退出理由为 MTF；

static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
	//...
	[EXIT_REASON_MONITOR_TRAP_FLAG]       = handle_monitor_trap,
	//...
}

This is a debugging feature.

具体可以参考 SDM 25.5.2。

VT MTF VM-Exit - OneTrainee - 博客园

x86_emulate_ops, KVM instruction emulator, how does KVM emulate an instruction?

Handle process:

Day 5: VM-exit handler, Event Injection, and CPUID Emulation

The place to advance VMX_GUEST_RIP: skip_emulated_instruction() in arch/x86/kvm/vmx/vmx.c.

The flow chart for handling an instruction: KVM Emulate Flowchart, there are some faults in it.

If you search in Google what is KVM instruction emulator, you will get NOTHING.

kvm_emulate_instruction
	x86_emulate_instruction

Operations in x86_emulate_ops represent the instruction emulator's interface to memory.

This ops is defined in arch/x86/kvm/kvm_emulate.h. Like other ops, there is a global instance emulate_ops:

static const struct x86_emulate_ops emulate_ops = {
	//...
}

这里定义的函数没有通过 static_call 来调用，而是直接 indirect call 来调用。

Difference with the ops with the prefix kvm, such as kvm_x86_ops? I don't know.

KVM_MP_STATE_*

x86 支持哪些 KVM_MP_STATE_*?

// 支持的
#define KVM_MP_STATE_RUNNABLE          0
#define KVM_MP_STATE_UNINITIALIZED     1

// 不确定的
#define KVM_MP_STATE_INIT_RECEIVED     2
#define KVM_MP_STATE_HALTED            3
#define KVM_MP_STATE_SIPI_RECEIVED     4
#define KVM_MP_STATE_STOPPED           5
#define KVM_MP_STATE_CHECK_STOP        6
#define KVM_MP_STATE_OPERATING         7
#define KVM_MP_STATE_LOAD              8
#define KVM_MP_STATE_AP_RESET_HOLD     9
#define KVM_MP_STATE_SUSPENDED         10

KVM VCPU Requests

An internal API enabling threads (Kernel) to request a VCPU thread to perform some activity. For example, a thread may request a VCPU to flush its TLB. The API consists of the following functions:

KVM VCPU Requests — The Linux Kernel documentation

struct kvm_vcpu {
	//...
	u64 requests;
	//...
}

We can notice that request is architecture-agnostic, it is a KVM-only concept.

When each time we vcpu_enter_guest (before the actual code we enter into the non-root mode), we will check if there are pending requests by kvm_request_pending, if so, we will handle them.

vcpu_block()

当 guest 执行 hlt 指令后会发生 VM Exit（不能让 vCPU 在非根模式下 hlt，这样会浪费 CPU 资源）。

kvm_emulate_halt (the vmexit handler)
	kvm_emulate_halt_noskip
		 __kvm_emulate_halt
			vcpu->arch.mp_state = KVM_MP_STATE_HALTED; // this can help us in vcpu_run(), we will execute vcpu_block()

Why there are kvm_vcpu_halt and kvm_vcpu_block?

static inline int vcpu_block(struct kvm_vcpu *vcpu)
{
		//...
		// guest execute halt and we want to halt polling mechanism
		if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED)
			kvm_vcpu_halt(vcpu);
		// In some situations, KVM want to simply block the vcpu
		else
			kvm_vcpu_block(vcpu);
		//...
}

KVM halt-polling机制分析 - tianshidan1998 - 博客园

VT-d Interrupt Posting Code Analysis · kernelgo

kvm_vcpu_has_events

Rep; nop

rep; nop is indeed the same as the pause instruction (opcode F390). It might be used for assemblers which don't support the pause instruction yet. On previous processors, this simply did nothing, just like nop but in two bytes. On new processors which support hyperthreading, it is used as a hint to the processor that you are executing a spinloop to increase performance.

What does "rep; nop;" mean in x86 assembly? Is it the same as the "pause" instruction? - Stack Overflow

`PAUSE` Instruction

当 spinlock 执行 lock () 获得锁失败后会进行 busy loop，不断检测锁状态，尝试获得锁。这么做有一个缺陷：频繁的检测会让流水线上充满了读操作。另外一个线程往流水线上丢入一个锁变量写操作的时候，必须对流水线进行重排，因为 CPU 必须保证所有读操作读到正确的值。流水线重排十分耗时，影响 lock () 的性能。

为了解决这个问题，Intel 发明了 pause 指令。这个指令的本质功能：让加锁失败时 CPU 睡眠 30 个（about）clock，从而使得读操作的频率低很多。流水线重排的代价也会小很多。

为什么SpinLock的实现中应该加上PAUSE指令？ - xinyuyuanm - 博客园

自旋锁 spinlock 剖析与改进_知识库_博客园

Improves the performance of spin-wait loops. When executing a “spin-wait loop,” processors will suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.

PAUSE — Spin Loop Hint

Spin waits, spin loop and busy spin

By the way I think these terms can also use interchangeably. Precisely these are the same terms.

Spin waits, spin loop and busy spin - Stack Overflow

Operating system jitter (OS Jitter)

Latency = Delay between an event happening in the real world and code responding to the event.

Jitter = Differences in Latencies between two or more events.

Difference between Latency and Jitter in Operating-Systems - Stack Overflow

How to reply to lore email messages?

Download the message on lore:

click the mbox.gz
Install 7-zip, right click it and extract the mbox files.

Import it into thunderbird:

Install ImportExportTools NG thunderbird plugin;
Thunderbird -> Local Folders (Right click) -> ImportExportTools NG -> Import mbox file
Move to the folder you want

QEMU make check-acceptance

Use the Avocado test framework.

Testing - QEMU

SMAP (Supervisor mode access prevention)

Motivation: Protecting user space from the kernel.

This extension defines a new SMAP bit in the CR4 control register; when that bit is set, any attempt to access user-space memory while running in a privileged mode will lead to a page fault.

Supervisor mode access prevention [LWN.net]

From Intel SDM:

CR4.SMAP allows pages to be protected from supervisor-mode data accesses. If CR4.SMAP = 1, software operating in supervisor mode cannot access data at linear addresses that are accessible in user mode. Software can override this protection by setting EFLAGS.AC. Section 4.6 explains how access rights are determined, including the definition of supervisor-mode accesses and user-mode accessibility.

SMEP (Supervisor-Mode Execution Prevention)

From Intel SDM:

CR4.SMEP allows pages to be protected from supervisor-mode instruction fetches. If CR4.SMEP = 1, software operating in supervisor mode cannot fetch instructions from linear addresses that are accessible in user mode. Section 4.6 explains how access rights are determined, including the definition of supervisor-mode accesses and user-mode accessibility.

PTE (page table entry) bits illustrated

From SDM Table 4-20. Format of a Page-Table Entry that Maps a 4-KByte Page:

Uboot vs. grub

U-Boot is a full bootloader but grub is only a 'second-stage' loader, i.e. it needs something to load it. In this case U-Boot also provides the EFI support needed by grub to work.

…

How to see the feature is first introduced in which platform?

ISE Chapter 1.3: INSTRUCTION SET EXTENSIONS AND FEATURE INTRODUCTION IN INTEL® 64 AND IA-32 PROCESSORS

Kerberos on Linux

Kerberos Linux is an authentication protocol for individual Linux users in any network environment.

What is Kerberos Linux

Kinit

Thread information flags

In arch/x86/include/asm/thread_info.h:

/*
 * thread information flags
 * - these are process state flags that various assembly files
 *   may need to access
 */
#define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
#define TIF_SIGPENDING		2	/* signal pending */
#define TIF_NEED_RESCHED	3	/* rescheduling necessary */
#define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
#define TIF_SSBD		5	/* Speculative store bypass disable */
#define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
#define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
#define TIF_UPROBE		12	/* breakpointed or singlestepping */
#define TIF_PATCH_PENDING	13	/* pending live patching update */
#define TIF_NEED_FPU_LOAD	14	/* load FPU on return to userspace */
#define TIF_NOCPUID		15	/* CPUID is not accessible in userland */
#define TIF_NOTSC		16	/* TSC is not accessible in userland */
#define TIF_NOTIFY_SIGNAL	17	/* signal notifications exist */
#define TIF_MEMDIE		20	/* is terminating due to OOM killer */
#define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
#define TIF_IO_BITMAP		22	/* uses I/O bitmap */
#define TIF_SPEC_FORCE_UPDATE	23	/* Force speculation MSR update in context switch */
#define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
#define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
#define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
#define TIF_ADDR32		29	/* 32-bit address space on 64 bits */

TIF_NEED_RESCHED

TIF_NEED_RESCHED is set to signal that a, usually currently running, task needs to be re-scheduled so that the core the task is running on, becomes available for other tasks. In other words: the TIF_NEED_RESCHED flag is set if it has been determine that the task has used its time slice and should be preempted. For reasons, setting the flag and actually preempting the task is done at two different occasions and points in time. For example the flag may be set in an interrupted handler but the actually re-scheduling is done at a later point.

linux kernel - What does TIF_NEED_RESCHED do? - Stack Overflow

Misc ideas

定期爬取博客内容

Why in C bool is 1 byte?

Some useful tools for binary format

NTFS directory junction vs. directory symbolic link

Soft link vs. hard link

How to ioctl in Python?

.efi File

Export thunderbird message filters

SIGINT, SIGTERM, SIGQUIT, SIGKILL

硬实时和软实时

Diff output to raw text

QMP reference

KVM_SET_CPUID2

Host and guest share folder in QEMU

QEMU redirect guest output to host console

Instruction fetches and data access

ISE in commit message

Git rebase --onto

X11 Forwarding

KVM_GET_SUPPORTED_CPUID

64-bit Process virtual address space

PFEC (page fault error code)

stac/clac

Kernel patching / text poking

ESP (EFI System Partition)

Instructions will cause vm-exit

MTF VM-exit

x86_emulate_ops, KVM instruction emulator, how does KVM emulate an instruction?

KVM_MP_STATE_*

KVM VCPU Requests

vcpu_block()

kvm_vcpu_has_events

Rep; nop

PAUSE Instruction

Spin waits, spin loop and busy spin

Operating system jitter (OS Jitter)

How to reply to lore email messages?

QEMU make check-acceptance

SMAP (Supervisor mode access prevention)

SMEP (Supervisor-Mode Execution Prevention)

PTE (page table entry) bits illustrated

Uboot vs. grub

How to see the feature is first introduced in which platform?

Kerberos on Linux

Kinit

Thread information flags

TIF_NEED_RESCHED

`PAUSE` Instruction