2023-02 Monthly Archive
Misc ideas
Some instructions are conditionally exiting instructions, which means they will exit only some conditions are met. See the "Conditionally Exiting Instructions" section in Day 5: VM-exit handler, Event Injection, and CPUID Emulation
What does "red rum" (红色朗姆酒) mean? murder.
PREFETCHh
instruction: prefetch data into caches.
Samba: n. 桑巴舞(巴西交谊舞)。
The function to handle qmp commands, e.g., query-cpu-definitions
is the function with the same name qmp_query_cpu_definitions
。
定期爬取博客内容
crontab -e
Add:
0 3 * * * "fish ~/.config/utils/scraper.fish"
Then:
sudo service cron restart
Running a cron job at 2:30 AM everyday - Stack Overflow
Why in C bool is 1 byte?
Because the CPU can't address anything smaller than a byte.
c++ - Why is a boolean 1 byte and not 1 bit of size? - Stack Overflow
Some useful tools for binary format
- 对于在命令行上提取和过滤数据,fq 提供了一种可访问的方式。
- 要编辑二进制数据,GNU poke 是一个强大的工具。
- 要解析并可能以支持的编程语言之一序列化二进制数据,Kaitai Struct 提供了一种灵活的方法。
Weixin Official Accounts Platform
NTFS directory junction vs. directory symbolic link
mklink /J linkName target
mklink /D linkName target
NTFS Hard Links, Junctions and Symbolic Links
Soft link vs. hard link
hard link: same inode with different file name, bi-directional.
soft link: a filename pointing another file.
How to ioctl in Python?
import fcntl
fcntl.ioctl()
.efi File
It contains system-level data that executes between the operating system and the firmware. EFI files are used for staging firmware updates, booting operating systems, and running pre-boot programs.
Typically, EFI files are not meant to be opened. Hardware developers and other advanced PC users can open EFI files using the EFI Developer Kit (cross-platform).
EFI File Extension - What is an .efi file and how do I open it?
这个回答可以看看,对于理解 .efi 存在的意义有所帮助。
Export thunderbird message filters
Export/Import section in Message Filters - MozillaZine Knowledge Base
You can search the file directly in "everything". Because the folder is hard to find.
SIGINT, SIGTERM, SIGQUIT, SIGKILL
SIGINT is the signal sent when we press Ctrl+C. The default action is to terminate the process. However, some programs override this action and handle it differently. One common example is the bash interpreter. When we press Ctrl+C it doesn’t quit, instead, it prints a new and empty prompt line.
If we want to use a signal to terminate it, we can’t use SIGINT with this script. We should use SIGTERM, SIGQUIT, or SIGKILL instead.
SIGTERM is the default signal when we use the kill command. The default action of both signals is to terminate the process. However, SIGQUIT also generates a core dump before exiting. When we send SIGTERM, the process sometimes executes a clean-up routine before exiting. We can also handle SIGTERM to ask for confirmation before exiting.
When a process receives SIGKILL it is terminated. This is a special signal as it can’t be ignored and we can’t change its behavior.
The default action for SIGINT, SIGTERM, SIGQUIT, and SIGKILL is to terminate the process. However
- SIGTERM, SIGQUIT, and SIGKILL are defined as signals to terminate the process,
- but SIGINT is defined as an interruption requested by the user. So, we shouldn’t depend solely on SIGINT to finish a process.
SIGINT And Other Termination Signals in Linux | Baeldung on Linux
硬实时和软实时
硬实时与软实时之间最关键的差别在于,软实时只能提供统计意义上的实时。例如,有的应用要求系统在 95% 的情况下都会确保在规定的时间内完成某个动作,而不一定要求 100%。
Diff output to raw text
# no need to create a git repo
git diff --word-diff=plain --word-diff-regex=. ~/icelake.raw.txt ~/spr.raw.txt
command line - Using 'diff' to get character-level diff between text files - Stack Overflow
QMP reference
QEMU QMP Reference Manual — QEMU documentation
KVM_SET_CPUID2
KVM forbids KVM_SET_CPUID2 after KVM_RUN was performed on a vCPU unless the supplied CPUID data is equal to what was previously set.
[PATCH v4 5/5] KVM: selftests: Test KVM_SET_CPUID2 after KVM_RUN — Linux KVM
What will happen when we set an unsupported CPUID to KVM?
以 CPUID leaf 7 为例:现在 leaf 7 支持 subleaf 0, 1, 2,就不能只暴露给 guest subleaf 0?
better have a test, it seems it won't warn.
Host and guest share folder in QEMU
Using Samba.
QEMU redirect guest output to host console
-chardev stdio,id=virtiocon0 \ # Backend, Connects the chardev virtiocon0 with the qemu process' stdin/out.
-device virtio-serial \ # channel
-device virtconsole,chardev=virtiocon0 # front-end
virtio-serial
simply creates a communication channel between host and guest. This is necessary for the next driver.
The last one, virtconsole
creates a console device on the guest, attached to the chardev
created before, which was attached to qemu's stdio/out. The guest can then use this console device like any other tty.
The device created on the guest will depend on the kernel and how it was compiled, in linux it's usually /dev/hvc0
.
kvm - How do these qemu parameters for stdout redirection work? - Unix & Linux Stack Exchange
Instruction fetches and data access
Who perform the instruction fetch?
It is performed automatically, An instruction fetch is user-mode if CPL = 3 and is supervisor mode if CPL < 3. Instruction fetches are always performed with linear addresses (RIP is linear address).
I think an user mode instruction fetch indicates that we are running user space programs, a supervisor mode instruction fetch indicates that we may running in the kernel space.
ISE in commit message
The LASS details have been published in Chapter 11 in https://cdrdv2.intel.com/v1/dl/getContent/671368
Intel is terrible about maintaining web URLs like this. Better to give this reference as:
The December 2022 edition of the Intel Architecture Instruction Set Extensions and Future Features Programming Reference manual.
Maybe also still give the URL in a reply like this, but avoid using It in a commit message that will be preserved forever.
-Tony
Git rebase --onto
git rebase --onto allows you to rebase starting from a specific commit rather than the branch.
How to git rebase a branch with the onto command? - Stack Overflow
X11 Forwarding
However, X11 is an insecure plaintext protocol by default, so it's not recommended to expose an X Server directly. Instead, most users today use X11 Forwarding to take advantage of the security of SSH when running X11 programs remotely.
…
What You Need to Know About X11 Forwarding
KVM_GET_SUPPORTED_CPUID
struct kvm_cpuid2 {
__u32 nent;
__u32 padding;
struct kvm_cpuid_entry2 entries[0];
};
#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX BIT(0)
#define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1) /* deprecated */
#define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2) /* deprecated */
struct kvm_cpuid_entry2 {
__u32 function;
__u32 index;
__u32 flags;
__u32 eax;
__u32 ebx;
__u32 ecx;
__u32 edx;
__u32 padding[3];
};
What does each bit in eax (ebx, ecx, edx) mean? There are 2 meanings:
The first explanation:
- If the bit returned is 1, it means this bit could be set as 1;
- If the bit returned is 0, it means this bit must be set as 0;
The second explanation:
- If the bit returned is 1, it means this bit must be set as 1;
- If the bit returned is 0, it means this bit must be set as 0;
According to the code:
// kvm_arch_dev_ioctl
// kvm_dev_ioctl_get_cpuid
// get_cpuid_func
// do_cpuid_func
// __do_cpuid_func
// do_host_cpuid
// cpuid_count
void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d)
{
asm volatile("cpuid"
: "=a" (*a), "=b" (*b), "=c" (*c), "=d" (*d)
: "0" (id), "2" (count)
);
}
So it just the output from the host. So I prefer to the first explanation.
64-bit Process virtual address space
The 64-bit x86 virtual memory map splits the address space into two:
- the lower section (with the top bit set to 0) is user-space,
- the upper section (with the top bit set to 1) is kernel-space.
How a 64-bit process virtual address space is divided in Linux? - Unix & Linux Stack Exchange
PFEC (page fault error code)
It's 32 bit.
// arch/x86/include/asm/trap_pf.h
/*
* Page fault error code bits:
*
* bit 0 == 0: no page found 1: protection fault
* bit 1 == 0: read access 1: write access
* bit 2 == 0: kernel-mode access 1: user-mode access
* bit 3 == 1: use of reserved bit detected
* bit 4 == 1: fault was an instruction fetch
* bit 5 == 1: protection keys block access
* bit 15 == 1: SGX MMU page-fault
*/
enum x86_pf_error_code {
X86_PF_PROT = 1 << 0,
X86_PF_WRITE = 1 << 1,
X86_PF_USER = 1 << 2,
X86_PF_RSVD = 1 << 3,
X86_PF_INSTR = 1 << 4,
X86_PF_PK = 1 << 5,
X86_PF_SGX = 1 << 15,
};
Search "Page Fault Error Code" in SDM outline.
stac/clac
Set AC Flag, Clear AC Flag.
Enable or disable SMAP.
static __always_inline void stac(void)
{
/* Note: a barrier is implicit in alternative() */
alternative("", __ASM_STAC, X86_FEATURE_SMAP);
}
static __always_inline void clac(void)
{
/* Note: a barrier is implicit in alternative() */
alternative("", __ASM_CLAC, X86_FEATURE_SMAP);
}
Kernel patching / text poking
Kernel deliberately initialized a temporary mm area within the lower half of the address range for text poking, see commit 4fc19708b165 ("x86/alternatives: Initialize temporary mm for patching").
Why?
ESP (EFI System Partition)
The EFI system partition (also called ESP) is an OS independent partition that acts as the storage place for the EFI bootloaders, applications and drivers to be launched by the UEFI firmware. It is mandatory for UEFI boot.
Typical mount points:
- mount ESP to
/boot
. This is the most straightforward method. - mount ESP to
/efi
. - …
Tip:
-
/efi
is a replacement[5] for the historical and now discouraged ESP mountpoint/boot/efi
.
EFI system partition - ArchWiki
How to see current mount point?
Typically, the EFI partition is mounted as /boot/efi
.
Size of the ESP
It is not fixed.
As per the Arch Linux wiki, to avoid potential problems with some EFIs, ESP size should be at least 512 MiB. 550 MiB is recommended to avoid MiB/MB confusion and accidentally creating FAT16. So, most common size guideline for EFI System Partition is between 100 MB to 550 MB.
boot - How to know the proper amount of needed disk space for EFI partition - Ask Ubuntu
Instructions will cause vm-exit
These are the instructions that always cause an exit:
- INVEPT
- INVVPID
- VMCALL
- VMCLEAR
- VMLAUNCH
- VMPTRLD
- VMPTRST
- VMRESUME
- VMXOFF
- VMXON
These instruction could case an exit:
- CLTS
- ENCLS
- ENCLV
- HLT
- IN, INS/INSB/INSW/INSD, OUT, OUTS/OUTSB/OUTSW/OUTSD
- INVLPG
- INVPCID
- LGDT, LIDT, LLDT, LTR, SGDT, SIDT, SLDT, STR
- LMSW
- MONITOR
- MOV from CR3, MOV from CR8, MOV to CR0, MOV to CR3, MOV to CR4, MOV to CR8, MOV DR
- MWAIT
- PAUSE
- RDMSR, RDPMC, RDRAND, RDSEED, RDTSC, RDTSCP
- RSM
- TPAUSE
- UMWAIT
- VMREAD
- VMWRITE
- WBINVD
- WRMSR (MSR bitmap in VMCS enable pass-though)
- XRSTORS
- XSAVES
assembly - VM Instructions If an exit is possible? - Stack Overflow
MTF VM-exit
Monitor Trap flag VM-exit.
当设置该标志位时,回到 guest 时第一条指令会再次触发 VM-Exit,且退出理由为 MTF;
static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
//...
[EXIT_REASON_MONITOR_TRAP_FLAG] = handle_monitor_trap,
//...
}
This is a debugging feature.
具体可以参考 SDM 25.5.2。
VT MTF VM-Exit - OneTrainee - 博客园
x86_emulate_ops, KVM instruction emulator, how does KVM emulate an instruction?
Handle process:
Day 5: VM-exit handler, Event Injection, and CPUID Emulation
The place to advance VMX_GUEST_RIP: skip_emulated_instruction()
in arch/x86/kvm/vmx/vmx.c
.
The flow chart for handling an instruction: KVM Emulate Flowchart, there are some faults in it.
If you search in Google what is KVM instruction emulator, you will get NOTHING.
kvm_emulate_instruction
x86_emulate_instruction
Operations in x86_emulate_ops
represent the instruction emulator's interface to memory.
This ops is defined in arch/x86/kvm/kvm_emulate.h
. Like other ops, there is a global instance emulate_ops
:
static const struct x86_emulate_ops emulate_ops = {
//...
}
这里定义的函数没有通过 static_call 来调用,而是直接 indirect call 来调用。
Difference with the ops with the prefix kvm
, such as kvm_x86_ops
? I don't know.
KVM_MP_STATE_*
x86 支持哪些 KVM_MP_STATE_*
?
// 支持的
#define KVM_MP_STATE_RUNNABLE 0
#define KVM_MP_STATE_UNINITIALIZED 1
// 不确定的
#define KVM_MP_STATE_INIT_RECEIVED 2
#define KVM_MP_STATE_HALTED 3
#define KVM_MP_STATE_SIPI_RECEIVED 4
#define KVM_MP_STATE_STOPPED 5
#define KVM_MP_STATE_CHECK_STOP 6
#define KVM_MP_STATE_OPERATING 7
#define KVM_MP_STATE_LOAD 8
#define KVM_MP_STATE_AP_RESET_HOLD 9
#define KVM_MP_STATE_SUSPENDED 10
KVM VCPU Requests
An internal API enabling threads (Kernel) to request a VCPU thread to perform some activity. For example, a thread may request a VCPU to flush its TLB. The API consists of the following functions:
KVM VCPU Requests — The Linux Kernel documentation
struct kvm_vcpu {
//...
u64 requests;
//...
}
We can notice that request
is architecture-agnostic, it is a KVM-only concept.
When each time we vcpu_enter_guest
(before the actual code we enter into the non-root mode), we will check if there are pending requests by kvm_request_pending
, if so, we will handle them.
vcpu_block()
当 guest 执行 hlt 指令后会发生 VM Exit(不能让 vCPU 在非根模式下 hlt,这样会浪费 CPU 资源)。
kvm_emulate_halt (the vmexit handler)
kvm_emulate_halt_noskip
__kvm_emulate_halt
vcpu->arch.mp_state = KVM_MP_STATE_HALTED; // this can help us in vcpu_run(), we will execute vcpu_block()
Why there are kvm_vcpu_halt
and kvm_vcpu_block
?
static inline int vcpu_block(struct kvm_vcpu *vcpu)
{
//...
// guest execute halt and we want to halt polling mechanism
if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED)
kvm_vcpu_halt(vcpu);
// In some situations, KVM want to simply block the vcpu
else
kvm_vcpu_block(vcpu);
//...
}
KVM halt-polling机制分析 - tianshidan1998 - 博客园
VT-d Interrupt Posting Code Analysis · kernelgo
kvm_vcpu_has_events
Rep; nop
rep; nop
is indeed the same as the pause
instruction (opcode F390). It might be used for assemblers which don't support the pause instruction yet. On previous processors, this simply did nothing, just like nop
but in two bytes. On new processors which support hyperthreading, it is used as a hint to the processor that you are executing a spinloop to increase performance.
PAUSE
Instruction
当 spinlock 执行 lock () 获得锁失败后会进行 busy loop,不断检测锁状态,尝试获得锁。这么做有一个缺陷:频繁的检测会让流水线上充满了读操作。另外一个线程往流水线上丢入一个锁变量写操作的时候,必须对流水线进行重排,因为 CPU 必须保证所有读操作读到正确的值。流水线重排十分耗时,影响 lock () 的性能。
为了解决这个问题,Intel 发明了 pause 指令。这个指令的本质功能:让加锁失败时 CPU 睡眠 30 个(about)clock,从而使得读操作的频率低很多。流水线重排的代价也会小很多。
为什么SpinLock的实现中应该加上PAUSE指令? - xinyuyuanm - 博客园
Improves the performance of spin-wait loops. When executing a “spin-wait loop,” processors will suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.
Spin waits, spin loop and busy spin
By the way I think these terms can also use interchangeably. Precisely these are the same terms.
Spin waits, spin loop and busy spin - Stack Overflow
Operating system jitter (OS Jitter)
Latency = Delay between an event happening in the real world and code responding to the event.
Jitter = Differences in Latencies between two or more events.
Difference between Latency and Jitter in Operating-Systems - Stack Overflow
How to reply to lore email messages?
Download the message on lore:
- click the
mbox.gz
- Install 7-zip, right click it and extract the mbox files.
Import it into thunderbird:
- Install
ImportExportTools NG
thunderbird plugin; - Thunderbird -> Local Folders (Right click) -> ImportExportTools NG -> Import mbox file
- Move to the folder you want
QEMU make check-acceptance
Use the Avocado test framework.
SMAP (Supervisor mode access prevention)
Motivation: Protecting user space from the kernel.
This extension defines a new SMAP bit in the CR4 control register; when that bit is set, any attempt to access user-space memory while running in a privileged mode will lead to a page fault.
Supervisor mode access prevention [LWN.net]
From Intel SDM:
CR4.SMAP allows pages to be protected from supervisor-mode data accesses. If CR4.SMAP = 1, software operating in supervisor mode cannot access data at linear addresses that are accessible in user mode. Software can override this protection by setting EFLAGS.AC. Section 4.6 explains how access rights are determined, including the definition of supervisor-mode accesses and user-mode accessibility.
SMEP (Supervisor-Mode Execution Prevention)
From Intel SDM:
CR4.SMEP allows pages to be protected from supervisor-mode instruction fetches. If CR4.SMEP = 1, software operating in supervisor mode cannot fetch instructions from linear addresses that are accessible in user mode. Section 4.6 explains how access rights are determined, including the definition of supervisor-mode accesses and user-mode accessibility.
PTE (page table entry) bits illustrated
From SDM Table 4-20. Format of a Page-Table Entry that Maps a 4-KByte Page:
Uboot vs. grub
U-Boot is a full bootloader but grub is only a 'second-stage' loader, i.e. it needs something to load it. In this case U-Boot also provides the EFI support needed by grub to work.
…
How to see the feature is first introduced in which platform?
ISE Chapter 1.3: INSTRUCTION SET EXTENSIONS AND FEATURE INTRODUCTION IN INTEL® 64 AND IA-32 PROCESSORS
Kerberos on Linux
Kerberos Linux is an authentication protocol for individual Linux users in any network environment.
Kinit
Thread information flags
In arch/x86/include/asm/thread_info.h
:
/*
* thread information flags
* - these are process state flags that various assembly files
* may need to access
*/
#define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
#define TIF_SIGPENDING 2 /* signal pending */
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
#define TIF_SSBD 5 /* Speculative store bypass disable */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_PATCH_PENDING 13 /* pending live patching update */
#define TIF_NEED_FPU_LOAD 14 /* load FPU on return to userspace */
#define TIF_NOCPUID 15 /* CPUID is not accessible in userland */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_NOTIFY_SIGNAL 17 /* signal notifications exist */
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
#define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
#define TIF_SPEC_FORCE_UPDATE 23 /* Force speculation MSR update in context switch */
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
#define TIF_ADDR32 29 /* 32-bit address space on 64 bits */
TIF_NEED_RESCHED
TIF_NEED_RESCHED is set to signal that a, usually currently running, task needs to be re-scheduled so that the core the task is running on, becomes available for other tasks. In other words: the TIF_NEED_RESCHED flag is set if it has been determine that the task has used its time slice and should be preempted. For reasons, setting the flag and actually preempting the task is done at two different occasions and points in time. For example the flag may be set in an interrupted handler but the actually re-scheduling is done at a later point.
linux kernel - What does TIF_NEED_RESCHED do? - Stack Overflow