Misc ideas

x86 指令编码 (硬编码) 的结构 opcode 最少 1 个字节,最多 3 个字节:X86-64 Instruction Encoding - OSDev Wiki

ftruncate()

ftruncate() is a simple, single-purpose function, it simply sets the file to the requested length.

fallocate()

fallocate() is a Linux-specific function that does a lot more, and in very specific ways. fallocate() is used to manipulate the allocated disk space for a file, either to deallocate or preallocate it.

FALLOC_FL_PUNCH_HOLE: deallocates space (i.e., creates a hole) in the byte range starting at offset and continuing for len bytes. Within the specified range, partial filesystem blocks are zeroed, and whole filesystem blocks are removed from the file. After a successful call, subsequent reads from this range will return zeros. The FALLOC_FL_PUNCH_HOLE flag must be ORed with FALLOC_FL_KEEP_SIZE in mode; in other words, even when punching off the end of the file, the file size does not change.(memfd 是不是可以通过这种方式来把内存的映射置空?,反正实现函数是 specific 的)。

kvm_msr_user_space

/*
 * Couldn't handle rdmsr or wrmsr in KVM, so handle it in userspace.
 * index: The MSR index caused this
 * exit_reason: can only be KVM_EXIT_X86_RDMSR or KVM_EXIT_X86_WRMSR
 * data: The MSR data
 * completion: A function to complete last vm exit before next kVM_RUN
 * For more, please see the implementation of kvm_read_guest_virt_helper()
 */
static int kvm_msr_user_space(struct kvm_vcpu *vcpu, u32 index,
			      u32 exit_reason, u64 data,
			      int (*completion)(struct kvm_vcpu *vcpu),
			      int r)

Kthreadd (PID 2)

由 0 号进程创建。

所有其它的内核线程的 ppid 都是 2,也就是说它们都是由 kthreadd thread 创建的。

If you examine the list you will see all [] processes have ppid=2 (kthreadd) while all user space processes may have ppid=1 (systemd/init).

What really is kthreadd ? — Linux Foundation Forums

How to know a process's command line (not all)

sudo readlink -f /proc/<num>/exe

Wait queue in the kernel

struct wait_queue_entry {
	unsigned int		flags;
	void			*private;
	wait_queue_func_t	func; // this is the callback function
	struct list_head	entry;
};

struct wait_queue_head {
	spinlock_t		lock;
	struct list_head	head;
};

When something interesting occurs, you call every callback for each entry in the wait queue. aka, activating.

When we want to activate this queue, we call __wake_up_common on it.

Do not mix-up wait queue and list head, wait queue use list_head which means it can act as a list item in any list.

Implementation of Epoll ❚ fd3kyt's blog

Embedded Anchor in Linux / struct list_head

struct list_head {
	struct list_head *next, *prev;
};

Design philosophy: internal link (method 2 below):

Pros:

  • Don’t need a list type for every type of element
  • Elements of a list can be of different types

Implementation of Epoll ❚ fd3kyt's blog

Array in struct in C / Flexible array member(FAM)

C struct data types may end with a flexible array member with no specified size.

If there is an array in a struct, what's the size of the struct? For an example:

struct student
{
   int stud_id;
   int name_len;
   int struct_size;
   char stud_name[];
};

The size of the structure is 4 + 4 + 4 + 0 = 12. The size i.e length of array stud_name isn’t fixed and is an FAM.

Another example I encountered:

struct kvm_cpuid2 {
	__u32 nent;
	__u32 padding;
	struct kvm_cpuid_entry2 entries[];
};

sizeof(kvm_cpuid2) is 8, sizeof(struct kvm_cpuid_entry2*) is also 8, but it become 0 because it is at the and of the struct definition.

Flexible array member - Wikipedia

Virtualization Exception (VE) / EPT-violation VE / Suppress VE bit

VE 相比于 VM-exit 的好处是不需要进行模式切换。A virtualization exception can occur only in VMX non-root operation. 也就是说在 bare-metal 的情况下,不会出现 VE。

VEs occur only with certain settings of certain VM-execution controls. Generally, these settings imply that certain conditions that would normally cause VM exits instead cause virtualization exceptions.

In particular, the setting of the “EPT-violation #VE” VM-execution control causes some EPT violations to generate VEs instead of VM exits. If the control is 0, EPT violations always cause VM exits. If instead it is 1, certain EPT violations may be converted to cause VEs instead; such EPT violations are convertible.

如果不进行模式切换,那么 Host 如何帮忙设置好 EPT Violation 对应的页表并返回呢?

暂时 VMX 和 TDX 都没有 enable 这个 feature 来做 EPT Violation handling 来处理 page fault,TDX 用 VE 来做一些 MMIO 相关的内容,详参 SHADOW_NONPRESENT_VALUE

In the settings that Linux will run in, VEs are never generated on accesses to normal, TD-private memory that has been accepted (by BIOS or with tdx_enc_status_changed()).(想起 SHADOW_NONPRESENT_VALUE 了吗,就是会 suppress VE)。

不过如果你想进一步看 code,可以找 arch/x86/kernel/traps.c 里的 DEFINE_IDTENTRY(exc_virtualization_exception)

Bit 63 of certain EPT PSE may be defined to mean suppress #VE。Bit 2:0 为 present or not:

  • 如果是 non present,EPT 翻译 GPA 的时候 walk 到了这里,会出现 EPT Violation,如果 bit 63 也是 0(表示我们不 suppress VE),那么这个 EPT violation 就是 convertible 的,可以转为 VE;
  • 如果 present:
    • 如果这个 PTE 的值是非法的,这个时候应该产生的是 EPT Misconfiguration 而不是 EPT Violation,所有的 EPT Misconfiguration 都一定会出现 VM-exit。
    • 否则:
      • 如果是一个 PTE(而不是 PSE),说明其映射了一个页。这个 PTE 应该用来翻译一个 GPA,如果访问这个 GPA 的时候出现了 EPT Violation,那么其应该 VM-exit 还是 VE 取决于 suppress VE bit。
      • 如果是一个 PSE,这个 PSE 的 suppress VE bit 是 ignore 的,不会对是否产生 VE 产生影响。
SDM
CHAPTER 26 VMX NON-ROOT OPERATION
26.5 FEATURES SPECIFIC TO VMX NON-ROOT OPERATION
26.5.7 Virtualization Exceptions

It is guest's responsibility to configure and setup #VE ISR (Interrupt Service Routine).

Like other exceptions, the processor also provides the corresponding exception information in Virtualization-Exception Information Area used by ISR, e.g. the violation permissions, guest linear and physical address. This area is populated by processor when such an exception happens.

VMFUNC, which can only be executed in guest OS (VMX non-root mode), allows software in VMX non-root operation (guest) to invoke a VM function, which is processor functionality enabled and configured by software in VMX root operation (host).

SIMPLE IS BETTER: Thoughts on Hardware Virtualization Exception

kvm_read_guest_virt

/*
 * addr: guest virtual address
 * val: the value we want
 * bytes: how long we want to read?
 * For more, please see the implementation of kvm_read_guest_virt_helper()
 */
int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
	gva_t addr, void *val, unsigned int bytes,
	struct x86_exception *exception);

Seqlock

普通的 spin lock 对待 reader 和 writer 是一视同仁,RW spin lock 给 reader 赋予了更高的优先级,那么有没有让 writer 优先的锁的机制呢?答案就是 seqlock。

Linux内核同步机制之:Seqlock

Printk stuck in kernel

Sometimes printk will stuck in the kernel such as epoll_wait function.

use trace_printk, then:

sudo cat /sys/kernel/debug/tracing/trace

c - Linux booting hang up after adding a printk statement in the kernel source code - Stack Overflow

Linux trace event subsystem

Using the Linux Kernel Tracepoints — The Linux Kernel documentation

sk_buff (skb)

sk_buff 是 Linux 网络中最核心的结构体,各层协议都依赖于 sk_buff 而存在。

sk_buff 结构体在各层协议之间传输不是用拷贝,而是通过增加协议头和移动指针来操作的。

  • 高层协议往低层协议(比如 L4 -> L2):通过往 sk_buff 中增加协议头。
  • 低层到高层(比如 L2 -> L4):通过移动指针,不删除各层协议头,为了提高 CPU 的工作效率。

Intel IBT

Some times you may see ibt=off in kernel cmdline, which means disable the Indirect Branch Tracking security feature.

Indirect Branch Tracking (IBT) that is part of Intel's Control-Flow Enforcement Technology (CET).

Scheduling in kernel

内核有两个调度器:

  • 主调度器:schedule()
  • 周期性调度器:scheduler_tick() in kernel/sched/core.c

其实任务切换的过程都是由这两个函数完成的:

  • 主调度器:大多数场景是任务(task)主动去调用,完成进程切换;
  • 周期性调度器:以定时器中断的方式定时触发。

kernel调度----基本知识介绍_扫地聖的博客-CSDN博客_kernel中断调度

Call trace for scheduler_tick():


update_process_times
scheduler_tick

Soft lockup/hard lockup

首先只有内核代码才能引起 lockup,因为用户代码是可以被抢占的,不可能形成 lockup。

其次内核代码必须处于禁止内核抢占的状态 (preemption disabled),因为 Linux 是可抢占式的内核,只在某些特定的代码区才禁止抢占,在这些代码区才有可能形成 lockup。

可以参考这篇文章:内核如何检测soft lockup与hard lockup? | Linux Performance

A soft lockup is the symptom of a task or kernel thread using and not releasing a CPU for a period of time.

  1. soft lockup 是针对单独 CPU 而不是整个系统的。
  2. soft lockup 指的是发生的 CPU 上在 20 秒 (默认) 中没有发生调度切换。

As its name, it is a software-based problem, not a hardware problem (hardware has bus lock).

One possible soft lockup reason:

  1. Write a dead loop in kernel code
  2. CONFIG_PREEMPT is not enabled so kernel thread cannot be preempted.

This article is a good resource.

Linux内核为什么会发生soft lockup?_confirmwz的博客-CSDN博客

这边文章也不错:

如何启用linux内核异常自动重启机制_watchdog_thresh-CSDN博客

How to know the file is in which filesystem?

mount -l

or

df .

Rcp connection refused

Use scp instead rcp.

Procfs / sysfs

两者都是由 systemd 挂载的:

/* Mount /proc, /sys and friends, so that /proc/cmdline and /proc/$PID/fd is available. */
r = mount_setup(loaded_policy, skip_setup);

Watchdog

It is a hardware timer. Watchdog 也是内核里的一个 clocksource 哦。

Detect and recover from computer malfunctions.

During normal operation, the computer regularly restarts the watchdog timer to prevent it from elapsing, or "timing out". If, due to a hardware fault or program error, the computer fails to restart the watchdog, the timer will elapse and generate a timeout signal. The timeout signal is used to initiate corrective actions. The corrective actions typically include placing the computer and associated hardware in a safe state and invoking a computer reboot.

Watchdog timer - Wikipedia

Why sometimes a process cannot be killed?

That usually indicates one of three things:

  • a network filesystem that isn't responding;
  • a kernel bug;
  • a hardware bug.

linux - How to kill a process which can't be killed without rebooting? - Unix & Linux Stack Exchange

Install kernel by rpm

install the kernel:

  • "i": install.
  • "v": verbose. Print verbose information.
  • "h": hash. Print 50 hash marks as the package archive is unpacked. Use with -v --verbose for a nicer display.
sudo rpm -ivh <name>.rpm

Set default kernel in CentOS by grubby

List installed kernels:

sudo grubby --info=ALL | grep "^index\|^kernel" 

choose the kernel:

sudo grubby --set-default-index=<num>

PIIX (PCI IDE ISA Xcelerator)

Is a family of Intel southbridge microchips.

There are some files in QEMU, such as hw/i386/pc_piix.c

IDE, PATA, ATA

They are the same. Parallel ATA (PATA), originally ATA, also known as IDE(Integrated Drive Electronics).

It is a standard.

When SATA ( Serial ATA ) came out, people started using PATA (Parallel ATA) to refer to the older parallel connected bus.

Do not mix IDE up with ISA, ISA is an old technology that has been replaced by PCI, PCIe and so on.

What is the difference between ISA and PCI? - CAVSI

Zero copy

Zero copy 就是绕过了 page cache(内核缓冲区),直接将数据从设备读到用户空间。

Linux中的零拷贝技术,sendfile,splice和tee之间的区别是什么? - 知乎

Readelf vs. objdump

The reason is that objdump sees an ELF file through a BFD filter of the world; if BFD has a bug where, say, it disagrees about a machine constant in e_flags, then the odds are good that it will remain internally consistent. The linker sees it the BFD way, objdump sees it the BFD way, GAS sees it the BFD way. There was need for a tool to go find out what the file actually says.

This is why the readelf program does not link against the BFD library - it exists as an independent program to help verify the correct working of BFD.

linux - readelf vs. objdump: why are both needed - Stack Overflow

Process image

Seems not used now time?

Now a days, process context switch occurs through exchanging PCBs (as in Process Context Blocks) with CPU registers. The outgoing process does not get moved to a disk image in secondary storage (ie swapping). That did happen in the old days before paging but no longer.

I don't believe that modern operating systems use what I described in the answer anymore. I'm pretty sure this is an artifact from Unix

memory - What's the difference between a process and a process image? - Stack Overflow

Why code shouldn't modify itself?

That's the restriction imposed by the Operating System, actually OS can modify themselves.

OS forbit self-modifying code due to the following reasons:

  • Computer virus.
  • Pipelines, when current instruction is executing, actually the next instruction is being decoded, so if you modify the next instruction, it will break the pipeline design.

What NOT to do: Self Modifying Code - Computerphile - YouTube

B4: Download applicable patch series from lore

Install: pip install b4

Download: b4 am <message_id>

apply: git am patch.mbx

Introducing b4 and patch attestation — Konstantin Ryabitsev

Pkg-config

Retrieve information about installed libraries in the system.

Here is a typical usage scenario in a Makefile:

gcc glib_event_loop.c `pkg-config --cflags --libs glib-2.0` # which flags should be used? which lib should be linked?

Instruction (mnemonic) and opcode

Instruction and opcode are not 1-to-1, they are many-to-many.

For example, je and jz share the same opcode: assembly - Why multiple instructions with same opcode and working? - Stack Overflow

Also, A jmp can be assembled to different opcodes:

There are different jmp instructions for relative or absolute jumps or far or near jumps. The assembler will choose one of them (e.g., the shortest one) and translate the mnemonic (jmp) to the corresponding machine code.

assembly - How do JMP and CALL work in assembler? - Stack Overflow

JMP — Jump

Obsidian code block supported language

Prism

Local label/Numeric label (assembly)

There is also a "local label" concept in C.

numeric label consists of a single digit in the range zero (0) through nine (9) followed by a colon (:).

Numeric labels are used only for local reference and are not included in the object file's symbol table.

Section and segment

Don't mix-up the "section" in assembly code and the "segment" in x86 architecture, they are totally different things!

IA-32 Assembly Language Reference Manual

Particularly important information are (besides lengths):

  • section: tell the linker if a section is either:
    • raw data to be loaded into memory, e.g. .data, .text, etc.
    • or formatted metadata about other sections, that will be used by the linker, but disappear at runtime e.g. .symtab, .srttab, .rela.text
  • segment: tells the operating system:
    • where should a segment be loaded into virtual memory
    • what permissions the segments have (read, write, execute). Remember that this can be efficiently enforced by the processor: How does x86 paging work?

Does a segment contain one or more sections?

Yes, and it is the linker that puts sections into segments.

In Binutils, how sections are put into segments by ld is determined by a text file called a linker script.

Once the executable is linked, it is only possible to know which section went to which segment if the linker stores the optional section header in the executable

linux - What's the difference of section and segment in ELF file format - Stack Overflow

The ELF format is an executable and linking format. Segments are used for the former, sections for the latter.

Once the binary is linked, sections can be completely discarded from it, and only segments are needed at runtime.

linux kernel - When is an ELF .text segment not an ELF .text segment? - Stack Overflow

Any given ELF file might have only segments, or only sections, or both segments and sections. In order to be executable it must have segments to load. In order to be linkable, it must have sections describing what is where.

assembly - Section vs. segment? - Stack Overflow

.byte #Assembly

One possible use case is to write down the opcode directly rather than using the mnemonic. such as .byte 0x90 is equal to nop.

What is the use of .byte assembler directive in gnu assembly? - Stack Overflow

.align #Assembly

For example .align 8 advances the location counter until it is a multiple of 8.

.globl #Assembly

.global makes the symbol visible to ld. If you define symbol in your partial program, its value is made available to other partial programs that are linked with it. Otherwise, symbol takes its attributes from a symbol of the same name from another file linked into the same program.

Using as

Location counter #Assembler

The location counter keeps track of the next available memory location that can be assigned within the current segment.

UD1

Raise invalid opcode exception.

Use the 0F0B opcode (UD2 instruction), the 0FB9H opcode (UD1 instruction), or the 0FFFH opcode (UD0 instruction) when deliberately trying to generate an invalid opcode exception (#UD).

The instruction pointer saved by delivery of the exception references the UD instruction (and not the following instruction).

Undocumented instructions

UD1 is originally such an instruction which is used by the implementation of static_call.

x86 instruction listings - Wikipedia

EXPORT_SYMBOL_GPL

When a loadable module is inserted, any references it makes to kernel functions and data structures must be linked to the current running kernel. The module loader does not provide access to all kernel symbols, however; only those which have been explicitly exported are available.

Exports come in two flavors: vanilla (EXPORT_SYMBOL) and GPL-only (EXPORT_SYMBOL_GPL). The former are available to any kernel module, while the latter cannot be used by any modules which do not carry a GPL-compatible license.

macros - What is EXPORT_SYMBOL_GPL in Linux kernel code? - Stack Overflow

Why rebasing onto a previous commit will have conflicts?

Maybe there are merge commits.

Try to add --rebase-merges option:

git rebase -i --rebase-merges <commit>

Rebasing a Git merge commit - Stack Overflow

Header guard

#ifndef HEADER_H_NAME
#define HEADER_H_NAME
/*…
…*/
#endif

SwitchyOmega forgetting

部分切换规则重启后不见了 · Issue #1476 · FelisCatus/SwitchyOmega

Directive (programming)

In C preprocessor: such as #define and #include are referred to as preprocessor directives.

In Assembly: directives, also referred to as pseudo-operations or "pseudo-ops", which are keywords beginning with a period that behave similarly to preprocessor directives in C.

GAS / GNU Assembly Format

Top

请启用虚拟机平台 Windows 功能并确保在 BIOS 中启用虚拟化

管理员打开 powershell,输入:

bcdedit /set hypervisorlaunchtype auto

重启电脑。

Hyper-V,Windows 虚拟机监控程序平台,虚拟机平台

虚拟机平台:底层的虚拟机平台。

Hyper-V:上层的虚拟机管理软件,相当于微软开发的类似 VMware 的产品。

Hyper-V 基于虚拟机平台。

WSL2 只需要虚拟机平台打开。

WSA 需要虚拟机平台和 Hyper-V 都打开。

Symbol

Types

function, indirect function, data object, thread local data object, common data object, globally unique data object.

Type

Section and Relocation #Assembly

Roughly, a section is a range of addresses, with no gaps; A section is the smallest unit of an object file that can be relocated.

ld moves blocks of bytes of your program to their run-time addresses. These blocks slide to their run-time addresses as rigid units; their length does not change and neither does the order of bytes within them. Such a rigid unit is called a section. Assigning run-time addresses to sections is called relocation.

Using as

Allocable section

its contents are to be placed in memory at program runtime.

Subsection

Within each section, there can be numbered subsections with values from 0 to 8192.

Subsections are optional. If you do not use subsections, everything goes in subsection number zero. So .text 0 equals to .text.

Each subsection is zero-padded up to a multiple of four bytes.

2 motivation:

  1. group the objects together even though they are not contiguous in the assembler source code.
  2. Separate objects even though they are in the same section.

Sub-Sections

For motivation 1:

If sections are relocatable, why sometimes defining a label before them?

#define ANNOTATE_NOENDBR					\
	"986: \n\t"						\
	".pushsection .discard.noendbr\n\t"			\
	_ASM_PTR " 986b\n\t"					\
	".popsection\n\t"

Sendmsg(), write(), send(), Etc.

ssize_t send(int sockfd, const void *buf, size_t len, int flags);

ssize_t sendto(int sockfd, const void *buf, size_t len, int flags,
               const struct sockaddr *dest_addr, socklen_t addrlen);

ssize_t sendmsg(int sockfd, const struct msghdr *msg, int flags);

Destination: the address of the target is given by dest_addr with addrlen specifying its size. For sendmsg(), the address of the target is given by msg.msg_name, with msg.msg_namelen specifying its size.

The send() call may be used only when the socket is in a connected state.

The only difference between send() and write() is the presence of flags. With a zero flags argument, send() is equivalent to write().

If sendto() is used on a connection-mode socket, the arguments dest_addr and addrlen are ignored. (sendto 可用于无连接).

readv() / writev()

readv() 称为散布读,即将文件中若干连续的数据块读入内存分散的缓冲区中。

writev() 称为聚集写,即收集内存中分散的若干缓冲区中的数据写至文件的连续区域。

Defined in fs/read_write.c.

ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

writev() is a bit similar to sendmsg(), except:

  • with sendmsg(), you can specify a destination address for use with connectionless sockets like UDP.
  • with sendmsg(), you can also add ancillary data

sendmsg()

You can implement the functionality of sendto() using sendmsg(), but sendmsg() also lets you do lots of other nifty stuff you can't do via sendto()… Eg: send control/ancillary messages, or send multiple separate chunks of data in a single operation (via iovec scatter/gather arrays)… Basically, sendmsg() is the ultimate low-level socket sending function.

When will sendmsg block:

  1. If space is not available at the sending socket to hold the message to be transmitted
    • O_NONBLOCK not set: shall block until space is available.
    • O_NONBLOCK is set: shall fail.

For datagram or message sockets, you send just one datagram or message with a single sendmsg call; not one per buffer element.

It looks like send, and sendto are just wrappers for sendmsg in source code, that build the struct msghdr for you. And in fact, the UDP sendmsg implementation makes room for one UDP header per sendmsg call.

It can send file descriptor to another process (Maybe the descriptor will have another value after the duplicating).

Defined in net/socket.c.

c - How to use sendmsg to send a file-descriptor via sockets between 2 processes? - Stack Overflow

struct msghdr {
	void		*msg_name;	/* ptr to socket address structure */
	int		msg_namelen;	/* size of socket address structure */
	int		msg_inq;	/* output, data left in socket */
	struct iov_iter	msg_iter;	/* data */
	/*
	 * Ancillary data. msg_control_user is the user buffer used for the
	 * recv* side when msg_control_is_user is set, msg_control is the kernel
	 * buffer used for all other cases.
	 */
	union {
		void		*msg_control;
		void __user	*msg_control_user;
	};
	bool		msg_control_is_user : 1;
	bool		msg_get_inq : 1;/* return INQ after receive */
	unsigned int	msg_flags;	/* flags on received message */
	__kernel_size_t	msg_controllen;	/* ancillary data buffer length */
	struct kiocb	*msg_iocb;	/* ptr to iocb for async requests */
	struct ubuf_info *msg_ubuf;
	int (*sg_from_iter)(struct sock *sk, struct sk_buff *skb,
			    struct iov_iter *from, size_t length);
};

Will sendmsg copy the user specified data in the iovec to kernel space?

默认应该不是 zero-copy,但是这个 patch set 引入了对于 zero-copy 的支持 socket sendmsg MSG_ZEROCOPY [LWN.net]。加了一个 MSG_ZEROCOPY 的 flag。

msg_name

This one is the address of the destination and is optional, because the socket may in a connected state.

msg_control, msg_controllen

msg_control points to a buffer for other protocol control-related messages or miscellaneous ancillary data(指向与协议控制相关的消息或者辅助数据)。

Ancillary data is a sequence of cmsghdr structures with appended data. See the specific protocol man pages for the available control message types:

struct cmsghdr {
	size_t cmsg_len;    /* Data byte count, including header (type is socklen_t in POSIX) */
	int    cmsg_level;  /* Originating protocol */
	int    cmsg_type;   /* Protocol-specific type */
	/* followed by unsigned char cmsg_data[]; */
};

msg_controllen is for counting the number of msg_control entries.

msg_flags

There are 2 flags:

  • One is served as the parameter of this function, the int flags field.
  • The other is the msg_flags, which is in the msghdr.

sendmsg 中,会忽略 msg_flags 成员,它会按照参数 flags 直接处理。那么当我们去设置 MSG_DONTWAIT(临时非阻塞)是就把 flags 设为 MSG_DONTWAIT 而不是 msg_flags

recvmsg 中,内核会使用 msg_flags 参数地址来存放一些输出(On successful completion, the msg_flags member of the message header is the bitwise-inclusive OR of all of the following flags that indicate conditions detected for the received message)。

sendmsg() -- send message from socket using structure

recvmsg

send System call

A call to send has 3 possible outcomes:

  1. There is at least one byte available in the send buffer →send succeeds and returns the number of bytes accepted (possibly fewer than you asked for).
  2. The send buffer is completely full at the time you call send.
    • if the socket is blocking, send blocks
    • if the socket is non-blocking, send fails with EWOULDBLOCK/EAGAIN
  3. An error occurred (e.g. user pulled network cable, connection reset by peer) →send fails with another error

c - When a non-blocking send only transfers partial data, can we assume it would return EWOULDBLOCK the next call? - Stack Overflow

Socket

struct socket {
	socket_state		state;
	short			type;
	unsigned long		flags;
	struct file		*file;
	struct sock		*sk;
	const struct proto_ops	*ops;
	struct socket_wq	wq;
};

Socket type

enum sock_type {
	SOCK_DGRAM = 1,
	SOCK_STREAM = 2,
	SOCK_RAW = 3,
	SOCK_RDM = 4,
	SOCK_SEQPACKET = 5,
	SOCK_DCCP = 6,
	SOCK_PACKET = 10,
};

Connection-mode socket includes: SOCK_STREAM, SOCK_SEQPACKET.

listen(fd, backlog)

某一时刻同时允许最多有 backlog 个客户端和服务器端进行连接。

The backlog argument defines the maximum length to which the queue of pending connections for sockfd may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later reattempt at connection succeeds.

listen - Linux manual page

socket->sk, Struct sock

每个 socket 数据结构都有一个 sock 数据结构成员,sock 是对 socket 的扩充,两者一一对应:

  • socket->sk 指向对应的 sock;
  • sock->socket 指向对应的 socket;

socket 和 sock 是同一事物的两个侧面,为什么不把两个数据结构合并成一个呢?这是因为 socket 是 inode 结构中的一部分(union):

struct inode {
	union {
		//...
		struct ext2_inode_info ext2_i;
		struct ext3_inode_info ext3_i;
		struct socket socket_i;
		//...
	} u;
};

由于 socket 有大量的结构成分,如果把这些成分全部放到 socket 结构中,则 inode 结构中的这个 union 就会变得很大,而对于其他文件系统(ext2_i, ext3_i)这个 union 是不需要这么大的,所以会造成巨大浪费。

系统中使用 inode 的频率要远远超过使用 socket 的频率,所以 socket 应该为 inode 做出让步。解决的办法就是分成两部分:

  • 把与 文件系统 关系密切的放在 socket 结构中;
  • 把与 通信 关系密切的放在另一个单独结构 sock 中。

socket和sock的一些分析 - kk Blog —— 通用基础

sk_data_ready

sk_data_ready: callback to indicate there is data to be processed.

使用此函数来唤醒等待的进程。

UDS 的 sk_data_ready 指向的应该是 sock_def_readable() in net/core/sock.c。这个函数会进一步 wake up 对应的 wait queue,为其中的每一个 entry 调用其之前注册的 func,对于 epitem 就是 ep_poll_callback

Unix domain socket (UDS) IPC

Exchanging data between processes executing on the same host operating system.

Can be connection-oriented (type SOCK_STREAM) or connectionless (type SOCK_DGRAM).

Address Family: AF_UNIX (also known as AF_LOCAL).

Code analysis: Linux 网络IO 优化篇 : 一种本机网络 IO 方法,让你的性能翻倍! | HeapDump性能社区

Diff. with named pipe?

c - unix domain socket VS named pipes? - Stack Overflow

Misc

wake_up_interruptible_sync_poll,只是会调用到 socket 等待队列项上设置的回调函数,并不一定有唤醒进程的操作。

图解 | 深入揭秘 epoll 是如何实现 IO 多路复用的! - 知乎

Clocksource

Typically the clock source is a monotonic, atomic counter which will provide n bits which count from 0 to (2^n-1) and then wraps around to 0 and start over.

系统中可能会同时 注册 多个 clocksource,only 1 clocksource can be current (jiffies is the default clocksource).

如果你用 linux 的 date 命令获取当前时间,内核会读取当前的 clock source,转换并返回合适的时间单位给用户空间。

内核用一个 struct clocksource 对真实的时钟源进行软件抽象。

/**
 * struct clocksource - hardware abstraction for a free running counter
 *	Provides mostly state-free accessors to the underlying hardware.
 *	This is the structure used for system time.
 *
 * @read:		Returns a cycle value, passes clocksource as argument
 * @mask:		Bitmask for two's complement
 *			subtraction of non 64 bit counters
 * @mult:		Cycle to nanosecond multiplier
 * @shift:		Cycle to nanosecond divisor (power of two)
 * @max_idle_ns:	Maximum idle time permitted by the clocksource (nsecs)
 * @maxadj:		Maximum adjustment value to mult (~11%)
 * @uncertainty_margin:	Maximum uncertainty in nanoseconds per half second.
 *			Zero says to use default WATCHDOG_THRESHOLD.
 * @archdata:		Optional arch-specific data
 * @max_cycles:		Maximum safe cycle value which won't overflow on
 *			multiplication
 * @name:		Pointer to clocksource name
 * @list:		List head for registration (internal)
 * @rating:		Rating value for selection (higher is better)
 *			To avoid rating inflation the following
 *			list should give you a guide as to how
 *			to assign your clocksource a rating
 *			1-99: Unfit for real use
 *				Only available for bootup and testing purposes.
 *			100-199: Base level usability.
 *				Functional for real use, but not desired.
 *			200-299: Good.
 *				A correct and usable clocksource.
 *			300-399: Desired.
 *				A reasonably fast and accurate clocksource.
 *			400-499: Perfect
 *				The ideal clocksource. A must-use where
 *				available.
 * @id:			Defaults to CSID_GENERIC. The id value is captured
 *			in certain snapshot functions to allow callers to
 *			validate the clocksource from which the snapshot was
 *			taken.
 * @flags:		Flags describing special properties
 * @enable:		Optional function to enable the clocksource
 * @disable:		Optional function to disable the clocksource
 * @suspend:		Optional suspend function for the clocksource
 * @resume:		Optional resume function for the clocksource
 * @mark_unstable:	Optional function to inform the clocksource driver that
 *			the watchdog marked the clocksource unstable
 * @tick_stable:        Optional function called periodically from the watchdog
 *			code to provide stable synchronization points
 * @wd_list:		List head to enqueue into the watchdog list (internal)
 * @cs_last:		Last clocksource value for clocksource watchdog
 * @wd_last:		Last watchdog value corresponding to @cs_last
 * @owner:		Module reference, must be set by clocksource in modules
 *
 * Note: This struct is not used in hotpathes of the timekeeping code
 * because the timekeeper caches the hot path fields in its own data
 * structure, so no cache line alignment is required,
 *
 * The pointer to the clocksource itself is handed to the read
 * callback. If you need extra information there you can wrap struct
 * clocksource into your own struct. Depending on the amount of
 * information you need you should consider to cache line align that
 * structure.
 */
struct clocksource {
	u64			(*read)(struct clocksource *cs);
	u64			mask;
	u32			mult;
	u32			shift;
	u64			max_idle_ns;
	u32			maxadj;
	u32			uncertainty_margin;
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
	struct arch_clocksource_data archdata;
#endif
	u64			max_cycles;
	const char		*name;
	struct list_head	list;
	int			rating;
	enum clocksource_ids	id;
	enum vdso_clock_mode	vdso_clock_mode;
	unsigned long		flags;

	int			(*enable)(struct clocksource *cs);
	void			(*disable)(struct clocksource *cs);
	void			(*suspend)(struct clocksource *cs);
	void			(*resume)(struct clocksource *cs);
	void			(*mark_unstable)(struct clocksource *cs);
	void			(*tick_stable)(struct clocksource *cs);

	/* private: */
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
	/* Watchdog related data, used by the framework */
	struct list_head	wd_list;
	u64			cs_last;
	u64			wd_last;
#endif
	struct module		*owner;
};

Rating 字段

同一个设备下,可以有多个时钟源,每个时钟源的精度由驱动它的时钟频率决定。

clocksource 结构中有一个 rating 字段,代表着该时钟源的精度范围,它的取值范围如下:

  • 1-99: 不适合于用作实际的时钟源,只用于启动过程或用于测试;
  • 100-199:基本可用,可用作真实的时钟源,但不推荐;
  • 200-299:精度较好,可用作真实的时钟源;
  • 300-399:很好,精确的时钟源;
  • 400-499:理想的时钟源,如有可能就必须选择它作为时钟源;

Linux时间子系统之一:clock source - kk Blog —— 通用基础

Bootloader (Grub/grub2)

grub2 is configured through /etc/default/grub file.

The grub.cfg file is the GRUB configuration file. It is generated by the grub2-mkconfig program using a set of primary configuration files and the grub default file as a source for user configuration specifications.

Note that any manual changes to /etc/default/grub require rebuilding the grub.cfg file by grub2-mkconfig.

update-grub , at least in Debian and its relatives like Ubuntu, is basically just a wrapper around grub-mkconfig.

boot/grub/grubenv 作用:

  • 如果在 /etc/default/grub 中设定 GRUB_DEFAULT=saved,则按这一段,把本次启动项记录下来,做为下次默认启动项;
  • 把 default=x 记录下来,下次启动时调用为 set default=x 而不是默认的 set default=0;
  • 如果由于软、硬件原因不能启动的,把 recordfail=1 记录下来,下次启动就会据此设定 set timeout=-1,就是出现菜单后不会进入默认启动,要手动按 enter 才进入启动。

/boot/grub2/grubenv is a regular file on Non-uEFI machine. (NOTE: it's a symlink to /boot/efi/EFI/redhat/grubenv on uEFI machine).

Why after grub2-mkconfig and reboot, cat /proc/cmdline is not updated?

Why do we need a bootloader?

A BIOS would need to know how to load a kernel, and this would make the BIOS over complicated: imagine a BIOS that needs to know how to load the many different operating systems available, how to pass kernel parameters to them etc… BIOS is a firmware in ROM and not flexible to replace, which means we cannot change the kernel parameters easily…

Why BIOS is a firmware not flexible to replace?

我觉得可能是因为 BIOS 本来就是一个很底层的东西,一旦坏了就只能去找厂家修了(但是如果 OS 坏了可以重新刷系统),所以最好不要去经常更改 BIOS 的内容(除非要 update),能不更改就不更改是最好的。所以如果我们要改内核参数,还是设计一个中间层比如 bootloader 来改比较好,作为一个将 BIOS 和 OS 解耦的工具。

Thus, it only initializes the hardware and jumps to a known place where the bootloader is stored; then, the control is passed to it.

bios - Why do we need a boot loader? - Super User

How does BIOS find the bootloader?

For MBR: BIOS search for Master Boot Record (MBR) which contains the bootloader.

For EFI, the ESP will be mounted by EFI firmware, The ESP should be in FAT32 file system. Actually, you can switch bootloaders by changing the boot sequence in my BIOS, just like you can switch kernel when you in a bootloader menu.

ESP/EFI/boot/bootx64.efi, For EFI, This is the only bootloader pathname that the UEFI firmware on 64-bit X86 systems will look for. On Ubuntu, we can find ESP/EFI/ubuntu/grubx64.efi, this is actually the EFI application^. On UEFI based systems, GRUB works by installing an EFI application into ESP/EFI/<id>/grubx64.efi, and id is replaced with an identifier specified in the grub-install command line. GRUB will create an entry in the EFI variables^ containing the path ESP/EFI/<id>/grubx64.efi so the EFI firmware can find grubx64.efi and load it.

Which EFI variable does grub create or modify to let EFI can find the grubx64.efi?

When Grub installs itself on an EFI system, it typically creates or modifies the BootOrder and BootXXXX EFI variables to allow the system firmware to locate and boot the Grub EFI bootloader (grubx64.efi).

The BootOrder variable is a global variable that specifies the order in which the firmware should attempt to boot the available EFI boot loaders. The BootOrder variable contains a list of one or more BootXXXX variables, where XXXX is a four-digit hexadecimal identifier. Each BootXXXX variable specifies a unique boot option, which includes information about the EFI boot loader to be executed and the device from which it should be loaded. 也就是说 BootXXXX 和 boot loader 是一对一的。

When Grub installs itself, it typically creates a new BootXXXX variable with a unique identifier, and sets the DevicePath and FilePath fields of the variable to point to the location of the grubx64.efi bootloader on the EFI system partition. Grub also updates the BootOrder variable to include the new BootXXXX entry, so that the firmware will attempt to boot Grub before any other boot options.

Once the BootOrder variable has been updated, the firmware will automatically attempt to boot the Grub bootloader the next time the system is started. If Grub is successfully loaded, it will then present the user with a boot menu and allow them to select from the available operating systems or boot options.

host_initialized, msr_data, msr_info

host_initialized means the MSR request is issued from host userspace, not from guest.

struct msr_data {
	bool host_initiated; // this access is initiated by host.
	u32 index; // the index of the MSR, i.e., the address
	u64 data; // the value of the MSR
};

How does host_initialized be set and used?

For a MSR write request:

kvm_arch_vcpu_ioctl
	case KVM_SET_MSRS:
        do_set_msr
            kvm_set_msr_ignored_check
                __kvm_set_msr // host_initiated is set here
                    vmx_set_msr // host_initiated is used here to allow taking difference actions on different value

For a MSR read request:

kvm_arch_vcpu_ioctl
    case KVM_GET_MSRS:
        do_get_msr
            kvm_get_msr_ignored_check
                __kvm_get_msr // host_initiated is set here
                    vmx_get_msr // host_initiated is used here to allow taking difference actions on different value

In case of reading an MSR, there are 2 functions which both call function kvm_get_msr_ignored_check(…, bool host_initiated):

  • kvm_get/set_msr (host_initiated is false)
  • do_get_msr (host_initiated is true)

Call trace for kvm_get_msr (reversed):

kvm_get_msr
	emulator_get_msr
		 ops->get_msr()

It is accessed by KVM instruction emulator (emulate_ops), so the access if from guest.

Call trace for do_get_msr (reversed):

do_get_msr
	kvm_arch_vcpu_ioctl
	kvm_arch_dev_ioctl

That's why the host_initiated is true, because it is called by host userspace.

Get Free Page flags (GFP flags)

Linux provides a variety of APIs for memory allocation. kmalloc, vmalloc, etc.

Most of the memory allocation APIs use GFP flags to express how that memory should be allocated. The GFP acronym stands for “get free pages”, the underlying memory allocation function.

  • Most of the time GFP_KERNEL is what you need. GFP_KERNEL implies GFP_RECLAIM, which means that direct reclaim may be triggered under memory pressure; the calling context must be allowed to sleep. There is the handy GFP_KERNEL_ACCOUNT shortcut for GFP_KERNEL allocations that should be accounted.
  • If the allocation is performed from an atomic context, e.g interrupt handler, use GFP_NOWAIT. under memory pressure GFP_NOWAIT allocation is likely to fail.
  • More…

Memory Allocation Guide — The Linux Kernel documentation

GFP_DMA

When you allocate memory with the GFP_DMA flag set, the kernel prioritizes memory zones suitable for DMA transfers. These zones typically meet the following criteria:

  • Physically Contiguous: The allocated memory should consist of physically contiguous pages. This means the memory pages reside in a continuous block of physical addresses, crucial for efficient DMA operations.
  • Below 16 GB (Legacy Systems): In older 32-bit systems, the DMA zone might be restricted to physical memory below a 16 GB address boundary due to limitations in addressing capabilities. However, this restriction is less relevant in modern 64-bit systems.

dma_alloc_coherent() returns address range for which proper memory attributes are already set so cache effect is handled naturally. We need not to do any cache operation for these addresses.

If we use address allocated by kmalloc() for DMA operation then we need to do extra cache operation like cache clean and cache invalidate based on direction of transfer.

Who Uses GFP_DMA?

Device Drivers: Device drivers that require DMA functionality for data transfer with their respective devices often use the GFP_DMA flag during memory allocation. This ensures the allocated memory buffers are suitable for DMA operations.

linux kernel - GFP_KERNEL vs GFP_DMA and kmalloc() vs dma_alloc_coherent() - Stack Overflow