Over that time period, I have seen dozens, possibly hundreds, of papers that refer to micro-ops as “converting CISC instructions into RISC instructions.” (Do an internet search for “uop risc cisc” if you’d like to see every possible opinion on this.) A common phrasing goes like this:

“The x86 ISA implements a variety of complex instructions that are internally broken down into RISC-like micro-ops…” with references to “…the micro-op ISA…” This is not just wrong. It is wrong-headed. IEEE Xplore Full-Text PDF:

The bootloader is part of the Operating System.

/dev/sda 显示的是一个分区,而不是一整个磁盘设备。

64bit Process memory layout

How a 64-bit process virtual address space is divided in Linux? - Unix & Linux Stack Exchange

How to get the function name by function pointer?

#include <dlfcn.h>

int main()
{
	Dl_info  DlInfo;
	int  nRet;
	if ((nRet = dladdr(h->func, &DlInfo)) != 0)
	        printf("%s\n", DlInfo.dli_sname);
	else
			printf("error\n");
	return 0;
}

c - dladdr doesn't return the function name - Stack Overflow

File descriptors

这篇文章讲的很精炼,适合一读(里面的图很直观)。

Linux文件描述符到底是什么?

The file descriptor, i.e. the 4 in your example, is the index into the process-specific file descriptor table, not the open file table.

kernel - How can same fd in different processes point to the same file? - Unix & Linux Stack Exchange

What is file descriptors 0, 1 and 2?

File descriptor is process scope.

0, 1 and 2 are each process's stdin, stdout and stderr.

file descriptor flags

open - Linux manual page

Patchwork clients are handful for applying community patches

I recommend git-pw, it supports series merge.

Clients — Patchwork 3.1.0.alpha.0 documentation

How to rebuild a ubuntu package from source?

mkdir foo
cd foo
apt-get source <the package name>
cd <the package>
dpkg-buildpackage -rfakeroot -b --no-sign
cd ..
sudo dpkg -i <the deb>

# to checkout and apply the patches automatically
dpkg-source -x <package>.dsc

How to install dependency package?

sudo apt-get build-dep <package>

串口

Modern devices use an integrated circuit called a UART to implement a serial port.

TTY

TTY: Teletypewriter

PTY: pseudo-TTY

上古时期,terminal 包含 tty,tty 就是 terminal 的一种。

现代 terminal 并不是直接和 shell 通信,而是通过 tty 设备,让 tty 设备来和 shell 通信。

终端、Shell、tty 和控制台(console)有什么区别? - 大川的回答 - 知乎 https://www.zhihu.com/question/21711307/answer/2231006377

Why tty?

The UART driver would then deliver the incoming bytes directly to some application process. But such an approach would lack the following essential features.

Together, a particular triplet of UART driver, line discipline instance and TTY driver may be referred to as a TTY device, or sometimes just TTY. A user process can affect the behaviour of any TTY device by manipulating the corresponding device file under /dev.

The TTY demystified

What is /dev/hvc0 ?

It only appears on guest, not on host.

What is /dev/console?

Now /dev/console and /dev/tty0 represent current display and usually are the same. You can override it for example by adding console=ttyS0 to kernel command line grub.conf. After that your /dev/tty0 is the monitor and /dev/console is /dev/ttyS0.

What is /dev/tty?

/dev/tty is a special file, representing the terminal for the current process. So, when you echo 1 > /dev/tty, your message ('1') will appear on your screen.

/dev/tty doesn't 'contain' anything as such, but you can read from it and write to it.

What is /dev/tty0?

tty0 is just an alias for the current active console.

打开 Desktop 上的 Ubuntu Server 系统,会进入一个默认的命令行。在另一台机器上 SSH 过来,输入:

echo 1 > /dev/tty0

你会在这个 Desktop 上的显示器上看到输出,因为另一台机器的 SSH 不是 console。

所以,/dev/tty 的“当前”是相对于进程而言的(你执行上述命令,只会在当前的命令行输出内容 1),而 /dev/tty0 的“当前”是相对于当前计算机而言的。

What is pty?

There is no such a thing /dev/pty[n].

/dev/tty vs. /dev/pts

tty is a native terminal device, the backend is either hardware or kernel emulated.

pty (pseudo terminal device) is a terminal device which is emulated by an other program (example: xtermscreen, or ssh are such programs). A pts is the slave part of a pty.

A pseudo-terminal slave (pts) session is used when connecting to a Linux computer via another application, such as SSH or PuTTY.

linux - Difference between pts and tty - Unix & Linux Stack Exchange

Why there only one pts /dev/pts?

It is a directory, access it using /dev/pts/<number>, that's how it get implemented.

Each SSH session corresponds a pts, you can type tty to see the pts.

Although there are only several entries in ls /dev/pts/, but when you open more and more SSH sessions, the result will also expand, which means it can scale automatically.

/dev/sda And /dev/vda

/dev/sda is the first detected disk of IDE/SATA/SCSI type. In this case, emulated(full virtualized) by the hypervisor.

/dev/vda is the first detected paravirtualizated disk driver. It is faster than emulated sdX devices if both are referred to the same disk, because there are less overhead in its operation compared to an emulated drive.

In particular, /dev/vd* devices are using the virtio paravirtual disk driver.

virtualization - what is the difference between /dev/vda and /dev/sda - Server Fault

如何查看根目录在哪一个分区

findmnt /

MSR type

MSRs scope can be:

  • Unique
  • Package
  • Core
  • Thread

So not all MSRs are owned by each VCPU.

MSRs can also be:

So sometimes, MSRs can also be regarded as part of the CPU model.

Re: [PATCH v5 05/15] KVM: nVMX: Let userspace set nVMX MSR to any host supported value - Sean Christopherson

如何查看 BIOS 的版本

dmidecode -t 0

Microcode Update Procedure

The microcode update binary is uploaded to the CPU in the following way:

  • First, the patch must be placed in accessible virtual address space. Then the 64-bit virtual address must be written to Model-Specific Register (MSR) 0xc0010020.
  • Depending on the update size and microarchitecture, the wrmsr instruction initiating the update may take around 5,000 cycles to complete.
  • Rejection of a patch causes a general protection fault. Internally, the update mechanism verifies the checksum, copies the triads to microcode patch RAM, and stores the match register fields in the actual match registers. Patch RAM is mapped into the address space of the microcode ROM, whereby the patch triads directly follow the read-only triads.

How to update?

OS 里就可以直接更新,不需要刷 BIOS。

如何在 Linux 上安装/更新 Intel 微码固件 - 知乎

Microcode (简体中文) - ArchWiki

Update by who?

Firmware, OS.

微码更新通常随主板固件一起提供,并在固件初始化时被应用。但是 OEM 可能不会及时发布固件更新,并且旧系统根本不会获得新的固件更新,所以 Linux 内核提供了启动时应用微码更新的功能。Linux 微码加载器 支持三种加载方式:

  1. 早期加载 在启动过程中很早就更新微码(比 initramfs 阶段还早),所以是推荐的方式。对于具有严重硬件错误的 CPU,例如 Intel Haswell 和 Broadwell 处理器系列,必须选择这种方式。
  2. 后期加载危险)在启动后更新微码,这可能太晚了,因为 CPU 可能已经使用了有问题的指令集。即使已经使用了早期加载,后期加载依然有价值,可以在系统不重启的情况下应用较新的微码更新。
  3. 内置微码 可以编译到内核中,然后由早期加载程序应用。

Microcode (简体中文) - ArchWiki

startup.nsh

Just the official name for the EFI Shell.

As startup.nsh is the equivalent of autoexec. bat in the DOS/Windows environment, Intel usually provides the startup.nsh script in the System Firmware Update package. This script is used to perform all System Firmware Update tasks.

efi_instructions.pdf

RISC definition

single-cycle, load/store, no microcode, few instructions and addressing modes, fixed instruction format, and a decided shift of complexity to the compiler.

The Origin of Intel’s Micro-Ops

Quickly generate a flame graph for a binary using perf

git clone https://github.com/brendangregg/FlameGraph
sudo perf record -F 99 -a -g -- <binary/script>
sudo perf script > out.perf
FlameGraph/stackcollapse-perf.pl out.perf > out.folded
FlameGraph/flamegraph.pl out.folded > kernel.svg

Malloc, kmalloc, vmalloc, kzalloc, kvmalloc and kcalloc

malloc() 只是 glibc 提供的一个函数,其底层还是基于的 sbrkmmap 这两个 Linux kernel 提供的 syscall。

  malloc kmalloc vmalloc
Physical Contiguous No Yes No
Virtually Contiguous Yes Yes Yes
Space User Kernel Kernel

Note:

  • kmalloc() is for objects smaller than page size, so it is physical contiguous. But that doesn't mean it cannot allocate larger than page size. The maximum size allocatable by kmalloc() is 1024 pages, or 4MB on x86.
  • For large allocations you can use vmalloc() and vzalloc().

kcalloc() allocates memory for an array, it is NOT a replacement for kmalloc(). 返回的是物理地址。kcalloc 输入参数说明:

  • n:数组中的元素个数;
  • size:指定数组中每个元素所对应的内存对象的大小。

Linux内核API kcalloc|极客笔记

Example Linux Memory Allocation Code | Hitch Hiker's Guide to Learning

Memory Allocation Guide — The Linux Kernel documentation

The kmalloc() & vmalloc() functions are a simple interface for obtaining kernel memory in byte-sized chunks.

  1. The kmalloc() function guarantees that the pages are physically contiguous (and virtually contiguous).
  2. The vmalloc() function works in a similar fashion to kmalloc(), except it allocates memory that is only virtually contiguous and not necessarily physically contiguous.

https://stackoverflow.com/a/33996607/18644471

vmalloc() 一般用于申请大块物理内存,但只是虚拟地址连续,物理地址不一定连续,因为如果申请小块物理内存的话,比如说申请的小于 4K 页,那么物理地址肯定是连续的,只有当申请多个页的时候,才会出现虚拟地址连续但是物理地址不连续的情况。

kmalloc() 一般用于申请小于一页的物理内存,因为虚拟物理地址都连续。

你应该曾经纠结过是用 kmalloc(),还是 vmalloc()?现在你不用那么纠结了,因为内核里面现在有个 API 叫 kvmalloc(),可以认为是 kmalloc()vmalloc() 的双剑合一。屠龙刀和倚天剑的合体

As mentioned in the article, vmalloc() has higher overhead.

kvmalloc [LWN.net]

Then what is kzalloc()?

It is just a kmalloc() with memory zeroed before return.

宋宝华: kvmalloc ——倚天剑屠龙刀两大神器合体? - 腾讯云开发者社区-腾讯云

Char device vs. Block device on interface

块设备除了给内核提供和字符设备一样的接口外,还提供了专门面向块设备的接口。

linux三大驱动类型:字符设备、块设备、网络设备_hello_courage的博客-CSDN博客_linux设备类型

Kernel config

In defconfig we can specify only options with non-default values. This way we can keep it small and clear. Every new kernel version brings a bunch of new options, and this way we don't need to update our defconfig file each time the kernel releases.

It's better to avoid modifying it by hand. Instead you should use make savedefconfig rule.

Where is the default values?

When .config file is being generated, kernel build system goes through all Kconfig files (from all subdirs), checking all options in those Kconfig files:

  • if option is mentioned in defconfig, build system puts that option into .config with value chosen in defconfig
  • if option isn't mentioned in defconfig, build system puts that option into .config using its default value, specified in corresponding Kconfig

kbuild - What exactly does Linux kernel's make defconfig do? - Stack Overflow

Are defconfig the files under arch/x86/configs?

Yes.

Defconfig files are typically stored in the kernel tree at arch/*/configs/

Not all files under this directory are defconfig files, some are missaved by other people are just .config files.

kernel/configs - Git at Google

What does make savedefconfig do?

正确地根据 .config 文件生成 defconfig 的文件。

What does make defconfig do?

arch/x86/configs 文件夹中有命名为 xxx_defconfig 的配置文件,如果运行 make xxx_defconfig,当前 .config 文件会由此文件生成。举个例子:

lei in ~/p/leinux on  master  
❯ make x86_64_defconfig  
#
# configuration written to .config
#
lei in ~/p/leinux on  master  
❯ make x86_64_defconfig  

Bus clock / core crystal clock frequency

core crystal clock is also known as system clock, CPU clock (like 3.6Ghz).

Do not mix up the bus lock and the bus clock.

Every bus also has a clock speed. Just like the processor, manufacturers state the clock speed for a bus in hertz. Recall that one megahertz (MHz) is equal to one million ticks per second. Today’s processors usually have a bus clock speed of 400, 533, 667, 800, 1066, 1333, or 1600 MHz. The higher the bus clock speed, the faster the transmission of data, which results in programs running faster.

Difference Between System Clock and Bus Clock

kvm_stat

kvm_stat prints counts of KVM kernel module trace events. These events signify state transitions such as guest mode entry and exit.

linux/kvm_stat.txt at master · torvalds/linux

Chroot

chroot,即 change root directory (更改 root 目录)。在 Linux 系统中,系统默认的目录结构都是以 /,即以根 (root) 开始的。而在使用 chroot 之后,系统的目录结构将以指定的位置作为 / 位置。

https://www.cnblogs.com/sparkdev/p/8556075.html

What is .Xauthority file?

Since the network may be accessible to other users, a method for forbidding access to programs run by users different from the one who is logged in is necessary.

There are five standard access control mechanisms that control whether a client application can connect to an X display server. They can be grouped in three categories:

  1. access based on host
  2. access based on cookie
  3. access based on user

X Window authorization - Wikipedia

当两台电脑用网线直连时,他们之间如何通信?

只要配好了各自的 IP 地址,就可以通信。

同一子网下,以太帧是广播的。

目的主机网卡收到包查看发现目的 MAC 为自己,如是接收此包,递交上层协议(TCP/IP)进一步解包,就通了。

Bridge / Switch

Hub:信号放大器,物理层。

Bridge:原来只有两个端口,后面扩展成了多个端口,变成了交换机(也叫多端口网桥)。网桥和交换机的一个口就是一个广播域。

网桥的两端各自是一个广播域,交换机的每一个端口都是一个广播域,这就是最重要的区别。

桥接网络是指本地物理网卡和虚拟网卡通过 VMnet0 虚拟交换机进行桥接,物理网卡和虚拟网卡在拓扑图上处于同等地位,那么物理网卡和虚拟网卡就相当于处于同一个网段,虚拟交换机就相当于一台现实网络中的交换机,所以两个网卡的 IP 地址也要设置为同一网段。

局域网交换机
├── 主机 A
├── 主机 B
└── 主机 C
    └── VMnet0 虚拟交换机
        ├── 主机 C
        ├── 虚拟机 1
        └── 虚拟机 2

因此,上面 VMnet0 的位置,其实可以不使用交换机,一个网桥也可以胜任,只不过会有广播风暴的问题,任何一台虚拟机的网络包来了,其它虚拟机都会收到。

因此,我认为数据链路层最主要的作用,就是隔绝网络风暴!

集线器、网桥、交换机的区别 - 知乎

How does guest handle its clock interrupt and schedule it's vcpus?

I think we shouldn't mix the concept scheduling in host and scheduling in guest.

In host, schedule occur in each constant interval. it can schedule a guest running thread out, which is opaque to the guest and guest won't know. From the guest point of view, it may just see some performance downgrading and it didn't know the time is stolen from the host.

How does the clock be virtualized?

VMLAUNCH/VMRESUME

Are VMLAUNCH/VMRESUME synchronous or asynchronous? in other words, when the QEMU vcpu thread is scheduled out, will the VM also stop executing?

The answer is, they are synchronous.

VMLAUNCH/VMRESUME are just like a function call, with some VMCS loading and operation mode switching functions?

  Function call VMLAUNCH/VMRESUME
Enter call vmentry
Exit ret vmexit
Where to save origin RIP when enter? Stack VMCS Host State Area
Where to load RIP when enter parameter VMCS Guest State Area
Where to save RIP when return? Needn't VMCS Guest State Area
Where to load RIP when return Stack VMCS Host State Area

There is no software-visible bit whose setting indicates whether a logical processor is in VMX non-root operation. This fact may allow a VMM to prevent guest software from determining that it is running in a virtual machine.

A few key registers are saved automatically by the CPU when the interrupt occurs.

linux kernel - How scheduler save the registers of previously running process - Stack Overflow

According to above evidence, I suppose maybe the register indicating current operation mode will be saved each clock interrupt, to ensure they will not interleave each other.

VMLAUNCH: Creates an instance of a VM and enters non-root mode.

VMRESUME: Enters non-root mode for an existing VM instance.

Besides, It is expected that, in general, VMRESUME will have lower latency than VMLAUNCH.

"Intel Virtualisation: How VT-x, KVM and QEMU Work Together – Binary Debt"

What is /boot/efi?

EFI system partition - Wikipedia

EFI (Extensible Firmware Interface) system partition or ESP.

When a computer is booted, UEFI firmware loads files stored on the ESP to start installed operating systems and various utilities.

An ESP contains the boot loaders or kernel images for all installed operating systems.

The EFI system partition is formatted with a file system whose specification is based on the FAT file system and maintained as part of the UEFI specification; therefore, the file system specification is independent from the original FAT specification.

The mount point for the EFI system partition is usually /boot/efi, where its content is accessible after Linux is booted.

GNU / Linux / Unix

Unix 系统被发明之后,大家用的很爽。但是后来开始收费和商业闭源了。一个叫 RMS 的大叔觉得很不爽,于是发起 GNU 计划,模仿 Unix 的界面和使用方式,从头做一个开源的版本。然后他自己做了编辑器 Emacs 和编译器 GCC。

GNU 是一个计划或者叫运动。在这个旗帜下成立了 FSF,起草了 GPL 等。

接下来大家纷纷在 GNU 计划下做了很多的工作和项目,基本实现了当初的计划。包括核心的 gcc 和 glibc。但是 GNU 系统缺少操作系统内核。原定的内核叫 HURD,一直完不成。同时 BSD(一种 UNIX 发行版)陷入版权纠纷,x86 平台开发暂停。然后一个叫 Linus 的同学为了在 PC 上运行 Unix,在 Minix 的启发下,开发了 Linux。注意,Linux 只是一个系统内核,系统启动之后使用的仍然是 gcc 和 bash 等软件。Linus 在发布 Linux 的时候选择了 GPL,因此符合 GNU 的宗旨。

最后,大家突然发现,这玩意不正好是 GNU 计划缺的么。于是合在一起打包发布叫 GNU / Linux。然后大家念着念着省掉了前面部分,变成了 Linux 系统。实际上 Debian,RedHat 等 Linux 发行版中内核只占了很小一部分容量。

  • Unix style: tar -c
  • GNU style: tar --create

GNU 是什么,和 Linux 是什么关系? - 知乎 https://www.zhihu.com/question/319783573/answer/656033035

Why the value of PSE is physical address, not virtual?

From SDM:

Each paging-structure entry contains a physical address, which is either the address of another paging structure or the address of a page frame. In the first case, the entry is said to reference the other paging structure; in the latter, the entry is said to map a page.

If we use the virtual address, then the virtual address is also need to be translated, which leads to a recursion hell.

cr3 also has the physical address of page table.

In kernel code:

task_struct
    mm_struct
        pgd (virtual address)

pgd can be converted to CR3 by __phys_addr, you can grep cr3 pgd in kernel code for more information.

linux - Difference between CR3 value and pgd_t - Stack Overflow

READ_ONCE/WRITE_ONCE

kernel-sanitizers/READ_WRITE_ONCE.md at master · google/kernel-sanitizers

What is GNU extension?

GNU C provides several language features not found in ISO standard C.

C Extensions )

GNU Extensions are explicitly allowed in the Linux kernel.

Assembly in C (A kind of GNU extension)

Extended asm statements have to be inside a C function.

Using Assembly Language with C )

Fields defined in vmcs_field vs. fields not defined in

For an example, VM_ENTRY_LOAD_IA32_PKRS is just a bit in the entry control register, whose address is further specified inside the vmcs_field.

How to use VMREAD and VMWRITE?

SDM 24.11.1:

Software should use the VMREAD and VMWRITE instructions to access the different fields in the current VMCS (see Section 24.11.2). Software should never access or modify the VMCS data of an active VMCS using ordinary memory operations, in part because the format used to store the VMCS data is implementation-specific and not architecturally defined, and also because a logical processor may maintain some VMCS data of an active VMCS on the processor and not in the VMCS region.

SDM 24.11.1:

Every field of the VMCS is associated with a 32-bit value that is its encoding. The encoding is provided in an operand to VMREAD and VMWRITE when software wishes to read or write that field.

Intel SDM Vol 3. APPENDIX B FIELD ENCODING IN VMCS:

Fields are grouped by width (16-bit, 32-bit, etc.) and type (guest-state, host-state, etc.)

prctl()

prctl - operations on a process

such as changing process name:

(void) prctl(PR_SET_NAME, "systemd");

This is a system call.

Ramfs / tmpfs / ramdisk

How did they evolve: ramdisk (initrd use this) -> ramfs -> tmpfs (initramfs use this).

Ramfs exports Linux’s disk caching mechanisms (the page cache and dentry cache) as a dynamically resizable RAM-based filesystem. With ramfs, there is no backing store. Files written into ramfs allocate dentries and page cache as usual, but there’s nowhere to write them to. This means the pages are never marked clean, so they can’t be freed by the VM when it’s looking to recycle memory.

The amount of code required to implement ramfs is tiny, because all the work is done by the existing Linux caching infrastructure. Basically, you’re mounting the disk cache as a filesystem. Because of this, ramfs is not an optional component removable via menuconfig, since there would be negligible space savings.

The older “ram disk” mechanism created a synthetic block device out of an area of RAM and used it as backing store for a filesystem. Using a ram disk also required unnecessarily copying memory from the fake block device into the page cache. Plus it needed a filesystem driver (such as ext2) to format and interpret this data.

The RAM disk is simply unnecessary; ramfs is internally much simpler.

One downside of ramfs is you can keep writing data into it until you fill up all memory, and the VM can’t free it, because of this, only root (or a trusted user) should be allowed write access to a ramfs mount.

A ramfs derivative called tmpfs was created to add size limits, and the ability to write the data to swap space. Normal users can be allowed write access to tmpfs mounts.

Ramfs, rootfs and initramfs — The Linux Kernel documentation

Root Filesystem

Mounting the root filesystem is a two-stage procedure, shown in the following list.

  1. The kernel mounts the initial rootfs filesystem, which just provides an empty directory that serves as initial mount point.
  2. The kernel mounts the real root filesystem over the empty directory.

Why does the kernel bother to mount the rootfs filesystem before the real one? Well, the rootfs allows the kernel to easily change the real root filesystem. In fact, in some cases, the kernel mounts and unmounts several root filesystems, one after the other. For instance,

  • the initial bootstrap floppy disk of a distribution might load in RAM a kernel with a minimal set of drivers, which mounts as root a minimal filesystem stored in a RAM disk.
  • Next, the programs in this initial root filesystem probe the hardware of the system (for instance, they determine whether the hard disk is EIDE, SCSI, or whatever), load all needed kernel modules, and remount the root filesystem from a physical block device.

Mounting the Root Filesystem - Linux Kernel Reference

Most systems just mount another filesystem over rootfs and ignore it. (Why?)

How to know the device mounted as the real root filesystem?

df -a too see all the devices, see which device is mounted as /.

If /dev/root is shown which is not the real device, it is just a link-like thing, then you can:

findmnt -n -o SOURCE /

too see the real device.

How to find the root device? - Bootlin's blog

If it is /dev/mapper/cs-root, 说明使用了 LVM,需要添加 initrd 来辅助解析文件系统。所以需要给起 VM 使用的 QEMU 的 cmdline 添加 -initrd 参数,比如:

$QEMU -accel kvm \
-kernel "/boot/vmlinuz-6.5.0-next-lei" \
-initrd "/boot/initramfs-6.5.0-next-lei.img" \
-append "root=/dev/mapper/cs-root console=hvc0 console=ttyS0 earlyprintk=ttyS0 nokaslr ignore_loglevel unknown_nmi_panic" \
//...

Without real root filesystem loaded, how does bootloader know the location of each kernel?

Kernel is always located in /boot which is in the root file system, so how does bootloader such as grub find the kernel without loading the root file system first?

Grub has it's own file system driver. When the system boots, GRUB loads its own filesystem driver (such as ext2 or btrfs) to read the configuration file and locate the kernel and initrd. Once the kernel and initrd are located, GRUB passes control to the kernel and the kernel takes over the boot process, including mounting the root filesystem.

GRUB has its own set of filesystem drivers that it can use to read files from various filesystems, including ext2, ext3, ext4, btrfs, NTFS, and more. These filesystem drivers are built into the GRUB bootloader image itself and are loaded at boot time. They allow GRUB to read configuration files, kernel images, and initrd images from various filesystems without relying on the kernel or any other tools to be installed on the system.

Why doesn't kernel inherit filesystem info from GRUB?

Q: Why kernel needs to mount the filesystem while GRUB knows it? Why doesn't it inherit directly from GRUB or something?

A: The operating system can't use the driver from the bootloader because once the bootloader has loaded the operating system in memory, the bootloader is erased from memory. (And also because the bootloader's driver is typically less capable than the OS's — for example Grub's filesystem drivers can only read, not write.)

linux - Why doesn't kernel inherit filesystem info from GRUB? - Unix & Linux Stack Exchange

Initial root filesystem (initrd And initramfs)

Both initrd and initramfs are initial root file systems (not the real physical root file system). They are used to provide the kernel with the required modules, drivers, and utilities needed to mount and access the real root file system^.

The reason: The device drivers for the kernel are included as kernel modules. This raises the problem of detecting and loading the modules necessary to mount the root file system at boot time. The root file system may be on a software RAID volume, LVM, NFS (on diskless workstations), or on an encrypted partition. All of these require special preparations to mount.

To avoid having to hardcode handling for so many special cases into the kernel, an initial boot stage with a temporary root filesystem is used. This root file-system can contain user-space helpers which do the hardware detection, module loading and device discovery necessary to get the real root file-system mounted.

  • The bootloader will load the kernel and initial root file system image into memory and then start the kernel, passing in the memory address of the image. At the end of its boot sequence, the kernel tries to determine the format of the image from its first few blocks of data, which can lead either to the initrd or initramfs scheme.
  • In the initrd scheme, the image may be a file system image, which is made available in a special block device (/dev/ram) that is then mounted as the initial root file system.

The driver for that file system must be compiled statically into the kernel. Filesystems used as initrd images include compressed ext2 and cramfs, which is used on memory-limited systems since the cramfs image can be mounted in-place without requiring extra space for decompression.

Once the initial root file system is up, the kernel executes /linuxrc as its first process; when it exits, the kernel assumes that the real root file system has been mounted and executes /sbin/init to begin the normal user-space boot process.

ramdisk, ramfs, tmpfs, rootfs, initrd and initramfs

What' the relationship between the initrd and the /dev/ram?

The kernel

  • first create a device (/dev/ram)
  • then loads the contents of the initrd image into the device
  • mount the device so it can use it to mount the real root file system
// init/do_mounts_initrd.c
bool __init initrd_load(void)
{
	if (mount_initrd) {
		create_dev("/dev/ram", Root_RAM0);
		/*
		 * Load the initrd data into /dev/ram0. Execute it as initrd
		 * unless /dev/ram0 is supposed to be our actual root device,
		 * in that case the ram disk is just set up here, and gets
		 * mounted in the normal path.
		 */
		if (rd_load_image("/initrd.image") && ROOT_DEV != Root_RAM0) {
			init_unlink("/initrd.image");
			handle_initrd();
			return true;
		}
	}
	init_unlink("/initrd.image");
	return false;
}

Initrd use ramdisk file system, why it include compressed ext2 and cramfs?

ramdisk is from the point of view of the the physical backend choices. while ext2 is from the perspective of how the data are organized, so I think they are orthogonal concepts.

So, I think a filesystem can be ext2 formatted, and used as a ramdisk.

ramdisk - Which filesystem to use for RAM disk? - Super User

initrd (init ramdisk)

/boot/initrd.img.

It is a type of initial rootfs, it uses ramdisk^ filesystem.

The initrd contains various executables and drivers that permit the real root file system to be mounted, after which the initrd RAM disk is unmounted and its memory freed.

The initial RAM disk (initrd) has been traditionally used in older versions of Linux. Unlike initrd, which is a static image, initramfs is a dynamically generated archive that is stored in memory. It is created by the kernel during the build process, and it contains a collection of essential files and tools that the kernel can use to mount the real root file system. Initramfs is typically built using the dracut^ utility, which allows for greater customization and flexibility than initrd.

Initial ramdisk - Wikipedia

Understanding Linux initrd - initial RAM disk - nixCraft

initramfs (init ramfs)

The image may be a cpio archive.

CONFIG_TMPFS:

  • enabled, use tmpfs;
  • disabled, use ramfs

ramdisk, ramfs, tmpfs, rootfs, initrd and initramfs

The real root filesystem

The root file system is the file system contained on the same disk partition on which the root directory is located; it is the filesystem on top of which all other file systems are mounted as the system boots up.

There is a kernel parameter root to tell the kernel what device is to be used as the root fs when booting.

bootparam - Linux manual page

bootloader -> bzImage (kernel) -> initrd/initramfs (initiial root filesystem) -> real root filesystem.

updates - RootFileSystem vs kernel updating - Stack Overflow

Some times, root file system is called the userspace. Just Linux

Is it mandatory to load a root file system when each time booting? Can we just boot the kernel?

I think it is mandatory. even for NFS: https://www.kernel.org/doc/Documentation/filesystems/nfs/nfsroot.txt

Is it possible to boot the Linux kernel without creating an initrd image? - Stack Overflow