特权指令与敏感指令 / Sensitive instructions and privilege instructions

大多数的现代计算机体系结构都有两个或以上的特权级,用来分隔系统软件和应用软件。系统中有一些操作和管理关键系统资源的指令会被定为特权指令,这些指令只有在最高特权级上能够正确执行。如果在非最高特权级上运行,特权指令会引发一个异常,处理器会陷入到最高特权级,交由系统软件来处理。

在不同的特权级上,不仅指令的执行效果是不同的,而且也并不是每个特权指令都会引发异常,假如一个 x86 平台的用户违反了规范,在用户态修改 EFLAGS 寄存器的中断开关位,这一修改将不会产生任何效果,也不会引起异常陷入,而是会被硬件直接忽略掉。

在虚拟化世界里,还有另一类指令被称作为敏感指令,简言之就是操作特权资源的指令,包括修改虚拟机的运行模式或者下面物理机的状态;读写敏感的寄存器或是内存,例如时钟或者中断寄存器;访问存储保护系统、内存系统或是地址重定位系统;以及所有的 I/O 指令。

敏感指令是 Guest 执行了之后会引发 VMExit 的指令。

显而易见,所有的特权指令都是敏感指令,然而并不是所有的敏感指令都是特权指令。

摘自《系统虚拟化 原理与实现》。

Ctrl z, fg and bg

postgresql - How to restart some progress which is stopped by "ctrl+z"? - Stack Overflow

Use of ! in VIM

What's the use of the exclamation mark ('!') in Vim's command line after certain commands like ":w!"? - Stack Overflow

CSP

Cloud Solution Provider

基本块

在电脑 编译器 架构中,基本块(basic block)是一段线性的程式码,只能从这段程式码开始处进入这段程式,没有其他程式码会跳跃进入这段程式,只能从这段程式码最后一行离开这段程式,中间没有其他程式码会跳跃离开这段程式。

中断返回指令 IRET

The IRET (return from interrupt) instruction returns program control from an interrupt handler to the interrupted procedure. The IRET instruction performs a similar operation to the RET instruction.

See SDM 7.3.8.4 Software Interrupt Instructions.

Win10 系统优化

Sycnex/Windows10Debloater: Script to remove Windows 10 bloatware.

Yet Another Dotfiles Manager

"Yet Another Dotfiles Manager - yadm"

Polling

轮询

Charm

We build tools to make the command line glamorous.

"Charm"

part of the

Grammarly for vim

"dpelle/vim-LanguageTool: A vim plugin for the LanguageTool grammar checker"

"Copyright vs. Copyleft"

VMCS Layout

cpu-internals/VMCS-Layout.pdf at master · LordNoteworthy/cpu-internals (github.com)

sTooltip - colored standard tooltip with timeout

"[function] sTooltip - colored standard tooltip with timeout - Scripts and Functions - AutoHotkey Community"

Your code editor, black on white or white on black?

Ask HN: Your code editor, black on white or white on black? | Hacker News

Georgia (typeface)

"Georgia (typeface) - Wikipedia"

SOMESOMEsome some some SOME SOME SOME 所么所么 SOME SOME some

For each vCPU there is one VMCS. This means that VMCS stores information on CPU-level granularity and not VM level.

"Intel Virtualisation: How VT-x, KVM and QEMU Work Together – Binary Debt"

Most ioctl implementations consist of a switch statement. It selects the correct behavior according to the cmd argument.

"char-enhanced.pdf"

Tampermonkey

This plugin can add Javascript scripts to any site, so I can replace some characters in title to enhance the copy as markdown style feature.

Magic number

motivation:

Before writing the code for ioctl, you need to choose the numbers that correspond to commands. Unfortunately, the simple choice of using small numbers starting from 1 and going up doesn’t work well. The command numbers should be unique across the system. In order to prevent errors caused by issuing the right command to the wrong device.[^1]

Two device nodes may have the same major number. An application could open more than one device and mix up the file descriptors, thereby sending the right command to the wrong device. Sending wrong ioctl commands can have catastrophic consequences, including damage to hardware. A unique magic number should be encoded into the commands with one of the following macros.[^2]

_IO (magic, number)
_IOR (magic, number, data_type)
_IOW (magic, number, data_type)
_IORW(magic, number, data_type)

where magic is the 8-bit magic number unique to the device.

Ioctl for block device

Block devices can provide an ioctl method to perform device control functions. The higher-level block subsystem code intercepts a number of ioctl commands before your driver ever gets to see them, however (see drivers/block/ioctl.c in the kernel source for the full set). In fact, a modern block driver may not have to implement very many ioctl commands at all.[^3]

Core difference between char device and block device

Character devices are those for which no buffering is performed, and block devices are those which are accessed through a cache.

Character devices are read from and written to with two function: foo_read() and foo_write(). The read() and write() calls do not return until the operation is complete. By contrast, block devices do not even implement the read() and write() functions, and instead have a function which has historically been called the ``strategy routine.'' Reads and writes are done through the buffer cache mechanism by the generic functions bread(), breada(), and bwrite(). These functions go through the buffer cache, and so may or may not actually call the strategy routine, depending on whether or not the block requested is in the buffer cache (for reads) or on whether or not the buffer cache is full (for writes). A request may be asyncronous: breada() can request the strategy routine to schedule reads that have not been asked for, and to do it asyncronously, in the background, in the hopes that they will be needed later.

The sources for character devices are kept in drivers/char/, and the sources for block devices are kept in drivers/block/. They have similar interfaces, and are very much alike, except for reading and writing. Because of the difference in reading and writing, initialization is different, as block devices have to register a strategy routine, which is registered in a different way than the foo_read() and foo_write() routines of a character device driver.[^4]

Telescope documentation for developers

telescope.nvim/developers.md at master · nvim-telescope/telescope.nvim

Write back vs write through cache

Write Through and Write Back in Cache - GeeksforGeeks

Intel vPro

Intel vPro technology is an umbrella marketing term used by Intel for a large collection of computer hardware technologies, including VT-x, VT-d, Trusted Execution Technology (TXT), and Intel Active Management Technology (AMT).

Compare two branches in Github

To compare different versions of your repository, append /compare to your repository's path.

git - How can I diff two branches in GitHub? - Stack Overflow

How to chsh in wsl

bash - How to change default shell for Linux susbsystem for Windows - Super User

Mkview and loadview

A View is the smallest subset of the three (View, Session, Viminfo). It is a collection of settings for one window.

Views, Sessions, And Viminfo | Learn Vim

Nvim-cmp builtin comparators

nvim-cmp/compare.lua at main · hrsh7th/nvim-cmp

All the lsp symbol kind name

export namespace SymbolKind {
	export const File = 1;
	export const Module = 2;
	export const Namespace = 3;
	export const Package = 4;
	export const Class = 5;
	export const Method = 6;
	export const Property = 7;
	export const Field = 8;
	export const Constructor = 9;
	export const Enum = 10;
	export const Interface = 11;
	export const Function = 12;
	export const Variable = 13;
	export const Constant = 14;
	export const String = 15;
	export const Number = 16;
	export const Boolean = 17;
	export const Array = 18;
	export const Object = 19;
	export const Key = 20;
	export const Null = 21;
	export const EnumMember = 22;
	export const Struct = 23;
	export const Event = 24;
	export const Operator = 25;
	export const TypeParameter = 26;
}

from Specification.

Default neovim highlight groups

Nvim documentation: syntax (neovim.io)

Remote tracking branch

Remote-tracking branches are references to the state of remote branches. They’re local references that you can’t move; Git moves them for you whenever you do any network communication, to make sure they accurately represent the state of the remote repository.

Remote-tracking branch names take the form <remote>/<branch>.[^6]

全角和半角

全角半角是文字的两种显示形式,“全角”指文字字身长宽比为一比一的正方形,而“半角”为宽度为全角一半的文字。

“半角/全角”源于日文,其中“角”是“方块”的意思,“全角/半角”在日文里即是原本“正方形/半个正方形大小文字”的本意。

fullwidth and halfwidth.

Cherry-pick order matters

git - Cherrypick commit orders - Stack Overflow

By default, any function that is defined in a C file is extern.

What is extern and static function in C? | Fresh2Refresh.com

VIM tab configurations for Linux Kernel Development

VIM configurations for Linux Kernel Development

Kernel develop commit pretty format

The following git config settings can be used to add a pretty format for outputting the above style in the git log or git show commands:

[core]
        abbrev = 12
[pretty]
        fixes = Fixes: %h (\"%s\")

An example call:

$ git log -1 --pretty=fixes 54a4f0239f2e
Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed")

Linux coding style notes

  • Outside of comments, documentation and except in Kconfig, spaces are never used for indentation.
  • The limit on the length of lines is 80 columns and this is a strongly preferred limit.
  • never break user-visible strings such as printk messages, because that breaks the ability to grep for them.

Use clang-format to fix kernel coding style

clang-format — The Linux Kernel documentation

Formatting in clangd / clang-format

clangd embeds clang-format, which can reformat your code: fixing indentation, breaking lines, and reflowing comments.

clangd respects your project’s .clang-format file which controls styling options.

Format-as-you-type is experimental and doesn’t work well yet.

Features

Why doesn't lua support POSIX regular expression?

Unlike several other scripting languages, Lua does not use POSIX regular expressions (regexp) for pattern matching. The main reason for this is size: A typical implementation of POSIX regexp takes more than 4,000 lines of code. This is bigger than all Lua standard libraries together. In comparison, the implementation of pattern matching in Lua has less than 500 lines. Of course, the pattern matching in Lua cannot do all that a full POSIX implementation does. Nevertheless, pattern matching in Lua is a powerful tool and includes some features that are difficult to match with standard POSIX implementations.

Programming in Lua : 20.1

Patterns in Lua

Programming in Lua : 20.2

Offset and exact in nvim-cmp

separated-word
│         │
│         └ source2 offset
└ source1 offset

The offset comparator prefers source1 candidates.

The exact comparator prefers exact match candidates.

  1. Exact match
    • The user input is word and candidates text is word
  2. Not exact match
    • The user input is word and candidates text is wording

Add doc for comparators · Issue #883 · hrsh7th/nvim-cmp

Motivation behind git patch

GIT patch or GIT diff is used to share the changes made by you to others without pushing it to main branch of the repository.

How do I get a linux kernel patch set from the mailing list?

Ubuntu Manpage: mbox-extract-patch - extract a git patch series from an mbox

Using patchwork to manage your patch

getpatchwork/patchwork: Patchwork is a web-based patch tracking system designed to facilitate the contribution and management of contributions to an open-source project.

What is the difference between git am and git apply?

Both the input and output are different:

  • git apply takes a patch (e.g. the output of git diff) and applies it to the working directory (or index, if --index or --cached is used), which means it won't commit automatically, even won't add it if you use the default option.
  • git am takes a mailbox of commits formatted as an email messages (e.g. the output of git format-patch) and applies them to the current branch.

git am uses git apply behind the scenes, but does more work before (reading a Maildir or mbox, and parsing email messages) and after (creating commits).

patch - What is the difference between git am and git apply? - Stack Overflow

Commit message in patch

git format-patch will use the first line of the commit message to generate patch title. and the rest of the commit message to format the commit body.

gitter.im

Gitter is a chat and networking platform that helps to manage, grow and connect communities through messaging, content and discovery.

Gitter — Where developers come to talk.

<cr>, <tab> And <esc> are considered equivalent to <c-m>, <c-i> and <c-[> in terminal

key bindings - How to distinguish C-m from RETURN? - Emacs Stack Exchange

Patchew Project: A patch tracking and testing system

Patchew Project

如何在 Windows 或 Mac 上找到 Outlook 中的 SMTP 服务器

如何在Windows或Mac上找到Outlook中的SMTP服务器

Microsoft Exchange Server

Microsoft Exchange Server is a mail server and calendaring server developed by Microsoft. It runs exclusively on Windows Server operating systems.

Microsoft Exchange Server - Wikipedia

Config git send-email

How to configure and use git send-email to work with gmail to email patches to developers - Stack Overflow

森林集输入法皮肤

森林集官方皮肤站

柚子输入法、影子输入法:用 Ahk 实现的输入法

You's 输入法 - 文集 - 简书

河许人/影子输入法 - 码云 - 开源中国

Thunderbird cannot send

FIX: Send and Attach Buttons Missing in Thunderbird

Thunderbird export settings

how can I tansfer my Thunderbird account settings onto a new clean install on Windows 7? | Thunderbird Support Forum | Mozilla Support

Don't Use Terminal Emacs

Don't Use Terminal Emacs - The Chronicle

Add-apt-repository with proxy

sudo -E add-apt-repository

How do I get add-apt-repository to work through a proxy? - Ask Ubuntu

Sudo apt-key adv --keyserver 卡住不动怎么解决?

加一个 proxy:

sudo apt-key adv --keyserver keyserver.ubuntu.com --keyserver-options http-proxy=http://address:port --recv-keys <the_key>

详参:debian - Unable to add gpg key with apt-key behind a proxy - Unix & Linux Stack Exchange

How to find out which partition is Ubuntu installed on?

partitioning - How to find out which partition is Ubuntu installed on? - Ask Ubuntu

dlopen(): Error loading libfuse.so.2 AppImages require FUSE to run.

FUSE · AppImage/AppImageKit Wiki

Error: failed to connect to the hypervisor, error: Failed to connect socket to '/run/user/1000/libvirt/libvirt-sock': No such file or directory

sudo apt install qemu qemu-kvm libvirt-clients libvirt-daemon-system virtinst bridge-utils
sudo systemctl enable libvirtd
sudo systemctl start libvirtd

virtualization - Failed to connect socket to '/var/run/libvirt/libvirt-sock' - Ask Ubuntu

Libvirt domain XML format

libvirt: Domain XML format

Converting QEMU arguments to domain XML

14.5.21. Converting QEMU Arguments to Domain XML Red Hat Enterprise Linux 6 | Red Hat Customer Portal

but unfortunetly it has been removed…

kvm - Qemu Native to Libvirt XML - Stack Overflow

Why are vms in KVM/QEMU called domains?

They're not kvm exclusive terminology. A hypervisor is a rough equivalent to domain zero, or dom0, which is the first system initialized on the kernel and has special privileges. Other domains started later are called domU and are the equivalent to a guest system or virtual machine.

Why are vms in KVM/QEMU called domains? - Unix & Linux Stack Exchange

Difference between virsh define and virsh create

本质上两者一样的,都是从 xml 配置文件创建虚拟机:

  • define 丛 xml 配置文件创建主机但是不启动;
  • create 同样是丛 xml 配置文件创建主机,但是可以指定很多选项,比如是否启动,是否连接控制台。

Virsh中创建虚拟机两种方式define和create的区别_weixin_34362991的博客-CSDN博客

Error: Disconnected from qemu:///session due to end of file, error: Failed to define domain from /etc/libvirt/qemu/ubuntu22.xml, error: End of file while reading data: Input/output error

使用 sudo 来运行。

Virsh cannot shutdown

just destroy it.

Virsh add port forward

Networking - Libvirt Wiki

Git clone into an existing folder

cd
git clone https://github.com/tristone13th/.config.git temp
mv temp/.git .config/.git
rm -rf temp
cd .config
git checkout .

CPU stepping

如果仔细的比较就会发现,步进实际上与某款特定型号的 处理器 无关,一款特定步进的晶元可以应用在多款处理器上,因此步进代表的其实是处理器制造工艺的某个阶段。比如两颗 处理器Core 2 E6400 和 Core 2 E6700,它们的步进都是 B2,这表示它们使用了相同的制造工艺。对于 超频 来说,这一概念是非常重要的:相同制程的 处理器,应当具备接近的极限频率。[^9]

这篇文章讲的不错:

CPU Stepping - baihuahua - 博客园

什么是 Tape-out,也就是下线?

In electronics and photonics design, tape-out or tapeout is the final result of the design process for integrated circuits or printed circuit boards before they are sent for manufacturing. The tapeout is specifically the point at which the graphic for the photomask of the circuit is sent to the fabrication facility.

Tab groups in edge

Microsoft Edge gets Tab groups Collapse & Automatic creation features

Linux 小版本号是什么意思,比如 5.17.4 里的 4

The current version numbering is slightly different from the above. The even vs. odd numbering has been dropped and a specific major version is now indicated by the first two numbers, taken as a whole. While the time-frame is open for the development of the next major, the -rcN suffix is used to identify the n'th release candidate for the next version.[39] For example, the release of the version 4.16 was preceded by seven 4.16-rcN (from -rc1 to -rc7). Once a stable release is made, its maintenance is passed off to the “stable team". Occasional updates to stable releases are identified by a three numbering scheme (e.g., 4.13.1, 4.13.2, …, 4.13.16).

Linux kernel - Wikipedia

What is linux-next?

up-to-the-second, bleeding-edge status of Linus's tree.

development work should be done against the linux-next tree rather than against the mainline kernel.

Link to Text Fragment - Chrome 网上应用店

如何给博客加上搜索功能

使用Chrome自定义搜索引擎快速查找资源 - @Lenciel

Conceal feature in vim

khzaw/vim-conceal: A vim plugin making use of vim's conceal feature for additional visual eyecandy.

Neorg, orgmode and markdown

nvim-neorg/neorg: Modernity meets insane extensibility. The future of organizing your life in Neovim.

neovim 也有自己的标记语言了,码住以后说不定会用上,可以用来记笔记什么的,就是目前还没有办法支持 pandoc 转成 markdown,所以估计还是没有办法放到博客上。

三热技术

热升级、热替换、热迁移。

小狼毫输入法相关文档

入门:

Schema.yaml 詳解:

定制指南:

userdb 自定义修改(注意 c, d, t 三个参数):

如何删除不想要的词:

码表就是字典,字典就是码表。

A nice fuzzy finder implementation for Neovim

nvim-telescope/telescope-fzf-native.nvim: FZF sorter for telescope written in c

DHCP

一般都是路由器负责 DHCP。

DHCP Discover

IP 冲突

没有 DHCP 的情况下,路由器不管分配,只管路由,分配是网管的职责,路由器不保证 IP 地址是否冲突。

桥接模式、NAT 模式和 Host-only 模式

VPC 网卡应该采用的是桥接的虚拟化方式,只不过 br0 或者说 host 上的 backend 虚拟设备里面在发包时集成了 VLAN 信息,从而使得这些虚拟机之间构成了 VPC 网络。

Linux Bridge

Linux Bridge 是一个软件实现的网桥/交换机(其实是一个内核模块)。

什么是网桥?

两口交换机相当于一个网桥,都是工作在数据链路层:

如何通俗地解释什么是网桥? - 知乎

两口交换机(网桥)相比于两口的集线器区别在于:集线器仅仅负责转发,而网桥会有一定的过滤(比如隔离广播风暴,有一些基础的二层路由功能(虽然只有两个口,就是转发和不转发的区别))。

桥接模式和 NAT 模式在架构上的区别

都需要一个虚拟交换机(无论名字是 br0 还是 virbr0),唯一的区别在于这个虚拟交换机工作在二层还是三层:

  • 工作在二层,那么就是桥接模式;
  • 工作在二层,但是配合上其他三层的软件比如 dnsmasqiptables 拥有了三层路由器的功能,比如 DHCP 和子网,那么这个就是 NAT 模式。

什么是 NAT 模式 / virbr0

virbr0 其作用是给其上的虚机网卡提供 NAT 访问外网的功能。为什么使用 NAT 模式而不使用桥接模式?

  • IP 地址短缺:物理网络通常不会给一台主机分配多个公网 IP 或局域网 IP。
  • 网络隔离与安全:我们希望虚拟机的网络活动与物理网络隔离开,避免虚拟机直接暴露在外部网络中,减少被攻击的风险。

NAT 需要一个私网,并进行 IP 地址转换,现实中这个都是路由器来做的,所以这就需要我们虚拟化工作在三层,而不仅仅像桥接模式那样,提供一个交换机,我们需要一个虚拟的路由器。这个虚拟路由器的功能被拆解并由几个不同的组件实现:

  • dnsmasq:提供 DNS 功能和 DHCP 功能;
  • netfilter/iptables:提供 NAT 地址翻译、以及基于 IP 的路由功能;
  • virbr0:提供二层路由功能。
  • virbr0-nic:这不是一个网桥,而是一个普通网口,主要是为了让 virbr0 有自己的固定 MAC 地址,The bridge would work without it, but then it could change it's MAC address as interfaces enter and exit the bridge, and when the mac of the bridge changes, external switches may be confused, making the host lose network for some time. linux - What is virtual bridge with -nic in the end of name - Unix & Linux Stack Exchange

另外还有一些参与方:

  • TUN/TAP 设备:这是 host 上能看到的代表 VM 里网卡的 TUN/TAP 设备,通常命名为 vnetX,在宿主机上可以看到;

虚拟机如何获得 NAT 下的私网 IP?

虚拟机在发送一个 DHCP Discover 广播包,virbr0 以某种方式(比如拆包到二层或者三层)发现这可能是一个 DHCP 报文,因此转发给监听在 virbr0 上的 dnsmasq 进程。

NAT 模式下,你 ip a 是可以看到 virbr0virbr0-nic 的。

假设虚拟机 VM1(IP: 192.168.122.10)要访问外网 8.8.8.8。

  • VM1 内部:数据包发出,源 IP 为 192.168.122.10,目标 IP 为 8.8.8.8(公网地址)。
  • 通过虚拟网卡:数据包通过 VM 内的虚拟网卡(eth0)离开 VM,通过各种 virtio 手段,被 KVM 发送到宿主机上的对应 TAP 设备 vnet0
  • 进入网桥:vnet0 是网桥 virbr0 的一个端口,所以数据包到达网桥。
  • 路由决策:网桥是二层设备,但它发现目标 8.8.8.8 不在本地网络,于是将数据包上交给了宿主机的三层路由子系统。
  • 触发 iptables NAT:
    • Linux 内核路由子系统发现数据包要从 virbr0 发往物理网络接口(如 eth0)。
    • 在发出之前,数据包会经过 iptables 的 POSTROUTING 链。
    • 在这里,规则 -A POSTROUTING -s 192.168.122.0/24 -j MASQUERADE 被匹配。
    • SNAT 发生:内核修改数据包的源 IP,将 192.168.122.10 改为宿主机物理网卡的 IP(如 192.168.1.100)。
  • 发往外部:修改后的数据包从宿主机的物理网卡 eth0 真正发出,目的地是 8.8.8.8。
  • 返回数据包:
    • 外部服务器 8.8.8.8 收到请求,回复的数据包目的 IP 是宿主机物理 IP 192.168.1.100。
    • 数据包到达宿主机物理网卡,进入路由子系统。
    • 路由发现这是发给 192.168.1.100 的,但系统上有一个到 192.168.122.0/24 的路由,且该路由指向 virbr0 设备。
    • 在数据包被转发到 virbr0 之前,会经过 iptables 的 PREROUTING 链。如果连接跟踪(conntrack)模块记录了这个连接之前的 SNAT 操作,它会自动将目标 IP 改回 192.168.122.10(这就是连接跟踪的强大之处,无需显式 DNAT 规则来处理回包)。
  • 返回虚拟机:目标 IP 已被还原的数据包被发送到网桥 virbr0,网桥通过 MAC 地址表知道该发给 vnet0 设备,最终数据包通过 vnet0 送回 VM1。

virbr0 And br0

virbr0 是为 NAT 模式准备的网桥,而 br0 是为桥接模式准备的网桥。

br0 是 OVS,

理解 virbr0_wuji3390的博客-CSDN博客_virbr0

源 MAC 地址可以通过软件指定吗?

完全可以,而且就是软件填的:

  1. OS 网卡驱动程序准备要发送的数据包;
  2. OS 会提供一个源 MAC 地址字段填入数据包头部;
  3. 通常情况下,OS 会直接从网卡的硬件 ROM 中读取固化 MAC 地址,并将其填入这个字段;
  4. 数据包被传递给网卡硬件进行发送。

什么是桥接模式?/ br0

br0 其实是扮演了交换机的角色,不要被名字迷惑。OVS 也提供自己的交换机实现。

逻辑上,VM 和宿主机在网络上的地位是对等的,都直接连接到外部的交换机上。VM DHCP 直接通过外部 DHCP 服务器(路由器)来获取 IP 地址。

实现上,其实是有轻微不对等的。下面这张图很直观:

下面这张图等同(图中的 br0 可以替换为 OVS 的交换机):

https://k48xz7gzkw.feishu.cn/wiki/SvAqwQ8i2iKiphkGt4mcJIqgnMd#share-AA6UdYaVaoQpuix5tLrce9v6nqf

VM 里的 eth0 是如何成为外部的一个 tun/tap 设备的呢?请看下面图,通过 virtio 以及基于 virtio 的 host 设备:

https://k48xz7gzkw.feishu.cn/docx/QcsXdAc3PoF6VHxhHM5cPeuHnhe#share-DzjIdXk47owsUmxwJ00czTuInoh

上面图中我们也不难得出:

  • br0 是 Linux bridge^ 设备,其实也可以看作是一个交换机因为有多个口可以接多个设备,而
  • OVS 也可以虚拟化出来一个虚拟网桥(虽然名字不叫做 br0),具有更加强大的功能,是 Linux bridge 的替代。

这种情况下,宿主机上的 eth0 在收到外界的包时,为什么会自动转发给网桥呢?其实是因为 eth0 在往上层协议栈发的过程中被网桥截获了。细节是:

从上面流程可以看出:

  • 宿主机和 VM 在网络上其实是存在轻微的不对等关系的。
  • 虚拟机在 host 上也是以一个设备(TUN/TAP)的形式展现给内核的。
  • br0 的确就是一个虚拟交换机,因为其需要通过查 MAC 表来看是路由给另外一台虚拟机的网卡还是到宿主机。

对物理网卡的要求:

  • 为了让虚拟机能够收到发往它的网络包,物理网卡必须被设置为混杂模式(Promiscuous Mode)

为什么桥接模式下,物理网卡需要设置为混杂模式?

一台普通机器的物理网卡在正常情况下是这样工作的:

  • MAC 地址过滤:网卡会检查流经它所在的物理网络段(冲突域或广播域)的每一个数据帧的二层包头。
  • 选择性接收:网卡只会接收以下三种类型的数据帧:
    1. 目标 MAC 地址是自己的数据帧。
    2. 目标 MAC 是广播地址(FF:FF:FF:FF:FF:FF) 的数据帧
    3. 目标 MAC 是组播地址,并且自己正好属于那个组播组的数据帧。
  • 丢弃其他包:对于目标 MAC 是其他设备的包,网卡会在硬件层面直接丢弃,根本不会上报给操作系统内核的协议栈,从而节省 CPU 资源。

为什么桥接模式下物理网卡要接收“别人的”包?

  • 一个发自外部网络的数据包,其目标 MAC 地址是虚拟机的 MAC 地址,通过物理交换机被发送到你宿主机的物理网卡上。
  • 如果物理网卡因为 MAC 不匹配直接丢弃,那么包根本就到不了 br0,开混杂模式就是网卡接收所有数据。

因此在桥接模式下,物理网卡更像是一个中继器的作用。只不过和别的中继器不同:

  • 物理中继器下,传进来的是物理信号,传出去的也是物理信号,原封不动,只是有可能被放大了;
  • 物理网卡所扮演的中继器:传进来物理信号,传出去是软件表示的信息数据,方便协议栈处理,方便软件交换机(br0)对接。

一定要记住,物理网卡本身不处理包,也不应答包,即使是 ARP,也是先上传到主机协议栈来处理的。


实践上:

可以通过编辑 /etc/network/interfaces 来添加一个 br0,记着把原来网卡的 IP 地址也一并移动到 br0 上。


为什么原来网卡(比如 eth0)的 IP 地址需要移动到 br0 上?

原来的物理网卡在抽象层面上已经不是一张网卡,正如上面所说,它已经不是一张网卡,而是一个中继器了(开了混杂模式)。

和下面网卡为什么不需要 MAC 地址^类似,网卡也不需要自己的 IP 地址了。

关于linux网桥(Linux Bridge)的一些个人记录 - 沐多 - 博客园


虚拟交换机 br0 的 MAC 地址需要和网卡的 MAC 地址保持一致吗?

不需要,这已经不是一张网卡,而是一个中继器了,所以甚至都不需要一个 MAC 地址了。因为没有包需要发到这张网卡。这个 MAC 地址可以直接被交换机复用,作为连接网卡的这个端口的 MAC 地址(交换机每一个端口都有一个 MAC 地址)。

这样,如果宿主机软件栈而不是 VM 想要发包,可以把源 MAC 直接指定成为这个 MAC 来发送(当然 br0 完全可以使用一个其他的 MAC 地址来赋予这个端口,但是没有必要)。

如果宿主机软件栈需要收包,那么接到的包的目的 MAC 地址也是这个端口的 MAC,而不是 VM 网卡的 MAC。

桥接模式和 VPC 网络的关系

VPC 就是通过桥接模式里面的虚拟交换机(br0 或者 OVS)使用 VxLan 技术来形成虚拟二层内网,再搭配一些私网 IP DHCP 分配的手段,也就形成了 VPC。