TLB Shootdown
TLB Flush 以及 TLB Shootdown 的虚拟化依赖于 KVM Steal Time 机制。
What is a TLB shootdown/TLB Flush
The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown.
A quick example:
- You have some memory shared by all of the processors in your system.
- One of your processors restricts access to a page of that shared memory.
- Now, all of the processors have to flush their TLBs, so that the ones that were allowed to access that page can't do so any more.
TLB shootdown virtualization performance issue
Remote TLB flush does a busy wait which is fine in bare-metal scenario. But within the guest, the vCPUs might have been preempted or blocked. In this scenario, the initiator vCPU would end up busy-waiting for a long amount of time; it also consumes CPU unnecessarily to wake up the target of the shootdown.
TLB flush without PV will cause IPI:
flush_tlb_mm_range()
flush_tlb_multi()
__flush_tlb_multi()
// if using kvm-pv-tlb-flush, it will be kvm_flush_tlb_multi()
native_flush_tlb_multi()
on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func, (void *)info, 1, cpumask);
smp_call_function_many_cond()
arch_send_call_function_ipi_mask()
smp_ops.send_call_func_ipi()
apic->send_IPI_mask(mask, CALL_FUNCTION_VECTOR);
How to solve this using a paravirtualization idea
In PV TLB shootdown, the TLB flush initiator vCPU will not wait the sleeping vCPU, instead it just set a flag in the guest-vmm shared area and then kvm will check this flag and do the TLB flush when the sleeping vCPU come to run.
kvm-pv-tlb-flush 基于 kvm-steal-time 实现。
Guest PV feature detection
Guest checks feature bit KVM_FEATURE_PV_TLB_FLUSH
before enabling para-virtualized TLB flush.
kvm_guest_init
// Use KVM_FEATURE_PV_TLB_FLUSH and other things to detect
if pv_tlb_flush_supported
// Add 2 function callbacks
pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
KVM_VCPU_FLUSH_TLB
Bit
This bit is in the following per-vcpu shared variable:
struct kvm_steal_time {
__u64 steal;
__u32 version;
__u32 flags;
__u8 preempted; // KVM_VCPU_FLUSH_TLB bit is defined here
__u8 u8_pad[3];
__u32 pad[11];
};
Check and use place (In KVM):
if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
record_steal_time(vcpu);
if (st_preempted & KVM_VCPU_FLUSH_TLB)
// flush ths vCPU's TLB
kvm_vcpu_flush_tlb_guest(vcpu);
Set place (In guest kernel):
// In guest kernel, trigger flush tlb
flush_tlb_mm_range
flush_tlb_multi
__flush_tlb_multi
PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
// set the KVM_VCPU_FLUSH_TLB for all vCPU
kvm_flush_tlb_multi
KVM_VCPU_PREEMPTED
Bit
Same as KVM_VCPU_FLUSH_TLB
, this is also 1 bit in the preempted
:
struct kvm_steal_time {
__u64 steal;
__u32 version;
__u32 flags;
__u8 preempted; // KVM_VCPU_PREEMPTED bit is defined here
__u8 u8_pad[3];
__u32 pad[11];
};
当 vCPU 被调度出去时,在 host 与 guest 共享的 steal_time variable 上打上该 flag,下一次 vm entry 时,抹掉该 flag;这个 flag 的作用是:当置上时,guest 可以直接用 PV TLB flush feature,不需要再发送 IPI 来 flush TLB。
When the vCPU is scheduled out, this flag is marked on the steal_time
variable shared by the host and the guest, and the flag is cleared when the next vm entry; the set of this bit indicates the vCPU is scheduled out, so the guest can directly use the PV TLB flush feature, no need to follow the traditional method to send IPI to flush TLB.
The place to set this bit:
kvm_sched_out
kvm_arch_vcpu_put
kvm_steal_time_set_preempted
// set KVM_VCPU_PREEMPTED on steal time
copy_to_user_nofault(&st->preempted, &preempted, sizeof(preempted))
https://lore.kernel.org/all/1513128784-5924-1-git-send-email-wanpeng.li@hotmail.com/
Testing on a Xeon Gold 6142 2.6GHz 2 sockets, 32 cores, 64 threads, so 64 pCPUs, and each VM is 64 vCPUs.
ebizzy -M
vanilla optimized boost
1VM 46799 48670 4%
2VM 23962 42691 78%
3VM 16152 37539 132%
pv_ops.mmu.flush_tlb_multi
1. Guest initiate TLB flush request by setting KVM_VCPU_FLUSH_TLB
bit
This function is only called by:
flush_tlb_mm_range
flush_tlb_multi
static inline void __flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info)
{
PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
}
default callback is native_flush_tlb_multi
, which means this kernel is running as host, not in a VM. In KVM TLB flushing scenario, it is kvm_flush_tlb_multi
function.
kvm_flush_tlb_multi
This function will try to set the KVM_VCPU_FLUSH_TLB
bit for each vcpu:
- If the corresponding vCPU's
KVM_VCPU_PREEMPTED
is set, which means we can leverage it's next vmentry to flush TLB, so we can using our PV method to just set theKVM_VCPU_FLUSH_TLB
bit; - if it is not set, the vCPU is running, use legacy method to send IPI to it.
// Guest kernel
static void kvm_flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info)
{
u8 state;
int cpu;
struct kvm_steal_time *src;
struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
cpumask_copy(flushmask, cpumask);
for_each_cpu(cpu, flushmask) {
src = &per_cpu(steal_time, cpu);
state = READ_ONCE(src->preempted);
if ((state & KVM_VCPU_PREEMPTED)) {
// set the KVM_VCPU_FLUSH_TLB flag
if (try_cmpxchg(&src->preempted, &state, state | KVM_VCPU_FLUSH_TLB))
// clear the processed cpu bit in flushmask
__cpumask_clear_cpu(cpu, flushmask);
}
}
native_flush_tlb_multi(flushmask, info);
}
2. Host KVM handle TLB flush request KVM_VCPU_FLUSH_TLB
KVM_REQ_STEAL_UPDATE
This request bit (KVM_REQ_STEAL_UPDATE
) for this vcpu will be set when:
- Guest kernel write to MSR:
MSR_KVM_STEAL_TIME
; kvm_arch_vcpu_load
Because guest kernel only write to MSR MSR_KVM_STEAL_TIME
one time when registering it, so the triggering point is the second condition: KVM got scheduled back and kvm_arch_vcpu_load
is called, which will make request KVM_REQ_STEAL_UPDATE
and check the bit KVM_VCPU_FLUSH_TLB
is set:
vcpu_enter_guest
if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
record_steal_time
// here, the KVM_VCPU_FLUSH_TLB is set
if (st_preempted & KVM_VCPU_FLUSH_TLB)
kvm_vcpu_flush_tlb_guest(vcpu);
vmx_flush_tlb_guest
vmx_get_current_vpid
vpid_sync_context
vpid_sync_vcpu_single
vmx_asm2(invvpid, "r"(ext), "m"(operand), ext, vpid, gva);
pv_ops.mmu.tlb_remove_table
// These 4 functions are called when the corresponding one will be removed, for pmd as an example:
___pte_free_tlb
___pmd_free_tlb
___pud_free_tlb
___p4d_free_tlb
paravirt_tlb_remove_table
PVOP_VCALL2(mmu.tlb_remove_table, tlb, table);
The original callback function is tlb_remove_page
。
The updated callback function is tlb_remove_table
。
tlb_remove_page
这个函数主要将物理页加入积聚结构中, 然后达到最大值时(在 __tlb_remove_page_size
中判断)进行批量释放。
This function will call following functions:
tlb_remove_page
tlb_remove_page_size
if __tlb_remove_page_size
tlb_flush_mmu
tlb_flush_mmu
释放之前积聚的物理页。
tlb_remove_table
积聚,当达到 MAX_TABLE_BATCH
时,执行 tlb_table_flush
来释放之前积聚的存放各级页目录的物理页。
void tlb_remove_table(struct mmu_gather *tlb, void *table)
{
struct mmu_table_batch **batch = &tlb->batch;
//...
(*batch)->tables[(*batch)->nr++] = table;
if ((*batch)->nr == MAX_TABLE_BATCH)
tlb_table_flush(tlb);
}
tlb_table_flush
释放之前积聚的存放各级页目录的物理页。