Spec: 341431-remote-action-request-white-paper.pdf

Remote Action Request (RAR) is introduced in Intel Architecture as a model-specific feature to speed up inter-processor operations by moving parts of those operations from software (OS, App) to hardware (the IA core).

RAR is used for speeding up remote TLB shootdowns and allowing them to be serviced while a long instruction is executing on the remote processor or when interrupts are disabled on that processor.

When the target pCPU receive RAR, if it is in non-root mode, it won't VMExit!

Concretely, it can replace the IPI with the remote action request. so its use case is not limited to TLB shootdown.

Benefits:

Fewer context switches: RAR needn't to be handled by software, so we will have fewer context switches caused by interrupts.
Lower latency: If a long instruction is executing on the remote processor or when interrupts are disabled on that processor, TLB flush using IPI will have long latency, but RAR
- doesn't need interrupts and,
- can be handled during a long instruction is executing,

so it will have lower latency.

注意 PVRAR 和 RAR 的区别，RAR 本身是一个 bare-metal 的 feature。是可以用来加速 baremetal 上的 TLB invalidation 的。

在 bare-metal 上引入 RAR 的原因是可以避免发送 IPI 给 RLP。也就是 RLP 端的 software 不需要参与，hardware 就直接做了，提升效率。背后基于的理论是，IPI 可以看作是一个更加 general 的机制，用来通知 RLP 做任何事，这取决于 RLP 这边的 hanlder，但是因为大多数情况下 workload 就那么几种，比如 TLB shootdown、EPT invalidation 等等。所以我们可以把这些 workload 的流程实现在 hardware 中从而让执行的速度更快，也就是 RAR 的本质是牺牲灵活性来提升效率。

RAR_INFO read-only: Report capabilities.
RAR_CONTROL: enable the feature.

How to perform a RAR?

Signaling a RAR is done similar to sending an INTR, by writing to the Interrupt Command Register (ICR).^

There is a new Delivery Mode^ added: RAR.

RAR Physical Memory Regions

有两片内存区域需要在 MSR 中指向：

Payload table;
Action vector.

Payload Table

是一个 4K 区域，有 64 个 64 bytes 的 entry。Payload table 又叫做 RAR island。

The payload table contains RAR payloads. It is allocated by the OS in contiguous physical memory and pointed by each LP’s RAR_PAYLOAD_TABLE_BASE MSR. Each entry defines the payload for a single action.

This architecture doesn’t preclude the option for the OS to allocate multiple Payload Tables, one per a subset of LPs and set the RAR_PAYLOAD_TABLE_BASE Logical Processor MSRs accordingly.

也就是不同 LP 的 MSR 都可以 share 同一个 payload table 的基址。当然也可以不 share（代码里目前的实现是 share 的）：

In a PCID enabled system where software threads are allocated each with a different PCID, the OS may choose to allocate a separate Payload Table for each RLP, i.e., each RLP is its own ‘RAR Island’.

这种架构并没限制用多个 Payload table，极端一点，甚至 OS 可以为每一个 LP 分配一个 Payload table。

The payload table is fixed size at 4KB, contains 64 entries of 64 bytes (512 bits).

RAR Payload Type / Subtype

每一个 payload 都有 Type 和 Subtype 两个域。Type 域：

Type 0 = Page invalidation; invalidate one or more pages.
Type 1 = Page invalidation; invalidate one or more pages, ignore CR3 match.
Type 2 = PCID invalidation; invalidate pages associated with a specific PCID.
Type 3 = EPT invalidate; invalidate pages associated with a specific EPTP.
Type 4 = VPID invalidation; invalidate pages associated with a specific VPID.
Type 5 = MSR write; write value to specified MSR.
Other types reserved for future usages.

For RAR-based PV TLB flush, Type 4 (VPID invalidation) is used.

Subtype 域用来决定这个 type 细分的动作，比如 Type 0 是 page invalidation，subtype 就用来决定做 address specific 的 invalidation 还是说执行 all-context 的 invalidation。

Action Vector

The action vector is a per-RLP 64 Byte aligned vector of actions, contains 64 entries of 8 bits. It is pointed to by the RAR Action Vector MSR. 每一个 entry 对应的是对应 request action 的 status。

为什么要有 64 个 entries 呢？是为了和 payload^ 的数量对应。Each 8-bit entry j (0 <= j < N) defines the per-RLP status of the jth action request;

0x00 = RAR_SUCCESS.
0x01 = RAR_PENDING.
0x02 = RAR_ACKNOWLEDGED.
0x80 = RAR_FAILURE.

RAR TLB Flush Design

Shoot4U

Shoot4U is the paper's name, it is the same as PV_SEND_IPI in KVM.

It is different than KVM PV-based TLB flush mechanism because when the target vCPU is online and NOT preempted, Guest still need to send vIPI to the target vCPU, but in Shoot4U, all vIPIs are translated to physical IPIs and the polling is also done by the VMM not the guest. So we don't need to inject the interrupt back to the vCPUs to perform TLB flush.

Traditional vIPI (note: not equals to IPIv) based TLB flush / PV-based flush when vCPU is NOT preempted:

Prepare payload in memory
vCPU0’s sending vIPI is trapped to VMM
The VMM sends IPI to pCPU1
pCPU1 injects interrupt to vCPU1
vCPU1 reads payload from memory and invalidate TLBs with INVLPG
notify vCPU0 that invalidation is completed

┌────────────────────────────────┐
│            Guest VM            │
│  ┌──────────────────────────┐  │
│  │                          │  │
│  │           ┌───────────┐  │  │
│  │           │           │  │  │
│  │   Memory  │  Payload  │  │  │
│  │           │           │  │  │
│  │      ┌───►└─────▲──┬──┘  │  │
│  │    1 │        6 │  │ 5   │  │
│  └──────┼──────────┼──┼─────┘  │
│         │          │  │        │
│  ┌──────┴────┐  ┌──┴──▼─────┐  │
│  │   vCPU0   │  │   vCPU1   │  │
│  └────┬──────┘  └──────▲────┘  │
│       │                │       │
└───────┼────────────────┼───────┘
      2 │              4 │
┌───────┼────────────────┼───────┐
│       │                │       │
│  ┌────▼────┐      ┌────┴────┐  │
│  │  pCPU0  ├──────►  pCPU1  │  │
│  └─────────┘   3  └─────────┘  │
│                                │
└────────────────────────────────┘

Notes: with IPIv, unicast vIPI won't VMexit. IPIv is a feature supported by hardware.

Shoot4U:

Guest invokes a hypercall to flush TLBs
VMM converts the hypercall into multiple remote INVVPID to the payload
VMM sends IPIs to target pCPUs
pCPU1 reads payload from memory and invalidate TLBs with INVVPID
notify that invalidation is completed

Can eliminates the problems which KVM PV-based approach has:

Overheads of IPI routing between vCPUs
It is possible that the preemption state of a vCPU can change after its state has been checked by the invoking CPU but before the IPI is actually delivered (TOCTOU problem)

Shortcomings:

External interrupt VM exits on host in Step 3（当 target pCPU 处于 non-root 模式下时，仍然需要 VMExit 出来）

┌────────────────────────────────┐
│            Guest VM            │
│  ┌───────────┐  ┌───────────┐  │
│  │   vCPU0   │  │   vCPU1   │  │
│  └────┬──────┘  └───────────┘  │
│       │                        │
└───────┴────────────────────────┘
  1 Hypercall
┌───────┬────────────────────────┐
│       │                        │
│  ┌────▼────┐      ┌─────────┐  │
│  │  pCPU0  ├──────►  pCPU1  │  │
│  └────────┬┘   3  └─┬───▲───┘  │
│          2│         │5  │4     │
│ ┌─────────┼─────────┼───┼────┐ │
│ │         │         │   │    │ │
│ │         └─►┌──────▼───┴─┐  │ │
│ │   Memory   │   Payload  │  │ │
│ │            └────────────┘  │ │
│ │                            │ │
│ └────────────────────────────┘ │
│                                │
└────────────────────────────────┘

Shoot4U + RAR

Just replace the above 2,3,4,5 steps to RAR rather than Physical IPIs. Because RAR can be performed while the target CPU is running in non-root mode, so we can avoid the VM exits introduced by IPIs.

但是 source hypercall 的 VMExit 应该还是没有办法避免的吧。

Shoot4U | Proceedings of the12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

RAR TLB Flush in KVM

`native_send_rar_ipi()` Kernel

void native_send_rar_ipi(const struct cpumask *mask)
{
	cpumask_var_t allbutself;

	if (!alloc_cpumask_var(&allbutself, GFP_ATOMIC)) {
        // allocate false
		apic->send_IPI_mask(mask, RAR_VECTOR);
		return;
	}

    // allbufself 是一个常量，表示除了自己所有 online 的 CPU
	cpumask_copy(allbutself, cpu_online_mask);
	cpumask_clear_cpu(smp_processor_id(), allbutself);

    // 如果我们要发送的 mask 和 albutself 是一样的，那么我们直接调用 send_IPI_allbutself，
    // 可能会有一些优化。
	if (cpumask_equal(mask, allbutself) && cpumask_equal(cpu_online_mask, cpu_callout_mask))
		apic->send_IPI_allbutself(RAR_VECTOR);
	else
		apic->send_IPI_mask(mask, RAR_VECTOR);
    //...
}

RAR TLB Flush Exposion and Detection

Guest enumerates this PV feature by CPUID in KVM_CPUID_FEATURES, and KVM will expose this feature as possible:

#define KVM_FEATURE_RAR_TLBFLUSH   		18

__do_cpuid_func
    case KVM_CPUID_FEATURES:
        //...
        if (rar_invvpid_supported())
            entry->eax |= (1 << KVM_FEATURE_RAR_TLBFLUSH);
        //...

How does guest kernel use RAR TLB Flush?

There is a new hypercall is defined:

#define KVM_HC_FLUSH_TLB       13

If guest want to use this feature, it should first use KVM based TLB flush PV feature, which means KVM_FEATURE_PV_TLB_FLUSH and KVM_FEATURE_STEAL_TIME PV features must be used as dependencies:

// If KVM_FEATURE_PV_TLB_FLUSH and KVM_FEATURE_STEAL_TIME are used,
// this function is added as the handler function of pv_ops.mmu.flush_tlb_others
// 
static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
			const struct flush_tlb_info *info)
{
    //...
    // first use PV Flush TLB if possible, i.e., the VCPU is not preempted
	for_each_cpu(cpu, flushmask) {
		src = &per_cpu(steal_time, cpu);
		state = READ_ONCE(src->preempted);
		if ((state & KVM_VCPU_PREEMPTED)) {
			if (try_cmpxchg(&src->preempted, &state,
					state | KVM_VCPU_FLUSH_TLB))
				__cpumask_clear_cpu(cpu, flushmask);
		}
	}

    // nopvrar will be true if "nopvrar" is added to guest kernel cmdline 
    // Here, some vCPUs are now preempted and we need to send IPI to them
    // to flush their TLBs, with RAR PV featrue, we can replace this step with
    // RAR to avoid IPI as possible 
	if (!nopvrar && kvm_para_has_feature(KVM_FEATURE_RAR_TLBFLUSH)) {
        // A successful execution
		if (!pv_rar_flush_tlb_others(flushmask, info)) {
			CNT_STOP(kvm_flush_tlb, cyc);
			return;
		}
	}

    // If RAR PV is not used, use traditional non-pv method to flush TLB
    native_flush_tlb_multi(flushmask, info);
}

static long pv_rar_flush_tlb_others(const struct cpumask *cpumask,
				const struct flush_tlb_info *info)
{
	u64 flags = 0, start = 0, end = 0, mask=0;
	long ret;
	int cpu, apic_id;

	if (!info->mm || (info->end == TLB_FLUSH_ALL))
		flags |= PV_TLB_FLUSH_ALL;
	else {
		start = info->start;
		end = info->end;
	}

	for_each_cpu(cpu, cpumask) {
    	// apic_id may not equal to cpu_id, so first map the cpu id to apic_id
		apic_id = per_cpu(x86_cpu_to_apicid, cpu);
        // set all the apic_id bits in "mask"
		__set_bit(apic_id, (unsigned long *)&mask);
	}

    // invoke a hypercall "KVM_HC_FLUSH_TLB"
	ret = kvm_hypercall4(KVM_HC_FLUSH_TLB, mask, flags, start, end);
    //...
}

`kvm_pv_rar_flush_tlb()` / How KVM Handle the hypercall `KVM_HC_FLUSH_TLB`:

[EXIT_REASON_VMCALL] = kvm_emulate_hypercall
    __kvm_emulate_hypercall
        case KVM_HC_FLUSH_TLB:
            kvm_pv_rar_flush_tlb(vcpu, a0, a1, a2, a3);

static long kvm_pv_rar_flush_tlb(struct kvm_vcpu *vcpu, unsigned long mask,	unsigned long flags,
unsigned long start, unsigned long end)
{
    //...
    map = rcu_dereference(vcpu->kvm->arch.apic_map);
	if (likely(map)) {
        //...
        // map from apic_id to physical cpu id
		for_each_set_bit(i, &mask, min) {
			if (map->phys_map[i]) {
				lvcpu = map->phys_map[i]->vcpu;
                // get all vpid
				lvpid[pcpus] = to_vmx(lvcpu)->vpid;
                // get all physical cpu id mapped from apic_id
				phys_cpu[pcpus] = lvcpu->cpu;
				pcpus++;
			}
		}
	}

    //...
    // lvpid: vpid list maybe used for the RAR_ACTION_INVVPID
	rar_invalidate_vpid_others_batch(lvpid, phys_cpu, pcpus, &info);
    //...
}

static long kvm_pv_rar_flush_tlb(struct kvm_vcpu *vcpu,
				unsigned long mask,	unsigned long flags,
				unsigned long start, unsigned long end)
{
	struct flush_tlb_info info;
	int *lvpid;
	int *phys_cpu;
	int pcpus = 0;
	struct kvm_apic_map *map;
	int i;
	unsigned long long __maybe_unused cyc;

    // Assume INVVPID type is supported if RAR is present
	if (!rar_invvpid_supported())
		return 1;

    map = rcu_dereference(vcpu->kvm->arch.apic_map);

	if (likely(map)) {
		u32 min;
		struct kvm_vcpu *lvcpu;

		min = min((u32)BITS_PER_LONG, (map->max_apic_id + 1));
		printk_once("min: %d\n", min);

		lvpid = kmalloc(min * sizeof(int), GFP_KERNEL);
		phys_cpu = kmalloc(min * sizeof(int), GFP_KERNEL);

		for_each_set_bit(i, &mask, min) {
			if (map->phys_map[i]) {
				lvcpu = map->phys_map[i]->vcpu;
				lvpid[pcpus] = to_vmx(lvcpu)->vpid;
				phys_cpu[pcpus] = lvcpu->cpu;
				pcpus++;
			}
		}
	}

	if (pcpus <= 0)
		return 0;

	if (flags == TLB_FLUSH_ALL)
		end = TLB_FLUSH_ALL;

	info.start = start;
	info.end = end;
	info.stride_shift = PAGE_SHIFT; /* need guest to provide */

	rar_invalidate_vpid_others_batch(lvpid, phys_cpu, pcpus, &info);
	kfree(lvpid);
	kfree(phys_cpu);

	return 0;
}