../16-sosp25-pebs-design-fixup

v1

Effective guest virtual address hotness management requires direct application access tracking in virtual address space, which EPT-friendly PEBS fits nicely and provides to guest VMs without compromising memory elasticity (as introduced in @h3-pebs-scalability).
Our design eliminates additional address translation costs through efficient sample collection.

Traditional PEBS sampling approaches typically vary sample frequency to optimize collection speed within CPU overhead constraints.
However, our analysis revealed significant inefficiencies in this approach due to Performance Monitoring Interrupt (PMI) overhead from buffer overshoots.
PEBS samples become visible to software only during either passive draining (triggered by PMIs when the buffer fills up) or proactive draining.
At higher sample frequencies, the buffer frequently overshoots before routine draining can occur, triggering expensive PMIs that waste the allocated CPU budget.

To address this fundamental limitation, we implemented two key optimizations.
First, we selected a small, constant sample frequency instead of dynamic adjustment.
Second, we integrated our sample collection directly into process context switches.
This approach effectively eliminates dedicated collection threads that would prohibit scalability in multi-tenant environments, while the fixed sampling rate prevents CPU wastage on unnecessary PMIs.
Samples are efficiently drained from the PEBS buffer immediately after the scheduler switches away from the generating process and fed to our range-based classifier through a lock-free multi-producer single-consumer channel.
Our empirical testing shows that a sampling frequency of $1/4093$ delivers consistent performance across diverse workloads, though our evaluation in @h4-sensitivity demonstrates that the design tolerates a wide range of sampling frequencies without significant performance degradation.

old verison

=== Scalable EPT-friendly PEBS <h3-ept-pebs>
Achieving guest virtual address hotness management requires application access tracking in this exact space, which can be directly supplied by EPT-friendly PEBS, as introduced in @h3-pebs-scalability, which is available to guest VMs without sacrificing memory elasticity.
Our design eliminates additional address translation costs through efficient sample collection.

// // TODO: rewrite: how to enable PEBS for various hardware versions
// Until recently, PEBS was widely misunderstood as being unavailable for guest VMs, creating a blind spot in tiered memory research @eurosys23vtmm @osdi24memstrata.
// We discovered that the root cause for such confusion was an architectural bug @lkml14silvermontpebs where the PEBS write process could not be interrupted by EPT page faults without risking machine malfunction.
// Cloud environments require memory overcommitment, in which guests' memory is lazily allocated and corresponding EPT entries are lazily populated at the EPT page fault triggered by first access.
// Although guest PEBS could technically be enabled by avoiding this architectural defect through eagerly mapping all available memory assigned to a VM and disabling swapping, with the recently introduced PEBS version 5 @lkml22eptfriendlypebs, this bug no longer exists, enabling painless EPT-friendly PEBS.

// // TODO: rewrite: highlight "avoid tracking using aux pgtbl"?
// The PMU operating under guest mode hardware captures load/store samples with guest virtual addresses and writes to the PEBS buffer.
// Despite the directly available and favorable virtual address samples, previous PEBS-based designs like HeMem and Memtis opted to use physical addresses and record physical page hotness inside an auxiliary page table structure.
// This approach requires walking the page table and translating the virtual address to physical address for every single valid sample generated by PEBS, introducing address translation costs that hurt efficiency and scalability.
// Our design feeds the readily available virtual address samples directly to range-based hotness classification without any address translation.

// // TODO: rewrite: highlight "efficiencybility"
// Beyond address translation costs, sample collection costs also constitute a significant portion of overall management overhead @sosp23memtis.
// Prior works either dedicated one core for sample collection through busy polling @sosp21hemem or assigned the collection process with a CPU overhead budget @sosp23memtis.
// However, our findings in @fig-motivation-scalability show that the tracking overhead per VM often overshoots the 3% CPU budget employed by Memtis due to the feedback delay in Memtis' sample frequency adjusting scheme, frequency is not lowered until it overshoots the perf buffer and overwhelms the collection thread.

Traditional PEBS sampling designs optimize for maximum collection speed under given CPU overhead constraints by varying the sample frequency @sosp23memtis.
We found that this approach suffers from additional PMI overhead brought by buffer overshoots.
PEBS samples are not directly visible right after generation;
rather, software sees samples during passive draining triggered by a PMI or proactive sample draining.
A PMI occurs when the PEBS buffer fills up to a configured threshold, and proactive draining happens routinely during task scheduling.
When increasing the sample frequency, the sample buffer is more likely to overshoot before routine draining, triggering expensive PMIs.
Because CPU resources are spent more on handling PMIs, the allocated CPU budget is wasted.

To address this, we choose a small and constant sample frequency as well as integrate our sample collection process into process context switches to avoid overshoots and eliminate PMIs.
This method avoid dedicated threads which prohibits scalable management in multi-tenants environments.
Fixed sample frequency then eliminate CPU wastage on PMIs triggered by problematic sample frequency scaling.
Samples are drained from PEBS buffer right after the scheduler switches out from the generating process, and fed to range-based classification through a lock-free multi-producer single-consumer channel.
We find a frequency of $1/4093$ works well in practice, and we show that our design tolerates a wide range of sample frequencies in @h4-sensitivity.