../09-rebuttal-reponse

We sincerely appreciate the detailed and constructive feedback from all reviewers. We are particularly grateful for the insightful suggestions from Reviewer D, who pinpoints the quintessential elements of our work, and Reviewer A, whose suggestion regarding a hypervisor-based design could further highlight our contribution. Below, we address the questions and concerns raised by all reviewers:

Novelty

The core insight behind our work is that existing hypervisor-based hotness tracking methods impose significant performance penalties. HeteroVisor/HeteroOS (Section 3.1) and MMU-notifier solutions (proposed by Reviewer A) capture A-bit hotness information by trapping every guest pagetable modification, causing expensive traps on first access to clean pages. While vTMM uses PML hardware to track pagetable modifications, it only reduces trap frequency by a constant factor without eliminating them. All these designs require further guest page table walks to translate addresses to host-understandable ones.

HeteroOS (2017) didn't solve the fundamental challenge of efficient hotness tracking, instead implementing expensive software-based methods despite the availability of hardware features like PEBS (Nehalem, 2008) and PML (Broadwell, 2015). Their primary contribution was limited to delegating page migration to the guest.

Our core innovation (as Reviewer D noted) is fully disaggregating data placement responsibilities—both hotness tracking and data migration—from tiered memory provisioning in the host. We utilize PEBS to generate samples with ready-to-use guest virtual addresses written directly to guest memory buffers. By draining these buffers during context switches, we eliminate trapping and VMexits. Our identification algorithm operates in virtual address space, removing the need for pagetable walks and avoiding fragmentation issues common in physical space due to Linux's page allocator and map-on-first-touch policy, which breaks spatial locality.

Beyond low-cost tracking, we recognize that accurate identification is crucial for tiered memory performance. Our range-based hotness identification demonstrates this approach, with migration simply executing decisions made during identification. Despite being inspired by DAMON (Middleware'19, HPDC'22)—which we inadvertently omitted due to space constraints—our solution addresses its limitations. DAMON is a region-based memory profiling tool that estimates regional hotness based on a randomly selected page's A-bit information, with user-configured region splits producing only a user-specified number of ranges. This design may aid manual performance tuning but cannot make use of rich hotness info generated by PEBS, nor could it serve as a comprehensive automatic hotness identification solution for diverse applications in virtualized environments.

Threat model

We build upon a hypervisor similar to Firecracker, and follow a similar threat model: Guest code operating within the virtualization boundary is untrusted, while the hypervisor itself is trusted. Hypervisor processes serving different guests maintain isolation from one another. We do not focus on confidentiality as most real world VMs do not require strict hardware-based containment. (thanks for Review D) As such, we assume hypervisor does not intentionally introspect guests' memory through following information exposed in VMCS. We shall withdraw our claim on isolation, despite this, our architecture still presents significant innovation through its fully disaggregated hotness management design.

Malicious attacks to our HyperPlace guest component could be prevented using cryptographic authentication with our hypervisor during VM boot and page migrations. Although we do not rely on pagetable to obtain hotness information, for adversarial guests attempting to manipulate our system by falsifying memory access patterns in PEBS buffers to mislead our classification algorithms, we note two mitigating factors. First, such manipulation would require substantial CPU and memory bandwidth resources, creating a counterproductive cost that undermines the attacker's goal of gaining memory performance advantages. Second, resource allocation decisions—including hard caps on both fast and slow memory—are ultimately controlled by the host. Although these allocation decisions depend on performance statistics from our balloon driver (potentially vulnerable to guest manipulation), we can implement cryptographic signing to authenticate both our driver and the data production chain. To address potential attacks against the virtio protocol through blocking or flooding, we could employ timeout mechanisms and rate-limiting safeguards, maintaining ultimate control at the host level with the ability to detect and terminate malicious guests.

Evaluation

(Reviewer A) (hypervisor-based works) Regarding Reviewer A's suggestion to use MMU notifiers for tracking A-bit changes: This approach resembles the methodology employed by HeteroVisor/HeteroOS. However, it has significant drawbacks, primarily frequent traps into the host kernel that cause severe performance degradation. MMU notifiers were originally designed for maintaining shadow page tables in software virtualization. Reintroducing them solely to track hotness information would effectively negate the core performance benefits of hardware virtualization—essentially trading a fundamental advantage for a minor feature. Nevertheless, we are willing to implement this approach as a baseline to better highlight our contribution's value. We have attempted to evaluate our solution against existing hypervisor-based approaches, but this has proven challenging as these solutions are not open-source and their papers lack sufficient design and implementation details for reproduction.

(Reviewer A, D) (GUPS modification) Our GUPS implementation merely adds support for work-stealing and is purely motivated by reproducibility. The original GUPS assigns each worker thread with a fixed amount of read-modify-write operations at startup, early terminating thread will stay idle and affect the execution time and overall throughput measurements. Additionally, we developed a Zipf variant of GUPS to evaluate performance under skewed workloads, providing more comprehensive performance data across diverse access patterns, which we are happy to include.

(Reviewer E) (responsiveness) While large memory capacity primarily benefits analytical applications (for which we present throughput results), we have included responsiveness evaluations in Figure 5, demonstrating how our design efficiently places hot data, as evidenced by the steeper slope of the curve. Our sampling approach inherently avoids the batching trade-offs that typically compromise latency performance. By actively draining sample buffers during every context switch—which occurs with sufficient frequency—applications experience no delays due to buffer-full interrupts. Furthermore, our evaluation includes Silo, an interactive/transactional database, for which we are happy to provide various latency measurements.

(All) (low level microbenchmark) The exceptional end-to-end improvements of our system stem from our low-cost hotness tracking and range-based classification mechanisms. Figure 7 presents a detailed overhead breakdown, while Figure 8 explores the configuration space. Our design classifies the maximum amount of hot data with superior agility (demonstrated by peak values and slopes in Figure 5) while maintaining minimal overhead throughout the system, particularly during the sampling stage (Figure 7). Figure 8 evaluates the configuration parameter space for both tracking and classification stages, presenting performance matrices across various dimensions: for sampling, we vary both what to capture (load latency threshold) and sampling frequency (sample period); for classification, we examine sensitivity to hotness differences across ranges (split threshold) and response frequency to collected samples (split period).

(Reviewer E) In the meantime, system administrators can select optimal parameters based on our performance matrices for their representative applications. Our tracking parameters are dependent on PMU hardware, but cloud environments typically deploy large numbers of physical machines with similar CPU microarchitectures and corresponding PMU hardware. Consequently, parameters selected on one platform can be readily transferred to other machines with comparable hardware configurations.

Figures

We apologize for the crowded figure and overlapping text labels due to the space considerations. We would be happy to offset the labels and enlarge the figures.

Detailed responses

References