Ocolos Online Code Layout Optimizations

1679840606
Ocolos online code layout optimizations

File Name: zhang-ocolos-micro-2022.pdf

File Size: 423.46 KB

File Type: Application/pdf

Last Modified: 6 months

Status: Available

Last checked: 1 months ago!

This Document Has Been Certified by a Professional

100% customizable

Language: English

We recommend downloading this file onto your computer

Summary

OCOLOS: Online COde Layout OptimizationS
Yuxuan Zhang∗ Tanvir Ahmed Khan† Gilles Pokam‡ Baris Kasikci† Heiner Litz§ Joseph Devietti∗
∗ University of Pennsylvania † Universityof Michigan ‡ Intel
Corporation § University
of California, Santa Cruz
∗ {zyuxuan, devietti}@seas.upenn.edu † {takh, barisk}@umich.edu
[email protected] § [email protected]
Abstract—The processor front-end has become an increasingly
important bottleneck in recent years due to growing application
code footprints, particularly in data centers. First-level instruc- 64
tion caches and branch prediction engines have not been able to
AMD
L1i capacity (KiB)
keep up with this code growth, leading to more front-end stalls Intel
and lower Instructions Per Cycle (IPC). Profile-guided optimiza-
48
tions performed by compilers represent a promising approach, as
they rearrange code to maximize instruction cache locality and
branch prediction efficiency along a relatively small number of
hot code paths. However, these optimizations require continuous 32
profiling and rebuilding of applications to ensure that the code
layout matches the collected profiles. If an application’s code is
2006 2010 2014 2018 2022
frequently updated, it becomes challenging to map profiling data
from a previous version onto the latest version, leading to ignored
profiling data and missed optimization opportunities

Fig. 1: AMD & Intel per-core L1i capacity over time
In this paper, we propose O COLOS, the first online code
layout optimization system for unmodified applications written
in unmanaged languages. O COLOS allows profile-guided opti- turned to Profile-Guided Optimizations1 (PGO) from the
mization to be performed on a running process, instead of compiler community that reorganize code within a binary to
being performed offline and requiring the application to be re- optimize the utilization of the limited L1i for the common-case
launched. By running online, profile data is always relevant to
the current execution and always maps perfectly to the running control-flow paths (we describe these compiler optimizations in
code. O COLOS demonstrates how to achieve robust online code detail in Section II). Google’s AutoFDO [10] and Propeller [29],
replacement in complex multithreaded applications like MySQL Meta’s BOLT [76, 77], and gcc’s and clang’s built-in PGO
and MongoDB, without requiring any application changes. Our passes are popular examples of this approach. While these
experiments show that O COLOS can accelerate MySQL by up to systems have seen successful deployment at scale, there exist
1.41×, the Verilator hardware simulator by up to 2.20×, and a
build of the Clang compiler by up to 1.14×. three significant challenges

First, the results of PGO are only as good as the profiling
information that drives PGO’s optimization decisions [53, 108,
I. I NTRODUCTION 110]. PGO requires relevant, fresh profiling information to
produce high-performance code layouts [72]. However, PGO
is an offline optimization, applied either during compilation
As the world demands ever more from software, code sizes
(AutoFDO, gcc, clang) or to a compiled binary (BOLT and
have increased to keep up. Google, for example, reports annual
Propeller), creating a fundamental lag between when profiling
growth of 30% in the instruction footprint of important internal
information is collected and when it is used to optimize the
workloads [6, 45]. This code growth has created bottlenecks
code layout [10]. If program inputs shift during this time,
in the front-end of the processor pipeline [50], as the sizes
previous profiling information is rendered irrelevant or even
of front-end hardware resources close to the processor have
harmful when it contradicts newer common-case behavior [69,
been relatively static over time [52]. Figure 1 shows that,
105]. Maintaining profiles for each input or program phase is
despite Moore’s Law, the per-core L1 instruction cache (L1i)
prohibitive in terms of storage costs, so profiles are merged
capacity of server microarchitectures from Intel and AMD has
together to capture average-case behavior at the cost of input-
remained effectively constant (literally so in Intel’s case) over
specific optimization opportunities [104, 111]

the past 15 years, because the L1i is so latency-critical [23]

As we attempt to cram ever more code into a fixed-size L1i, Second, even if we have secured timely profiling information,
strain on the processor front-end is inevitable. The inability to if the program code itself changes then it is difficult to map
deliver instructions to the processor leads to front-end stalls the profiling information onto the new code [10]. By its nature,
that tank IPC and end-to-end application performance as even profiling information is captured at the machine code level, and
advanced techniques like out-of-order processing cannot hide
1 Many optimizations can be driven by profiling information, so the term
these stalls [58]. “profile-guided optimization” is quite broad. In this paper, we use it to refer
To address front-end stalls, large software companies have exclusively to profile-driven code layout optimizations

even modest changes to the source code can lead to significant cost for code replacement which is readily amortized, along
differences in machine code [36]. To make things worse, in with a small amount of run-time instrumentation on function
large software organizations, code changes can arrive every few pointer creation (see Section IV-C)

minutes for important applications [3]. Since we always need For some short-running programs, even the 1-time cost of
to deploy the latest version of an application, there is a constant O COLOS is too high to be effectively amortized at run time

challenge when applying PGO with profiling data collected To address this problem, we propose BATCH ACCELERATOR
from version k to the compilation of the latest version k0 . M ODE (BAM), a technique that allows batch workloads
Profiling data that cannot be mapped to k0 is discarded, causing consisting of a large collection of short-running processes,
us to miss optimization opportunities [22]. like large software builds, to benefit from O COLOS. BAM
The third key challenge with offline PGO approaches is works by profiling initial executions of a binary, generating
that recording, storing, and accessing PGO profiles adds an an optimized binary with BOLT, and using that optimized
operational burden to code deployment. In particular, for end- binary in subsequent executions. BAM operates transparently
user mobile applications, managing profiles can itself be a to the workload, via LD PRELOAD injection, allowing BAM
non-trivial task [69, 90]. to accelerate builds of the Clang compiler without any changes
In this paper, we propose O COLOS, a novel system for online to Clang code or build scripts

profile-guided optimizations in unmanaged languages. O COLOS While in this paper we focus on using O COLOS to enable
performs code layout optimizations at run time, on a running online PGO, we envision O COLOS also being applicable in
process. By moving PGO from compile time to run time, we a range of other cases such as performance optimization
avoid the challenges listed previously. Profile information is and security. We will open-source O COLOS to facilitate this
always up-to-date with the current behavior the program is exploration

exhibiting. O COLOS also supports continuous optimizations to In summary, this paper makes the following contributions:
keep up with input changes over time. In O COLOS, profiling • We describe the design of O COLOS, the first online profile-
data always maps perfectly onto the code being optimized guided code layout optimization tool for unmanaged code

since we profile and optimize the currently-running process. • We show how to perform online code replacement effi-
There is no burden of profile management, as the profile is ciently in unmodified, large-scale C/C++ programs

produced and then immediately consumed. Some managed • We evaluate O COLOS on a series of big-code applications
language runtimes (e.g., Oracle HotSpot JVM [39] and Meta like MySQL and MongoDB, demonstrating speedups of
HHVM [73, 74]) support online code layout optimizations and up to 1.41× on MySQL

achieve similar benefits. We are not aware, however, of any • We evaluate a variant of O COLOS targeted at batch work-
system before O COLOS that brings the benefits of online PGO loads with many short-running processes, demonstrating
to unmanaged code written in languages like C/C++. a 1.14× speedup on a from-scratch Clang build

To realize the benefits of PGO in the online setting, O COLOS
builds on the BOLT [76, 77] offline PGO system, which takes II. BACKGROUND
a profile and a compiled binary as inputs and produces a
new, optimized binary as the output. O COLOS instead captures In this section, we provide the necessary background of
profiles during execution of a deployed, running application, PGO passes implemented in tools like AutoFDO [10] and
uses BOLT to produce an optimized binary, extracts the code BOLT [76, 77], which are now the state of the art at all major
from that BOLTed binary, and patches the code in the running cloud companies including Google and Meta

process. To avoid corrupting the process, code patching requires
careful handling of the myriad code pointers in registers and A. Hardware Performance Profiling
throughout memory. O COLOS takes a pragmatic approach that Profile collection is the first step of all PGO workflows

requires no changes to application code, which enables support There are two different methods of profile collection: 1) through
for complex software like relational databases. compiler instrumentation of branch instructions (e.g., Clang
O COLOS is different from other Dynamic Binary Instrumen- and GCC), and 2) through hardware support (e.g., Intel’s Last
tation (DBI) frameworks like Intel Pin [64], DynamoRIO [8, Branch Record [56] and Processor Trace [55]). Due to the high
35], and Valgrind [71] in that O COLOS 1) focuses on code overheads of compiler instrumentation [10], cloud providers
replacement, instead of providing APIs for instrumentation, generally leverage hardware profiling support [10, 13, 18, 67, 76,
and 2) has a “1-time” cost model where major work is done 77]. For instance, Intel’s LBR [56] facility, which dates back to
only during code replacement and the program runs with native the Netburst architecture (Pentium 4), is widely available at this
performance once the replacement is complete. Existing DBI point. When LBR tracing is enabled, the processor records the
frameworks would be unsuitable for our online PGO use-case Program Counter (PC) and target of taken branches in a ring
because programs running under, say, Pin experience a non- buffer2 . The recording overhead via LBR is negligible [43, 67]
trivial ongoing overhead to intercept control-flow transfers and software can then sample this ring buffer to learn the
and maintain the code cache. The performance benefits of the branching behavior of an application. The Linux perf utility
improved code layout would, in practice, often be outweighed
by these ongoing overheads. Instead, O COLOS exacts a 1-time 2 The LBR buffer in Skylake and newer cores has 32 entries

2
provides simple access to LBR sampling, including the ability The C3 [75] algorithm improves upon Pettis-Hansen by
to start and stop LBR recording of arbitrary running processes. placing callers before callees, which is especially helpful in
Each LBR sample represents a snippet of the program’s asymmetric calling relationships where A calls B frequently
control flow. By aggregating these snippets, the approximate but B never calls A. This allows C3 to move the target of a
frequency of branch taken/not-taken paths through the code function call closer to the call instruction itself, improving L1i
can be reconstructed. With these branch frequencies in hand, and iTLB locality beyond what PH can provide

we can make intelligent decisions about optimizing the code Sometimes PGO passes will incorporate additional opti-
layout as described below. mizations, such as function inlining or peephole optimizations

However, nearly all of the performance benefit of PGO passes
B. Basic Block Reordering comes from basic block reordering and function reordering [76]

Basic block reordering is typically the most impactful
D. BOLT: Binary Optimization & Layout Tool
PGO code transformation [76]. Whenever programs contain
if statements, the compiler must decide how to place the BOLT [76, 77] is a post-link optimization tool built in the
resulting basic blocks into a linear order in memory [80]. LLVM framework, which operates on compiled binaries. The
Without profiling information, the compiler must use some BOLT workflow begins with gathering profiling information

heuristics to decide on a layout based on static code proper- Though BOLT can use a variety of profile formats, LBR
ties [7, 9, 15, 44, 66, 85, 107, 112], often leading to sub-optimal samples are preferred. Armed with the profile and the original,
results [69]. non-BOLTed binary, BOLT decompiles the machine code into
The ideal layout places the common-case blocks consecu- LLVM’s low-level Machine Intermediate Representation (MIR)
tively, maximizing L1i and instruction Translation Lookaside format, not to be confused with the more commonplace LLVM
Buffer (iTLB) locality while reducing pressure on the branch IR. BOLT performs a series of optimizations, including basic
prediction mechanism [72]. In particular, by linearizing the block reordering and function reordering, at the MIR level
code, the number of taken branches is minimized, reducing the before performing code generation to emit a new, BOLTed
pressure on the Branch Target Buffer (BTB) which only stores binary

information about taken branches [40, 41, 48, 98]. Consider the The layout of a BOLTed binary is unconventional in a few
example program in Figure 2. Assuming both conditions are ways. First, cold functions are left in-place in the original .text
typically true, shaded basic blacks constitute the common-case section of the binary, which is renamed to the bolt.org.text
execution. A naive layout which places the blocks from each section. These cold functions are subject to small peephole op-
if statement together results in two taken branches (shown timizations but their starting addresses do not change and their
by arrows). The optimal layout, however, avoids any taken basic blocks are not reordered because there was insufficient
branches, and results in better performance. profiling information to justify stronger optimizations. The hot
functions, however, are heavily optimized by BOLT (via basic
block and function reordering) and are moved to a new .text
naive optimal
layout layout section at a higher address range. Additionally, BOLT may
if (cond1) { // A A A
perform hot-cold code splitting, where the cold basic blocks of
// B a hot function f are not stored contiguously with the hot blocks
} else { B B
// C for f , but are instead exiled to another region of the binary
C D
} with other cold blocks from other hot functions. Functions that
if (cond2) { // D D E
// E
are entirely cold are not worth splitting in this way, and have
} else { E G their code stored contiguously in the bolt.org.text section

// F F C
} III. C HALLENGES
// G G F
A well-known and intuitive challenge with offline profiling-
Fig. 2: Example program which benefits from PGO based optimizations like conventional PGO is ensuring that
the gathered profile data is of high quality [11, 53, 54, 106]

Profiling with one program input and then running on a different
C. Function Reordering input can lead to many sub-optimal optimization decisions [2,
Function reordering optimizes the linear order of functions 51]. We validate this effect experimentally in Section III-A

within a binary to take advantage of caller-callee relationships. O COLOS offers a solution to offline PGO’s input sensitivity:
This optimization first uses profiling information to construct since O COLOS profiles and optimizes a running process, the
a call graph and annotates edges with the frequency of calls. profile data is always for the current binary and the current
The classic Pettis-Hansen (PH) algorithm [80] puts functions inputs. However, performing code replacement at run time
next to each other if they call or are called by each other introduces other challenges. Chief among them is that changing
frequently. While the PH algorithm uses a greedy approach to code can break any explicit or implicit code pointers (pointers
place the most frequently-invoked functions adjacent to each to other instructions, not data) that referenced the changed
other, it makes no distinction between callers and callees. code. In Section III-B, we catalog the myriad sources of code
3
pointers in a running process, to better motivate the design of Functions can call each other via direct calls, encoding the
O COLOS in Section IV which can update or preserve these callee function’s starting address as a PC-relative offset. There
code pointers as necessary. may also be indirect calls via function pointers stored in a
v-table3 , or programmer-created function pointers stored on the
A. Input Sensitivity stack, heap, or in global variables. As with other pointers
Figure 3 quantifies the sensitivity of BOLT’s performance in C/C++, a function pointer might be cast to an integer,
to the quality of its profile data. The bars show along the obfuscated via arithmetic, then restored to the original value,
x-axis the throughput of a BOLTed MySQL binary running cast back to a pointer again, and dereferenced

the read only input from Sysbench [1]. The y-axis lists the Code pointers that do not refer to the start of a function are
Sysbench input used to provide profile data to BOLT. Thus, the also commonplace. Jump and conditional branch instructions
bottom bar shows the performance when profiling the insert within a function reference code locations via PC-relative off-
input, BOLTing the binary, and then running with the read only sets. Sometimes indirect jumps rely on compile-time constants
input. For reference, the dashed line shows the performance that are used to compute a code pointer at run time, e.g., in the
of the original MySQL binary without BOLT optimizations. implementation of some switch statements. Return addresses
While BOLT improves performance regardless of the training on the stack are code pointers to functions that are on the call
input used, the worst profile (insert) is 21% slower than the best stack. The value of PC in each thread (the rip register in x86) is
profile (read only). Aggregating all profiling inputs (the bar a pointer to an instruction in the currently-running function in
labeled all) experiences some destructive interference between each thread. A thread may be blocked doing a system call, in
profiles and is about 8% slower than the best profile. Because which case its PC is effectively stored in the saved context held
O COLOS (shown with the solid line) runs online instead of by the operating system. libc’s setjmp/longjmp API can be used
ahead of time, it always profiles the current input, and achieves to create programmer-managed code pointers to essentially
high performance comparable to the best profile. arbitrary code locations

As is clear from the discussion above, the address space of
a typical process contains a large number of code pointers

read only Keeping track of all of them so that they can be updated if a
read write piece of code moves is essentially impossible for any serious
point select program. Sometimes a code pointer of interest resides in kernel
space where it is inaccessible to user code. In a managed
all
language, the runtime can indeed track all code pointers and
profiling input
random ranges update or invalidate them as needed. However, O COLOS targets
random points complex unmanaged code so we have to be able to live with
update index code pointers that are outside our control. Even small code
write only changes can silently break such code pointers

update nonindex OCOLOS In the next section, we discuss how O COLOS overcomes
delete original these challenges by retaining the original code within a process,
BOLT
insert adding optimized code at a new location, and patching up as
0 2000 4000 6000 many code pointers as possible to steer execution towards the
transactions / second on read_only optimized code in the common case

Fig. 3: Performance achieved when running MySQL with the
IV. OCOLOS
Sysbench read only input, using BOLT to produce a binary
from the given profiling input or, with the all bar, from profiles In Figure 4a, we show a high-level overview of the steps
of all inputs combined. O COLOS performs to optimize the code of a target process
at run time. First, we gather profiling information from the
target process Ê, then build the BOLTed binary Ë, pause the
B. Challenges of Changing Code Pointers target process Ì, inject code Í, update pointers to refer to
O COLOS requires modification of code pointers at run time to the injected code Î, and finally resume the process Ï. In
perform its optimizations. To better understand the challenges this section, we assume the presence of the BOLTed binary
of changing these code pointers, we first discuss the different and focus on the key components of O COLOS’s online code
flavors of code pointers that arise in a running process. The replacement mechanism: Steps Ì-Î. Later, in Section V, we
conventional compilation flow of offline PGO deals only with discuss Steps Ê and Ë, which are conceptually simpler as they
static code, so many of the challenges we discuss below are leverage existing tools like Linux’s perf utility for performance
unique to O COLOS’s run-time approach. profiling and BOLT. Note that Steps Ê and Ë, which consume
First, we distinguish between code pointers that refer to the most time, are done concurrently in the background while
the starting address of a function versus those that reference 3 A virtual function/method table (v-table) is used to implement dynamic
a specific instruction within a function (e.g., the target of a dispatch or virtual functions in object-oriented languages. The table itself
conditional branch). We discuss function starting addresses first. stores function pointers to the methods of a class

4
a0 call: b0 a0 call: b0 a0 call: b0 a0 call: b0
C0 b0 b0 b0 b0
c0 call: b0 c0 call: b1 c0 call: bi c0 call: bi+1
3 4 5 b1 bi ci+1 call: bi+1
pause inject update C1
c1 call: b1 ci call: bi bi+1 Ci+1
process code pointers
bi,i+1
2
1 build optimized 6 v-table v-table v-table v-table
profile binary run with func ptr: b0 func ptr: b1 func ptr: bi func ptr: bi+1
target optimized code
stack stack stack stack
process ret addr: c0 ret addr: c0 ret addr: bi ret addr: bi,i+1
Figure 4a: Main steps O COLOS takes to optimize a target process Figure 4b: Starting state of the Figure 4c: Before (left) and after
address space (left) and state (right) continuous optimization
after code replacement (right)
the target process continues to run. Though operations like function) address has been saved on the heap or stack at run
running BOLT are CPU-intensive, they compete for cycles time

with the target process for only a limited time. Steps Ì-Î are B. Updating Code Pointers
done synchronously while the target process is paused

To better describe key operations within O COLOS, we first When patching code pointers to make the C1 code reachable,
describe the important regions of the address space of the target O COLOS follows our second design principle:
process, shown in the left part of Figure 4b. The code from the Design Principle #2: run C1 code in the common case
original binary we refer to as C0 , which consists of 3 functions
O COLOS executes code from C0 instead of C1 occasionally
a0 , b0 and c0 . A v-table contains a pointer to b0 . Finally, each
to ensure correctness. However, the more frequently O COLOS
thread’s stack is also important as it contains return addresses
executes code from C0 , the more it reduces the potential
of currently-executing functions. In Figure 4b, c0 is on the call
performance gains C1 can provide. Therefore, we seek to make
stack

C1 the common case. Other O COLOS use-cases such as profiling
O COLOS takes as input an optimized binary, with modified
are likely to also be amenable to this trade-off. For instance, if
code for functions in C0 or code for entirely new functions

we need to count function invocations then we can instrument
While O COLOS’s code replacement ultimately requires a short
only the C1 code, ignoring the rare invocations of the old C0
stop-the-world period (Section IV-B) to modify code and update
version of a function. For security or debugging use-cases,
code pointers, O COLOS performs some bookkeeping in advance

however, it may be necessary to redirect all invocations of C0
In particular, O COLOS parses the original binary offline to
functions to their C1 counterparts instead, e.g., via trampoline
identify the locations of all direct call instructions. O COLOS
instructions [16] at the start of C0 functions and at call sites
patches these calls at run time, but identifying the call sites
within them

in advance significantly shortens the stop-the-world period

Since our goal for the current version of O COLOS is
O COLOS leverages the Linux ptrace API, which allows one
minimizing (but not eliminating) time spent in C0 , O COLOS
process (often a debugger like gdb) to control and inspect
updates as many code pointers to refer to C1 as it is worthwhile
another process. O COLOS uses ptrace to stop the target process
to update. Note first of all that hot code gets optimized by
and to inspect and adjust its register state

BOLT and resides in C1 . Direct calls in C1 will already refer
A. Adding Code to C1 (e.g., c1 calls b1 ) and do not require updating

Figure 4b illustrates changes O COLOS makes. We update
As we describe in Section III-B, finding and updating all
function pointers in v-tables and direct calls in C0 for functions
code pointers is fraught with corner cases. This leads to the
on the call stack (like c0 ). Recall that these C0 changes preserve
first principle guiding O COLOS’s design:
instruction addresses, honoring our first design principle. We
Design Principle #1: preserve addresses of C0 instructions found that, in practice, updating direct calls in all functions (i.e.,
To enable significant performance gains by optimizing both including those, like a0 , not on the stack) does not improve
the function-level and basic block layout, while preserving performance – because functions like a0 are cold – though it
correctness, we design the following technique. Instead of does slow code replacement

We could additionally seek out function pointers in regis-
updating the code of a function in place, O COLOS injects a
ters and memory, though doing so would require expensive
new version of the code C1 into the address space while leaving
always-on run-time instrumentation to track their propagation
the original code intact (see Figure 4b). O COLOS then changes
throughout the program’s execution. This tracking would violate
a subset of code pointers within C0 to redirect execution to
O COLOS’s “fixed-costs only” cost model:
the C1 code. Remaining code pointers are not perturbed and
continue to point to C0 code. This approach can also handle Design Principle #3: code replacement can incur fixed costs,
thorny cases like setjmp/longjmp where a target instruction (not but must avoid all possible recurring costs
5
Our experiments show that leaving these remaining function While copying bi to bi,i+1 is a key part of enabling continuous
pointers (which our workloads do contain) pointing to C0 code optimization, it does not improve performance of the currently-
is fine, since C0 code does not execute very long before it running call to bi since the code is the same. However,
encounters a direct call or a virtual function call which steers subsequent calls are likely to reach bi+1 instead via other
execution back to C1 . code pointers, like the v-table in Figure 4c

2) Function pointers: Apart from return addresses, function
C. Continuous Optimization
pointers may also point to Ci . At any time during execution,
A natural use-case for O COLOS is to perform continuous programs can create function pointers that may exist on the
optimization, whereby O COLOS can replace C1 with C2 , and Ci stack, heap, or in registers and point to a function in Ci . Instead
with Ci+1 more generally. These subsequent code versions of trying to track down and update these pointers while moving
Ci can be generated by periodic re-profiling of the target from Ci to Ci+1 , O COLOS enforces a simpler invariant that
process, to account for program phases, daily patterns in a program cannot create function pointers to Ci code in the
workload behavior like working versus at-home hours, and first place – rather, function pointers must always refer to C0

so on. O COLOS can perform continuous optimization largely This allows function pointers to propagate freely throughout
through the same code replacement algorithm described above, the program without the risk that they will be broken during
though functions on the stack and function pointers require code replacement

delicate handling as explained below. O COLOS enforces this invariant via a simple LLVM compiler
The key challenge in continuous optimization is the need to pass that instruments function pointer creation sites with a
replace code, instead of just adding new code elsewhere in the callback function: void* wrapFuncPtrCreation(void*)
address space. If we continuously add code versions without This function takes as its argument the function pointer
removing old versions, the code linearly grows over time, being created (which may reference Ci code), and returns the
wasting DRAM and hurting front-end performance. To address value that the program will actually use - a pointer to the
this challenge, we introduce a garbage collection mechanism corresponding C0 function instead. O COLOS maintains a map
for removing dead code. We define dead code as code that can from Ci to C0 addresses to enable this translation. If O COLOS
no longer be reached via any code pointers and hence is safe has not yet replaced any code, or the function pointer being
to be removed. created does not reference Ci (e.g., it references library code),
Instead of waiting for code version Ci to naturally become wrapFuncPtrCreation simply acts as the identity function

unreachable, as in conventional garbage collection, we can Once a function pointer is created, it can freely propagate
proactively update code pointers to enforce the unreachability through registers and memory without any instrumentation
of Ci . O COLOS patches v-tables, direct calls from C0 , return - intervention is required only on function pointer creation

addresses on the stack, and threads’ PCs to refer to the incoming This instrumentation has a negligible cost: MySQL running the
Ci+1 code instead, as described in Section IV-B and illustrated read only input creates just 45 function pointers per millisecond
in Figure 4c. on average. While we have not found the need to implement it
1) Return addresses: Code pointers in return addresses and for our workloads, calls to setjmp could be similarly redirected
in threads’ PCs may reference Ci , so O COLOS must update to C0

these references to point to Ci+1 . To update these references, Having avoided function pointers to Ci , O COLOS is able to
O COLOS first crawls the stack of each thread via libunwind to update all other references to Ci code to refer to the incoming
find all return addresses. O COLOS examines RIP for each thread Ci+1 code instead. Thus, O COLOS can safely overwrite Ci code

via ptrace. Collectively, this examination provides O COLOS
Due to technical limitations in the current version of BOLT,
with the set of stack-live functions that are currently being
BOLT assumes the presence of a single .text code section
executed. If any stack-live function is in Ci (such as bi in
and refuses to run on a BOLTed binary. Unfortunately, this
Figure 4c), O COLOS must copy its code to Ci+1 . While there
prevents us from evaluating continuous optimization because
may be an optimized version bi+1 in Ci+1 , it is challenging to
our profiling data will refer to Ci code, and we need BOLT to
update the return address to refer to bi+1 because, in general,
run optimizations on Ci to produce Ci+1 . We plan to add this
the optimizations applied to produce bi+1 can have a significant
feature to BOLT in the future

impact on the number and order of instructions within a
function

D. Limitations
Thus, O COLOS makes a copy of bi in Ci+1 , which we call
bi,i+1 to distinguish it from the more-optimized version bi+1 . O COLOS currently does not support jump tables, as they
bi,i+1 may need to have a different starting address than bi , rely on compile-time constants to compute the jump target,
so O COLOS updates PC-relative addressing within bi,i+1 to and hence O COLOS does not update these constants during
accommodate its new location. O COLOS must also update the code replacement yet. Thus, O COLOS currently requires that a
return address to refer to the appropriate instruction within binary be compiled with the -fno-jump-tables flag. The binaries
bi,i+1 , but O COLOS can treat the original return address into bi for BOLT and the non-PGO baseline, however, can include
as an offset from bi ’s starting address, and then use this offset jump tables. This jump table restriction is not fundamental to
into bi,i+1 to compute the new return address. O COLOS’s approach. With a little extra support from BOLT to
6
identify these constants within the optimized binary, O COLOS the target process via Intel’s LBR mechanism (Section II-A)

can extract and update them as part of code replacement. We feed this information into the BOLT optimizer, as discussed
O COLOS requires a pause time during code replacement, next

during which the target process cannot respond to incoming Running BOLT. We provide a quick summary of how BOLT
requests or do other useful work. This may hurt application operates here to keep this paper self-contained. More details
performance, especially in terms of tail latency. There is scope can be found in the BOLT papers [76, 77]

to reduce the latency of O COLOS’s code replacement: it requires First, we use the perf2bolt utility to extract the LBR
a few MiB of scattered writes throughout the address space, information recorded by perf into an internal format that BOLT
all of which are currently done sequentially. If O COLOS, say, can consume more easily. Armed with the extracted LBR
updated v-tables in parallel with patching direct calls that information and the binary corresponding to the target process,
should reduce the end-to-end replacement time further. BOLT runs a series of optimization passes (most notably basic-
An additional approach to preserving tail latency during block and function reordering, see Section II) to produce a
code replacement is to leverage techniques proposed for new, optimized binary

mitigating the effects of garbage collection pauses in distributed Efficient Code Copying. To provide direct copying of code
systems [82, 103]. If the system includes a load-balancing from the optimized binary into the target process, we launch the
tier, as many modern web services do, then the load balancer target process with an LD PRELOAD library. LD PRELOAD
can be made aware of application pauses (like major garbage is a Linux feature that allows a user-specified shared library
collections, or O COLOS code replacement) and can route traffic to be loaded, alongside a program’s required shared libraries,
to other nodes temporarily. Because code optimizations are when a process is launched. We use LD PRELOAD to add
explicitly triggered by the operator, pause times are well known some functions for code replacement into the address space
and can be scheduled accordingly. of the target process. We then use ptrace to transfer control
O COLOS requires that the functionality and ABI of C1 is to our code, which reads in the optimized binary and copies
unchanged with respect to C0 , so that a function f0 in C0 has its relevant contents into place. While ptrace can also perform
equivalent application semantics to a function f1 in C1 . f1 memory copies into the target process, they are prohibitively
can, however, vary in non-semantic ways, such as having extra slow since each copy requires a system call and several context
instrumentation or different performance. switches. Performing this memory copy from within the target
Global variables cannot change location in O COLOS, since process is much more efficient and helps minimize the stop-
C0 code often hard-codes a global variable’s original location the-world time

via RIP-relative addressing. C1 code thus needs to reference
those same global variables. A. BAM: Batch Accelerator Mode
For programs with short running times, O COLOS’s fixed
V. I MPLEMENTATION optimization costs cannot be effectively amortized. If these
In this section, we discuss some of O COLOS’s implementa- programs are executed frequently, as is common in data centers,
tion issues, including O COLOS’s methodology to profile running it may still be worthwhile to optimize them. To address this
processes, steps to run BOLT, and mechanism to transform code. problem, we have developed an alternative deployment mode
Finally, we describe O COLOS’s BAM mode for accelerating for O COLOS called BATCH ACCELERATOR M ODE or BAM

batch workloads such as software builds. As the name implies, BAM is focused on batch workloads
Profiling. As Figure 4a shows, O COLOS’s first step is to where the same binary is invoked repeatedly. Early invocations
profile the target process, to determine whether it suffers from of the binary can be profiled and fed into BOLT, so that
sufficient front-end stalls to merit O COLOS’s optimizations. subsequent invocations can use the BOLTed binary instead
O COLOS uses the standard Linux perf utility to record hardware and see improved performance. BAM performs its optimization
performance counters for this purpose. perf can attach to an online as the batch workload runs, so it does not suffer from
already-running process, allowing O COLOS to be deployed on stale profiles, stale binary mapping issues, or require any profile
a new process or an existing one. management – all of which can hinder the use of offline PGO
O COLOS adopts a 2-stage approach for profiling. The first systems like BOLT

stage follows the methodology proposed in DMon [49], which BAM is a Linux shared library that is attached to a command,
itself is built on Intel’s TopDown microarchitectural bottleneck e.g., with LD PRELOAD=bam.so make. BAM additionally
analysis [109]. Note that in many data centers, systems such as needs to be told, via a configuration file, the binary to optimize

GWP [89] already continuously profile all applications in the The BAM library makes use of another LD PRELOAD feature
fleet. We have not integrated this analysis into O COLOS yet which is transparent interception of calls to functions in any
since it is not the primary focus of our work, but we perform shared library. In particular, BAM intercepts libc’s exec* calls
measurements to validate the feasibility of this approach in and, if it finds an invocation of the target binary, adjusts the exec
Section VI-C4. arguments to launch the binary with perf’s profiling enabled

If this first-stage exploration reveals significant time spent in BAM also attaches its shared library to child processes to find
the processor front-end, we continue with the second profiling invocations of the target binary no matter where they occur in
stage. Here we use perf to record the hot control-flow paths of the process tree

7
Once BAM has collected a (configurable) number of profiles We use MongoDB version 6.0.0-alpha-655-gea6cea6,
of the target binary’s execution, it runs BOLT in a background driven by inputs from YCSB. We use Memcached version
process to produce the BOLTed binary. Once the BOLTed 1.6.12, driven by inputs from memaslap version 1.0. For
binary is available, BAM rewrites exec calls to use the BOLTed MongoDB and Memcached the input names show the mix of
binary instead of the original binary, leading to an automatic operations, e.g. read95 insert5 means 95% of operations are
performance boost for the remainder of the batch workload. reads and the other 5% are inserts. We use Verilator version
Similarly to O COLOS’s single-process mode (Section IV), 3.904, simulating an in-order rv64imafdc RISC-V single-core
BAM automatically profiles a workload as it runs, avoiding processor generated from RocketChip [5], with the processor
the challenges of stale profiling data and storing and retrieving running a set of RISC-V benchmarks [91]. All benchmarks are
profiles at scale. BAM’s highly-compatible LD PRELOAD- compiled with their default optimization level: -O3 for MySQL
based design is also similar in spirit to O COLOS, in that no and Verilator, and -O2 for MongoDB and Memcached. We
application changes are required to use BAM. In the make measure Verilator’s performance as the throughput of iterations
example above, no changes are required to the Makefiles, of the main Dhrystone loop or iterations over the input array
application source code, make program, or the compiler for median and vvadd. We evaluate BAM on a build of Clang
toolchain. version 14.0

One unique feature of BAM compared to O COLOS is that All performance measurements, unless otherwise noted,
BAM does not replace the code of a running process; it requires show steady-state performance. For O COLOS, we measure
instead a subsequent exec call to allow the optimized binary performance after code replacement is complete, except in
to run. There is thus no stop-the-world component to BAM, Figure 7 where we show MySQL’s performance before, during,
and the overhead of switching from the original binary to the and after code replacement. O COLOS and BOLT results are
optimized one is essentially zero. based on 60 seconds of profiling unless otherwise noted. Unless
We see BAM being especially useful for accelerating otherwise noted, we show averages of 5 runs with error bars
Continuous Integration (CI) builds of large software projects. indicating the standard deviation

These CI builds are always done from scratch to ensure the
software builds correctly on a fresh system [14]. So long as B. Performance and Characterization
the software build is long enough for BAM to obtain useful Figure 5 shows the throughput improvement O COLOS
profiling and run BOLT, BAM can transparently accelerate provides across our set of benchmarks. We compare O COLOS
compiler invocations for the latter part of the build. BAM is to four baselines. Original is the performance of the original
complementary to build optimization techniques like distributed binary, compiled with only static optimizations (nothing profile-
build caches [19, 20, 30, 37, 83, 84]. While a build cache can guided). BOLT oracle input is the performance offline BOLT
avoid some compiler invocations, BAM accelerates those provides when profiling and running the same input; PGO
compiler invocations that remain. BAM is also simpler to deploy oracle input uses the same profiling file as BOLT oracle input
than a build cache as BAM does not need any remote web but feeds it to clang’s builtin PGO pass [62]. Finally, BOLT
services to be provisioned – BAM is purely local to each build. average-case input is the performance offline BOLT achieves
when aggregating profiles from all inputs and then running on
VI. E VALUATION the input shown on the y-axis. We show throughput normalized
In our evaluation of O COLOS, we set out to demonstrate that to original

O COLOS can provide significant performance improvements Figure 5 shows that O COLOS uniformly improves perfor-
for programs that suffer from processor front-end bottlenecks. mance over the original binary, by up to 1.41× on MySQL
To demonstrate O COLOS’s robustness, we evaluate it across a read only, 1.29× on MongoDB read update, 1.05× on
range of benchmarks, from complex, multithreaded programs Memcached and 2.20× on Verilator. Clang’s PGO generally
such as the MySQL relational database to compute-bound, falls short of BOLT, similar to the results from the BOLT
single-threaded workloads like the Verilator chip simulator and paper [76] though our benchmarks are different, likely due to
batch workloads like building the Clang compiler. the challenges of mapping PCs back to the source code [36]

Aggregating profiling information across inputs is worse than
A. Experimental Setup using just the oracle profile of the input being run, as different
We run our experiments on a 2-socket Intel Broadwell Xeon inputs tend to exhibit contradictory control-flow biases that
E5-2620v4 server with 8 cores and 16 threads per socket (16 cancel each other out

cores and 32 threads total) running at 2.1GHz. Each core The results for BOLT oracle input represent an upper bound
has a 64-entry iTLB, a 1536-entry L2 TLB, a 32KiB L1i, for O COLOS’s performance, since BOLT has access to the
a 32KiB L1d, a 256KiB L2 cache, and access to a shared oracle profiling data and ensures that all code pointers refer
20MiB L3 cache and 128 GiB of RAM. The server runs Linux to optimized code, not just a judicious subset of them as with
version 4.18. We use commit 88c70afe of the Lightning O COLOS (Section IV-B). In some cases like MySQL delete
BOLT system [77] from its GitHub repository [21]. and write only, use of code pointers that continue to refer to
For our benchmarks, we use MySQL version 8.0.28, unoptimized C0 code results in a non-trivial performance gap
driven by inputs from Sysbench version 1.1.0-ead2689. (18 and 13 percentage points, respectively)

8
2.25 original
OCOLOS
2.00 BOLT,oracle input
PGO,oracle input
normalized throughput
1.75 BOLT,average-case input
1.50
1.25
1.00
0.75
0.50
MySQL MySQL MySQL MySQL MySQL MySQL MySQL MySQL MySQL MySQL mongo mongo mongo mongo mongo mongo mem$ mem$ mem$ mem$ veriltr veriltr veriltr
point read select select update update read insert write delete read read95 read read95 read scan95 update set9 set5 set1 vvadd median dhrys
select only random random non index write only update insert5 only update5 rmw insert5 get1 get5 get9
ranges points index
Fig. 5: Performance of O COLOS (light blue bars) compared to BOLT using an oracle profile of the input being run (dark blue
bars), Clang PGO using the same oracle profile (purple bars) and BOLT using an average-case profiling input aggregated from
all inputs (pink bars). All bars are normalized to original non-PGO binaries (white bars)

However, on average O COLOS is close to the BOLT oracle’s physical memory allocated to a process, when running the
performance with a slowdown of just 4.6 points. Compared original binary, BOLT, and O COLOS on MySQL oltp read only,
to offline BOLT with an average-case profile, O COLOS is mongodb read update, Memcached set10 get90, and Verilator
8.9 points faster on average. This shows that O COLOS’s dhrystone. O COLOS requires a modest amount of extra memory,
efficient design enables dynamic code optimizations with only 208 MiB for mongodb and much less for other benchmarks

almost the same performance gains as PGO while providing O COLOS’s memory consumption is affected primarily by binary
additional benefits such as guaranteeing the accuracy of size, and does not scale up with larger or longer-running inputs

profiling information and easy mapping to the target binary, and Note also that O COLOS’s memory consumption is not an
a simple deployment model that avoids the need for profiles ongoing cost, but is incurred during code replacement and
to be stored and queried. can be deallocated afterwards

MongoDB scan95 insert5 is an odd case where conven- O COLOS’s storage requirements are under 200 MiB for each
tional static compilation outperforms all of the profile-guided benchmark, chiefly for profiling data and the optimized binary,
techniques (e.g., O COLOS is 14% slower than original). To which does not produce a significant amount of disk I/O. Note
understand this behavior better, we applied Intel’s TopDown that these files are also transient: after the optimized binary is
[109] performance measurement methodology which can produced they can be deleted

identify the root microarchitectural cause of low IPC. TopDown
classifies pipeline slots in each cycle to one of four top-level MySQL Mongo Mem$ Verilator
cases: Retiring (useful work), Front-End Bound (L1i, iTLB, and functions 33,170 69,807 374 406
v-tables 3,812 6,165 0 10
decoder bottlenecks), Back-End Bound (L1d or functional unit .text section (MiB) 24.6 50.0 0.142 2.3
bottlenecks) or Bad Speculation (branch or memory aliasing avg funcs reordered 963.6 2,364.2 74.2 83.2
mispredictions). With scan95 insert5, in all of the BOLT-based avg funcs on stack 79 100.6 10 5
configurations (O COLOS, BOLT oracle and BOLT average-case) avg call sites changed 31,677.2 30,9297.8 496.6 251.2
max RSS (MiB)
the workload shifts from being front-end bound to back-end original 397.4 1434.4 67.8 263.4
bound, with many memory accesses in particular stalled waiting BOLT 398.0 1432.8 67.9 263.7
for DRAM, suggesting that poor memory controller scheduling O COLOS 438.5 1640.5 69.8 265.4
may be the root cause of the slowdown. The PGO version of TABLE I: Benchmark characterization data
scan95 insert5 has very similar TopDown metrics to original,
so the cause of its slowdown is unclear

Table I shows characterization data for our benchmarks,
C. MySQL Case Study
such as code size metrics, the average number (across inputs)
of functions that are reordered by BOLT, on the call stack Next we present an in-depth case study of MySQL, using it
when code replacement occurs, and direct call sites that are to illustrate different aspects of O COLOS’s performance. We
patched. We also report memory consumption in terms of focus on MySQL because it is a complex workload and it has
maximum resident set size, which is the peak amount of the widest variety of inputs among our benchmarks

9
To see a small, concrete example of how O COLOS can
improve performance, we used the perf report and perf annotate
7000
utilities to examine the distribution of L1i misses in an
throughput (transactions / second)
execution of MySQL oltp read only. Under both BOLT with
average-case input and Clang PGO, the MYSQLparse function 6000
is a common source of L1i misses - with BOLT average-case it
actually has the most L1i misses of any function. MYSQLparse 5000
is auto-generated by Bison as the main parsing function for SQL
queries, with over 176 KiB of binary code. perf reports frequent 4000
L1i misses in basic blocks dealing with backtracking and
checking for additional tokens. It makes sense that the average- 3000
case input has a difficult time, as it is unable to specialize the
parser code for the current query mix and oltp read only has 2000 1 2 3 5
only select queries. It is less clear why PGO performs poorly
since it has the oracle profile, but it is likely due to problems 4
1000
mapping low-level PCs back to source code and LLVM IR. 0 50 100 150
With both O COLOS and BOLT oracle, MYSQLparse does not time (seconds)
even appear on perf’s radar as no L1i misses are sampled Fig. 7: Throughput of MySQL read only before, during, and
within it. after code replacement. 95% tail latency degrades from 1.00ms
to at most 1.55ms during code replacement

1.4 we performed an experiment with MySQL read only with
Sysbench reporting the client’s transaction throughput every
speedup over original
second. Figure 7 shows the results. The first 20 seconds
1.3 (region 1 of the graph) are a warm-up period, showing the
performance of the original binary at around 4,200 transactions
per second (tps). After this, perf profiling begins collecting
1.2
LBR samples (region 2), reducing throughput to about 3,600
tps. In region 3, perf2bolt runs 4 background threads to translate
1.1 the LBR samples into a format that BOLT can use, and
then single-threaded BOLT generates the optimized binary

BOLT BOLT, in particular, is quite CPU-intensive, causing a reduction
1.0 OCOLOS in throughput just after the 100-second mark. In region 4,
10 5 10 4 10 3 10 2 10 1 100 101 102 O COLOS performs code replacement which entails a brief
profile duration (seconds) single-threaded stop-the-world phase of 669 milliseconds (see
Fig. 6: The impact of profile duration on speedup for MySQL Table I for other benchmarks). After that, in region 5, MySQL’s
read only parallel execution resumes with the optimized code in place
lifting the performance to almost 6,000 tps. Sysbench also
th
1) Profiling Duration: The amount of time that O COLOS reports 95 percentile tail latency for each 1-second window
spends gathering profile information is configurable. While of execution. Analyzing a single representative run, the average
we use a default of 60 seconds for our current experiments, 95% transaction latency during region 1 is 1.00 milliseconds,
O COLOS can still perform well with significantly less profiling degrading to a worst-case of 1.55 ms during regions 3 and 4,
information. Figure 6 shows the speedup over the original and improving to 0.73 ms on average in region 5

binary when varying the duration of profiling. The green Figure 7 shows that the performance impact of O COLOS
squares show O COLOS, and the blue triangles show BOLT to is modest, even during code replacement. As we discussed in
represent how well offline BOLT can optimize when given the Section VI-C1, O COLOS can still perform well with as little
same profiling information as O COLOS. BOLT again provides a as 1 second of profiling. Although we are already using the
ceiling on O COLOS’s expected performance. Figure 6 illustrates Lightning BOLT system [77] which has been optimized for
that profiling for at least 1 second offers a good absolute lower execution times, there likely exist further opportunities
speedup over the original binary and also achieves most of to reduce region 3 costs by shifting some of BOLT’s work into
the benefits that offline BOLT does. Below 100 milliseconds, an offline phase. Such optimizations do not matter for BOLT’s
profile quality suffers significantly for both O COLOS and BOLT. original offline setting, however, they would be beneficial
2) Code Replacement Costs: To better understand the per- for O COLOS. Finally, there is scope to reduce O COLOS’s
formance impact of O COLOS’s code replacement mechanism, pause time further by shifting more work to occur inside
10
the target process via the O COLOS LD PRELOAD library
and parallelizing the code replacement routines that currently
execute serially

60
L1i misses
40
3) End-to-End Overheads: Table II shows O COLOS’s over- 20
heads for code replacement. The intervals between code 0
replacements are configurable with O COLOS; longer intervals
iTLB misses
amortize code replacement costs better but are less sensitive 10.0 original
OCOLOS
to application phases or input changes. 7.5 BOLT,oracle
5.0 PGO
One way to evaluate O COLOS’s overheads in an end-to- BOLT,average-case
2.5
end manner is to consider how long it takes O COLOS to 0.0
“recover” the ground lost during code replacement. Considering 125
MySQL read only as an example (Figure 7), the steady state 100
branches
taken
throughput of the original program is about 4,200 transactions 75
per second, which O COLOS boosts to 5,850 tps after code 50
replacement is complete. Taking the reduced throughput 25
0
during code replacement into account, at 30 seconds after
30
mispredictions
code replacement completes O COLOS has processed as many
branch
transactions as if we had run the original binary the entire time. 20
All execution after this point is a net gain for O COLOS, so 10
running for several minutes before the next code replacement
is advisable in practice. With smaller speedups, O COLOS must 0
read
update
select
ranges
points
write
insert
write
delete
nonindex
only
read
only
point
rand
index
update
rand
run for longer before performing code replacement again. More
generally, if O COLOS hurts performance by a factor of a during
code replacement which lasts for s seconds, and then boosts
Fig. 8: Microarchitectural events (per 1,000 instructions) for
performance by a factor of b after code replacement completes,
MySQL inputs
we should run the optimized code for at least as/b seconds to
recover the ground lost during code replacement

MySQL Mongo Mem$ Verilator can accurately determine workloads that will and won’t benefit
perf2bolt time (sec) 28.186 26.624 12.918 4.181 from O COLOS (Figure 9). Moreover, with O COLOS’s online
llvm-bolt time (sec) 8.237 17.882 0.1404 1.935 approach, even should identifying performance losses a priori
replacement time (sec) 0.669 1.221 0.020 0.146 prove challenging, we can always revert to C0 code to at least
TABLE II: Fixed costs of code replacement recover the original performance

D. Batch Accelerator Mode
In this section, we examine the impact of BAM on a from-
4) Microarchitectural Impacts: Next we investigate the scratch build of the Clang compiler. In a large software
microarchitectural causes of O COLOS’s performance benefits. build, BAM profiles the initial compiler executions to generate
Figure 8 shows a variety of front-end performance counter an optimized compiler binary that is tuned to the source
measurements, each represented as events per 1,000 instructions. program being compiled. A full Clang build contains 2,624
The MySQL inputs along the x-axis are sorted from highest compiler executions in all. The dashed red line near the top
(left) to lowest (right) speedup with O COLOS to match the order of Figure 10 illustrates the running time of the original Clang
in Figure 5. Moving from top to bottom in Figure 8, we see that build, executing parallel jobs via make -j. For the dashed orange
O COLOS is able to achieve significant reductions in L1i and line at the bottom, we aggregate profiling information from
iTLB MPKI. All MySQL inputs also show large reductions in the entire build and feed it to BOLT, and then measure a fresh
the number of taken branches; fewer taken branches means less build using the resulting BOLTed binary. This represents a
pressure on branch prediction resources which may reduces lower bound on the running time that BAM can achieve

mispredicted branches as well. Across all of these front-end The green triangles (with a polynomial curve fit to them)
metrics, O COLOS achieves results very similar to offline BOLT. show the performance of BOLT when we profile only a limited
Somewhat surprisingly, the front-end metrics in Figure 8 number of compiler executions (given on the x-axis), generating
often do not correlate particularly well with the speedup that an optimized binary while measuring the time of a fresh
O COLOS provides. To overcome this, we again turned to Intel’s build using this binary. The cost of collecting profiles and
TopDown [109] methodology. Using TopDown’s Front-End running BOLT is excluded; the optimized binary is available
Latency and Retiring percentages, a simple linear regression at the start of the build. These results show how well an ideal
11
no performance gain
30 have performance gain 1000
25 950
real time (seconds)
900
20 original
Retire %
BAM
850 BOLT profile prefix
15 BOLT profile all
800
10 750
5 700
10 20 30 40 50 60 70 0 5 10 15 20 25
Front-End Latency % executions profiled
Fig. 9: TopDown’s [109] Front-end Latency and Retire per- Fig. 10: The running time of a Clang build with the original
centages allow us to accurately classify which workloads will compiler, and compilers optimized by BOLT and BAM

benefit from O COLOS and which won’t

from simpler next-line [97] and discontinuity [42, 81, 100, 101]
BAM implementation can perform if it did not suffer from any
prefetchers to sophisticated temporal [24, 25, 46, 47] (or record-
profiling and optimizations overheads, revealing the marginal
and-replay [6]) prefetching, aim to strike a balance between
utility of extra profiling data

performance and high metadata storage overhead. Branch
Finally, the blue squares (and polynomial curve) show the
predictor-guided prefetchers [87, 88] are extremely effective [40,
performance achieved by BAM. We first observe that, even
41, 58, 59] and consequently, have been adopted in many recent
when profiling just one compiler execution, BAM provides a
processors [32, 78, 95, 102]. Nevertheless, these state-of-the-art
speedup of 1.09× over the original build. At first, profiling
prefetchers fall short when applications contain a large number
additional compiler executions leads to a speedup of up to
of taken branch instructions that exhaust the capacity of the
1.14×, as this profiling data is “worth the wait”. However,
branch predictor and BTB [25, 48, 58, 99]. O COLOS can convert
after about 5 executions BAM suffers diminishing returns from
taken branches into not-taken ones, easing pressure on the
additional profiling for two reasons. First, the value of that
branch predictor (Figure 8) and improving overall performance

profiling data is relatively low as shown by the decreasing
slope of the green line. Second, as BAM waits for more profile Profile-guided code layout optimizations. Compiler tech-
data, it starts the optimization process later, losing out on niques to address the front-end stall problem mainly fo-
opportunities to use the optimized binary. This opportunity cus on improving instruction locality via code layout opti-
cost increases over time, causing the BAM running time to mizations [10, 29, 34, 36, 61, 63, 65, 76, 77, 79, 86, 113]. These
eventually surpass that of the original build. techniques perform basic-block reordering [72, 80], function
Ultimately, our BAM investigation demonstrates that the reordering [75], and hot/warm/cold code splitting [12] (also
amount of profiling data needed to run PGO effectively is known as function splitting [76]) using profiles collected
quite low, mirroring our results from Section VI-C1. BAM is from previous executions [27, 38]. While these techniques are
able to leverage this property to accelerate the Clang build, extremely effective at improving instruction locality [6] and
without any changes to Clang or the build infrastructure. therefore widely adopted in today’s data centers [10, 76, 77],
profile quality limits their ability to achieve close-to-optimal
VII. R ELATED W ORK performance as we show in Section III-A. To address this
The performance implications of front-end stalls have limitation, O COLOS always uses the best-quality profile from
inspired computer architecture and compiler researchers to the current execution. Some managed language runtimes, like
propose numerous techniques for improving instruction locality. the HotSpot JVM [39], also perform PGO at run time, profiling
We divide this work into three categories and qualitatively the application running on the VM. While O COLOS targets
compare these techniques against O COLOS to describe how unmanaged languages instead, O COLOS could complement
O COLOS addresses their shortcomings. a system like HotSpot by performing PGO on the running
Instruction prefetching mechanisms. Computer architects HotSpot binary itself

primarily aim to solve the front-end stall problem via instruction Other systems [17, 26, 60] have also proposed run-time code
prefetching [4, 24, 25, 28, 31, 33, 42, 46, 47, 57, 58, 59, 68, 70, 87, optimization for unmanaged languages. ClangJIT [26] can
88, 92, 93, 94, 96, 97]. A plethora of such techniques, ranging perform C++ template specialization at run time, improving
12

Guided code layout optimization tool for unmanaged code. We show how to perform online code replacement effi-ciently in unmodified, large-scale C/C++ programs. We evaluate OCOLOS …

Download Now

Documemt Updated