File Name: zhang-ocolos-micro-2022.pdf
File Size: 423.46 KB
File Type: Application/pdf
Last Modified: 6 months
Status: Available
Last checked: 1 months ago!
This Document Has Been Certified by a Professional
100% customizable
Language: English
We recommend downloading this file onto your computer
OCOLOS: Online COde Layout OptimizationS Yuxuan Zhang∗ Tanvir Ahmed Khan† Gilles Pokam‡ Baris Kasikci† Heiner Litz§ Joseph Devietti∗ ∗ University of Pennsylvania † Universityof Michigan ‡ Intel Corporation § University of California, Santa Cruz ∗ {zyuxuan, devietti}@seas.upenn.edu † {takh, barisk}@umich.edu ‡ [email protected] § [email protected] Abstract—The processor front-end has become an increasinglyimportant bottleneck in recent years due to growing applicationcode footprints, particularly in data centers. First-level instruc- 64tion caches and branch prediction engines have not been able to AMD L1i capacity (KiB)keep up with this code growth, leading to more front-end stalls Inteland lower Instructions Per Cycle (IPC). Profile-guided optimiza- 48tions performed by compilers represent a promising approach, asthey rearrange code to maximize instruction cache locality andbranch prediction efficiency along a relatively small number ofhot code paths. However, these optimizations require continuous 32profiling and rebuilding of applications to ensure that the codelayout matches the collected profiles. If an application’s code is 2006 2010 2014 2018 2022frequently updated, it becomes challenging to map profiling datafrom a previous version onto the latest version, leading to ignoredprofiling data and missed optimization opportunities
Fig. 1: AMD & Intel per-core L1i capacity over time In this paper, we propose O COLOS, the first online codelayout optimization system for unmodified applications writtenin unmanaged languages. O COLOS allows profile-guided opti- turned to Profile-Guided Optimizations1 (PGO) from themization to be performed on a running process, instead of compiler community that reorganize code within a binary tobeing performed offline and requiring the application to be re- optimize the utilization of the limited L1i for the common-caselaunched. By running online, profile data is always relevant tothe current execution and always maps perfectly to the running control-flow paths (we describe these compiler optimizations incode. O COLOS demonstrates how to achieve robust online code detail in Section II). Google’s AutoFDO [10] and Propeller [29],replacement in complex multithreaded applications like MySQL Meta’s BOLT [76, 77], and gcc’s and clang’s built-in PGOand MongoDB, without requiring any application changes. Our passes are popular examples of this approach. While theseexperiments show that O COLOS can accelerate MySQL by up to systems have seen successful deployment at scale, there exist1.41×, the Verilator hardware simulator by up to 2.20×, and abuild of the Clang compiler by up to 1.14×. three significant challenges
First, the results of PGO are only as good as the profiling information that drives PGO’s optimization decisions [53, 108, I. I NTRODUCTION 110]. PGO requires relevant, fresh profiling information to produce high-performance code layouts [72]. However, PGO is an offline optimization, applied either during compilation As the world demands ever more from software, code sizes (AutoFDO, gcc, clang) or to a compiled binary (BOLT andhave increased to keep up. Google, for example, reports annual Propeller), creating a fundamental lag between when profilinggrowth of 30% in the instruction footprint of important internal information is collected and when it is used to optimize theworkloads [6, 45]. This code growth has created bottlenecks code layout [10]. If program inputs shift during this time,in the front-end of the processor pipeline [50], as the sizes previous profiling information is rendered irrelevant or evenof front-end hardware resources close to the processor have harmful when it contradicts newer common-case behavior [69,been relatively static over time [52]. Figure 1 shows that, 105]. Maintaining profiles for each input or program phase isdespite Moore’s Law, the per-core L1 instruction cache (L1i) prohibitive in terms of storage costs, so profiles are mergedcapacity of server microarchitectures from Intel and AMD has together to capture average-case behavior at the cost of input-remained effectively constant (literally so in Intel’s case) over specific optimization opportunities [104, 111]
the past 15 years, because the L1i is so latency-critical [23]
As we attempt to cram ever more code into a fixed-size L1i, Second, even if we have secured timely profiling information,strain on the processor front-end is inevitable. The inability to if the program code itself changes then it is difficult to mapdeliver instructions to the processor leads to front-end stalls the profiling information onto the new code [10]. By its nature,that tank IPC and end-to-end application performance as even profiling information is captured at the machine code level, andadvanced techniques like out-of-order processing cannot hide 1 Many optimizations can be driven by profiling information, so the termthese stalls [58]. “profile-guided optimization” is quite broad. In this paper, we use it to refer To address front-end stalls, large software companies have exclusively to profile-driven code layout optimizations
even modest changes to the source code can lead to significant cost for code replacement which is readily amortized, alongdifferences in machine code [36]. To make things worse, in with a small amount of run-time instrumentation on functionlarge software organizations, code changes can arrive every few pointer creation (see Section IV-C)
minutes for important applications [3]. Since we always need For some short-running programs, even the 1-time cost ofto deploy the latest version of an application, there is a constant O COLOS is too high to be effectively amortized at run time
challenge when applying PGO with profiling data collected To address this problem, we propose BATCH ACCELERATORfrom version k to the compilation of the latest version k0 . M ODE (BAM), a technique that allows batch workloadsProfiling data that cannot be mapped to k0 is discarded, causing consisting of a large collection of short-running processes,us to miss optimization opportunities [22]. like large software builds, to benefit from O COLOS. BAM The third key challenge with offline PGO approaches is works by profiling initial executions of a binary, generatingthat recording, storing, and accessing PGO profiles adds an an optimized binary with BOLT, and using that optimizedoperational burden to code deployment. In particular, for end- binary in subsequent executions. BAM operates transparentlyuser mobile applications, managing profiles can itself be a to the workload, via LD PRELOAD injection, allowing BAMnon-trivial task [69, 90]. to accelerate builds of the Clang compiler without any changes In this paper, we propose O COLOS, a novel system for online to Clang code or build scripts
profile-guided optimizations in unmanaged languages. O COLOS While in this paper we focus on using O COLOS to enableperforms code layout optimizations at run time, on a running online PGO, we envision O COLOS also being applicable inprocess. By moving PGO from compile time to run time, we a range of other cases such as performance optimizationavoid the challenges listed previously. Profile information is and security. We will open-source O COLOS to facilitate thisalways up-to-date with the current behavior the program is exploration
exhibiting. O COLOS also supports continuous optimizations to In summary, this paper makes the following contributions:keep up with input changes over time. In O COLOS, profiling • We describe the design of O COLOS, the first online profile-data always maps perfectly onto the code being optimized guided code layout optimization tool for unmanaged code
since we profile and optimize the currently-running process. • We show how to perform online code replacement effi-There is no burden of profile management, as the profile is ciently in unmodified, large-scale C/C++ programs
produced and then immediately consumed. Some managed • We evaluate O COLOS on a series of big-code applicationslanguage runtimes (e.g., Oracle HotSpot JVM [39] and Meta like MySQL and MongoDB, demonstrating speedups ofHHVM [73, 74]) support online code layout optimizations and up to 1.41× on MySQL
achieve similar benefits. We are not aware, however, of any • We evaluate a variant of O COLOS targeted at batch work-system before O COLOS that brings the benefits of online PGO loads with many short-running processes, demonstratingto unmanaged code written in languages like C/C++. a 1.14× speedup on a from-scratch Clang build
To realize the benefits of PGO in the online setting, O COLOSbuilds on the BOLT [76, 77] offline PGO system, which takes II. BACKGROUNDa profile and a compiled binary as inputs and produces anew, optimized binary as the output. O COLOS instead captures In this section, we provide the necessary background ofprofiles during execution of a deployed, running application, PGO passes implemented in tools like AutoFDO [10] anduses BOLT to produce an optimized binary, extracts the code BOLT [76, 77], which are now the state of the art at all majorfrom that BOLTed binary, and patches the code in the running cloud companies including Google and Meta
process. To avoid corrupting the process, code patching requirescareful handling of the myriad code pointers in registers and A. Hardware Performance Profilingthroughout memory. O COLOS takes a pragmatic approach that Profile collection is the first step of all PGO workflows
requires no changes to application code, which enables support There are two different methods of profile collection: 1) throughfor complex software like relational databases. compiler instrumentation of branch instructions (e.g., Clang O COLOS is different from other Dynamic Binary Instrumen- and GCC), and 2) through hardware support (e.g., Intel’s Lasttation (DBI) frameworks like Intel Pin [64], DynamoRIO [8, Branch Record [56] and Processor Trace [55]). Due to the high35], and Valgrind [71] in that O COLOS 1) focuses on code overheads of compiler instrumentation [10], cloud providersreplacement, instead of providing APIs for instrumentation, generally leverage hardware profiling support [10, 13, 18, 67, 76,and 2) has a “1-time” cost model where major work is done 77]. For instance, Intel’s LBR [56] facility, which dates back toonly during code replacement and the program runs with native the Netburst architecture (Pentium 4), is widely available at thisperformance once the replacement is complete. Existing DBI point. When LBR tracing is enabled, the processor records theframeworks would be unsuitable for our online PGO use-case Program Counter (PC) and target of taken branches in a ringbecause programs running under, say, Pin experience a non- buffer2 . The recording overhead via LBR is negligible [43, 67]trivial ongoing overhead to intercept control-flow transfers and software can then sample this ring buffer to learn theand maintain the code cache. The performance benefits of the branching behavior of an application. The Linux perf utilityimproved code layout would, in practice, often be outweighedby these ongoing overheads. Instead, O COLOS exacts a 1-time 2 The LBR buffer in Skylake and newer cores has 32 entries
2 provides simple access to LBR sampling, including the ability The C3 [75] algorithm improves upon Pettis-Hansen byto start and stop LBR recording of arbitrary running processes. placing callers before callees, which is especially helpful in Each LBR sample represents a snippet of the program’s asymmetric calling relationships where A calls B frequentlycontrol flow. By aggregating these snippets, the approximate but B never calls A. This allows C3 to move the target of afrequency of branch taken/not-taken paths through the code function call closer to the call instruction itself, improving L1ican be reconstructed. With these branch frequencies in hand, and iTLB locality beyond what PH can provide
we can make intelligent decisions about optimizing the code Sometimes PGO passes will incorporate additional opti-layout as described below. mizations, such as function inlining or peephole optimizations
However, nearly all of the performance benefit of PGO passesB. Basic Block Reordering comes from basic block reordering and function reordering [76]
Basic block reordering is typically the most impactful D. BOLT: Binary Optimization & Layout ToolPGO code transformation [76]. Whenever programs containif statements, the compiler must decide how to place the BOLT [76, 77] is a post-link optimization tool built in theresulting basic blocks into a linear order in memory [80]. LLVM framework, which operates on compiled binaries. TheWithout profiling information, the compiler must use some BOLT workflow begins with gathering profiling information
heuristics to decide on a layout based on static code proper- Though BOLT can use a variety of profile formats, LBRties [7, 9, 15, 44, 66, 85, 107, 112], often leading to sub-optimal samples are preferred. Armed with the profile and the original,results [69]. non-BOLTed binary, BOLT decompiles the machine code into The ideal layout places the common-case blocks consecu- LLVM’s low-level Machine Intermediate Representation (MIR)tively, maximizing L1i and instruction Translation Lookaside format, not to be confused with the more commonplace LLVMBuffer (iTLB) locality while reducing pressure on the branch IR. BOLT performs a series of optimizations, including basicprediction mechanism [72]. In particular, by linearizing the block reordering and function reordering, at the MIR levelcode, the number of taken branches is minimized, reducing the before performing code generation to emit a new, BOLTedpressure on the Branch Target Buffer (BTB) which only stores binary
information about taken branches [40, 41, 48, 98]. Consider the The layout of a BOLTed binary is unconventional in a fewexample program in Figure 2. Assuming both conditions are ways. First, cold functions are left in-place in the original .texttypically true, shaded basic blacks constitute the common-case section of the binary, which is renamed to the bolt.org.textexecution. A naive layout which places the blocks from each section. These cold functions are subject to small peephole op-if statement together results in two taken branches (shown timizations but their starting addresses do not change and theirby arrows). The optimal layout, however, avoids any taken basic blocks are not reordered because there was insufficientbranches, and results in better performance. profiling information to justify stronger optimizations. The hot functions, however, are heavily optimized by BOLT (via basic block and function reordering) and are moved to a new .text naive optimal layout layout section at a higher address range. Additionally, BOLT may if (cond1) { // A A A perform hot-cold code splitting, where the cold basic blocks of // B a hot function f are not stored contiguously with the hot blocks } else { B B // C for f , but are instead exiled to another region of the binary C D } with other cold blocks from other hot functions. Functions that if (cond2) { // D D E // E are entirely cold are not worth splitting in this way, and have } else { E G their code stored contiguously in the bolt.org.text section
// F F C } III. C HALLENGES // G G F A well-known and intuitive challenge with offline profiling- Fig. 2: Example program which benefits from PGO based optimizations like conventional PGO is ensuring that the gathered profile data is of high quality [11, 53, 54, 106]
Profiling with one program input and then running on a differentC. Function Reordering input can lead to many sub-optimal optimization decisions [2, Function reordering optimizes the linear order of functions 51]. We validate this effect experimentally in Section III-A
within a binary to take advantage of caller-callee relationships. O COLOS offers a solution to offline PGO’s input sensitivity:This optimization first uses profiling information to construct since O COLOS profiles and optimizes a running process, thea call graph and annotates edges with the frequency of calls. profile data is always for the current binary and the current The classic Pettis-Hansen (PH) algorithm [80] puts functions inputs. However, performing code replacement at run timenext to each other if they call or are called by each other introduces other challenges. Chief among them is that changingfrequently. While the PH algorithm uses a greedy approach to code can break any explicit or implicit code pointers (pointersplace the most frequently-invoked functions adjacent to each to other instructions, not data) that referenced the changedother, it makes no distinction between callers and callees. code. In Section III-B, we catalog the myriad sources of code 3 pointers in a running process, to better motivate the design of Functions can call each other via direct calls, encoding theO COLOS in Section IV which can update or preserve these callee function’s starting address as a PC-relative offset. Therecode pointers as necessary. may also be indirect calls via function pointers stored in a v-table3 , or programmer-created function pointers stored on theA. Input Sensitivity stack, heap, or in global variables. As with other pointers Figure 3 quantifies the sensitivity of BOLT’s performance in C/C++, a function pointer might be cast to an integer,to the quality of its profile data. The bars show along the obfuscated via arithmetic, then restored to the original value,x-axis the throughput of a BOLTed MySQL binary running cast back to a pointer again, and dereferenced
the read only input from Sysbench [1]. The y-axis lists the Code pointers that do not refer to the start of a function areSysbench input used to provide profile data to BOLT. Thus, the also commonplace. Jump and conditional branch instructionsbottom bar shows the performance when profiling the insert within a function reference code locations via PC-relative off-input, BOLTing the binary, and then running with the read only sets. Sometimes indirect jumps rely on compile-time constantsinput. For reference, the dashed line shows the performance that are used to compute a code pointer at run time, e.g., in theof the original MySQL binary without BOLT optimizations. implementation of some switch statements. Return addressesWhile BOLT improves performance regardless of the training on the stack are code pointers to functions that are on the callinput used, the worst profile (insert) is 21% slower than the best stack. The value of PC in each thread (the rip register in x86) isprofile (read only). Aggregating all profiling inputs (the bar a pointer to an instruction in the currently-running function inlabeled all) experiences some destructive interference between each thread. A thread may be blocked doing a system call, inprofiles and is about 8% slower than the best profile. Because which case its PC is effectively stored in the saved context heldO COLOS (shown with the solid line) runs online instead of by the operating system. libc’s setjmp/longjmp API can be usedahead of time, it always profiles the current input, and achieves to create programmer-managed code pointers to essentiallyhigh performance comparable to the best profile. arbitrary code locations
As is clear from the discussion above, the address space of a typical process contains a large number of code pointers
read only Keeping track of all of them so that they can be updated if a read write piece of code moves is essentially impossible for any serious point select program. Sometimes a code pointer of interest resides in kernel space where it is inaccessible to user code. In a managed all language, the runtime can indeed track all code pointers and profiling input random ranges update or invalidate them as needed. However, O COLOS targets random points complex unmanaged code so we have to be able to live with update index code pointers that are outside our control. Even small code write only changes can silently break such code pointers
update nonindex OCOLOS In the next section, we discuss how O COLOS overcomes delete original these challenges by retaining the original code within a process, BOLT insert adding optimized code at a new location, and patching up as 0 2000 4000 6000 many code pointers as possible to steer execution towards the transactions / second on read_only optimized code in the common case
Fig. 3: Performance achieved when running MySQL with the IV. OCOLOSSysbench read only input, using BOLT to produce a binaryfrom the given profiling input or, with the all bar, from profiles In Figure 4a, we show a high-level overview of the stepsof all inputs combined. O COLOS performs to optimize the code of a target process at run time. First, we gather profiling information from the target process Ê, then build the BOLTed binary Ë, pause theB. Challenges of Changing Code Pointers target process Ì, inject code Í, update pointers to refer to O COLOS requires modification of code pointers at run time to the injected code Î, and finally resume the process Ï. Inperform its optimizations. To better understand the challenges this section, we assume the presence of the BOLTed binaryof changing these code pointers, we first discuss the different and focus on the key components of O COLOS’s online codeflavors of code pointers that arise in a running process. The replacement mechanism: Steps Ì-Î. Later, in Section V, weconventional compilation flow of offline PGO deals only with discuss Steps Ê and Ë, which are conceptually simpler as theystatic code, so many of the challenges we discuss below are leverage existing tools like Linux’s perf utility for performanceunique to O COLOS’s run-time approach. profiling and BOLT. Note that Steps Ê and Ë, which consume First, we distinguish between code pointers that refer to the most time, are done concurrently in the background whilethe starting address of a function versus those that reference 3 A virtual function/method table (v-table) is used to implement dynamica specific instruction within a function (e.g., the target of a dispatch or virtual functions in object-oriented languages. The table itselfconditional branch). We discuss function starting addresses first. stores function pointers to the methods of a class
4 a0 call: b0 a0 call: b0 a0 call: b0 a0 call: b0 C0 b0 b0 b0 b0 c0 call: b0 c0 call: b1 c0 call: bi c0 call: bi+1 3 4 5 b1 bi ci+1 call: bi+1 pause inject update C1 c1 call: b1 ci call: bi bi+1 Ci+1 process code pointers bi,i+1 2 1 build optimized 6 v-table v-table v-table v-table profile binary run with func ptr: b0 func ptr: b1 func ptr: bi func ptr: bi+1 target optimized code stack stack stack stackprocess ret addr: c0 ret addr: c0 ret addr: bi ret addr: bi,i+1 Figure 4a: Main steps O COLOS takes to optimize a target process Figure 4b: Starting state of the Figure 4c: Before (left) and after address space (left) and state (right) continuous optimization after code replacement (right)the target process continues to run. Though operations like function) address has been saved on the heap or stack at runrunning BOLT are CPU-intensive, they compete for cycles time
with the target process for only a limited time. Steps Ì-Î are B. Updating Code Pointersdone synchronously while the target process is paused
To better describe key operations within O COLOS, we first When patching code pointers to make the C1 code reachable,describe the important regions of the address space of the target O COLOS follows our second design principle:process, shown in the left part of Figure 4b. The code from the Design Principle #2: run C1 code in the common caseoriginal binary we refer to as C0 , which consists of 3 functions O COLOS executes code from C0 instead of C1 occasionallya0 , b0 and c0 . A v-table contains a pointer to b0 . Finally, each to ensure correctness. However, the more frequently O COLOSthread’s stack is also important as it contains return addresses executes code from C0 , the more it reduces the potentialof currently-executing functions. In Figure 4b, c0 is on the call performance gains C1 can provide. Therefore, we seek to makestack
C1 the common case. Other O COLOS use-cases such as profiling O COLOS takes as input an optimized binary, with modified are likely to also be amenable to this trade-off. For instance, ifcode for functions in C0 or code for entirely new functions
we need to count function invocations then we can instrumentWhile O COLOS’s code replacement ultimately requires a short only the C1 code, ignoring the rare invocations of the old C0stop-the-world period (Section IV-B) to modify code and update version of a function. For security or debugging use-cases,code pointers, O COLOS performs some bookkeeping in advance
however, it may be necessary to redirect all invocations of C0In particular, O COLOS parses the original binary offline to functions to their C1 counterparts instead, e.g., via trampolineidentify the locations of all direct call instructions. O COLOS instructions [16] at the start of C0 functions and at call sitespatches these calls at run time, but identifying the call sites within them
in advance significantly shortens the stop-the-world period
Since our goal for the current version of O COLOS isO COLOS leverages the Linux ptrace API, which allows one minimizing (but not eliminating) time spent in C0 , O COLOSprocess (often a debugger like gdb) to control and inspect updates as many code pointers to refer to C1 as it is worthwhileanother process. O COLOS uses ptrace to stop the target process to update. Note first of all that hot code gets optimized byand to inspect and adjust its register state
BOLT and resides in C1 . Direct calls in C1 will already referA. Adding Code to C1 (e.g., c1 calls b1 ) and do not require updating
Figure 4b illustrates changes O COLOS makes. We update As we describe in Section III-B, finding and updating all function pointers in v-tables and direct calls in C0 for functionscode pointers is fraught with corner cases. This leads to the on the call stack (like c0 ). Recall that these C0 changes preservefirst principle guiding O COLOS’s design: instruction addresses, honoring our first design principle. WeDesign Principle #1: preserve addresses of C0 instructions found that, in practice, updating direct calls in all functions (i.e., To enable significant performance gains by optimizing both including those, like a0 , not on the stack) does not improvethe function-level and basic block layout, while preserving performance – because functions like a0 are cold – though itcorrectness, we design the following technique. Instead of does slow code replacement
We could additionally seek out function pointers in regis-updating the code of a function in place, O COLOS injects a ters and memory, though doing so would require expensivenew version of the code C1 into the address space while leaving always-on run-time instrumentation to track their propagationthe original code intact (see Figure 4b). O COLOS then changes throughout the program’s execution. This tracking would violatea subset of code pointers within C0 to redirect execution to O COLOS’s “fixed-costs only” cost model:the C1 code. Remaining code pointers are not perturbed andcontinue to point to C0 code. This approach can also handle Design Principle #3: code replacement can incur fixed costs,thorny cases like setjmp/longjmp where a target instruction (not but must avoid all possible recurring costs 5 Our experiments show that leaving these remaining function While copying bi to bi,i+1 is a key part of enabling continuouspointers (which our workloads do contain) pointing to C0 code optimization, it does not improve performance of the currently-is fine, since C0 code does not execute very long before it running call to bi since the code is the same. However,encounters a direct call or a virtual function call which steers subsequent calls are likely to reach bi+1 instead via otherexecution back to C1 . code pointers, like the v-table in Figure 4c
2) Function pointers: Apart from return addresses, functionC. Continuous Optimization pointers may also point to Ci . At any time during execution, A natural use-case for O COLOS is to perform continuous programs can create function pointers that may exist on theoptimization, whereby O COLOS can replace C1 with C2 , and Ci stack, heap, or in registers and point to a function in Ci . Insteadwith Ci+1 more generally. These subsequent code versions of trying to track down and update these pointers while movingCi can be generated by periodic re-profiling of the target from Ci to Ci+1 , O COLOS enforces a simpler invariant thatprocess, to account for program phases, daily patterns in a program cannot create function pointers to Ci code in theworkload behavior like working versus at-home hours, and first place – rather, function pointers must always refer to C0
so on. O COLOS can perform continuous optimization largely This allows function pointers to propagate freely throughoutthrough the same code replacement algorithm described above, the program without the risk that they will be broken duringthough functions on the stack and function pointers require code replacement
delicate handling as explained below. O COLOS enforces this invariant via a simple LLVM compiler The key challenge in continuous optimization is the need to pass that instruments function pointer creation sites with areplace code, instead of just adding new code elsewhere in the callback function: void* wrapFuncPtrCreation(void*)address space. If we continuously add code versions without This function takes as its argument the function pointerremoving old versions, the code linearly grows over time, being created (which may reference Ci code), and returns thewasting DRAM and hurting front-end performance. To address value that the program will actually use - a pointer to thethis challenge, we introduce a garbage collection mechanism corresponding C0 function instead. O COLOS maintains a mapfor removing dead code. We define dead code as code that can from Ci to C0 addresses to enable this translation. If O COLOSno longer be reached via any code pointers and hence is safe has not yet replaced any code, or the function pointer beingto be removed. created does not reference Ci (e.g., it references library code), Instead of waiting for code version Ci to naturally become wrapFuncPtrCreation simply acts as the identity function
unreachable, as in conventional garbage collection, we can Once a function pointer is created, it can freely propagateproactively update code pointers to enforce the unreachability through registers and memory without any instrumentationof Ci . O COLOS patches v-tables, direct calls from C0 , return - intervention is required only on function pointer creation
addresses on the stack, and threads’ PCs to refer to the incoming This instrumentation has a negligible cost: MySQL running theCi+1 code instead, as described in Section IV-B and illustrated read only input creates just 45 function pointers per millisecondin Figure 4c. on average. While we have not found the need to implement it 1) Return addresses: Code pointers in return addresses and for our workloads, calls to setjmp could be similarly redirectedin threads’ PCs may reference Ci , so O COLOS must update to C0
these references to point to Ci+1 . To update these references, Having avoided function pointers to Ci , O COLOS is able toO COLOS first crawls the stack of each thread via libunwind to update all other references to Ci code to refer to the incomingfind all return addresses. O COLOS examines RIP for each thread Ci+1 code instead. Thus, O COLOS can safely overwrite Ci code
via ptrace. Collectively, this examination provides O COLOS Due to technical limitations in the current version of BOLT,with the set of stack-live functions that are currently being BOLT assumes the presence of a single .text code sectionexecuted. If any stack-live function is in Ci (such as bi in and refuses to run on a BOLTed binary. Unfortunately, thisFigure 4c), O COLOS must copy its code to Ci+1 . While there prevents us from evaluating continuous optimization becausemay be an optimized version bi+1 in Ci+1 , it is challenging to our profiling data will refer to Ci code, and we need BOLT toupdate the return address to refer to bi+1 because, in general, run optimizations on Ci to produce Ci+1 . We plan to add thisthe optimizations applied to produce bi+1 can have a significant feature to BOLT in the future
impact on the number and order of instructions within afunction
D. Limitations Thus, O COLOS makes a copy of bi in Ci+1 , which we callbi,i+1 to distinguish it from the more-optimized version bi+1 . O COLOS currently does not support jump tables, as theybi,i+1 may need to have a different starting address than bi , rely on compile-time constants to compute the jump target,so O COLOS updates PC-relative addressing within bi,i+1 to and hence O COLOS does not update these constants duringaccommodate its new location. O COLOS must also update the code replacement yet. Thus, O COLOS currently requires that areturn address to refer to the appropriate instruction within binary be compiled with the -fno-jump-tables flag. The binariesbi,i+1 , but O COLOS can treat the original return address into bi for BOLT and the non-PGO baseline, however, can includeas an offset from bi ’s starting address, and then use this offset jump tables. This jump table restriction is not fundamental tointo bi,i+1 to compute the new return address. O COLOS’s approach. With a little extra support from BOLT to 6 identify these constants within the optimized binary, O COLOS the target process via Intel’s LBR mechanism (Section II-A)
can extract and update them as part of code replacement. We feed this information into the BOLT optimizer, as discussed O COLOS requires a pause time during code replacement, next
during which the target process cannot respond to incoming Running BOLT. We provide a quick summary of how BOLTrequests or do other useful work. This may hurt application operates here to keep this paper self-contained. More detailsperformance, especially in terms of tail latency. There is scope can be found in the BOLT papers [76, 77]
to reduce the latency of O COLOS’s code replacement: it requires First, we use the perf2bolt utility to extract the LBRa few MiB of scattered writes throughout the address space, information recorded by perf into an internal format that BOLTall of which are currently done sequentially. If O COLOS, say, can consume more easily. Armed with the extracted LBRupdated v-tables in parallel with patching direct calls that information and the binary corresponding to the target process,should reduce the end-to-end replacement time further. BOLT runs a series of optimization passes (most notably basic- An additional approach to preserving tail latency during block and function reordering, see Section II) to produce acode replacement is to leverage techniques proposed for new, optimized binary
mitigating the effects of garbage collection pauses in distributed Efficient Code Copying. To provide direct copying of codesystems [82, 103]. If the system includes a load-balancing from the optimized binary into the target process, we launch thetier, as many modern web services do, then the load balancer target process with an LD PRELOAD library. LD PRELOADcan be made aware of application pauses (like major garbage is a Linux feature that allows a user-specified shared librarycollections, or O COLOS code replacement) and can route traffic to be loaded, alongside a program’s required shared libraries,to other nodes temporarily. Because code optimizations are when a process is launched. We use LD PRELOAD to addexplicitly triggered by the operator, pause times are well known some functions for code replacement into the address spaceand can be scheduled accordingly. of the target process. We then use ptrace to transfer control O COLOS requires that the functionality and ABI of C1 is to our code, which reads in the optimized binary and copiesunchanged with respect to C0 , so that a function f0 in C0 has its relevant contents into place. While ptrace can also performequivalent application semantics to a function f1 in C1 . f1 memory copies into the target process, they are prohibitivelycan, however, vary in non-semantic ways, such as having extra slow since each copy requires a system call and several contextinstrumentation or different performance. switches. Performing this memory copy from within the target Global variables cannot change location in O COLOS, since process is much more efficient and helps minimize the stop-C0 code often hard-codes a global variable’s original location the-world time
via RIP-relative addressing. C1 code thus needs to referencethose same global variables. A. BAM: Batch Accelerator Mode For programs with short running times, O COLOS’s fixed V. I MPLEMENTATION optimization costs cannot be effectively amortized. If these In this section, we discuss some of O COLOS’s implementa- programs are executed frequently, as is common in data centers,tion issues, including O COLOS’s methodology to profile running it may still be worthwhile to optimize them. To address thisprocesses, steps to run BOLT, and mechanism to transform code. problem, we have developed an alternative deployment modeFinally, we describe O COLOS’s BAM mode for accelerating for O COLOS called BATCH ACCELERATOR M ODE or BAM
batch workloads such as software builds. As the name implies, BAM is focused on batch workloadsProfiling. As Figure 4a shows, O COLOS’s first step is to where the same binary is invoked repeatedly. Early invocationsprofile the target process, to determine whether it suffers from of the binary can be profiled and fed into BOLT, so thatsufficient front-end stalls to merit O COLOS’s optimizations. subsequent invocations can use the BOLTed binary insteadO COLOS uses the standard Linux perf utility to record hardware and see improved performance. BAM performs its optimizationperformance counters for this purpose. perf can attach to an online as the batch workload runs, so it does not suffer fromalready-running process, allowing O COLOS to be deployed on stale profiles, stale binary mapping issues, or require any profilea new process or an existing one. management – all of which can hinder the use of offline PGO O COLOS adopts a 2-stage approach for profiling. The first systems like BOLT
stage follows the methodology proposed in DMon [49], which BAM is a Linux shared library that is attached to a command,itself is built on Intel’s TopDown microarchitectural bottleneck e.g., with LD PRELOAD=bam.so make. BAM additionallyanalysis [109]. Note that in many data centers, systems such as needs to be told, via a configuration file, the binary to optimize
GWP [89] already continuously profile all applications in the The BAM library makes use of another LD PRELOAD featurefleet. We have not integrated this analysis into O COLOS yet which is transparent interception of calls to functions in anysince it is not the primary focus of our work, but we perform shared library. In particular, BAM intercepts libc’s exec* callsmeasurements to validate the feasibility of this approach in and, if it finds an invocation of the target binary, adjusts the execSection VI-C4. arguments to launch the binary with perf’s profiling enabled
If this first-stage exploration reveals significant time spent in BAM also attaches its shared library to child processes to findthe processor front-end, we continue with the second profiling invocations of the target binary no matter where they occur instage. Here we use perf to record the hot control-flow paths of the process tree
7 Once BAM has collected a (configurable) number of profiles We use MongoDB version 6.0.0-alpha-655-gea6cea6,of the target binary’s execution, it runs BOLT in a background driven by inputs from YCSB. We use Memcached versionprocess to produce the BOLTed binary. Once the BOLTed 1.6.12, driven by inputs from memaslap version 1.0. Forbinary is available, BAM rewrites exec calls to use the BOLTed MongoDB and Memcached the input names show the mix ofbinary instead of the original binary, leading to an automatic operations, e.g. read95 insert5 means 95% of operations areperformance boost for the remainder of the batch workload. reads and the other 5% are inserts. We use Verilator version Similarly to O COLOS’s single-process mode (Section IV), 3.904, simulating an in-order rv64imafdc RISC-V single-coreBAM automatically profiles a workload as it runs, avoiding processor generated from RocketChip [5], with the processorthe challenges of stale profiling data and storing and retrieving running a set of RISC-V benchmarks [91]. All benchmarks areprofiles at scale. BAM’s highly-compatible LD PRELOAD- compiled with their default optimization level: -O3 for MySQLbased design is also similar in spirit to O COLOS, in that no and Verilator, and -O2 for MongoDB and Memcached. Weapplication changes are required to use BAM. In the make measure Verilator’s performance as the throughput of iterationsexample above, no changes are required to the Makefiles, of the main Dhrystone loop or iterations over the input arrayapplication source code, make program, or the compiler for median and vvadd. We evaluate BAM on a build of Clangtoolchain. version 14.0
One unique feature of BAM compared to O COLOS is that All performance measurements, unless otherwise noted,BAM does not replace the code of a running process; it requires show steady-state performance. For O COLOS, we measureinstead a subsequent exec call to allow the optimized binary performance after code replacement is complete, except into run. There is thus no stop-the-world component to BAM, Figure 7 where we show MySQL’s performance before, during,and the overhead of switching from the original binary to the and after code replacement. O COLOS and BOLT results areoptimized one is essentially zero. based on 60 seconds of profiling unless otherwise noted. Unless We see BAM being especially useful for accelerating otherwise noted, we show averages of 5 runs with error barsContinuous Integration (CI) builds of large software projects. indicating the standard deviation
These CI builds are always done from scratch to ensure thesoftware builds correctly on a fresh system [14]. So long as B. Performance and Characterizationthe software build is long enough for BAM to obtain useful Figure 5 shows the throughput improvement O COLOSprofiling and run BOLT, BAM can transparently accelerate provides across our set of benchmarks. We compare O COLOScompiler invocations for the latter part of the build. BAM is to four baselines. Original is the performance of the originalcomplementary to build optimization techniques like distributed binary, compiled with only static optimizations (nothing profile-build caches [19, 20, 30, 37, 83, 84]. While a build cache can guided). BOLT oracle input is the performance offline BOLTavoid some compiler invocations, BAM accelerates those provides when profiling and running the same input; PGOcompiler invocations that remain. BAM is also simpler to deploy oracle input uses the same profiling file as BOLT oracle inputthan a build cache as BAM does not need any remote web but feeds it to clang’s builtin PGO pass [62]. Finally, BOLTservices to be provisioned – BAM is purely local to each build. average-case input is the performance offline BOLT achieves when aggregating profiles from all inputs and then running on VI. E VALUATION the input shown on the y-axis. We show throughput normalized In our evaluation of O COLOS, we set out to demonstrate that to original
O COLOS can provide significant performance improvements Figure 5 shows that O COLOS uniformly improves perfor-for programs that suffer from processor front-end bottlenecks. mance over the original binary, by up to 1.41× on MySQLTo demonstrate O COLOS’s robustness, we evaluate it across a read only, 1.29× on MongoDB read update, 1.05× onrange of benchmarks, from complex, multithreaded programs Memcached and 2.20× on Verilator. Clang’s PGO generallysuch as the MySQL relational database to compute-bound, falls short of BOLT, similar to the results from the BOLTsingle-threaded workloads like the Verilator chip simulator and paper [76] though our benchmarks are different, likely due tobatch workloads like building the Clang compiler. the challenges of mapping PCs back to the source code [36]
Aggregating profiling information across inputs is worse thanA. Experimental Setup using just the oracle profile of the input being run, as different We run our experiments on a 2-socket Intel Broadwell Xeon inputs tend to exhibit contradictory control-flow biases thatE5-2620v4 server with 8 cores and 16 threads per socket (16 cancel each other out
cores and 32 threads total) running at 2.1GHz. Each core The results for BOLT oracle input represent an upper boundhas a 64-entry iTLB, a 1536-entry L2 TLB, a 32KiB L1i, for O COLOS’s performance, since BOLT has access to thea 32KiB L1d, a 256KiB L2 cache, and access to a shared oracle profiling data and ensures that all code pointers refer20MiB L3 cache and 128 GiB of RAM. The server runs Linux to optimized code, not just a judicious subset of them as withversion 4.18. We use commit 88c70afe of the Lightning O COLOS (Section IV-B). In some cases like MySQL deleteBOLT system [77] from its GitHub repository [21]. and write only, use of code pointers that continue to refer to For our benchmarks, we use MySQL version 8.0.28, unoptimized C0 code results in a non-trivial performance gapdriven by inputs from Sysbench version 1.1.0-ead2689. (18 and 13 percentage points, respectively)
8 2.25 original OCOLOS 2.00 BOLT,oracle input PGO,oracle inputnormalized throughput 1.75 BOLT,average-case input 1.50 1.25 1.00 0.75 0.50 MySQL MySQL MySQL MySQL MySQL MySQL MySQL MySQL MySQL MySQL mongo mongo mongo mongo mongo mongo mem$ mem$ mem$ mem$ veriltr veriltr veriltr point read select select update update read insert write delete read read95 read read95 read scan95 update set9 set5 set1 vvadd median dhrys select only random random non index write only update insert5 only update5 rmw insert5 get1 get5 get9 ranges points index Fig. 5: Performance of O COLOS (light blue bars) compared to BOLT using an oracle profile of the input being run (dark blue bars), Clang PGO using the same oracle profile (purple bars) and BOLT using an average-case profiling input aggregated from all inputs (pink bars). All bars are normalized to original non-PGO binaries (white bars)
However, on average O COLOS is close to the BOLT oracle’s physical memory allocated to a process, when running the performance with a slowdown of just 4.6 points. Compared original binary, BOLT, and O COLOS on MySQL oltp read only, to offline BOLT with an average-case profile, O COLOS is mongodb read update, Memcached set10 get90, and Verilator 8.9 points faster on average. This shows that O COLOS’s dhrystone. O COLOS requires a modest amount of extra memory, efficient design enables dynamic code optimizations with only 208 MiB for mongodb and much less for other benchmarks
almost the same performance gains as PGO while providing O COLOS’s memory consumption is affected primarily by binary additional benefits such as guaranteeing the accuracy of size, and does not scale up with larger or longer-running inputs
profiling information and easy mapping to the target binary, and Note also that O COLOS’s memory consumption is not an a simple deployment model that avoids the need for profiles ongoing cost, but is incurred during code replacement and to be stored and queried. can be deallocated afterwards
MongoDB scan95 insert5 is an odd case where conven- O COLOS’s storage requirements are under 200 MiB for each tional static compilation outperforms all of the profile-guided benchmark, chiefly for profiling data and the optimized binary, techniques (e.g., O COLOS is 14% slower than original). To which does not produce a significant amount of disk I/O. Note understand this behavior better, we applied Intel’s TopDown that these files are also transient: after the optimized binary is [109] performance measurement methodology which can produced they can be deleted
identify the root microarchitectural cause of low IPC. TopDown classifies pipeline slots in each cycle to one of four top-level MySQL Mongo Mem$ Verilator cases: Retiring (useful work), Front-End Bound (L1i, iTLB, and functions 33,170 69,807 374 406 v-tables 3,812 6,165 0 10 decoder bottlenecks), Back-End Bound (L1d or functional unit .text section (MiB) 24.6 50.0 0.142 2.3 bottlenecks) or Bad Speculation (branch or memory aliasing avg funcs reordered 963.6 2,364.2 74.2 83.2 mispredictions). With scan95 insert5, in all of the BOLT-based avg funcs on stack 79 100.6 10 5 configurations (O COLOS, BOLT oracle and BOLT average-case) avg call sites changed 31,677.2 30,9297.8 496.6 251.2 max RSS (MiB) the workload shifts from being front-end bound to back-end original 397.4 1434.4 67.8 263.4 bound, with many memory accesses in particular stalled waiting BOLT 398.0 1432.8 67.9 263.7 for DRAM, suggesting that poor memory controller scheduling O COLOS 438.5 1640.5 69.8 265.4 may be the root cause of the slowdown. The PGO version of TABLE I: Benchmark characterization data scan95 insert5 has very similar TopDown metrics to original, so the cause of its slowdown is unclear
Table I shows characterization data for our benchmarks, C. MySQL Case Study such as code size metrics, the average number (across inputs) of functions that are reordered by BOLT, on the call stack Next we present an in-depth case study of MySQL, using it when code replacement occurs, and direct call sites that are to illustrate different aspects of O COLOS’s performance. We patched. We also report memory consumption in terms of focus on MySQL because it is a complex workload and it has maximum resident set size, which is the peak amount of the widest variety of inputs among our benchmarks
9 To see a small, concrete example of how O COLOS canimprove performance, we used the perf report and perf annotate 7000utilities to examine the distribution of L1i misses in an throughput (transactions / second)execution of MySQL oltp read only. Under both BOLT withaverage-case input and Clang PGO, the MYSQLparse function 6000is a common source of L1i misses - with BOLT average-case itactually has the most L1i misses of any function. MYSQLparse 5000is auto-generated by Bison as the main parsing function for SQLqueries, with over 176 KiB of binary code. perf reports frequent 4000L1i misses in basic blocks dealing with backtracking andchecking for additional tokens. It makes sense that the average- 3000case input has a difficult time, as it is unable to specialize theparser code for the current query mix and oltp read only has 2000 1 2 3 5only select queries. It is less clear why PGO performs poorlysince it has the oracle profile, but it is likely due to problems 4 1000mapping low-level PCs back to source code and LLVM IR. 0 50 100 150With both O COLOS and BOLT oracle, MYSQLparse does not time (seconds)even appear on perf’s radar as no L1i misses are sampled Fig. 7: Throughput of MySQL read only before, during, andwithin it. after code replacement. 95% tail latency degrades from 1.00ms to at most 1.55ms during code replacement
1.4 we performed an experiment with MySQL read only with Sysbench reporting the client’s transaction throughput every speedup over original second. Figure 7 shows the results. The first 20 seconds 1.3 (region 1 of the graph) are a warm-up period, showing the performance of the original binary at around 4,200 transactions per second (tps). After this, perf profiling begins collecting 1.2 LBR samples (region 2), reducing throughput to about 3,600 tps. In region 3, perf2bolt runs 4 background threads to translate 1.1 the LBR samples into a format that BOLT can use, and then single-threaded BOLT generates the optimized binary
BOLT BOLT, in particular, is quite CPU-intensive, causing a reduction 1.0 OCOLOS in throughput just after the 100-second mark. In region 4, 10 5 10 4 10 3 10 2 10 1 100 101 102 O COLOS performs code replacement which entails a brief profile duration (seconds) single-threaded stop-the-world phase of 669 milliseconds (seeFig. 6: The impact of profile duration on speedup for MySQL Table I for other benchmarks). After that, in region 5, MySQL’sread only parallel execution resumes with the optimized code in place lifting the performance to almost 6,000 tps. Sysbench also th 1) Profiling Duration: The amount of time that O COLOS reports 95 percentile tail latency for each 1-second windowspends gathering profile information is configurable. While of execution. Analyzing a single representative run, the averagewe use a default of 60 seconds for our current experiments, 95% transaction latency during region 1 is 1.00 milliseconds,O COLOS can still perform well with significantly less profiling degrading to a worst-case of 1.55 ms during regions 3 and 4,information. Figure 6 shows the speedup over the original and improving to 0.73 ms on average in region 5
binary when varying the duration of profiling. The green Figure 7 shows that the performance impact of O COLOSsquares show O COLOS, and the blue triangles show BOLT to is modest, even during code replacement. As we discussed inrepresent how well offline BOLT can optimize when given the Section VI-C1, O COLOS can still perform well with as littlesame profiling information as O COLOS. BOLT again provides a as 1 second of profiling. Although we are already using theceiling on O COLOS’s expected performance. Figure 6 illustrates Lightning BOLT system [77] which has been optimized forthat profiling for at least 1 second offers a good absolute lower execution times, there likely exist further opportunitiesspeedup over the original binary and also achieves most of to reduce region 3 costs by shifting some of BOLT’s work intothe benefits that offline BOLT does. Below 100 milliseconds, an offline phase. Such optimizations do not matter for BOLT’sprofile quality suffers significantly for both O COLOS and BOLT. original offline setting, however, they would be beneficial 2) Code Replacement Costs: To better understand the per- for O COLOS. Finally, there is scope to reduce O COLOS’sformance impact of O COLOS’s code replacement mechanism, pause time further by shifting more work to occur inside 10 the target process via the O COLOS LD PRELOAD libraryand parallelizing the code replacement routines that currentlyexecute serially
60 L1i misses 40 3) End-to-End Overheads: Table II shows O COLOS’s over- 20heads for code replacement. The intervals between code 0replacements are configurable with O COLOS; longer intervals iTLB missesamortize code replacement costs better but are less sensitive 10.0 original OCOLOSto application phases or input changes. 7.5 BOLT,oracle 5.0 PGO One way to evaluate O COLOS’s overheads in an end-to- BOLT,average-case 2.5end manner is to consider how long it takes O COLOS to 0.0“recover” the ground lost during code replacement. Considering 125MySQL read only as an example (Figure 7), the steady state 100 branches takenthroughput of the original program is about 4,200 transactions 75per second, which O COLOS boosts to 5,850 tps after code 50replacement is complete. Taking the reduced throughput 25 0during code replacement into account, at 30 seconds after 30 mispredictionscode replacement completes O COLOS has processed as many branchtransactions as if we had run the original binary the entire time. 20All execution after this point is a net gain for O COLOS, so 10running for several minutes before the next code replacementis advisable in practice. With smaller speedups, O COLOS must 0 read update select ranges points write insert write delete nonindex only read only point rand index update randrun for longer before performing code replacement again. Moregenerally, if O COLOS hurts performance by a factor of a duringcode replacement which lasts for s seconds, and then boosts Fig. 8: Microarchitectural events (per 1,000 instructions) forperformance by a factor of b after code replacement completes, MySQL inputswe should run the optimized code for at least as/b seconds torecover the ground lost during code replacement
MySQL Mongo Mem$ Verilator can accurately determine workloads that will and won’t benefit perf2bolt time (sec) 28.186 26.624 12.918 4.181 from O COLOS (Figure 9). Moreover, with O COLOS’s online llvm-bolt time (sec) 8.237 17.882 0.1404 1.935 approach, even should identifying performance losses a priori replacement time (sec) 0.669 1.221 0.020 0.146 prove challenging, we can always revert to C0 code to at least TABLE II: Fixed costs of code replacement recover the original performance
D. Batch Accelerator Mode In this section, we examine the impact of BAM on a from- 4) Microarchitectural Impacts: Next we investigate the scratch build of the Clang compiler. In a large softwaremicroarchitectural causes of O COLOS’s performance benefits. build, BAM profiles the initial compiler executions to generateFigure 8 shows a variety of front-end performance counter an optimized compiler binary that is tuned to the sourcemeasurements, each represented as events per 1,000 instructions. program being compiled. A full Clang build contains 2,624The MySQL inputs along the x-axis are sorted from highest compiler executions in all. The dashed red line near the top(left) to lowest (right) speedup with O COLOS to match the order of Figure 10 illustrates the running time of the original Clangin Figure 5. Moving from top to bottom in Figure 8, we see that build, executing parallel jobs via make -j. For the dashed orangeO COLOS is able to achieve significant reductions in L1i and line at the bottom, we aggregate profiling information fromiTLB MPKI. All MySQL inputs also show large reductions in the entire build and feed it to BOLT, and then measure a freshthe number of taken branches; fewer taken branches means less build using the resulting BOLTed binary. This represents apressure on branch prediction resources which may reduces lower bound on the running time that BAM can achieve
mispredicted branches as well. Across all of these front-end The green triangles (with a polynomial curve fit to them)metrics, O COLOS achieves results very similar to offline BOLT. show the performance of BOLT when we profile only a limited Somewhat surprisingly, the front-end metrics in Figure 8 number of compiler executions (given on the x-axis), generatingoften do not correlate particularly well with the speedup that an optimized binary while measuring the time of a freshO COLOS provides. To overcome this, we again turned to Intel’s build using this binary. The cost of collecting profiles andTopDown [109] methodology. Using TopDown’s Front-End running BOLT is excluded; the optimized binary is availableLatency and Retiring percentages, a simple linear regression at the start of the build. These results show how well an ideal 11 no performance gain 30 have performance gain 1000 25 950 real time (seconds) 900 20 original Retire % BAM 850 BOLT profile prefix 15 BOLT profile all 800 10 750 5 700 10 20 30 40 50 60 70 0 5 10 15 20 25 Front-End Latency % executions profiledFig. 9: TopDown’s [109] Front-end Latency and Retire per- Fig. 10: The running time of a Clang build with the originalcentages allow us to accurately classify which workloads will compiler, and compilers optimized by BOLT and BAM
benefit from O COLOS and which won’t
from simpler next-line [97] and discontinuity [42, 81, 100, 101]BAM implementation can perform if it did not suffer from any prefetchers to sophisticated temporal [24, 25, 46, 47] (or record-profiling and optimizations overheads, revealing the marginal and-replay [6]) prefetching, aim to strike a balance betweenutility of extra profiling data
performance and high metadata storage overhead. Branch Finally, the blue squares (and polynomial curve) show the predictor-guided prefetchers [87, 88] are extremely effective [40,performance achieved by BAM. We first observe that, even 41, 58, 59] and consequently, have been adopted in many recentwhen profiling just one compiler execution, BAM provides a processors [32, 78, 95, 102]. Nevertheless, these state-of-the-artspeedup of 1.09× over the original build. At first, profiling prefetchers fall short when applications contain a large numberadditional compiler executions leads to a speedup of up to of taken branch instructions that exhaust the capacity of the1.14×, as this profiling data is “worth the wait”. However, branch predictor and BTB [25, 48, 58, 99]. O COLOS can convertafter about 5 executions BAM suffers diminishing returns from taken branches into not-taken ones, easing pressure on theadditional profiling for two reasons. First, the value of that branch predictor (Figure 8) and improving overall performance
profiling data is relatively low as shown by the decreasingslope of the green line. Second, as BAM waits for more profile Profile-guided code layout optimizations. Compiler tech-data, it starts the optimization process later, losing out on niques to address the front-end stall problem mainly fo-opportunities to use the optimized binary. This opportunity cus on improving instruction locality via code layout opti-cost increases over time, causing the BAM running time to mizations [10, 29, 34, 36, 61, 63, 65, 76, 77, 79, 86, 113]. Theseeventually surpass that of the original build. techniques perform basic-block reordering [72, 80], function Ultimately, our BAM investigation demonstrates that the reordering [75], and hot/warm/cold code splitting [12] (alsoamount of profiling data needed to run PGO effectively is known as function splitting [76]) using profiles collectedquite low, mirroring our results from Section VI-C1. BAM is from previous executions [27, 38]. While these techniques areable to leverage this property to accelerate the Clang build, extremely effective at improving instruction locality [6] andwithout any changes to Clang or the build infrastructure. therefore widely adopted in today’s data centers [10, 76, 77], profile quality limits their ability to achieve close-to-optimal VII. R ELATED W ORK performance as we show in Section III-A. To address this The performance implications of front-end stalls have limitation, O COLOS always uses the best-quality profile frominspired computer architecture and compiler researchers to the current execution. Some managed language runtimes, likepropose numerous techniques for improving instruction locality. the HotSpot JVM [39], also perform PGO at run time, profilingWe divide this work into three categories and qualitatively the application running on the VM. While O COLOS targetscompare these techniques against O COLOS to describe how unmanaged languages instead, O COLOS could complementO COLOS addresses their shortcomings. a system like HotSpot by performing PGO on the runningInstruction prefetching mechanisms. Computer architects HotSpot binary itself
primarily aim to solve the front-end stall problem via instruction Other systems [17, 26, 60] have also proposed run-time codeprefetching [4, 24, 25, 28, 31, 33, 42, 46, 47, 57, 58, 59, 68, 70, 87, optimization for unmanaged languages. ClangJIT [26] can88, 92, 93, 94, 96, 97]. A plethora of such techniques, ranging perform C++ template specialization at run time, improving 12
Guided code layout optimization tool for unmanaged code. We show how to perform online code replacement effi-ciently in unmodified, large-scale C/C++ programs. We evaluate OCOLOS …