logo资料库

PHP外语文献.pdf

第1页 / 共14页
第2页 / 共14页
第3页 / 共14页
第4页 / 共14页
第5页 / 共14页
第6页 / 共14页
第7页 / 共14页
第8页 / 共14页
资料共14页,剩余部分请下载后查看
Abstract
1 Introduction
2 Microarchitectural Analysis
3 Mitigate Abstraction Overhead
4 Specializing the General-Purpose Core
4.1 Accelerator Design Principles
4.2 Hash Table
4.3 Heap Manager
4.4 String Accelerator
4.5 Regular Expression Accelerator
4.6 ISA Extensions
5 Experimental Framework and Results
5.1 Methodology
5.2 Performance and Energy Improvement
5.3 Breakdown of Execution Time
6 Related Work
7 Conclusion
References
Architectural Support for Server-Side PHP Processing Dibakar Gope David J. Schlais Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin - Madison Madison, WI, USA gope@wisc.edu,schlais2@wisc.edu,mikko@engr.wisc.edu ABSTRACT PHP is the dominant server-side scripting language used to im- plement dynamic web content. Just-in-time compilation, as imple- mented in Facebook’s state-of-the-art HipHopVM, helps mitigate the poor performance of PHP, but substantial overheads remain, especially for realistic, large-scale PHP applications. This paper ana- lyzes such applications and shows that there is little opportunity for conventional microarchitectural enhancements. Furthermore, prior approaches for function-level hardware acceleration present many challenges due to the extremely flat distribution of execution time across a large number of functions in these complex applications. In-depth analysis reveals a more promising alternative: targeted ac- celeration of four fine-grained PHP activities: hash table accesses, heap management, string manipulation, and regular expression han- dling. We highlight a set of guiding principles and then propose and evaluate inexpensive hardware accelerators for these activities that accrue substantial performance and energy gains across dozens of functions. Our results reflect an average 17.93% improvement in performance and 21.01% reduction in energy while executing these complex PHP workloads on a state-of-the-art software and hardware platform. CCS CONCEPTS • Computer systems organization → Cloud computing; Special purpose systems; KEYWORDS PHP, dynamic languages, domain-specific accelerators ACM Reference format: Dibakar Gope David J. Schlais Mikko H. Lipasti Department of Electri- cal and Computer Engineering University of Wisconsin - Madison Madison, WI, USA . 2017. Architectural Support for Server-Side PHP Processing. In Proceedings of ISCA ’17, Toronto, ON, Canada, June 24-28, 2017, 14 pages. https://doi.org/10.1145/3079856.3080234 1 INTRODUCTION In recent years, the importance and quantity of code written in dy- namic scripting languages such as PHP, Python, Ruby and Javascript Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ISCA ’17, June 24-28, 2017, Toronto, ON, Canada © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-4892-8/17/06. . . $15.00 https://doi.org/10.1145/3079856.3080234 has grown considerably. Among all scripting languages used for server-side web development, PHP is the most commonly used [70, 72], representing about 82.3% [13] of all web applications. In par- ticular, server-side PHP web applications have created an ecosystem of their own in the world of web development. They are used to build and maintain websites and web applications of all sizes and complexity, ranging from small websites to complex large scale en- terprise management systems. PHP powers many of the most popular web applications, such as Facebook and Wikipedia. Despite their considerable increase in popularity, their performance is still the main impediment for deploying large applications. Since these PHP applications run on live datacenters hosting millions of such web applications, even small improvements in performance or utilization will translate into immense cost savings. The desire to reduce datacenter load inspired the design of the HipHop JIT compiler (HHVM) [19, 72], which translates PHP to native code. HHVM demonstrates a significant performance increase for a set of micro-benchmark suites. However, runtime character- istics of popular real-world PHP web applications (e-commerce platforms, social networking sites, online news forums, banking servers, etc.) are found to be dramatically different than the de- facto benchmark suites SPECWeb2005 [10], bench.php [18], and the computer language benchmarks [1] used so far for evaluating the performance of web servers. These micro-benchmark suites have been the primary focus for most architectural optimizations of web servers [20]. Furthermore, these micro-benchmarks spend most of their time in JIT-generated compiled code, contrary to the real-world PHP applications that tend to spend most of their time in various library routines of the VM. Figure 1 depicts the distribution of CPU cycles spent in the hottest leaf functions of a few real-world PHP applications compared to SPECWeb2005’s banking and e-commerce workloads. Clearly, the SPECWeb2005 workloads contain significant hotspots – with very few functions responsible for about 90% of their execution time. However, the PHP web applications exhibit significant diversity, having very flat execution profiles – the hottest single function (JIT compiled code) is responsible for only 10− 12% of cycles, and they take about 100 functions to account for about 65% of cycles. This tail-heavy behavior presents few obvious or compelling opportunities for microarchitectural optimizations. In order to understand the microarchitectural implications (perfor- mance and energy-efficiency bottlenecks) of these workloads with hundreds of leaf functions, we undertook a detailed architectural characterization of these applications. Despite our best effort, we could not find any obvious target or easy opportunity for architectural optimization. 507
ISCA ’17, June 24-28, 2017, Toronto, ON, Canada D. Gope et al. Heap Management. A significant fraction of execution time in these applications come from memory allocation and deallocation, despite significant efforts to optimize them in software [37, 51]. Current software implementations mostly rely on recursive data structures and interact with the operating system, which makes them non-trivial to implement in hardware. However, these applications exhibit an unique memory usage pattern that allows us to develop a heap manager with its most frequently accessed components in hardware. It can respond to most of the memory allocation and deallocation requests from this small and simple hardware structure. String Functions. These PHP applications exercise a variety of string copying, matching, and modifying functions to turn large volumes of unstructured textual data (such as social media updates, web documents, blog posts, news articles, and system logs) into appropriate HTML format. Prior works [68] have realized the po- tential of hardware specialization for string matching, but do not support all the necessary string functions frequently used in these PHP applications. Nevertheless, designing separate accelerators to support all the necessary string functions will deter their commercial deployment. Surprisingly, all the necessary string functions can be supported with a few common sub-operations. We propose a string accelerator that supports these string functions by accelerating the common sub-operations rather than accelerating each function. It processes multiple bytes per cycle for concurrency to outperform the currently optimal software with SSE extensions. Regular Expressions. These PHP applications also use regular expressions (regexps) to dynamically generate HTML content from the unstructured textual data described above. Software-based reg- exp engines [8, 57] or recent hardware regexp accelerators [34, 39, 58, 66] are overly generic in nature for these PHP applications as they do not take into consideration the inherent characteristics of the regular expressions in them. We introduce two novel techniques – Content Sifting and Content Reuse – to accelerate the execution of regexp processing in these PHP applications and achieve high performance and energy efficiency. These two techniques signifi- cantly reduce the repetitive processing of textual data during regular expression matching. They essentially exploit the content locality across a particular or a series of consecutive regular expressions in these PHP applications. The remainder of this paper is organized as follows. Section 2 performs an in-depth microarchitectural analysis of the PHP applica- tions. Failing to find any clear microarchitectural opportunities shifts our focus towards designing specialized hardware. We discover the potential candidates for specialization in Section 3. In Section 4 we design specialized hardware accelerators for them. Section 5 presents results. Section 6 discusses related work and Section 7 concludes the paper. 2 MICROARCHITECTURAL ANALYSIS We begin by performing an in-depth architectural characterization of the PHP applications to identify performance and energy-efficiency bottlenecks (if any) in them. We use gem5 [4] for our architectural simulation. Surprisingly, the most significant bottlenecks lie in the processor front-end, with poor branch predictor and branch target buffer performance. Neither increased memory bandwidth, nor larger Figure 1: Distribution of CPU cycles of SPECWeb2005 work- loads and few real-world PHP web applications over leaf func- tions (details in evaluation section). As the processor industry continues to lean towards customization and heterogeneity [41, 54, 65, 67] to improve performance and en- ergy efficiency, we seek to embrace domain-specific specializations for these real-world PHP applications. We note function level special- ization is not a viable solution for these applications given their very flat execution profiles. However, a closer look into the leaf functions’ overall distribution reveals that many leaf functions suffer from ei- ther the abstraction overheads of scripting languages (such as type checking [22], hash table accesses for user-defined types [31, 32], etc.) or the associated overhead of garbage collected languages [25]. These observations guide us to apply several hardware and software optimization techniques from prior works [22, 31, 32, 40] together to these PHP applications in order to minimize those overheads. After applying these optimizations, a considerable fraction of their exe- cution time falls into four major categories of activities – hash map access, heap management, string manipulation, and regular expres- sion processing. These four categories show the potential to improve the performance and energy efficiency of many leaf functions in their overall distribution. This motivates us to develop specialized hardware to accelerate these four major activities. Hash Table Access. Unlike micro-benchmarks that mostly ac- cesses hash maps with static literal names, these real-world applica- tions often tend to exercise hash maps in their execution environment with dynamic key names. These accesses cannot be converted to regular offset accesses by software methods [31, 32, 40]. Programma- bility and flexibility in writing code are two of several reasons to use dynamic key names. In order to reduce the overhead and inher- ent sources of energy inefficiency from hash map accesses in these PHP applications, we propose to deploy a hash table in hardware. This hash table processes both GET and SET requests entirely in hardware to satisfy the unique access pattern of these PHP applica- tions, contrary to prior works deploying a hash table that supports only GET requests in a memcached environment [55]. Furthermore, supporting such a hash table in the PHP environment presents a new set of challenges in order to support a rich set of PHP features com- municating with hash maps that these real-world PHP applications tend to exercise often. 02040608010016111621263136414651566166717681869196Distribution of cycles (CDF %)# Leaf functionsWordpressDrupalMediawikiSPECWeb(Banking)SPECWeb(E-commerce)Hottest: JIT-compiled code: ~10%508
Architectural Support for Server-Side PHP Processing ISCA ’17, June 24-28, 2017, Toronto, ON, Canada (a) Sensitivity to BTB and ICache sizes, normalized to 4K BTB 32KB ICache. Other PHP applications show similar behavior. (b) Cache analysis (c) In-order vs. out-of-order behavior Figure 2: Microarchitectural characterization of the PHP applications. instruction or data caches show significant opportunity for improving performance. Branch predictor bottlenecks. We experimented with the state- of-the-art TAGE branch predictor [63]1 with 32KB storage budget. The branch mispredictions per kilo-instructions (MPKI) for the three PHP applications considered in this work are 17.26, 14.48, and 15.14. Compared to the 2.9 MPKI average for the SPEC CPU2006 benchmarks, these applications clearly suffer from high mispre- diction rates. The poor predictor performance is primarily due to the presence of large number of data-dependent branches in the PHP applications. The outcomes of data-dependent branches depend solely on unpredictable data that the PHP applications require to process during most of their execution time. Prior work on predicting data-dependent branches [35] may improve the MPKI of the PHP applications. Branch target buffer bottlenecks. Real-world PHP applica- tions suffer significantly from the poor performance of branch target buffers (BTBs) in server cores. We simulate a BTB that resembles the BTB found in modern Intel server cores with 4K entries and 2-way set associativity. Such behavior stems from the large number of branches in the PHP applications. Around 12% of all dynamic instructions are branches in the SPEC CPU2006 workloads [73], whereas in the PHP applications about 22% of all instructions are branches, thus adding more pressure on BTB. Figure 2(a) shows the execution time of one of the PHP applications, as the BTB size is progressively increased from 4K entries to 64K entries for different sizes of I-cache. However even with 64K entries, the PHP appli- cation obtains a modest BTB hit rate of 95.85%. Deploying such large BTBs is not currently feasible in server cores due to power constraints. Cache analysis. Figure 2(b) presents the cache performance. L1 instruction and data cache behavior are more typical of SPEC CPU-like workloads contrary to the instruction cache behavior ob- served in prior works with other server-side applications (servers- side Javascript applications [73] or memcached workloads [55]). Note that we simulate an aggressive memory system with prefetchers at every cache level. Although there are hundreds of leaf functions in the execution time distribution of these PHP applications, they are 1The branch prediction accuracy observed on Intel server processors and TAGE are in the same range [60] compact enough that can be effectively cached in the L1. Besides, the numerous data structures in these PHP applications do not appear to stress the data cache heavily. The L2 cache has very low MPKI, as the L1 filters out most of the cache references. Figure 2(a) shows the potential of minor performance gain with very large instruction caches. In-order vs. out-of-order. Figure 2c shows the impact four different architectures (2-wide in-order, 2-wide out-of-order (OoO), 4-wide OoO, and 8-wide OoO) had on workload execution time. Changing from in-order to OoO cores shows a significant increase in performance. We also note that the 4-wide OoO shows fairly significant performance gains over the 2-wide OoO architecture, hinting that some ILP exists in these workloads. However, increasing to an 8-wide OoO machine shows very little (< 3%) performance increase, hinting that ILP cannot be exploited for large performance benefits beyond 4-wide OoO cores. Overall, our analysis suggests that PHP applications require far more BTB capacity and much larger caches than server cores cur- rently provide to obtain even minor performance benefit. In short, our analysis does not present any obvious potential target for mi- croarchitectural optimizations. 3 MITIGATE ABSTRACTION OVERHEAD As microarchitectural analysis fails to reveal any clear opportuni- ties for improvement, we shift our focus towards augmenting the base processor with specialized hardware accelerators. However, function-level acceleration is not an appealing solution for these ap- plications given their very flat execution profiles. Nevertheless, it is commonly known that PHP-like scripting languages suffer from the high abstraction overheads of managed languages. Overhead exam- ples include dynamic type checks for primitive types, and hash table accesses for user-defined types. Furthermore PHP, like all garbage collected languages, suffers from the overhead of reference counting. While there are many research proposals [22, 31, 32, 40] from the academic community in mitigating each of these abstraction over- heads separately, most of them have not been adopted so far by the industry into commercial server processors. Considering the fact that these abstraction overheads constitute a significant source of perfor- mance and energy overhead in these PHP applications (as our results indicate), we believe the industry is more likely to embrace these 0.850.870.890.910.930.950.970.9932KB64KB128KB512KBNormalized execution timeICache sizeWordPress4K BTB8K BTB16K BTB32K BTB64K BTB0510152025ICacheMPKIDCacheMPKIL2 MPKIMisses per kilo-instructions (MPKI)WordPressDrupalMediaWiki00.20.40.60.81WordPressDruaplMediaWikiNormalized Execution TimeInOrder(2 wide)OoO(2 wide)OoO(4 wide)OoO(8 wide)509
ISCA ’17, June 24-28, 2017, Toronto, ON, Canada D. Gope et al. subsystem performs the required type check for a variable before returning its value. Reference Counting. Reference counting constitutes a major source of overhead in these PHP applications as it is spread across compiled code and many library functions. We adopted a hardware proposal from prior work that introduces minimal changes to the cache subsystem to mitigate this overhead [46]. In addition to applying the above four optimizations, we tuned the applications to reduce their overhead from expensive memory allocation and deallocation calls to the kernel. Figure 3 demonstrates the effect of applying these above op- timizations to the leaf functions of one of the PHP applications, WordPress [15]. The left bar shows the contribution of the leaf func- tions to the overall execution time before applying the optimizations, whereas the right bar shows their corresponding contribution after applying the optimizations. Clearly, the contribution of many leaf functions diminishes with these optimizations (indicated by arrows), and as a result, the contributions of the remaining functions in the overall distribution have gone up. More interestingly, many of the leaf functions in the overall distribution now fall into four major categories – hash map access, heap management, string manipula- tion, and regular expression processing, as shown by the different color coding in Figure 4. This consequently presents opportunities to accelerate them in hardware to obtain performance and energy efficiency. Figure 5 shows the execution time breakdown of a few real-world PHP applications after applying the above optimizations. If a function contained aspects of one of the four categories, we grouped it into execution time for that category.2 Since these four categories show the potential to improve the execution efficiency of many leaf functions, we propose specialized hardware to accelerate these activities. 4 SPECIALIZING THE GENERAL-PURPOSE CORE In this section, we propose accelerators for the four major hot spots observed in the PHP applications. We first describe the design prin- ciples that we followed while developing these accelerators and then describe their hardware design. 4.1 Accelerator Design Principles Recent explorations in cache-friendly accelerator design demonstrate the criticality and feasibility of balancing the efficiency of applica- tion specific specializations with general-purpose programmability using tightly-coupled accelerators [64]. We espouse a similar accel- erator design philosophy, and propose accelerators that adhere to the following design principles so that they fit naturally into multi-core server SoCs. a) They are VM and OS agnostic. The VM still observes the same view of software data structures in memory. b) They have cache interfaces and participate in the cache coher- ence mechanism. c) They only accelerate the frequently-executed common path through each function. Any unusual or unexpected condition is relegated to a software handler. 2For the few functions containing aspects of multiple categories, it was placed in the category it spent more of its execution time. Figure 3: Contribution of leaf functions to the execution time of WordPress before and after applying all optimizations. Figure 4: Categorization of leaf functions of WordPress into ma- jor categories. proposals sooner than later. So in order to mitigate the abstraction overheads and get a clear view of what other fundamental activities are going to dominate the execution time of these PHP applications in near future, we apply several hardware and software optimizations from prior research together to these applications in our simulated environment. Note that the objective of this exercise is not only to move the abstraction overheads towards the tail of the distribution of these applications, but also to determine the fundamental dominant activities in many of the leaf functions which were obscured by these overheads that have known solutions. Next, we describe briefly those optimizations from prior research proposals. Inline Caching [31, 32] and Hash Map Inlining [40]. Modern JIT compilers [14, 19] use inline caching (IC) to specialize code that accesses members in dynamically-typed objects. With IC, access to a dynamically-typed object is specialized to a simple offset access from the start of the object. A type check around that offset access ensures that the assumptions used in the generation of specialized code hold at runtime. We extend the IC technique to specialize these PHP applications’ accesses to hash maps as done in [40]. Further, we adopt the recent proposal on hash map inlining [40] (HMI) to specialize hash map accesses with variable though predictable key names. Type Checking. As discussed above, specialized code for ac- cessing variables of primitive or user-defined types now requires run-time type checks. We adopted a technique from prior work [22] to mitigate this overhead in hardware. With this technique, the cache 0246810121234567891011121314151617181920212223242526272829303132333435% of Execution TimeLeaf functionsBaseline HipHopBaseline HipHop + Prior OptimizationsCompiledCodeType Checking + ReferenceCountingInlineCachingMallocTuningReferenceCountingWordPress0123456781234567891011121314151617181920212223242526272829303132333435% of Execution TimeLeaf functionsRegular ExpressionString FunctionHeap ManagementHash Map AccessHeap ManagementString FunctionRegular ExpressionOthersHash Map Access510
Architectural Support for Server-Side PHP Processing ISCA ’17, June 24-28, 2017, Toronto, ON, Canada (a) WordPress (b) Drupal (c) MediaWiki Figure 5: Execution time breakdown after mitigating the abstraction overheads. d) They are tightly coupled and are invoked via a small set of single-cycle instruction extensions to the general-purpose ISA. e) They rely on a shared virtual address space and maintain a coherent view of memory to avoid the need for explicitly managed scratchpad memories. Our evaluation shows that most memory refer- ences from these accelerators are small and fall within the boundaries of a single 4KB page, so a single TLB lookup performed by the invoking instruction is sufficient. The hardware for managing co- herence is necessarily somewhat complex, but, in our view, well worth the effort since it dramatically simplifies deployment of these accelerators in a realistic multicore system. 4.2 Hash Table In order to reduce the overhead from hash map accesses, we deploy a hash table in hardware. Overview Real-world PHP applications frequently access hash maps with dynamic key names; such accesses cannot be converted to efficient offset references by software methods [31, 32, 40]. Typi- cally, these applications use many PHP commands that access short- lived hash maps using dynamic key names. For example, the PHP extract command is commonly used to import key-value pairs from a hash map into a local symbol table3 in order to communicate their values later to an appropriate application template that is re- sponsible for generating some dynamic content. Populating such a symbol table always occurs using dynamic key names. Furthermore, these PHP applications often store key-value pairs in a global or local symbol table to communicate their values to other functions in the appropriate scope. For example, the regular expression man- ager shares a search pattern (key) and its FSM table (value) with other appropriate functions through a hash map. Accesses to all such hash maps occur using dynamic key names. More importantly, such accesses to hash maps ease programming while developing large applications. Proposed design Figure 6 shows the hash table accelerator de- sign. A critical requirement with hardware-traversable hash tables is to bound the number of hash table entries accessed per lookup. Thus, when a key is looked up in the hash table in our design, several consecutive entries are accessed in parallel, starting from the first indexed entry, to find a match. A hash table lookup in hardware thus 3A symbol table is implemented using a hash map. Figure 6: Hardware hash table. Figure 7: Hash table hit rate. reduces control-flow divergence and exploits parallelism with hash map accesses. Note that it is not easy to extract parallelism from a serial hash map walk in software because of the complex control flow, memory aliasing, and numerous loop-carried dependencies. Each hash table entry contains a string field to store a key, an 8-byte address (base address of a hash map structure in memory), a pointer to the memory location of the value, a dirty bit to indicate if the hash map structure in memory is required to be updated with the corresponding key, and a valid bit. The valid and dirty bits assist in replacing entries and making space for new incoming keys. The 8-byte address field contains the base address of a hash map data structure, accesses to which for a key-value pair the hash table Hash map accessHeap managementString manipulationRegular exp.OthersHash map accessHeap managementString manipulationRegular exp.OthersHash map accessHeap managementString manipulationRegular exp.OthersHashKeyBase address (HashmapDS)Base AddressKeyValue pointerFlagsPointer to MemoryPointer to Hash map entryReverse Translation Table0204060801001(1)2(1)4(1)32(1)64(1)128(1)128(4)256(4)512(4)1024(4)Hit Rate %# Entries (Associativity)WordPressDrupalMediaWiki511
ISCA ’17, June 24-28, 2017, Toronto, ON, Canada D. Gope et al. attempts to provide a fast lookup. Thus, in response to a request, the hash table performs a hash on the combined value of the key and the base address of the requested hash map to index into an entry and begin the lookup process. Starting with the entry, when a hash table lookup is performed, each entry has its key and 8-byte base address compared with the key and the base address of the request. Upon a match, the hash table updates the entry’s last-access time stamp (for LRU replacement) and sends the response to the request. If no match is found, control falls back to the software to perform the regular hash map access in memory. Design considerations The HHVM JIT compiler [19] already uses an efficient hash computation function that can operate on variable length strings (in our case the combination of a hash map base address and a key in it). However, we used a simplified hash function for the hash table without compromising its hit rate. This is because the HHVM hash function is overly complex to map into an efficient hardware module and it requires many processor cycles to compute a hash. Our design leverages several inherent characteristics of these PHP applications. First, in contrast to the most large-scale memcached deploy- ments [23, 55] where the GET requests vastly outnumber the SET and other request types, these PHP applications observe relatively higher percentage of SET requests (ranging from 15− 25%) when generating dynamic contents for web pages. As a result, a hash table deployed for such applications should respond to both GET and SET requests in order to take full advantage of the hardware hash table and offload most operations associated with hash map accesses from the core. Second, the majority (about 95%) of the hash map keys accessed in these PHP applications are at most 24 bytes in length. As a result, we store the keys in the hash table itself, unlike the hash table designed for memcached deployments [55]. Storing the keys directly in the hash table eases the traversal of the hash table in hardware. We next discuss typical operations offloaded from a traditional software hash map to our hardware hash table. Note that the alloca- tion or the deallocation of a hash map structure in memory is still handled by software. GET: A GET request attempts to retrieve a key-value pair of the requested hash map from the hash table without any software interaction. Upon a match, the hash table updates the entry’s last- access time stamp. If the key is not found for the requested hash map, control transfers to the software to retrieve the key-value pair from memory and places it into the hash table. In order to make space for the retrieved key, if an invalid entry is not found, then a clean entry (dirty bit not set) is given more priority for replacement. This avoids any costly software involvement associated with replacing a dirty entry from the hardware hash table. The hash map of a dirty entry is not up-to-date in memory with the key-value pair and therefore requires software intervention in our design to be updated when the entry is evicted. If none is found, the LRU dirty entry is replaced with incurring the associated software cost. SET: A SET request attempts to insert a key-value pair of the requested hash map into the hash table without any software interac- tion. Upon finding a match, SET simply updates the value pointer and the entry’s last-access time stamp. If the key is not found, then the key-value is inserted into the hash table and the entry’s dirty bit is set. If an eviction is necessary (due to hash table overflow), the same replacement policy as described for GET is followed. Note that a SET operation silently updates the hash table in our design without updating the memory. Figure 7 demonstrates the hit rate of such a hash table. Even a hash table with only 256 entries observes a high hit rate of about 80%. Since SET operations never miss in our design, a hash table with very few entries (1, 2 or 4) shows such a decent hit rate. Having support of a SET operation in hardware helps in serving a major fraction of the short-lived hash map accesses from the hash table. Free: In response to deallocating a hash map from the memory, the hardware hash table would (in a naive implementation) need to scan the entire table to determine the set of entries belonging to the requested hash map. The reverse translation table (RTT) in Figure 6 assists the hash table during this seemingly expensive operation. The RTT is indexed by the base address of a requested hash map. Each RTT entry stores back pointers to the set of hash table entries containing key-value pairs of a hash map. Each RTT entry also has a write pointer. The write pointer assists in adding these back-pointers associated with different key-value pairs of a hash map into the RTT entry in the same order as they are inserted into the hash table. As a result, a SET operation adds a back pointer in a RTT entry using the associated write pointer and increments it to point to the next available position in the RTT entry. Consequently, each entry in the RTT is implemented using a circular buffer. When an entry is evicted from the hash table, its back pointer in the RTT is invalidated. Hence in response to a Free request, the RTT invalidates the hash table entries of the requested hash map. This way, short-lived hash maps mostly stay in the hash table throughout their lifetime without ever being written back to the memory. Replacement: Replacement is handled by software as discussed above. foreach: The foreach array iterator in PHP iterates over the key- value pairs of a hash map in their order of insertion. The RTT assists in performing this operation. It captures the order of insertion and later updates the memory of the hash map using it in response to a foreach command. Even if an inserted key-value pair is evicted from the hash table and re-inserted later, the RTT can still guarantee the required insertion order invariant. Ensure coherence Each hash map in the hardware hash table has an equivalent software hash map laid out in the conventional address space. The two are kept coherent by enforcing a writeback policy from the hardware hash map, and by requiring inclusion of the software equivalent at the L2 level. When a hash map is first inserted into the hardware hash table, the accelerator acquires exclusive coherence permission to the entire address range (depending on the size of the hash map this could be several cache lines). If remote coherence requests arrive for any hardware hash map entries, they are forwarded via the RTT to the accelerator, which flushes out any entries corresponding to the address range of the hash map (the same thing happens for L2 evictions to enforce inclusion). In practice, the hash maps we target are small (requiring a handful of cache lines), process private, and exhibit a lot of temporal locality (small hash maps are freed and reallocated from the process-local heap), so there is virtually no coherence activity due to the hash map accelerator. The software hash map stores each key/value pair in a table ordered 512
Architectural Support for Server-Side PHP Processing ISCA ’17, June 24-28, 2017, Toronto, ON, Canada (a) Cumulative memory usage with different alloca- tion slabs. (b) Memory usage of WordPress. (c) Memory usage of MediaWiki. Figure 8: Memory usage pattern of the PHP applications. based on insertion, and also stores a pointer to that table in a hash table for fast lookup. The hardware hash table only writes back to the former table while flushing out entries, and marks a flag in the software hash map to indicate that the hash table of the software hash map is now stale. Subsequent software accesses to the software hash map check this flag and reconstruct the hash table if the flag is set. This is exceedingly rare in practice (triggered only by process migration), so the reconstruction mechanism is necessary only for correctness. 4.3 Heap Manager Memory allocations and deallocations are ubiquitous in real-world PHP applications. They consume a substantial fraction of the execu- tion time and are spread across many leaf functions. To handle dy- namic memory management, the VM typically uses the well-known slab allocation technique. In slab allocation, the VM allocates a large chunk of memory and breaks it up into smaller segments of a fixed size according to the slab class’s size and stores the pointer to those segments in the associated free list. Overview We propose a heap manager with its most frequently accessed components in hardware to improve its performance and energy-efficiency. Our hardware heap manager is motivated by the unique memory usage pattern of these PHP applications. First, a majority of the allocation and deallocation requests re- trieve at most 128 bytes, which reflects heavy usage of small memory objects. Figure 8a shows the cumulative distribution of memory us- age with different memory allocation slabs. Second, these applications exhibit strong memory reuse. Such strong memory reuse is due to the following reasons: (a) these applications insert many HTML tags while generating dynamic contents for web pages. HTML tags are the keywords within a web page that define how the browser must format and display the content. Generating and formatting these HTML tags often require retrieving many attribute values from the database, storing them in string objects and later concatenating those values to form the overall formatted tag. Once a HTML tag is produced with all its required attributes, the memory associated with these strings are recycled. (b) These applications typically process large volumes of textual data, URL etc. Such processing commonly parses the content through many string functions and regular expressions with frequent memory allocations and deallocations to hold the string contents. To illustrate Figure 9: Hardware heap manager. this, Figure 8b and Figure 8c show memory usage pattern of two workloads during the course of their execution. We see that memory usage stays flat for the four smallest slabs of 0−32, 32−64, 64−96, and 96− 128 bytes, demonstrating their strong memory reuse for the slabs that dominate the total memory usage. Proposed design We deploy a heap manager with only its size class table and a few free lists in hardware to capture the heavy memory usage of small objects and their strong memory reuse. Fig- ure 9 shows the high-level block diagram. The comparator limits the maximum size of a memory allocation request that the hardware heap manager can satisfy. The size class table chooses an appropri- ate free list for an incoming request depending on its request size. Typically, a memory allocation request accesses the size class table and retrieves an available memory block from the chosen hardware free list without any software involvement. The free list for each size class has head and tail pointers to orchestrate allocation and deallo- cation of memory blocks. The core uses the head pointer for push and pop requests, and the prefetcher pushes to the location of the tail pointer. Upon finding a miss in the hardware free list, control trans- fers to software to satisfy the memory allocation request. At the start of a program, upon finding a miss for an available memory block in the hardware free list, the runtime retrieves memory blocks from the software heap manager and places it into the appropriate hardware free list. A prefetcher ensures that the free lists stay populated with available memory blocks so that a request for memory allocation can hide the latency of software involvement whenever possible. We use a pointer-based prefetcher to prefetch the next available memory blocks from the software heap manager structure. 0204060801000-32 B32-64 B64-96 B96-128 B128-256 B256-512 B> 512 BCDF %Allocation SlabsWordPressDrupalMediaWiki050100150200250300Memory Usage (Bytes)ThousandsProgram Execution0-32 B32-64 B64-96 B96-128 B128-256 B256-512 B> 512 BSmaller Slabs05001000150020002500300035004000Memory Usage (Bytes)ThousandsProgram Execution0-32 B32-64 B64-96 B96-128 B128-256 B256-512 B> 512 BSmaller SlabsPrefetcherComparator>>Size Class TableSet zero flagFreelist513
ISCA ’17, June 24-28, 2017, Toronto, ON, Canada D. Gope et al. in sequential (single-byte) string accelerators. Figure 10 shows a block diagram of some of the sub-blocks used within our accelerator design. ASCII compare uses combinational logic to find the presence of pattern characters within the subject string to populate a match- ing matrix. This operation is done in parallel, and can process as many characters per cycle as is supported by the table width and the width of subject string reads. Operations that require matching of multiple characters use AND gates of diagonal entries within the matching matrix to find the position of consecutive character matches. For example, Figure 10 shows the subject string ’babc’ doing a string_find for ’abc.’ String operations that require index calculation of pattern/character matches use a priority encoder to find the first instance of a valid match. Output logic forwards the updated ASCII value to the final output for string functions writing new characters to the result string. Shifting logic aligns the subject string to the correct address offset for string manipulation that re- quires (re-)writing a resulting string to memory. The datapath is organized in such a way that control logic determines the correct combination of sub-blocks enabled based on the incoming string operation. Multi-byte character sets (Unicode) can be handled by grouping the single-byte characters comparisons in the simplified ASCII example shown. Our accelerator design extracts concurrency by processing many bytes of the subject string in parallel. Our design is not limited to sequential text processing, and therefore can significantly increase string processing throughput. This is because each element in the matching matrix does not require any other element’s output for cor- rect calculation. Additionally, due to the commonalities between the string manipulation functions, our generalized accelerator supports many different operations with low overheads. Design considerations There are several important implementa- tion details that are required for correct operation. First, it is impor- tant to support wrap-around in string matching operation, since a match is possible between read text-block boundaries. We support this by buffering previous matching matrix values, and feeding them into the glue-logic sub-block. Second, since a few string functions such as string_to_upper and string_to_lower are dependent on a range of many ASCII characters, we allow 6 of our matching matrix rows to also support inequality comparisons, as opposed to exclu- sively equal comparisons. We also allow our pattern length (rows in the matching matrix) and size of subject string processed per cycle (columns in the matching matrix) to be configurable. Entries within the ASCII compare matrix that are unused during a given operation can be clock-gated to further reduce energy consumption. Coherence for writes to destination strings are handled by standard coherence mechanisms, while ordering of memory writes with respect to trail- ing loads is handled with a hardware interlock similar to a store queue in an out-of-order processor. Since PHP strings are of known length (rather than null-terminated), this coherence and consistency logic is straightforward to implement. 4.5 Regular Expression Accelerator Traditional regular expression (regexp) processing engines [8] are built around a character-at-a-time sequential processing model that Figure 10: String accelerator. Design considerations Since a major fraction of the requests attempt to retrieve at most 128 bytes, the hardware heap manager is restricted to serve requests that are at most 128 bytes in size. It uses only 8 memory allocation slabs to perform this, resulting in a very small, power-efficient hardware heap manager. Furthermore, the strong memory reuse means that in the common case it satisfies the requests from the hardware free list, resulting in very infrequent fall-backs to the software routine that handles any complex cases. Concurrent research work on accelerating memory allocation [48] eagerly updates the memory’s head pointer and linked list on all malloc and free requests to keep coherency with the memory’s data structure. However, due to the high memory reuse of these work- loads, we instead lazily update the memory’s heap manager data structure only on overflow or during context switches. This avoids the overheads of constantly updating memory to be coherent with the hardware table during periods that the heap manager accelera- tor is servicing the common case requests, while not causing any correctness errors or memory leaks. 4.4 String Accelerator In order to reduce the cost of searching, modifying, or otherwise processing text in PHP applications, we propose a generalized string accelerator. Although a significant portion of execution time comes from this category of functions, the execution time is spread out through numerous different string operations. These tasks include string finding, matching, replacing, trimming, comparing, etc. Pre- vious work, such as [68], propose methods for string matching in hardware. However, the hardware proposed processes a single char- acter every cycle, leaving large opportunities for parallelism and higher throughput. Prior designs also do not support the large variety of string operations we wish to accelerate. Proposed design We note that although string processing is an aggregation of many functions, their overall operation can be bro- ken down into common hardware sub-blocks. By sharing hardware resources, we propose a single string accelerator that can perform several of these operations, as opposed to creating a separate accel- erator for every string function. In addition to matching, our accelerator can substitute charac- ters, perform priority encoding of matches, and determine ranges of character types (useful for detecting lower case, upper case, alphanu- meric, etc. characters). We also design our accelerator to process multiple bytes per cycle to exploit concurrency that is not utilized babca0100b1010c00010010StringOpControl logicpatternSubject StringASCII CompareMatchingMatrix+Glue logicResult StringPriority encoder / Shifter Output logic514
分享到:
收藏