Sequential Optimization for Low Power Digital Design by Aaron Paul Hurst B.S. (Carnegie Mellon University) 2002 M.S. (Carnegie Mellon University) 2002 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor Robert K. Brayton, Chair Professor Andreas Kuehlmann Professor Margaret Taylor Spring 2008 The dissertation of Aaron Paul Hurst is approved. Chair Date Date Date University of California, Berkeley Spring 2008 Sequential Optimization for Low Power Digital Design c 2008 Copyright by Aaron Paul Hurst Abstract Sequential Optimization for Low Power Digital Design by Aaron Paul Hurst Doctor of Philosophy in Electrical Engineering and Computer Science University of California, Berkeley Professor Robert K. Brayton, Chair The power consumed by digital integrated circuits has grown with increasing transistor density and system complexity. One of the particularly power-hungry design features is the generation, distribution, and utilization of one or more synchronization signals (clocks). In many state-of-the-art designs, up to 30%-50% of the total power is dissipated in the clock distribution network. In this work, we examine the application of sequential logic synthesis techniques to reduce the dynamic power consumption of the clocks. These optimizations are sequential because they alter the structural location, functionality, and/or timing of the synchronization elements (registers) in a circuit netlist. A secondary focus is on developing algorithms that scale well to large industrial designs. The first part of the work deals with the use of retiming to minimize the number of registers and therefore the capacitive load on the clock network. We introduce a new formulation of the problem and then show how it can be extended to include necessary constraints on the worst-case timing and initializability of the resulting netlist. It is then demonstrated how retiming can be combined with the orthogonal technique of intentional clock skewing to minimize the combined capacitive load under a timing constraint. 1 The second part introduces a new technique for inserting clock gating logic, whereby a clock’s propagation is conditionally blocked for subsets of the registers in the design that are not actively switching logic state. The conditions under which the clock is disabled are detected through the use of random simulation and Boolean satisfiability checking. This process is quite scalable and also offers the potential for additional logic simplification. Professor Robert K. Brayton Dissertation Committee Chair 2 Contents Contents i List of Figures v List of Tables vii Acknowledgements ix 1 Introduction 1.1 1.2 1.3 1 Low Power Digital Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Technological . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Commercial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Environmental . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Sequential Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.2 Clock Skew Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Organization of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Unconstrained Min-Register Retiming 2.1 2.2 2.3 22 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.1 LP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.2 Min-Cost Network Circulation Formulation . . . . . . . . . . . . . . 31 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3.1 33 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 2.4 2.5 2.6 2.3.2 Single Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3.3 Multiple Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.1 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.4.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.5.2 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.5.3 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.5.4 Large Artificial Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 71 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3 Timing-Constrained Min-Register Retiming 74 3.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2.1 LP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2.2 Minaret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.3.1 Single Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.3.2 Multiple Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.5 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.5.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.6.1 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.6.2 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.3 3.6 3.7 4 Guaranteed Initializability Min-Register Retiming 102 4.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.2.1 104 Initial State Computation . . . . . . . . . . . . . . . . . . . . . . . . ii 4.2.2 Constraining Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.3.1 Feasibility Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.3.2 Incremental Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.4.1 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.4.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.3 4.4 5 Min-Cost Combined Retiming and Skewing 5.1 119 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.3 Algorithm: Exact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.4 Algorithm: Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.4.1 Incremental Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6 Clock Gating 6.1 6.2 6.3 138 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.1.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.2.1 Structural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.2.2 Symbolic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.2.3 RTL Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.2.4 ODC-Based Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.3.2 Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 iii 6.3.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.3.4 Literal Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.3.5 Candidate Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.3.6 Candidate Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.3.7 Candidate Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.3.8 Covering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.4 Circuit Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.5.2 Structural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.5.3 Power Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.5.4 Circuit Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.6 7 Conclusion 165 7.1 Minimizing Total Clock Capacitance . . . . . . . . . . . . . . . . . . . . . . 166 7.2 Minimizing Effective Clock Switching Frequency . . . . . . . . . . . . . . . 167 Bibliography 169 A Benchmark Characteristics 174 iv List of Figures 1.1 Tradeoff of performance and power. . . . . . . . . . . . . . . . . . . . . . . 1.2 Cost of IC cooling system technologies. . . . . . . . . . . . . . . . . . . . . 6 1.3 Overview of US power consumption. [1] . . . . . . . . . . . . . . . . . . . . 8 1.4 Forward and backward retiming moves. . . . . . . . . . . . . . . . . . . . . 11 1.5 A circuit and its corresponding retiming graph. . . . . . . . . . . . . . . . 12 1.6 Retiming to improve worst-case path length. . . . . . . . . . . . . . . . . . 14 1.7 Retiming to reduce the number of registers. . . . . . . . . . . . . . . . . . 14 1.8 Intentional clock skewing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 The elimination of clock endpoints also reduces the number of distributive elements required. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 A scan chain for manufacturing test. . . . . . . . . . . . . . . . . . . . . . 27 2.3 A three bit binary counter with enable. . . . . . . . . . . . . . . . . . . . . 34 2.4 An example circuit requiring unit backward flow. 38 2.5 An example circuit requiring multiple backward flow. 2.6 Fan-out sharing in flow graph. 2.7 The illegal retiming regions induced by the primary input/outputs. 2.8 The corresponding flow problem for a combinational network. 2.9 Flow chart of min-register retiming over multiple frames . . . . . . . . . . . . . . 5 . . . . . . . . . . . . 39 . . . . . . . . . . . . . . . . . . . . . . . . . 40 . . . . 41 . . . . . . . 45 . . . . . . . . . . 46 2.10 A cut in the unrolled circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.11 Retiming cut composition. 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12 The runtime of flow-based retiming vs. CS2 and MCF for the largest designs. 66 2.13 The runtime of flow-based retiming vs. CS2 and MCF for the medium designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.14 The distribution of design size vs. total number of iterations in the forward and backward directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 v 2.15 The percentage of register savings contributed by each direction / iteration. 72 3.1 Bounding timing paths using ASAP and ALAP positions. . . . . . . . . . 77 3.2 The computation of conservative long path timing constraints. . . . . . . . 80 3.3 The implementation of conservative timing constraints. . . . . . . . . . . . 81 3.4 The computation of exact long path timing constraints. . . . . . . . . . . . 82 3.5 The implementation of exact long path timing constraints. . . . . . . . . . 83 3.6 An example of timing-constrained min-register forward retiming. . . . . . . 86 3.7 An example of timing-constrained min-register retiming on a critical cycle. 88 3.8 Average fraction of conservative nodes refined in each iteration. . . . . . . 97 3.9 Registers in over-constrained cut vs under-constrained cut over time relative to final solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.10 Registers after min-reg retiming vs. max delay constraint for selected designs. 99 4.1 A circuit with eight registers and their initial states.. . . . . . . . . . . . . 105 4.2 Computing the initial states after a forward retiming move. . . . . . . . . . 106 4.3 Computing the initial states after a backward retiming move. . . . . . . . 106 4.4 Binary search for variables in feasibility constraint. . . . . . . . . . . . . . 111 4.5 Feasibility bias structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.1 Costs of moving register boundary with retiming and skew on different topologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.2 Overall progression of retiming exploration. . . . . . . . . . . . . . . . . . . 131 5.3 Dynamic power of two designs over course of optimization. . . . . . . . . . 137 6.1 Clock gating circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.2 Opportunities for structural gating. 6.3 Non-structural gating. . . . . . . . . . . . . . . . . . . . . . . 141 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.4 Unknown relationship between BDDs and post-synthesis logic. . . . . . . . 142 6.5 Timing constraints based upon usage. . . . . . . . . . . . . . . . . . . . . . 149 6.6 Distance constraints. 150 6.7 Proving candidate function. 6.8 Heuristic candidate grouping. 6.9 ODC-Based Circuit Simplification after Gating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 . . . . . . . . . . . . . . . . . . . . . . . . . 154 . . . . . . . . . . . . . . . 157 6.10 Four-cut for structural check. . . . . . . . . . . . . . . . . . . . . . . . . . . 159 vi List of Tables 1.1 Power consumption of performance-oriented NVIDIA GPUs in 2004 and 2008 [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Worst-case runtimes of various min-cost network flow algorithms . . . . . . 32 2.2 Worst-case runtimes of selected maximum network flow algorithms [3] . . . 37 2.3 Unconstrained min-reg runtime, LGsynth benchmarks. . . . . . . . . . . . . 63 2.4 Unconstrained min-reg runtime, QUIP benchmarks. . . . . . . . . . . . . . 64 2.5 Unconstrained min-reg runtime, OpenCores benchmarks. . . . . . . . . . . . 64 2.6 Unconstrained min-reg runtime, Intel benchmarks. . . . . . . . . . . . . . . 65 2.7 Unconstrained min-reg characteristics, LGsynth benchmarks w/ improv. . . 68 2.8 Unconstrained min-reg characteristics, QUIP benchmarks. . . . . . . . . . . 68 2.9 Unconstrained min-reg characteristics, Intel benchmarks. . . . . . . . . . . . 69 2.10 Unconstrained min-reg characteristics, OpenCores benchmarks. . . . . . . . 70 2.11 Unconstrained min-reg runtime, large artificial benchmarks. . . . . . . . . . 72 3.1 Delay-constrained min-reg runtime vs. Minaret. . . . . . . . . . . . . . . . . 93 3.2 Period-constrained min-reg characteristics, LGsynth benchmarks. . . . . . . 95 3.3 Period-constrained min-reg characteristics, OpenCores benchmarks. . . . . 95 3.4 Period-constrained min-reg characteristics, QUIP benchmarks. . . . . . . . 96 3.5 Min-delay-constrained min-reg characteristics, LGsynth benchmarks. . . . . 99 3.6 Min-delay-constrained min-reg characteristics, OpenCores benchmarks. . . . 100 3.7 Min-delay-constrained min-reg characteristics, QUIP benchmarks. . . . . . 100 4.1 Guaranteed-initializability retiming applied to benchmarks. . . . . . . . . . 117 5.1 Runtime and quality of exact and heuristic approaches. . . . . . . . . . . . 133 5.2 Power-driven combined retiming/skew optimization. . . . . . . . . . . . . . 133 vii 5.3 Area-driven combined retiming/skew optimization. . . . . . . . . . . . . . . 134 5.4 Results summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.1 Structural clock gating results. . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.2 New clock gating results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.3 ODC-based simplification results. . . . . . . . . . . . . . . . . . . . . . . . . 164 A.1 Benchmark Characteristics: LGsynth . . . . . . . . . . . . . . . . . . . . . . 175 A.2 Benchmark Characteristics: QUIP . . . . . . . . . . . . . . . . . . . . . . . 176 A.3 Benchmark Characteristics: OpenCores . . . . . . . . . . . . . . . . . . . . 177 A.4 Benchmark Characteristics: Intel . . . . . . . . . . . . . . . . . . . . . . . . 178 viii Acknowledgements Prof. Robert Brayton has my infinite gratitude for making the last five years enjoyable and educational and for allowing my graduate school experience to exceed my expectations. There was never a thought that he was unwilling to explore, and I thank him for the intellectual freedom to walk down so many paths and the experienced guidance on every one of them. His impression on his students and the field as a whole is unmeasurable. I’d like to thank Prof. Andreas Kuehlmann for his support in so many different and varied ways: as an instructor (twice), for a GSI experience, for an internship (twice), as a committee member (on all of my preliminary, qualifying, and dissertation committees), and as a manager. His possession of both detailed insight and broad vision is a rare combination. Alan Mishchenko has been an absolute joy to interact with, and I thank him for all of his effort on paper writing, in code, and in ideas. I can only aspire to a fraction of his perpetual enthusiasm for new ideas. It was a conversation in his car that sparked my interest in pursuing flow-based retiming, and he deserves credit for much of it. Christoph Albrecht has been a wonderful collaborator, coworker, and mentor throughout this work. His thoughtfulness and careful precision in unsurpassed and has challenged me in many ways. His expertise in sequential optimization has also contributed much to this work. Philip Chong was a mentor for my EE219B project, my first foray into the area of sequential optimization. He was great to collaborate with on that and other projects, including my summer work on clock skewing under variation and OpenAccess Gear. I would like thank Prof. Andrew Neureuther for the feedback during my qualifying exam, and Prof. Margaret Taylor for being on my committee and supporting this small piece of work on reducing unnecessary power usage. It is but a small step in our larger pursuit towards better energy policy. My summer internships were an invaluable piece of my education, and I’d like to thank everyone who gave me a taste of the world outside of academia. I thank Premal Buch ix and C. Van Eijk at Magma Design Automation for enriching my first summer, everyone at Cadence Research Labs, and Peter Hazewindus, Ken McElvain, and Bing Tian at Synplicity for making my two hour commute absolutely worthwhile. Thank you to Katharina GroteSchwinges, Lydia Probst, and Miteinander for an unforgettable summer spent pursuing interests outside of engineering. Bob’s other students, past and present, will hopefully remain lifelong collaborators, compatriots, and friends. I’ve enjoyed my interaction with Fan Mo, Yinghua Li, William Jiang, Zile Wei, and Sungmin Cho and would like to especially thank Shauki Elisaad, Satrajit Chatterjee, and Mike Case. I will come to miss our late Friday meetings. Donald Chai and Nathan Kitchen also deserve thanks for ideas, feedback, and enjoyable trips out of town. Thank you Arthur Quiring, Martin Barke, and Sayak Ray for the efforts on our joint projects on clock gating. My studies were supported through the generous contributions of the State of California MICRO program, the Center for Circuits and Systems Solutions (C2S2), and our industrial collaborators Actel, Altera, Calypto, Intel, Magma, Synopsys, Synplicity, and Xilinx. I will strive to repay their far-sighted investment into the educational system. Without a constant flow of caffeine, I’d have been a walking zombie for the last few years. Perhaps more importantly, the coffee shops of Berkeley gave me a place to escape to work and a truly comfortable third space. Thank you (in no particular order) to the employees and owners of Cafe Strada, Milano, Spasso, Roma, Nomad, Jumpin’ Java, A Cuppa Team, Bittersweet, and Peet’s. My friends and roommates have been an integral part of the last five years and deserve credit for keeping me sane: thank you Andrew Main, Tim DeBenedictis, Ryan Huebsch, Bryan Vodden, Josh Walstrom, Jay Kuo, Jimmy Tiehm, William Ma, Jen Archuleta, Steve Ulrich, Andrew MacBride, Adrian Rivera, Luis di Silva, and Simon Goldsmith. Thank you Chris for tolerating the late hours and all the sacrifices that were made in the name of completing this dissertation. Your support has meant the world to me. I will do everything in my power to return the favor when it comes your turn! x A life-long thank you is owed to my family for the unconditional support and love through all of these years. I must have been destined to be an electrical engineer from the weekends spent filling breadboards in my father’s electronics lab: to this day, I still remember the function of a 74LS90 (it’s a 4-bit decimal counter). Perhaps the greatest credit is due my parents for instilling in me a love of science and thought that has propelled me this far. xi xii Chapter 1 Introduction This dissertation is a study of how automatic digital circuit design techniques that manipulate the sequential components can be used to minimize the power consumption of integrated circuit devices. In the course of this study, several new techniques are introduced to enhance the potential for power reduction and are characterized on a set of benchmark designs. Before moving to the main part of the work, we begin with an introduction to and some background in the two facets of the subject: Low Power Digital Design, discussed in Section 1.1, and Sequential Optimization, discussed in Section 1.2. It is assumed throughout that the reader has some familiarity and comfort with mathematical and algorithmic notation, the vocabulary and terminology of digital design, and computer science, especially with regard to complexity theory. 1 1.1 Low Power Digital Design . The power consumption of CMOS integrated circuits (ICs) has remained at most a secondary concern for most of their history. While low power devices and design technologies have been in existence for decades, it is really only the last ten years that have seen the promotion of low power from a niche or secondary issue to an critical concern in digital design. The convergence of technological, market, and societal forces has brought this issue to the forefront. As a broad motivation of this work, we examine in detail why power consumption is such an important issue at present. Section 1.1.1 examines the technological changes that lie behind skyrocketing power densities and increasing per-die consumption. Section 1.1.2 discusses the commercial applications and drivers behind the push for lower power technology. Finally, on a macroscopic level, Section 1.1.3 examines the environmental ramifications of these technological trends. 1.1.1 Technological As in many other aspects of integrated circuit technology, the fundamental driver of the changing role of power is continued semiconductor device scaling. The ever-shrinking size of each transistor results in ever-increasing power consumptions through the consequent increasing speeds, increasing densities, and increased parasitic device behavior. The total power consumed by a digital design can be decomposed into two main components: dynamic and static. The static component of power is that which is consumed by a device regardless of its operational behavior; this includes the case when all transistors are quiescent. In a modern CMOS design, static power includes transistor gate leakage but is dominated by the sub-threshold leakage: the flow of current that passes through the transistor stack from the supply to ground due to the gate voltage being insufficiently above/below the threshold and the transistors incompletely switched off. The sub-threshold 2 leakage current scales exponentially with the threshold voltage Vth device characteristic. Smaller transistors and smaller supply voltages have driven this parameter downward and in short time brought the resuling leakage power from near zero to a real concern. The dynamic component of the power is the energy that is dissipated per unit time due to the switching of transistors. In a CMOS circuit, this is primary comprised of two components: the short-circuit current and the capacitive switching. In a well balanced cell library, the short-circuit current is a small fraction of the total. Generally, the capacitive switching dominates. As the capacitive elements in a circuit (e.g. the nets, transistor gates, and internal capacitances) switch logic state and charge from a low to high voltage, a quantity of energy is required to effect the transition. The power required is a function of the capacitance to be changed, the switching frequency, and the rail-to-rail voltage. This is expressed by Equation 1.1. Here, f is the transition frequency, Vdd the supply voltage, and C the switched capacitance. P = 1 2 V Cf 2 dd (1.1) The focus in this work is on the dynamic power dissipated during capacitive switching. This presents three variables with the potential to optimize, all of which are affected by synthesis choices. While the supply voltage can be increased to improve performance (at the expense of power), the challenges to its further decrease are substantial: the largest being the maintenance of the relationship Vth ≈ 21 Vdd and the increase in static power dissipation that results from pushing Vth any lower. Although useful, we do not wish to consider the tradeoff between dynamic and static power at this time and instead turn to methods that accomplish a straightforward reduction. Chapters 2 through 4 discuss techniques for reducing the switched capacitance on the clock, Cclk . In a typical design, the clock network possess both the single largest total capacitive sink and also the greatest switching frequency: its share of the power is accordingly large and often in the range of 30% to 50% of the total. The clock network presents an important and attractive target for power optimization techniques. 3 Chapter 6 introduces a new algorithm for reducing the average frequency with which the clock must be switched, fclk , for particular subsets of the network. This is completely compatible with the above methods. 1.1.2 Commercial The market forces that lie behind the drive to reduce integrated circuit power requirements are not driven by the quantity and cost of the energy itself– though this will be examined in the next section– so much as the consequences of the power usage on device functionality and value. Unlike the cost of the energy use itself, these pressures are felt more directly on the manufacturers of the integrated circuits, and it is these manufacturers who are the consumers of design technology such as is the subject of this research. There are many possible channels through which power affects the functionality and competitive value of a particular digital device, and we examine two of them in more detail now. Consider the set of all digital devices characterized jointly by performance (measured via clock frequency, computational operations per second, etc.) and total power consumption. As is illustrated in Figure 1.1. There is a direct trade-off between these two characteristics through the frequency-power relationship of Equation 1.1 but also through other design choices and variables. Given whatever is the current state-of-the-art design technology, we can then establish a maximum performance-power frontier, as is illustrated by the curve in Figure 1.1. This curve aids in differentiating two broad market segments of interest. High-Performance Systems One lies to the right of the graph and could be labeled high-performance systems. For our purposes, this includes scientific and super-computers, information servers, networking equipment, and personal computers: any system whose market value is driven primarily or in part by its performance. For this type of design, the power consumption is an issue not because of its effect on value but because of the limitations it presents in the pursuit of continued performance improvement. Because of 4 Figure 1.1. Tradeoff of performance and power. cost or application requirements, these designs eventually face such a limitation in the form thermal is depicted on the graph. of thermal constraints. A hypothetical barrier Pmax The thermal constraints arise from the fact that the energy consumed (through the mechanisms described in the previous section) ends up almost entirely as waste heat. At high enough rates of energy consumption, the accumulation of waste heat surpasses the ability of the integrated circuit’s environment to passively dissipate it. The resulting temperature increase can quickly disrupt or even permanently damage the device. This necessitates the inclusion of heat-dissipation systems, from passive heat-sinks to active air-flow control and air-conditioning and eventually to liquid cooling. However, the cost of these options does not scale well with increased capacity and presents an economic limitation on chasing increased computational performance. Beyond 35-40W, the cost of additional capacity is approximately $1 per watt [4]. An overview of the capacity and cost of various cooling technologies is outlined in Figure 1.2. The high supply currents also add additional cost to the power delivery and regulation systems. This can be especially costly in large installations with multiple computers. 5 Figure 1.2. Cost of IC cooling system technologies. Portable Systems At the opposite end of the performance-power curve lies a market segment identifiable as portable systems. These are the devices that depend on mobile power sources (e.g. batteries) and include phones, music and media players, hand-held computers and game systems, and remote sensors and monitoring devices. Here, the power consumption affects value through the cost and weight of the energy storage necessary to meet the minimal functionality as well as the single-charge operating lifetime. This hypothetical barrier is depicted as fmin . Energy storage has become an issue because of the divergence between the quantities of energy that can be stored per unit weight and the quantities consumed by increasingly power-hungry devices. Improvements in battery technology and the amount of energy that can be stored have lagged significantly behind increases in the rate at which it is consumed. Whereas there has been an exponential increase in density (ala Moore’s law), battery energydensity has only improved about 6% per year [5]. The next-generation technologies (e.g. fuel cells) are not yet close to productization. Any decrease in power consumption can therefore be translated into either increased single-change lifetime or reduced battery costs. Both high-performance and portable system applications are facing immediate con6 straints imposed by the power consumption of digital integrated circuits. This work has industrial applications for both market segments. 1.1.3 Environmental Beyond the market pressures that are driving low power IC technology, there are strong reasons to strive for minimizing the energy consumed in digital devices. While the power required to charge a 0.2 femtofarad transistor gate input to 1.5V is trivially small– 0.2 fJ, approximately the same energy to lift a grain of chalk dust a few centimeters– the combined frequency, transistor density, and pervasiveness of digital devices totals to a substantial rate of energy use. Furthermore, as each of these quantities continues to grow so does the energy used. In a carbon-based energy economy, this exploding growth in electricity consumption represents a dangerous proposition for atmospheric health. Consider the 2001 total U.S. annual energy usage, broken down by sector in the left side of Figure 1.3. Within the residential component, retail electric power (and the accompanying loss through distribution) accounted for 70% of the total, and within this 70%, 67% was used for appliances (excluding air conditioners, water heaters, and household heating). While large mechanical appliances (and especially refrigerators) make up the bulk of this total, the home office and entertainment devices that are wholly digital or digitally-centric contribute a 10% share. This corresponds to 82 billion kilowatt-hours per year or an average 9,360 megawatts of continuous usage. The relative contribution of digital devices in the commercial segment is even higher. While non-digital uses for energy still represent the most substantial target for energyefficiency technology and conservation efforts, this is rapidly changing. The trend in power usage (both per capita and total) of many large appliances has actually been shrinking in recent years due to continuing improvements in efficiency and an effective campaign to replace older models with newer energy-saving versions. Unfortunately, this trend does not extend to digital-centric devices: several of the last few years have seen double-digit growth in their combined power draw. Again, this can likely be attributed to both their increased 7 Figure 1.3. Overview of US power consumption. [1] proliferation and increasing per-device energy consumption. This represents an increasing large cause for attention. While this type of top-down analysis illustrates the total energy used by digital devices, it’s difficult to isolate the exact contribution of integrated circuit power consumption to the total. Even within a personal computer, a substantial fraction of the power goes towards the power supply, cooling system, and mechanical disks. To make a case that individual integrated circuits consume a non-trivial fraction of the total energy output, we examine a case built from the bottom-up. As an example, consider the latest GPU (graphics processing unit) offerings from NVIDIA, Inc. This one company represents a tiny fraction of the integrated circuit industry, though their products do find themselves in a sizeable number of personal computers. Based on the figures from [6], approximately 115.9 million desktop-based GPUs were sold in 2007. 8 GPU Year Market Positioning Idle Power (W) Peak Power (W) Geforce 5900 Ultra Geforce 8800 GTX 2004 2008 Consumer, Consumer, Performance-Oriented Performance-Oriented 26.8 46.4 59.2 131.5 Table 1.1. Power consumption of performance-oriented NVIDIA GPUs in 2004 and 2008 [2]. The power consumption of one of the latest performance desktop products, the Geforce 8800 GTX, is presented in Table 1.1. Note that the idle power– when neither the GPU nor the computer are performing any computation– is 46.4W. While it’s difficult to estimate typical usage patterns, it’s not unreasonable to assume that a significant fraction of the host machines are on at any given time and wasting this power. If 115.9 million Geforce 8800 GTX units are on (and idle), the combined power draw would total 5,380 MW. While the capacity of any given generation station may vary dramatically, a typical output of a coal-based electric generator is roughly 1,000 MW. Approximately five coal plants are required to supply the energy that is wasted in this scenario. As the average pollution rate for a coal-fired electricity station was 2.095 pounds CO2 per kilowatt-hour in 1999 [7], the resulting carbon released would amount to 49.3 million tons in one year. Not all of NVIDIA’s graphics products consume as much electricity as their performance products, but the trend towards increased energy usage is unmistakable. Table 1.1 also lists the power consumed by a component with identical market positioning just four years ago. Within this span of time, the idle power dissipation has increased 2.2x times and the peak power 2.8x! This trend has been continuing for a long time (even though power was not enough of a concern to have been widely characterized for a desktop components in earlier times) and is likely to continue into the future. An unfortunate consequence of the relationship between power and performance (as depicted in Figure 1.1) is that any improvements in digital power-efficiency are likely to be traded for increased computational performance in the class of speed-driven devices that consume the bulk of the total IC power draw. While this is an unmitigated good for the 9 future utility of computation and the many benefits that it brings, it utterly fails to address the problem of the resulting energy use. As with most viable attempts to address the problem of energy sustainability, it is likely that new technology will have to be coupled with fresh approaches to policy to achieve the needed results. 1.2 Sequential Optimization The collection of low power technologies discussed in this work fall into the class of logic synthesis techniques broadly known as sequential optimization. In synchronous digital designs, the correct temporal behavior of the system is achieved though the insertion of synchronization logic. This usually consists of state storage elements known as registers that are driven by one or more clock signals. This is the overwhelmingly dominant paradigm for current digital design. It is the manipulation of these elements on which we focus. In contrast, combinational optimization represents a variety of logic synthesis techniques that treat certain aspects of the circuit behavior as invariants. Though the logic implementation can be dramatically altered, the function implemented at the inputs of every register is exactly preserved. For timing-driven combinational optimizations, it is also assumed that the timing relationships at and between the registers is fixed. It is exactly the relaxation of these two assumptions that is considered in this work. We also explicitly consider the clock network and its accordant power consumption; the mechanics of the clock distribution are outside the scope of combinational logic synthesis. The rest of this section gives a general overview of two of the sequential optimization techniques that are central to several of the chapters in this dissertation: retiming [8] and clock skew scheduling [9]. If additional background is necessary, we refer the reader to the original works; these are complete and still very relevant sources for understanding the motivations for and details of the transformations. The other general technique, clock gating, is introduced and motivated in Chapter 6. 10 1.2.1 Retiming Retiming is a method for relocating the structural positions of the registers in a design such that the output functionality is preserved. First proposed by [8], it has been utilized for two decades. Implementations of retiming are found in all of the major commercial logic synthesis tools in both the ASIC and FPGA markets. Figure 1.4. Forward and backward retiming moves. The retiming transformation can be most easily understood as the repeated application of set of simple moves. If every direct output of a combinational gate is a register, these registers can be removed and one inserted on every input of the node. Correspondingly, if every direct input of a node is a register, these can be removed and a register inserted on every output. In this manner sequential elements can be “pushed across” combinational ones. This is illustrated in Figure 1.4. Every valid retiming transformation can be decomposed into a sequence of these incremental moves. The work of [8] describes an elegant method of capturing any legal retiming without explicitly enumerating a sequence of moves; we review this now. 11 A retiming graph G is defined as follows. Let G =< V, E, wi > be a directed graph with edge weights. The vertices correspond to the combinational elements and external connections in a circuit and the edges E ⊆ V × V to the dependencies between them. (Hereafter, edges are interchangeably referred to by their endpoints or their label: for example, e ≡ (u, v).) Each edge e represents a path between two combinational elements or a primary IO through zero or more sequential elements. The number of registers present on each edge is captured by wi (e) : E → Z. wi is the initial register weight or sequential latency. In the timing-constrained version of the problem, each combinational element has an associated worst-case delay, d(v) : V → ℜ. Figure 1.5. A circuit and its corresponding retiming graph. A circuit and its corresponding retiming graph are depicted in Figure 1.5. The sequential elements have been removed. Note that there are two edges g2 → g4 because there are two paths with different sequential latencies. The problem of retiming consists of generating a new graph G ′ by altering only the number of registers on each edge w(e). The retiming transformation can be completely 12 described by a retiming lag function r(v) : V → Z. The lag function describes the number of registers that are to be moved backwards over each combinational node. After the registers have been relocated, the number present on each edge, wr is given by Equation 1.2. wr (u, v) = wi (u, v) − r(u) + r(v) (1.2) There may be restrictions imposed on the lag function. For the retimed circuit to be physical, the final register count wr (e) must be non-negative for every edge. This imposes a constraint on the lag function of the form of Equation 1.3. It is typically also desirable to fix the lags of all external inputs and outputs to zero; this prevents desyncronization with the environment. r(u) − r(v) ≤ wi (u, v) (1.3) Typically, other constraints are imposed upon the selection of this function and an objective is defined. Common objectives include minimizing the register count, minimizing the worst-case combinational path delay, or either with a constraint imposed on the other. The additional details of these problems will be discussed in later chapters. Examples of how retiming can be applied to improve either the worst-case delay or the number of registers are illustrated in Figures 1.6 and 1.7, respectively. The green combinational gates are labelled with their worst-case delays D. The optimum retiming moves (for either delay or register count) are illustrated using magenta arcs to the new silhouetted locations. The corresponding retiming lags r(V ) are also indicated on the combinational gates (where the value is non-zero). The retiming problem is an instance of integer linear programming (ILP). In most cases, the structure can be used with specialized solvers to attack the specific problem more efficiently. If the network structure of the problem is maintained, the worst-case bound is polynomial (and strongly P for the classes of problems that we will be examining). This 13 Figure 1.6. Retiming to improve worst-case path length. Figure 1.7. Retiming to reduce the number of registers. will be examined more closely in Chapter 2. However, we will also see multiple subproblems that require the solution of a mixed-integer linear program (MILP). The MILP variant is NP-hard. 14 Multiple clocks Modern synchronous designs may utilize anywhere between one and hundreds of clocks. The complicating problem is that registers with different clock signals can not be merged into a single register in either the final solution or any of the intermediate points over which the registers must be retimed. This reality is often not explicitly addressed in works on retiming, but there is a relatively straightforward solution. The registers can be partitioned into clock domains, and the domains boundaries never crossed. A similar partitioning is also necessary for differences in any other asynchronous control signals. Though all of our example circuits have a single clock and reset, we assume that this method would be applied for multiple-clocked designs. Advantages As a sequential optimization, the advantages of retiming are several. The first is relative ease of computation: the runtime of the optimization algorithm grows with the size of the circuit with a low-degree polynomial asymptote. This is better than many combinational synthesis problems– let alone sequential ones, many of which lie in the PSPACE class. Despite being easy to compute, retiming often holds significant potential to improve the desired objective. In the case of performance-oriented optimization, the resulting improvement in speed can be quite significant. In [10], a profiling of several industrial circuits indicated that the worst-case average cycle length was significantly shorter than the single worst-case path length. Through some combination of misbalanced sequential partitioning of the original designs and the inability of the design tools to perfectly balance the path lengths, there remains significantly potential to reposition the registers and balance slack. Retiming can also be applied to reduce power consumption. This will be examined in more detail in Chapters 2 through 5. Challenges The primary challenges involved with retiming a design arise from the reencoding of the state machine. The values (and number of bits) stored in the retimed registers at each cycle do not correspond with those in the original design. First and foremost, this complicates the formal verification of the netlist revision. The problem no 15 longer becomes one of combinational equivalence checking at the register inputs and primary outputs, though the problem is still quite tractable due to the maintainence of other equivalent points. However, if the retiming is interleaved with additional resynthesis, it immediately becomes very difficult to resolve even potential locations for equivalence between the original and transformed circuits. Until very recently, there were no commercial verification tools that could overcome this problem in a completely satisfactory manner, and in industrial practice, no change is generally allowable unless it can be verified. While still a difficulty, advances in sequential verification have brought retiming into the realm of verifiable optimizations. The burden due to state re-encoding is not only borne by the automated tools but by the human designer as well. The latched state values are often the primary points for debugging a simulated version of the design and the only points for debugging a silicon devices. The translation of these values to and from the original specification requires additional tools and/or effort. 1.2.2 Clock Skew Scheduling Clock skew scheduling [9] offers a technique for balancing the computation across sequential elements by applying different non-zero delays on the clock inputs of each register. To differentiate from the unwanted version, this is often called intentional skewing. The latest arrival times at the latches in a single design may vary considerably. This imbalance may come as a result of timing misprediction in the design flow or because of a fundamental imbalance in the sequential partitioning of a design. Since the latches in a single clock domain must all operate at the same frequency, performance is limited by the slowest delay path, even if the others could operate at a higher speed. While the delay balancing of skew scheduling, the timing of the design is then no longer limited by the single worst-case path but by the maximum average delay around any loop of register-to-register path segments. The insertion of intentional clock skew is illustrated in Figure 1.8. The register-to- 16 Figure 1.8. Intentional clock skewing. register timing paths are labeled with with a maximum delay D and minimum delay d The arrival of the clock at register R2 is intentionally delayed by τ (R2 ) time. There is assumed to be clock insertion delay along all of the paths, and the τ value represents the deviation from the nominal. We will see that only the relative values of τ matter; the choice of the nominal value is arbitrary. The re-balancing of the timing criticality can then be observed. While the insertion of an intentional skew of delay τ (R2 ) > 0 delays the arrival of the clock at R2 and increases the allowable delay DR1 →R2 along the longest path R1 R2 (before the setup timing of register R2 is violated), the permissible worst-case delay DR2 →R3 along the path R2 R3 is correspondingly decreased. Any timing slack added to the incoming paths of R2 is directly borrowed from the outgoing paths. The opposite effect occurs for the shortest paths and hold timing constraints. The problem of computing an optimal clock skew schedule can be formulated as a continuous linear program. The objective is to minimize the clock period by choosing a set of per-register skews, τ (r), subject to the linear constraints arising from setup and hold constraints along each register-to-register timing path. The setup and hold constraints are 17 Equations 1.4 and 1.5, respectively. D(u, v) is the maximum path delay along u is the minimum path delay along u v, d(u, v) v, and T is the clock period. Su→v : Hu→v : D(u, v) ≤ T − τ (u) + τ (v) d(u, v) ≥ τ (u) − τ (v) (1.4) (1.5) (1.6) The final problem can be solved using a general approach to linear programming (e.g. simplex, interior-point methods, etc.). While in the worst-case simplex is exponential in the size of the problem, it does not generally perform so poorly for practical problems. Our experience is that it is slow but tractable for solving retiming problems. In any case, weakly polynomial alternatives exist [11] [12]. This minimization also corresponds to the determination of the maximum mean distance around any cycle in the register-to-register timing graph. There exist several algorithms [13] for solving the maximum-mean-cycle-time problem, several of which are quite efficient in practice. Our experience agrees with the observation that Howard’s algorithm is the most efficient, though Burns’ method is also useful for incremental analysis and computations performed directly on the circuit structure. Advantages In contrast to retiming, clock skew scheduling also possesses the desirable feature of preserving circuit structure and functionality. The verification and testing issues that plague retiming do not apply to clock skew scheduling. In recent years, it has gained practical acceptance in multiple design tools, usually at the end of the flow after physical synthesis is nearly complete. Challenge There are real difficulties in the implementation of a specific clock skew schedule. The challenges of constructing a near-zero-skew clock distribution network are already significant, and the requirement that each endpoint have a different and specific insertion 18 delay complicates the problem. Furthermore, the physical difficulties of inserting multiple buffers in the vicinity of every skewed register are also problematic. There have been some recent advances in the ease clock skew implementation. The use of routing delays to insert skews has been studied in [14], thereby minimizing the number of delay buffers that must be placed. The clock skew schedule itself can be altered by using the flexibility of the non-critical constraints and/or the relinquishment of optimality. This was studied by [15] with favorable results. 1.3 Organization of this Dissertation The overall theme of this dissertation is low-power design using sequential logic optimization techniques. Within this space, each chapter represents the study of a problem and a corresponding solution. We define a problem as consisting of the optimization of some objective under a given transformation that is subject to a particular set of constraints. There is (unsurprisingly) significant overlap in these elements between the chapters, and it may benefit the reader to consider the thesis in its entirety. However, significant effort has been put into the organization to provide coherent boundaries between the particular problem features that may be of interest to different readers. design A secondary but very important theme that is common to all of the work present here is scalability to large designs. While computationally expensive and powerful optimization techniques can be applied to small circuits with impressive results, the utility of such methods is very limited in practice. We have intentionally focused on algorithms that are applicable to large (and growing) design sizes– even if this comes at the expensive of obviously better or more complete solutions. This important property of our approach should be observed throughout this dissertation. The structure of each chapter follows the following general format (with the section titles are in bold): an introduction to the Problem and its motivation, background and information about Previous Work, a detailed description of our Algorithm and a corresponding Analysis of its behavior, and finally a presentation of Experimental Results. The content of each chapter is roughly as follows: 19 • Chapter 2. Unconstrained Minimum-Register Retiming. We introduce a new algorithm for the minimization of the number of registers in a circuit using retiming. At this point, the only constraint on the solution is functional correctness. The technique is compared to existing solutions both analytically and empirically. This chapter serves as the foundation of the two subsequent ones. • Chapter 3. Delay-Constrained Minimum-Register Retiming In this chapter we extend the algorithm in Chapter 2 to include constraints on both the worst-case minimum and maximum path delays in the problem of minimization the number of registers in a circuit under retiming. For synthesis applications, these constraints are critical to ensure the timing correctness of the resulting circuit. • Chapter 4. Guaranteed Initializable Minimum-Register Retiming The algorithm in Chapter 2 is further extended to guarantee that the resulting retiming will be initializable to a state that corresponds to the initial one in the original circuit. The worst-case complexity of this problem is in class N P , but we show that our technique is quite efficient in practice for the examples that we examined. • Chapter 5. Combined Minimum-Cost Retiming and Clock Skewing We discuss algorithms for simultaneously minimizing the both number of registers in a circuit and the number of clock skew buffers under a maximum path delay constraint. A general cost function is defined (that is inclusive of power minimization) and both its exact and heuristic minimization is studied. It is demonstrated that combining the features of both retiming and skewing can lead to a significant better solution than either on its own. • Chapter 6. A New Technique for Clock Gating A new technique for the synthesis of clock gating logic is introduced using the efficient combination of functional simulation and a satisfiability solver. Clock gating inserts combinational logic on the clock path to minimize the conditions under which the 20 registers in the design must be switched. We improve on previous methods in runtime, quality, and/or the minimization of netlist perturbation. 21 Chapter 2 Unconstrained Min-Register Retiming In this chapter we introduce a new algorithm for the minimization of the number of registers in a circuit using retiming. At this point, the only constraint on the solution is functional correctness: the primary outputs in the retimed design must exhibit functionally identical behavior to the original circuit under every possible sequence of inputs. It assumed that the registers do not have any specific reset or initial state. This flavor of the retiming problem is known as unconstrained min-register retiming. We assume that the retiming transformation is understood. The reader may review Section 1.2.1 for more background on retiming. The chapter begins in Section 2.1 by defining the problem of register minimization and discussing the motivations behind and importance of reducing the number of registers in the design. In Section 2.2, we discuss the background and previous solutions to this problem. Section 2.3 introduces a new algorithm to compute the optimal min-register retiming using a maximum-flow-based formulation and illustrates its behavior on several small examples. Further analysis of the correctness, complexity, and limitations of the new algorithm is described in Section 2.4. Finally, experimental results– including a direct comparison with existing best practices– are presented in Section 2.5. 22 Chapters 3 and 4 further develop the maximum-flow-based retiming technique introduced in this chapter, describing the means to constrain the solution’s worst-case delay and correctness at initialization, respectively. 2.1 Problem A circuit is assumed to be a directed hypergraph Ghyp =< V, H > where each directed hyperedge h ∈ H is ⊆ 2V × 2V . There exist three classes of vertices: Vseq , the sequential elements (hereafter referred to as registers), Vcomb , the combinational gates, and Vio , the primary inputs and outputs (PIOs). The sequential and combinational gates may correspond to either specific objects in a technology library, generic primitives, black-boxed hierarchy, or some other complex technology-independent descriptions (e.g. sum-of-products). This flexibility makes retiming applicable to any stage of a synthesis flow, from a block-level RTL netlist to a placed physical one. We first decompose Ghyp into an equivalent graph G with pair-wise directed edges E ⊆ V × V such that E = {u → v : ∃h s.t. u ∈ sources(h) ∧ v ∈ sinks(h)}. Each hyperedge is broken into the complete set of connections from sources to sinks. The problem studied here is the simple minimization of the number of sequential vertices |Vseq | via retiming. The only constraint is that the functionality of the circuit, as observed at the primary outputs, remains unaltered. Here, functionality does not include timing or electrical considerations; only the logical values at the end of an unbounded period of evaluation determine correctness. It is assumed that the initial values of the registers are unspecified. 2.1.1 Motivation Registers are particularly important targets for optimization. There are common optimizations that are shared between combinational and sequential elements: for example, several functionally identical cells with varying drive strengths and threshold voltages may 23 be present in a library to trade off area and dynamic power against performance and leakage power. There are also unique complexities inherent to only sequential elements. These present opportunities for improving design characteristics that are not applicable to the combinational logic. It is these design features that make sequential optimizations (such as retiming) of particular interest, and we examine them now. The first critical aspect of design that is not directly touched by combinational optimization is squarely within the domain of sequential optimization: the clock. Clock Power It is typical for the current generation of integrated circuits to consume about 30% of their total dynamic power consumption in the generation, distribution, and utilization of sequential synchronization signals, and it is possible for this fraction to climb above one half [16]. In most architectures, this takes the form of one or more large clock networks [17]. These signals must be distributed with extreme timing precision across large areas– or in many cases the entire die– to thousands of synchronization points. The total dynamic power consumed in the clock network takes the form of Equation 2.1, where Vpp is the peak-to-peak signal voltage, Cclk is the total capacitance in the clock distribution network (including endpoints), and fclk is the frequency. The minimum voltage necessary to switch the transistors– and therefore Vpp – is dictated by the process technology. The performance is proportional to fclk and is often either tightly constrained or is the primary optimization objective. 1 2 Cclk fclk P = Vpp 2 (2.1) This leaves the total capacitance of the clock network as the best target for minimizing the dynamic power consumption. The components of this capacitance can be broken into three categories: wire, intermediate buffers, and the clock-driven sequential gates. This is captured in Equation 2.2. We are mostly concerned with minimizing the total power consumption by minimizing the capacitance on the leaves the clock distribution network, the registers R. 24 Cclk = Cnet + Cbuf + X cireg (2.2) i=1..R We assume that the capacitance of each clock input (that is, each register or latch) is determined by the technology and the details of the standard cell implementation. Further advances in device and library technology present excellent opportunities for reducing clock power consumption by improving these values. However, with these characteristics fixed, the design problem becomes one of minimizing the total number of clock inputs, or correspondingly, minimizing the total number of sequential elements. Reducing the number of points to which the clock must be distributed also reduces the power that must be consumed for the purely distributional components (i.e. Cnet and Cbuf ). For this reason, minimizing the number of registers has a greater effect on the total power consumption than the reduction in leaf capacitance alone. Though we do not measure and include this effect in our results, its contribution should be recognized. An alternative to minimize the clock power would be to abandon the synchronous paradigm and eliminate the need for an explicit clock entirely. Various asynchronous design methods have been proposed that do not require the expensive distribution of a regional synchronization signals [18] [19] [20] [21]. There has even been success in employing this strategy in both academic [22] and commercial designs [23], but for now, the synchronous model retains a commanding dominance. Its simplicity, tool support, and maturity is unmatched. For the immediate future, digital design is wedded to the existence of a clock. Clock Tree Synthesis Effort The design of the clock network is consistently one of the most challenging aspects of VLSI timing closure and typically involves significant effort on the part of both the automated tools and the human designers. Routing, placement, and buffering each present physical, electrical, and timing challenges. While this process is beyond the scope of this work, [17] presents an overview of the problem and current methodologies. The reduction in the number of clock distribution endpoints simplifies each on these 25 Figure 2.1. The elimination of clock endpoints also reduces the number of distributive elements required. problems. As illustrated in Figure 2.1, if retiming is able to eliminate registers R5 and R6 , additional savings in design effort, area, power, and routing can be realized because of the elimination of buffers B3 and B6 . In general, the reduction in registers may not be concentrated in any one branch of the clock network; however, because the levels are (re)allocated to balance the electrical loads, the effect is the same. Retiming is often performed before clock tree synthesis. Manufacturing Test Because of the tremendous complexity of a silicon device, there are a tremendous number of opportunities and locations for defects to appear during manufacturing. An overview is presented in [24]. Ensuring that each device conforms to the functional and operational specifications is time-consuming, challenging, and requires expensive equipment. However, the costs for missing defects range from loss in yield to in-field replacement to the consequences of jeopardizing human safety. There are several different styles of test logic, but the dominant one involves the insertion of scan chains, as illustrated by Figure 2.1.1. A serially-connected path is created through all of the testable registers in the design. The testing then consists of three phases: first, 26 the register scan inputs and outputs (“scin” and “scout”) are enabled to allow the shift of test vectors into the sequential elements; the regular inputs and outputs are enabled and the values of the next state are computed; these result vectors are shifted out and evaluated for correctness. The shift in and out operations can be combined, but because of the large number of registers, the length of a chain can still be quite long. With increasing design complexity, the number of registers in a design grows, and the time required to shift in/out a single vector increases proportionally. As a complex design can require thousands of test vectors, this becomes the driving component of total test time. Figure 2.2. A scan chain for manufacturing test. The total test time is the main component of per-unit test cost. Register reduction addressed this directly. A decrease in the register count is a proportional a decrease in the scan chain length and the time required to load each vector. This offers a valuable means for reducing the test cost. 27 Verification While the flow-based formulation of min-register retiming developed in this chapter is extended in Chapters 3 and 4 to include design constraints that are necessary for synthesis, the unconstrained problem does have intrinsic value in the area of sequential verification. The goal of sequential verification is to prove one or more properties over the entire operation of the state machine implemented by a sequential circuit [25] [26] [27] [28]. The state space of the design is the critical driver of the complexity; the number of potential design states grows exponentially with the number of state bits. Because each register implements a state bit, register minimization can be used to significantly reduce the size of the problem. While the corresponding reduction in the total state space is exponential, this reduction doesn’t necessarily come within the reachable state space. Never-the-less, the guaranteed linear reduction in the state representation is useful to improve the practical memory and runtime requirements of a sequential verification tool. The work of [29] does demonstrate an empirical relationship between retimed register count and the difficulty of sequential equivalence checking. In this work, it was shown that preprocessing with retiming decreases the total runtime of sequential verification. Retiming is used industrially in IBM’s SixthSense tool to this same end [30]. Although anecdotal, the experience of others in using these retiming algorithms in sequential verification has been decidedly positive. 2.2 Previous Work 2.2.1 LP Formulation The use of retiming to minimize the number of registers in a design was first suggested by [8]. This objective was one of the first suggested applications for retiming in this original work. The problem can be formulated as an integer linear program of the form of Equation 2.3. As introduced in Section 1.2.1, let G =< V, E > be a retiming graph. r(v) is a retiming 28 lag function, and wi (u, v) is the initial number of registers present on each edge u → v. The number of registers on edge u → v after retiming is wr (u, v) = wi (u, v) − r(u) + r(v). min X wr (e) s.t. ∀e∈E r(u) − r(v) ≤ wi (u, v) ∀u → v (2.3) Let |G| be the total number of registers present in the circuit described by the graph G. If the circuit is retimed as described by the lag function r(v), let |G ′ | be the total number of registers after retiming. This quantity can be computed from r(v) as described by Equation 2.4. Let the outdegree of a node be the number of outgoing edges: outdegree(v) = |{e = v → u : ∃u ∈ V ∧ e ∈ E}|. Indegree is defined similarly. |G ′ | = X wr (e) (2.4) ∀e∈E = |G| + X r(v)(indegree(v) − outdegree(v)) (2.5) ∀v∈V Fan-out Sharing Because the retiming graph G represents all connectivity as pair-wise edges, it does not adequately model the hypergraph connectivity of the netlist Ghyp . A single physical wire may implement multiple point-to-point connections. For certain applications of retiming, this is irrelevant to the problem; for the minimum-register objective, correctly accounting for the connectivity of the retimed circuit is imperative. In particular, the registers on edges faning out from the same vertex can be shared. With this register fan-out sharing, the correct register count is described by Equation 2.6. |G′ | = X ∀u∈V max ∀{(u,v)|u→v,∃v} 29 wr (u, v) (2.6) Leiserson and Saxe introduce a transformed graph Gˆ that exactly models register fanout sharing. A mirror vertex vˆ is added for every vertex v that has outdegree(v) > 1 (i.e. multiple fan-outs). For every edge v → u, a mirror edge eˆ = u → vˆ is also created and assigned an initial register count as given by Equation 2.7. An edge breadth function β(e) : E → ℜ is also applied, where β(e) = 1 outdegree(v) . With the edge breadths, the total number of registers in the retimed circuit becomes Equation 2.8. The number of registers ˆ can be shown to be identical to the number of registers after maximally collapsing the |G| registers in G with fan-out sharing. wi (u → vˆ) = |Gˆ′ | = X max ∀e∈outgoing(v) wi (e) − wi (v → u) (2.7) β(e)wr (e) (2.8) ∀e∈E = Gˆ X ∀v∈V r(v) X β(e) − ∀e∈outgoing(v) X ∀e∈incoming(v) β(e) (2.9) We assume this method of modeling fan-out sharing is used throughout this work. Problem Size The size of the final unconstrained min-register linear program is quite compact. The number of variables is 2Vcomb , where Vcomb is the number of combinational nodes in the circuit. The number of constraints is proportional to the number of pairwise combinational edges. The LP can be generated from the circuit in O(P ) time, where P is the number of node connections (i.e. pins). This can be solved directly using a general integer linear programming (ILP) solver. In practice, a more efficient solution is possible due to the specific nature of the problem. The dual of the problem does not require integer methods (that are NP-hard in the worst case) and can instead be solved as a continuous LP. Furthermore, the min-register retiming formulation has a particular structure that makes it well suited to network solutions. Next, we look at the minimum-cost network circulation problem and how it can be applied to the problem at hand. 30 2.2.2 Min-Cost Network Circulation Formulation The dual of the linear program in Equation 2.3 possesses a network structure. In particular, this problem corresponds to the computation of the minimum-cost network circulation (MCC). The minimum-cost network circulation problem is as follows. Given a graph G = (V, E), let u(e) : E → ℜ be the capacity of each edge, and c(e) : E → ℜ be the cost per unit of flow along each edge. A flow demand d(v) : V → ℜ is associated with each vertex; the total demand of all vertices is zero. The objective is to find a flow along each edge f (e) : E → ℜ that satisfies the demand at each vertex and minimizes the total cost. An MCC problem can be expressed as a linear program of the form of Equation 2.11. min X u(e)c(e) s.t. ∀e∈E X ∀e∈incoming(v) f (e) − X f (e) ≤ u(e) ∀e ∈ E (2.10) f (e) = d(v) ∀v ∈ V (2.11) ∀e∈outgoing(v) When MCC is applied to retiming, the vertices V in the minimum-cost circulation problem correspond exactly to the vertices in the retiming graph. The demand d(v) is defined by Equation 2.12 and equal to the net weight of the incoming less the outgoing edges. The capacity of each edge u(e) is unbounded, and the cost of each edge c(e) equal to the register weight, as specified in Equations 2.13 and 2.14, respectively. All of the costs and demands are integers (or rationals, if fan-out sharing is used); this property is important can be shown to improve the worst-case runtime of some methods. d(v) = incoming(v) − outgoing(v) (2.12) u(e) = ∞ (2.13) c(e) = wi (e) (2.14) 31 Algorithm Year of Publication Worst-Case Runtime Strongly P Edmonds and Karp 1972 O(e(log U )(e + v log v)) No Tardos 1985 O(e4 ) Yes Goldberg and Tarjan 1987 O(ve2 log v log(v 2 /e)) Yes Goldberg and Tarjan 1988 O(ve2 log2 v) Yes Ahuja et al. 1988 O(ve log log U log vC) Yes Orlin 1988 O(e(log v)(e + v log v)) Yes Table 2.1. Worst-case runtimes of various min-cost network flow algorithms The algorithms available to solve MCC problems have expanded and been improved in recent decades. In Table 2.1, the worst-case asymptotic runtime bounds of several algorithms are compared. Here, e is the number of arcs, v is the number of vertices, U is the maximum capacity of any edge, and C is the maximum cost of any edge. Currently, the (generally) best-performing solution methods are based upon scaling and preflow-push. [31] describes an algorithm with O V E log(V 2 /E) log(V C) worst-case time, although other methods have non-comparable bounds. Within the class of network linear programs, minimum-cost flow appears to one of the trickier problems. While its worst case bound is not strictly greater than other similar problems (e.g. maximum-flow), its application to practical problems is widely understood to require a greater degree of effort. Theoretically, it has also proved to be a challenge. It wasn’t until 1985 with the work of [32] that an algorithm with strongly polynomial worst-case runtime was developed; this is over a decade later than the same bound was established for computing maximum-flow. Several of the procedures in Table 2.1 even require solving a maximum-flow problem in the course of the algorithm. 2.3 Algorithm We introduce a new method for unconstrained minimum-register retiming. Instead of the traditional minimum-cost network circulation formulation, we utilize a technique based upon iterating a maximum network flow problem. 32 The overall outline of our algorithm is presented in Algorithm 1. A maximum network flow problem is constructed from the circuit graph and solved, the residual graph is used to generate a minimum cut, and the registers are retimed forward to the cut location; this procedure is iterated until a fix-point is reached. Next, a similar sequence of operations is performed to retime the registers backward. When the backward fix-point is reached, the resulting circuit is the optimal minimum register retiming. Algorithm 1: Flow-based Min-register Retiming: FRETIME() Input : a sequential circuit G Output: min-register retimed circuit G′ let |G| be the number of registers in G direction ← forward repeat nprev ← |G| Gresidual ← maxf low(G) C ← mincut(Gresidual ) move registers to C until |G| = nprev direction ← backward repeat nprev ← |G| Gresidual ← maxf low(G) C ← mincut(Gresidual ) move registers to C until |G| = nprev 2.3.1 Definitions A combinational frame of the circuit is comprised of the acyclic network between the register outputs / PIs and register inputs / POs. An example of this is illustrated in Figure 2.3(ii) for the circuit in 2.3(i). The inputs (the register outputs / PIs) lie on the left, and the 33 outputs (the register inputs / POs) on the right. The registers are denoted with rectangles and the primary IOs with squares. The registers are duplicated for ease of illustration, and in the duplicated names we use superscripts to denote the n-th cycle replication of the element. The cycles that exists in the original sequential circuit are implied by the connections through the registers between the inputs and the duplicated outputs. These conventions apply to subsequent diagrams of a similar nature. Figure 2.3. A three bit binary counter with enable. Let G =< V, E, Vsrc , Vsink > be a directed acyclic graph with a set of source nodes Vsrc ⊂ V with no incoming edges and a set of sink nodes Vsink ⊂ V with no outgoing edges. A source-to-sink path is a set of edges p = vsrc vsink that transitively connect some vsrc ∈ Vsrc and vsink ∈ Vsink The fan-out of v is the set of nodes U = {u : v → u ∈ E}. Similarly, the fan-in of v is the set of nodes U = {u : u → v ∈ E}. The set T F O(v) is the set of vertices in the transitive fan-out of v; T F I(v) is the transitive fan-in of v. Unless state otherwise, the transitive connectivity is assumed to be broken at sequential vertices. A cut of a G is a subset of edges C ⊆ E that partitions G into two disjoint subgraphs with the source and sink nodes in separates halves. It holds that for any cut there exists no source-to-sink path p where p ∩ C = ∅; every source-to-sink path is cut at least once. 34 A retiming cut of G is a cut such that there exists no path p from vsrc → vsink where |p ∩ C| > 1. Every source-to-sink path is cut exactly once. 2.3.2 Single Frame The core of the algorithm consists of minimizing the number of registers within a single combinational frame. Let us consider only the paths through the combinational logic that lie between two registers (thus temporarily ignoring the primary inputs and outputs). In this combinational frame, we assign Vsrc = R0 and all Vsink = R1 . The current position of the registers clearly forms a complete cut through the network (immediately at its inputs) and also meets the above definition of a retiming cut. The width of the cut is the initial number of registers. Consider retiming the registers in the forward direction through the combinational circuit. As the registers are retimed over the combinational nodes, the corresponding cut moves forward through the network and may grow or shrink in width as registers are replicated and/or shared as dictated by the graph structure. The problem of minimizing the number of registers by retiming them to new positions within the scope of the combinational frame is equivalent to finding a minimum width cut. This is the dual of the maximum network flow problem, for which efficient solutions exist. Maximum Flow The maximum flow problem is defined as follows. A flow graph G =< V, E, Vsrc , Vsink , u > extends the previous graph by adding a capacity function. u(e) : E → ℜ is the capacity of each edge. Without loss of generality, we also identify a single source and sink: vsrc and vsink . The multiple sources and sinks can be simulated by adding an unconstrained edge from every element of those sets to the appropriate singular version. The objective is to find a flow along every edge f (e) : E → ℜ that maximizes the total flow from vsrc to vsink without violating any of the individual edge capacities. The 35 maximum-flow problem can also be expressed as a linear program of the form of Equation 2.15. max X f (e) s.t. (2.15) ∀{e≡(s,t):s=vsrc } X f (e) − ∀e∈incoming(v) X f (e) − ∀e∈incoming(vsink ) X f (e) ≤ u(e) ∀e ∈ E f (e) = 0 ∀v ∈ V \vsink , vsrc ∀e∈outgoing(v) X f (e) = 0 ∀e∈outgoing(vsrc ) Maximum-flow is one of the fundamental problems in the class of network algorithms. It can be viewed as one of the essential “halves” of the more general minimum-cost network circulation problem. (The other being the shortest path computation.) [33] A maximumflow problem can be written in the more general MCC form by setting all of the costs to zero and adding an unconstrained edge from vsink to vsrc . Similarly to minimum-cost network circulation, there are specialized solution methods that make use of the particular structure of the maximum-flow problem. Table 2.2 describes some of these algorithms of historical and practical interest. Here, v is the number of nodes in the problem and e the number of edges. U is the maximum capacity of any edge in the graph. There are both pseudo-polynomial algorithms (whose complexity involves the maximum edge capacity U ) and strongly polynomial algorithms (whose complexity only depends on the size of the flow graph). Given a flow f (e) in the original graph G, the residual graph Gresidual =< V, E, Vsrc , Vsink , u > is defined as the having the same structure as the original flow network but a set of capacities uresidual (e) as in Equation 2.16. The residual graph captures the amount of remaining capacity available on each edge. If the flow f (e) is indeed maximal, there will exist no path vsrc vsink in the edges with remaining capacity in the residual 36 Algorithm Year of Publication Worst-Case Runtime Dantzig 1951 O(v 2 eU ) Ford and Fulkerson 1956 O(veU ) Dinitz 1970 O(v 2 e) 2 Edmonds and Karp 1972 O(e log U ) Karzanov 1974 O(v 3 ) Cherkassky 1977 O(v 2 e1/2 ) Sleator and Tarjan 1983 O(ve log v) Goldberg and Tarjan 1986 O(ve log(v 2 /e) Ahuja and Orlin 1987 O(ve + n2 log U ) Cheriyan et al. 1990 O(v 3 / log v) 2/3 1/2 Goldberg and Rao 1997 O(min(v , e )e log(v 2 /e) log U ) Table 2.2. Worst-case runtimes of selected maximum network flow algorithms [3] graph. Correspondingly, the residual flow at every edge in a minimum-width cut will be zero. uresidual (e) = u(e) − f (e) Deriving Minimum Cut (2.16) Once the maximum flow through the combinational network has been determined, the corresponding minimum cut is derived. The width of this cut is identical to the maximum flow and corresponds to the number of registers in the circuit after the retiming has been applied. The residual graph is used to generate a corresponding minimum cut. The vertices in the network are partitioned into two sets: S, those that are reachable in the residual graph with additional flow from the source, and R, those that are not. Generating this partition is O(E) in the worst case. The partition must be a complete cut, because there can exist no additional flow path from the source to the sink if a maximal flow has already been assigned. We define the minimum-width cut Cmin to be the set of edges u → v where (u ∈ S) ∧ (v ∈ R). 37 Reverse Edges Without additional constraints, the cut Cmin may not be a legal retiming. A cut in a directed graph only guarantees that all paths in the graph are cut at least once. This is a necessary but not sufficient condition for the cut to be a valid retiming. In the initial circuit, it is evident that any path in the combined graph passes through exactly one register (i.e. it is a retiming cut). Any valid retiming must preserve this property. If this were not the case, the latency of that path would be altered and the sequential behavior of the circuit changed. We seek the minimum cut in the graph such that all paths are crossed exactly once. Figure 2.4 illustrates a simple example of how this can lead to an illegal retiming. There are exactly two forward flow paths: {R10 → v1 → R11 v} and {R30 → v4 → R31 }. The corresponding cut Cmin = {(v1 , R11 ), (R30 , v4 )}, but this is illegal. The path {R3 → v4 → v3 → v2 → v1 → R11 } now has two registers where it previously had one. This would insert additional sequential latency and alter the functionality of the circuit. Another example is provided in Figure 2.5. Figure 2.4. An example circuit requiring unit backward flow. 38 Figure 2.5. An example circuit requiring multiple backward flow. The network flow problem can be altered to eliminate the possibility that a path is crossed more than once. Reverse edges with unbounded capacity are added in the direction opposite to the constrained edges in the original network. These additional paths may increase the maximum flow (and therefore the size of the minimum cut) but guarantee that the resulting minimum cut will correspond to a legal retiming. For a path in the original graph to cross the finite width cut more than once from S → R, there must be at least one edge that crosses from R → S. If the unbounded reverse edges are also present, this would imply an infinite-capacity edge from S → R, thus violating the finiteness of the cut-width. We label this Property 1. Property 1. If there exists an unconstrained flow edge e = u → v, a finite minimum cut Cmin will never contain edge e. Proof. If e were in Cmin , this implies that u ∈ S ∧ v ∈ R from the method in which the cut is generated. However, if u(e) = ∞, the edge can never become saturated. There will always exist a path to R, and this edge could not have possibly been included in Cmin . The addition of the unbounded reverse flow corrects the example in Figures 2.4. A new 39 flow path is created {R20 → v2 → v3 → R21 }. With this increase in the maximum flow comes an increase in the width of the minimum cut; the new locations of the registers after retiming are identical to their pre-retiming positions. Similarly, the number of registers in Figure 2.5 will be increased, and the correct functionality of the circuit restored. In this example, f (v1 → v6 ) = 2 in the maximum possible flow. This demonstrates that unit flow on the reverse edges is not sufficient to enforce guarantee that the resulting cut is a legal retiming cut. Fan-out Sharing Equivalently to the method in Section 2.2.1, it is also needed to account for the sharing of registers at nodes with multiple fan-outs. This requires another simple modification to the network flow problem. Each circuit node v is decomposed into two vertices: a receiver of all of the former fan-in arcs v receiver and an emitter of all of the former fan-out arcs v emitter . The flow constraints are removed from these structural edges, and a single edge with a unit flow constraint is inserted from the receiver to the emitter. This is transformation is depicted in Figure 2.6. Figure 2.6. Fan-out sharing in flow graph. Via Property 1, the unconstrained edges can not participate in the minimum cut; only the internal edge is available to make a unit contribution to the cut-width. Each node will therefore require at most one register regardless of its fan-out degree. Then, to model 40 fan-out (as opposed to fan-in) sharing, the reverse edges are connected between adjacent receivers. This idea can also be extended to model fan-in sharing as in [34]. Primary Inputs and Outputs The primary inputs and outputs (PIOs) can be treated in different ways, depending on the application. The allowed locations of the minimum cut and the subsequent insertion/removal of registers can be adjusted to either fix or selectively alter the sequential behavior of the circuit with respect to the external environment. In synthesis, the relative latencies at all of the PIOs is assumed to be invariant. In verification applications, it is not necessary to preserve the synchronization of the inputs and outputs. It may be desirable to borrow or loan registers to the environment individually for each PIO if the result is a net decrease in the total register count. To allow register borrowing, the external connections should be left dangling. Registers will be donated to the environment if the minimum cut extends past the dangling primary outputs (POs); conversely, registers will be borrowed if the minimum cut appears in the transitive fan-out region of the dangling primary inputs (PIs). The inclusion of this region introduces additional flow paths and allows additional possibilities for minimizing the total register count. Figure 2.7. The illegal retiming regions induced by the primary input/outputs. To disallow desynchronization with the environment, a host node and normalization can 41 be employed, or alternatively, the flow problem suitably modified: the POs are connected to the sink and the transitive fan-out of the PIs is blocked from participating in the minimum cut. This is illustrated in Figure 2.7. All paths through the combinational network that originate from a PI have a sequential latency that must remain at zero; inserting a register anywhere in the T F O({PIs}) would alter this. To exclude this region, one of two methods can be used: (i) temporarily redirecting to the sink all edges e ≡ (u, v) where v ∈ T F O({PIs}), or (ii) replacing the constrained flow arcs in this fan-out cone with unconstrained ones, thus preventing these nodes from participating in the minimum cut. Both methods restrict the insertion or deletion of registers in the invalid region. We primarily utilize (ii), as it decreases the length of the flow paths from source to sink. Selectively disallowing desynchronization during verification may be motivated by the need to control complexity. Because register borrowing requires the initial values of the new registers to be constrained to those reachable in the original circuit, it is necessary to construct additional combinational logic for computing the initial state. If the size of this logic grows undesirably large, register borrowing can be turned off at any point for any individual inputs or outputs. Implementing the Retiming The mechanics of moving the registers to their new lo- cations is trivial. First, the register nodes are disconnected them from their original graph locations. The former fan-outs of each register are reconnected to its fan-ins. Secondly, new register nodes are inserted along the arcs in the minimum cut, one register per arc source (if fan-out sharing is enabled). Note that this does not require that all of the fan-outs of a node in the circuit are transfered to the register; the connections to fan-outs that correspond to any outgoing arcs of a node that do not cross the minimum cut are left untouched. If the registers have defined initial states, some computation must be performed to calculate a set of equivalent initial states at the new positions (if one exists). This is addressed in detail in Chapter 4. We can now prove an important property of the resulting retiming: that it minimizes the registers with the minimum amount of movement. Let C1 and C2 be two minimum width cuts. We define topological nearness to v as 42 follows. Let p = v v ′ be a path in G that consists of a set of ordered edges e0..i . Cut C1 is strictly nearer than C2 if ∀p, ei ∈ C1 ∧ ej ∈ C2 ∧ i ≤ j. It is partially nearer if ∃p, ei ∈ C1 ∧ ej ∈ C2 ∧ i ≤ j. A cut is the strictly nearest of a set of cuts if and only if there exists no other member of the set that is partially nearer. Lemma 2. The returned cut Cmin will be the minimum-width retiming cut that is topologically strictly nearest to vsrc . Proof. We prove this by demonstrating that the existence of a nearer cut violates the manner in which Cmin was constructed. The source u of every edge (u, v) ∈ Cmin must have been reachable in the residual graph from vsrc , but this can not be the case for every edge if there exists another closer cut. Assume that there exists some minimum-width retiming cut C ′ that lies topologically partially nearer to vsrc than the returned cut Cmin . From the definition of partial topological nearness, we know that there exists some path p = vsrc v ′ such that ei ∈ C ′ ∧ ej ∈ Cmin ∧ i < j. If ej is further from vsrc than ei in path p, then we know that for every other path containing ej that there can exist no edge in ek ∈ C ′ where k ≥ j. If this were the case, then by transitivity ek ∈ T F O(ej ) and C ′ would not be a retiming cut. However, because C ′ is a cut, every path vsrc vsink must contain some edge ek in C ′ , including those with ej . Therefore, every path to ej must have an edge ek ∈ C ′ where k < j. Because uresidual (ek ) = 0, there could not have been a flow path in the residual graph to ej . Because Cmin was constructed exactly by partitioning the nodes by source-reachability and ej is in the shallower source-reachable region, this implies a contradiction. This situation can not happen. Cmin must therefore be the topologically strictly nearest cut to vsrc . Lemma 2 also holds when expressed in terms of the corresponding retiming lag functions. 43 Let r1 (v) and r2 (v) be two valid retiming lag functions. r1 is strictly nearer than r2 if ∀v ∈ V, r1 (v) ≤ r2 (v). It is partially nearer if ∃v ∈ V, r1 (v) < r2 (v). Given a set of valid retimings r1..i , one can be said to be strictly nearest if and only if there exists no other member of the set that is partially nearer. The proof is similar. Final Problem The final flow graph on which the minimum-register retiming with fan- out sharing can be computed for a single combinational frame is illustrated in Figure 2.8(ii). The complete local flow graph is shown for gate g3 . Solid lines represent edges with unit flow constraints and dotted lines edges with unbounded flow constraints. This has been derived from 2.8(ii) be decomposing the hypergraph into arcs, splitting the nodes into emitters and receivers to model fan-out sharing, and adding reverse unconstrained edges. We will see in Section 2.4.2 that it is not necessary to explicitly construct this network. 2.3.3 Multiple Frames Thus far, we have only considered the forward retiming of registers in the circuit. It is sufficient to consider only one direction if the circuit is strongly connected (i.e. through the use of a host or environment node) and normalization is applied. However, in general, the optimum minimum-register retiming requires both forward and backward moves. The procedure for a single iteration of backward retiming is nearly identical, except that the maximum flow from the register inputs (sources) to the primary inputs and register outputs (sinks) is computed. Note that the fan-out sharing receivers now correspond to the original outputs and the fan-out sharing emitters to the inputs. The overall algorithm consists of two iterative phases: forward and backward. In each phase, the single frame of iteration is repeated until the number of registers reaches a fix-point. The procedure is outlined in Algorithm 1 and Figure 2.9. At no point during retiming is it necessary to unroll the circuit or alter the combinational logic; only the register boundary is moved by extracting registers from their initial position 44 Figure 2.8. The corresponding flow problem for a combinational network. and inserting them in the their final position. Therefore, each iteration is fast. In each iteration, every node’s retiming lag r(v) is either changed by one or unchanged. The ordering of the two phases (forward and backward) doesn’t affect the number of registers in the result, but we chose to perform forward retiming first because in general min-register retiming is not unique. The registers can be moved to identical-sized cuts that closest in either the forward or backward directions. It is also possible to interleave forward and backward steps. However, the forward-first approach reduces the amount of logic that has to be retimed backward, thereby reducing the difficulty of computing a new initial state. This process will be explained in Chapter 4. 45 Figure 2.9. Flow chart of min-register retiming over multiple frames 2.4 2.4.1 Analysis Proof In this section, we prove two facts about our flow-based retiming algorithm: (i) the result preserves functional correctness (less initialization behavior), and (ii) the result has the minimum number of registers possible via retiming alone. Correctness Theorem 3. Algorithm 1 preserves functional correctness. Proof. From the result of [8], we know that all legal retimings preserve functionality, and also that a transformation is a legal retiming if and only if it can be described by some retiming lag function. It is therefore sufficient to prove functional correctness to describe the retiming lag function that exactly corresponds to the transformation that results from the application of our algorithm. We do this constructively. 46 We begin with the registers in their initial positions and the lag value of every combinational node initialized to zero. That is ∀v ∈ V, r0 (v) = 0. The algorithm proceeds by iterating the single frame register minimization in either direction. In each iteration, a minimum cut is computed under the constraint that every directed path from the starting positions of the registers to either a register or an output crosses the cut exactly once. Let C0 be the original locations of the registers and Cmin be the set of edges in this cut. We move the registers to these edges. Lemma 4 (Correspondence of lag function to cut). The movement between two retiming cuts C0 and C1 corresponds to a retiming lag function rc (v). There exists a retiming lag function that reproduces this register movement exactly. Given the lag function ri (v) that generates the circuit at the start of each iteration i, let ri+1 (v) be the lag function that generates the circuit at the end of the iteration. This is computed as stated in Equation 2.4.1. Both of these are transformations from the initial register positions; the movement of this one iteration alone is captured by ri+1 (v) − ri (v). ri (v) + 1 v ∈ T F O(C) ∧ backward direction ri+1 (v) = ri (v) − 1 v ∈ T F I(C) ∧ forward direction ri (v) otherwise (2.17) In the case where the flow graph has been transformed to model fan-out sharing, consider a cut edge between vreceiver → vemitter to be a cut among all structural edges {v → u : u ∈ f anout(v)}. We now show that ri+1 (v)− ri (v) exactly reproduces the movement of the registers from their positions at the start of the iteration to those at the end. Registers are removed from C0 , added to C1 , and the remaining edges are left untouched. First, consider the edge u → v in the retiming graph that corresponds to each of the original register locations. If the minimum cut does not lie on exactly this edge, it must be in its fanout. Therefore, either the lag of u will be incremented or the lag of node v 47 decremented (depending on the retiming direction of this iteration). We also know that the lag at the other end of the edge remains constant, as this marks the boundary of the combinational frame in which the cut is contained. The result in a net decrease in the retimed register weight, wr (u → v) (from Equation 1.2), of exactly one. If the minimum cut lies on exactly this edge, the register is preserved as expected. Next, consider the edge u → v in the retiming graph that corresponds to each edge in C. Here, the lag of the node in the direction of the movement will remain constant, as this lies beyond the cut and can not be in its transitive fanout/in. Because no path can cross the cut more than once, no other edge in C could contain the node in its transitive fanout/in. (This is exactly the condition that would be violated and lead to functionally incorrect solutions if a minimum cut were computed without the single-crossing-per-path constraint.) As the other end of the edge is incremented/decremented, the net register weight is increased by exactly one. All edges in the retiming graph that do not lie at the original register boundary or the minimum cut will have no change in register weight. We know that wi (u → v) = 0 because it was not an original register location. This means that there is a combinational path between u and v. Because the cut C does not lie between the two, the transitivity of the TFO/TFI operations in Equation 2.4.1 dictates that u is incremented if and only if v is incremented. There will be no net change on any of these edges and no new registers. We have shown that there exists a retiming lag function that reproduces the register relocation performed in each iteration. The described function exactly: (i) removes the registers from their positions at the start of the iteration, (ii) inserts registers at exactly the new positions at the end of the iteration (and no others). We have also described how to compute the cumulative lag function that describes the total change over all iterations. From the result of [8], we can establish that this transformation is indeed a valid and functionally correct retiming. If the transitive fan-in(out) of the PIs(POs) is excluded from the minimum cut compu- 48 tation via the mechanism described in Section , the lag at the inputs and output nodes (as described by Equation ) will remain zero. This guarantees the that sequential latency of every output is preserved. Optimality First, we demonstrate the converse of Lemma 4. We had proven Lemma 4 by considering combinational frames one at a time and composing the solution. The idea can be generalized to multiple simultaneous combinational frames (e.g. an unrolled circuit). For Lemma 5 we examine the unrolled circuit directly. Consider unrolling the sequential circuit by n cycles, where n > maxr(v) − minr(v). Each vertex v is replicated n times, producing a corresponding set of unrolled vertices v 0..n . If v 0 represents the vertex in the reference frame, v i corresponds to that vertex in the ith unrolled frame. Similarly for edges. The register inputs from time cycle i are connected to the register outputs of time cycle i + 1. An example of this is contained in Figure 2.10; Lemma 5. [Correspondence of cut to lag function] Every retiming lag function rc (v) corresponds to a cut C in the unrolled circuit. Proof. A retiming lag function r(v) corresponds to a cut C in the unrolled circuit. This cut consists of the edges ei = ui → v i where r(v) − r(u) ≥ i < r(u). The positions of the registers of the reference cycle after any retiming r(v) can be expressed as a cut C in the edges of this unrolled circuit. The elements of C are the register positions. The unretimed cut, Cinit , (such that r(v) = 0) lies at the base of the unrolled circuit. The size of this cut, |C|, is the number registers post-retiming, or equivalently, the number of combinational nodes whose fan-outs hyper edges cross the cut. A cut C is a valid retiming if every path through the combinational network passes through it exactly once. This implies that for any two registers R1 , R2 ∈ C, R1 ∈ / T F O(R2 ) and vice versa. If this were not the case, additional latency would be introduced and functionality of the circuit would be altered. A combinational frame of the cut C with retiming function r(v) is the region in the 49 Figure 2.10. A cut in the unrolled circuit. unrolled circuit between C and C ′ , where C ′ is generated by r ′ (v) = r(v) + 1. If the circuit were retimed to C, this corresponds exactly to the register-free combinational network structure that would lie on the outputs of the register boundary. Theorem 6. Algorithm 1 results in the optimally minimum number of registers of any retiming. Proof. Consider an optimal minimum register retiming and its corresponding cut Cmin . While there exist many such cuts, assume Cmin to be the one that lies strictly forward 50 of the initial register positions and is topologically closest to Cinit . It can be shown with Lemma 7 that there is one unambiguously closest cut. Our algorithm iteratively computes the nearest cut of minimum width reachable within one combinational frame and terminates when there is no change in the result. Let the resulting cut after iteration i be Ci . The cut Ci at termination will be identical to Cmin if the following two conditions are met. Condition 1 . No register in Ci lies topologically forward of any register in Cmin . Condition 2 . After each iteration, |Ci+1 | < |Ci | unless Ci = Cmin . Figure 2.11. Retiming cut composition. Lemma 7 (Cut composition). Let Ci and Cj be two valid retiming cuts, and {si , ti } and {sj , tj } be a partitioning of each: (s ∪ t = C) ∧ (s ∩ t = ∅). Also for any path p, (p ∩ si ) ⇔ (p ∩ sj ) and (p ∩ ti ) ⇔ (p ∩ tj ). If this is the case, the cuts {si , tj } and {sj , ti } are also valid retimings. Proof. One example of such a partitioning is induced by topological order. If the points of intersection of the cuts with a path p are Ri ∈ Ci and Rj ∈ Cj , we can assign the registers to s if Rj ∈ T F O(Ri ), and t otherwise. The s sets will include the registers that are topologically closer in Ci than Cj ; the t sets will include the registers that are in both sets or topologically closer in Cj than Ci . 51 If a given p crosses Ci at Ri ∈ si , it may not cross any other register in ti (from the definition of a partition). It also may not contain any register in tj (from the definition of the sets). The cut {si , tj } has no more than one register on any path. If a p does not intersect si , then we know that it must cross at some Ri ∈ ti (from the definition of a partition). It also must then intersect some register in tj (from the definition of the sets). The cut {si , tj } has at least one register on any path. Therefore, {si , tj } is crossed by every path exactly once and is a valid retiming. Similarly for {sj , ti }. Proof of Condition 1 Consider a cut Ci that violates Condition 1. Let {si , ti } be a partition of Ci and {smin , tmin } be a partition of Cmin such that si is the subset of registers in Ci that lie topologically forward of the subset smin of the registers in Cmin . This is illustrated in Figure 2.11. By Lemma 7, we know that both {si , tmin } and {smin , ti } are valid cuts. Because a single iteration returns the nearest cut of minimum width within a frame, this Ci = {si , ti } must be strictly smaller than the closer {smin , ti }. This implies that |si | < |smin | and that |{si , tmin }| < |{smin , tmin }| = Cmin . This is impossible by definition. Therefore, Condition 1 must be true. Observation 1 Retiming by an entire combinational frame does not change any of the register positions in the resulting circuit and also represents a valid retiming cut. Because a register is moved over every combinational node, the retiming lag function is universally incremented. The number of registers on a particular edge is a relative quantity, the result is structurally identical to the original. Proof of Condition 2 We can use the minimum cut to generate a cut that is strictly ′ smaller than Ci and reachable within a combinational frame. Consider the cut Cmin that is generated from Cmin via Observation 1 such that its deepest point is reachable within the combinational frame of Ci . Some of the retiming lags may be temporarily negative. ′ Let {si , ti } be a partition of Ci and {smin , tmin } be a partition of Cmin such that smin are 52 ′ the deepest registers in Cmin that lie topologically forward of the subset si of the registers in Ci . smin 6= ∅ if Ci 6= Cmin . Using the reasoning from Condition 1, both {si , tmin } and {smin , ti } are valid cuts. We know that |smin | < |si |, otherwise there would be implied the existence of a topologically nearer cut |{si , tmin }| ≤ Cmin . Therefore, the cut {smin , ti } is strictly smaller than Ci and is reachable within one combinational frame and would be returned by a single iteration of the algorithm. Note that this doesn’t imply that there aren’t other smaller cuts, only that there must exist at least one that is strictly smaller. Therefore, Condition 2 must also be true. 2.4.2 Complexity The core of our algorithm consists of computing the maximum flow through a single combinational frame of the circuit. If the circuit has V gates and E pairwise connections between them, the corresponding flow problem will have 2V vertices and V +2E edges. The doubling of vertices is due to the split into receiver/emitter pairs to model fanout sharing, and the edge total is due to the structural edges, the reverse edges, and the internal edges between the receiver and emitter nodes. We assume that every vertex has at least one structural fanout and that V < E. The mapping between the flow graph and the original circuit is therefore linear. Based on the result of [35], the worst-case runtime of computing the maximum flow through this graph is O V E log(V 2 /E) . We can then derive a minimum cut from the residual graph in O(E) time, as this is just a check for source reachability. The maximum number of iterations can also be bounded by R via Condition 2 in the above proof. The total worst-case runtime is therefore O RV E log(V 2 /E) using this method. Our experience with [35] indicates that the algorithmic enhancements to improve the worst-case bound of the maximum-flow runtime do not provide much savings in terms of average runtime for min-reg retiming on our examples. We initially used the HIPR tool [36] to compute the maximum flow and determined that it was not any more effective than 53 a simple and efficient implementation that utilizes the structure of our particular problem. We describe this now. Binary Simplification The specific nature of the flow problem constructed to solve for the minimum-register retiming within a single combinational frame (as shown in Figure 2.8) permits simplification in the method used to solve for the maximum flow. This simplification is premised on the observation of the capacity of every edge in the flow graph is either one or infinity. Furthermore, with fanout sharing, there is exactly one unit of flow that may pass from the input to the output of a gate in the circuit. Instead of having to maintain the residual flow on each edge, we therefore need only store for each node: (i) whether its internal edge is at capacity, and (ii) the last internal edge in the flow path. We introduce the binary maximum flow technique described by Algorithm 2. This technique is based on shortest path augmentation and proves to be quite fast and efficient for solving minimum-register retiming problems. In the context of path augmentation, the above two per-node pieces of information permit the ability to check for remaining capacity (on the unit constrained edges) and the unwinding of flow segments (to redirect along other paths). Because the model of fan-out sharing requires two (implicit) nodes in the flow graph for every node in the circuit graph, we introduce the notion of a vertex pole to differentiate between the fan-out emitter and receiver. A vertex pole vpole ∈ V × {r, e}. The reason that a shortest path augmentation performs favorably compared to the more sophisticated capacity-scaling pre-flow push method is that both the capacities and resulting flows are quite uniform. Scaling does not work because every path from source to sink passes through an edge of minimum capacity (i.e. 1). With few exceptions– Figure 2.5 being such an exception– the maximal flow along every edge will be zero or one. This negates much of the benefit of pre-flow, in which bundles of flow can be pushed along large edges in a single early step and shared amongst smaller edges in later ones. 54 Algorithm 2: Binary Max-flow 1 : MAXFLOW() 1: Input: a combinational circuit graph G =< V, E > 2: Output: minimum retiming cut width 3: define Vpole = V × {r, e} 4: let d(ˆ v ) : Vpole → Z be initially unassigned 5: let f (v) : V → 2 be flow markers 6: let pred(v) : V → V be predecessor markers 7: let H(x) : Z → Z = 0 be histogram of sink distances 8: let f low = 0 9: INITDIST() 10: 11: 12: while ADVANCE E(vsrc , ∅) do increment f low return f low The worst-case runtime of our binary-simplified method can be bounded strictly in terms of the vertices and edges in the circuit, but it is useful to introduce an alternative bound. Let R be the number of registers in the design. Because the number of registers decreases in each iteration, we know that the min-cut will never be larger than R. This also arises structurally from the fact that the flow out of the source has no more than R paths through the initial positions of the registers. As the time to compute each of the maximum R augmenting paths is O(E), the runtime of determining the maximum flow can be bounded by O(RE). Using this bound on the maximum flow, the total runtime of the algorithm is O(R2 E). This is non-comparable (neither strictly better nor strictly worse) than the runtime of the best minimum-cost flow algorithm. While the maximum number of iterations in a real circuit appears to be quite small (based on the results in Section 2.5), a bound may still be desirable to limit the worst-case runtime. This comes at the expense of optimality. In this case, our algorithm’s runtime becomes O(RE). This is the time in which we can get a useable reduction in register count. 55 Algorithm 3: Binary Max-flow 1 : INITDIST(ˆ v) 1: Input: a vertex pole v ˆ 2: let Π be a queue of Vpole 3: d(vsink , r) ← 0 4: push (vsink , r) → Π 5: while Π do 6: pop Π → vˆ ≡ (v, y) 7: increment H(d(ˆ v )) 8: if y = r then 9: 10: for all u ˆ ≡ (u, e) s.t. ∃(u, v) ∈ E do if d(ˆ u) is unassigned then 11: d(ˆ u) ← d(ˆ v) + 1 12: push u ˆ→Π 13: 14: for all u ˆ ≡ (u, r) s.t. ∃(v, u) ∈ E do if forward and d(ˆ u) is unassigned then 15: d(ˆ u) ← d(ˆ v) + 1 16: push u ˆ→Π 17: 18: if y = e then if d(v, r) is unassigned then 19: d(v, r) ← d(ˆ v) + 1 20: push (v, r) → Π 21: 22: for all u ˆ ≡ (u, e) s.t. ∃(v, u) ∈ E do if backward and d(ˆ u) is unassigned then 23: d(ˆ u) ← d(ˆ v) + 1 24: push u ˆ→Π 56 Algorithm 4: Binary Max-flow 1 : ADVANCE R(v, vpred ) 1: Input: a vertex v 2: Input: a vertex vpred , the flow predecessor 3: Output: {true, false} 4: if v ≡ vsink then 5: return true 6: let vˆ = (v, r); rsl = false 7: mark vˆ visited 8: // reverse unconstrained edges (forward) 9: for all u ˆ ≡ (u, r) s.t. ∃(u, v) ∈ E do 10: 11: if forward and d(ˆ u) + 1 = d(ˆ v ) and u ˆ unvisited then rsl ← ADVANCE R(u, vpred ) 12: // unwinding flow to another node 13: if f (v) ∧ vpred 6= ∅ ∧ d(vpred , e) + 1 = d(ˆ v ) ∧ (vpred , e) unvisited then 14: rsl ← ADVANCE E(vpred , pred(v)) 15: pred(v) ← vpred if rsl 16: // adding internal flow 17: if ¬f (v) and d(v, e) + 1 = d(ˆ v ) and (v, e) unvisited then 18: 19: if ADVANCE E(v, vpred ) then f (v) ← true; rsl ← true; pred(v) ← vpred 20: mark vˆ unvisited 21: if ¬rsl then 22: 23: RETREAT(ˆ v) return rsl 57 Algorithm 5: Binary Max-flow 1 : ADVANCE E(v, vpred ) 1: Input: a vertex v 2: Input: a vertex vpred , the flow predecessor 3: Output: {true, false} 4: if v ≡ vsink then 5: return true 6: let vˆ = (v, e); rsl = false 7: mark vˆ visited 8: // structural edges (backward) 9: for all u ˆ ≡ (u, r) s.t. ∃(u, v) ∈ E do 10: 11: if backward and d(ˆ u) + 1 = d(ˆ v ) and u ˆ unvisited then rsl ← ADVANCE R(u, vpred ) 12: // structural edges (forward) 13: for all u ˆ ≡ (u, r) s.t. ∃(v, u) ∈ E do 14: 15: if forward and d(ˆ u) + 1 = d(ˆ v ) and u ˆ unvisited then rsl ← ADVANCE R(u, vpred ) 16: // reverse unconstrained edges (backward) 17: for all u ˆ ≡ (u, e) s.t. ∃(v, u) ∈ E do 18: 19: if backward and d(ˆ u) + 1 = d(ˆ v ) and u ˆ unvisited then rsl ← ADVANCE E(u, vpred ) 20: // unwinding internal flow 21: if f (v) and d(v, r) + 1 = d(ˆ v ) and (v, r) unvisited then 22: 23: if ADVANCE R(v, pred(v)) then f (v) ← 0; rsl ← true; pred(v) ← ∅ 24: mark vˆ unvisited 25: if ¬rsl then 26: 27: RETREAT(ˆ v) return rsl 58 Algorithm 6: Binary Max-flow 1 : RETREAT(ˆ v) 1: Input: a vertex pole v ˆ ≡ (v, y) 2: Output: none 3: let m = ∞ 4: if y = r then 5: // unwinding flow to another node 6: m ← min m, d(pred(v), e) if f (v) 7: // adding internal flow 8: m ← min m, d(v, e) if ¬f (v) 9: // reverse unconstrained edges (forward) 10: 11: 12: if forward then m ← min m, d(u, r)∀u s.t. ∃(u, v) ∈ E if y = e then 13: // unwinding internal flow 14: m ← min m, d(v, r) if f (v) 15: // structural edges 16: if backward then 17: 18: m ← min m, d(u, r)∀u s.t. ∃(u, v) ∈ E else 19: m ← min m, d(u, r)∀u s.t. ∃(v, u) ∈ E 20: // reverse unconstrained edges (backward) 21: if backward then 22: m ← min m, d(u, e)∀u s.t. ∃(v, u) ∈ E 23: let d′ = d(ˆ v) 24: decrement H(d′ ); d(ˆ v ) ← m + 1; increment H(d(ˆ v )) 25: if H(d′ ) = 0 then 26: exit with maximum flow 59 2.4.3 Limitations Retiming does not exist within a vacuum. It is but one of many optimizations available to the designer to exchange performance for power, area, and complexity. When the entire space of available design transformations is considered, retiming is only one of many degrees of freedom. The best solution in the joint space does not generally correspond to the power-minimal retiming of the initial solution. For example, it may be desirable to retime a circuit to maximize performance at the expense of power consumption to allow for smaller combinational gates (thereby consuming less dynamic power) along the critical paths; this may be a better overall solution than retiming to minimize power at the expense of even greater power consumption in the combinational gates. In addition, higher speed can be traded from power by scaling back Vdd . In general, this problem is complex, though a limited exploration is discussed in Chapter 5. Instead, we focus on the design space reachable solely with retiming alone and how this can be used to minimize power consumption. However, even if the choices of the other design elements are decoupled from the power-minimization problem, these details still interact with and are dependent upon the choice of retiming. For example, the positions of the registers in the netlist can affect the global and local placement of standard cells, the congestion of the routing problem, and the capacitance of the routed wires. Two design alternatives could be propagated in parallel through the design process and evaluated on at a stage where the physical model is sufficiently detailed to make an accurate comparison. However, there remains unpredictability in the process until very late in the flow (e.g. routing), and it is not practical or cost-effective to evaluate alternatives in this manner. The consequences on power consumption of retiming one or more registers is therefore greatly dependent on the model used to evaluate power and the design details available. Despite the inaccuracy, the computational requirements typically dictate that the model used to evaluate and select retiming choices is typically without physical implementation. Discussions of optimality are therefore confined to this view. 60 2.5 2.5.1 Experimental Results Setup The experimental setup consisted of a pool of 3.0Ghz AMD x64 machines made available by [37]. All applications were written in C/C++, compiled using GNU g++ version 4.1, and run under Linux using PBS. Several different sources of circuit benchmarks were used to evaluate the behavior and performance of flow-based retiming. These can be grouped into four sets: ISCAS/LGsynth, OpenCores, QUIP, and Intel. The basic characteristics of each of these are described in Appendix A. The ISCAS/LGsynth benchmarks are well known to the logic synthesis community, having originated at ISCAS in 1987. The original set was extended in 1989 and then again in 1991 and 1993; these versions were obtained in BLIF format through the LGsynth suite. A large fraction of the elements therein are purely combinational and not of interest to this work; these have been excluded from consideration. The remaining sequential circuits are described in Table A.1. On average, the circuits in this set are by far the smallest in size, both in number of combinational and sequential elements. The OpenCores benchmarks are examples from the OpenCores open-source hardware designs [38]. These designs were synthesized and offered as benchmarks for the synthesis community in 2005 [39]. A large number of the OpenCores designs implement hardware controllers and are excellent examples to evaluate the efficacy of (minimum-register) retiming on practical sequential circuits. We use versions synthesized from the RTL originals using Altera’s Quartus [40] tool as a front end. It should be noted that the Quartus flow includes optimizations to both the combinational and sequential behavior. The statistics presented in Table A.3 already include this pre-optimization and any improvements reported elsewhere are in addition to it. The QUIP [41] benchmarks were provided in behavioral Verilog as part of the Altera 61 University QUIP package and were also synthesized using Quartus as a front end. This set contains the single largest example, “uoft raytracer”. The final set of designs were provided by Intel and are derived from various verification problems. All of the circuits in this suite were provided as single-output AIGer files. While not synthesis examples, this set provides a means of evaluating the utility of flow-based retiming in another domain. On average, this set contains the largest and most difficult examples. All of the above circuits were imported into the ABC logic synthesis and verification system. The original structures of the LGsynth examples were preserved, but the other three sets were further optimized. Hierarchy removal, dangling node removal, structural hashing, greedy rewriting, and algebraic re-balancing were applied to minimize the number of combinational nodes. No changes were made to the sequential elements or sequential behavior of the circuits. 2.5.2 Runtime One of the primary contributions of the flow-based unconstrained minimum-register retiming approach is the reduction in the computational effort required to compute the optimal minimum-register solution. In this section, we contrast the runtime of our approach against the best-known available alternatives. Though the previous approaches to minimum-register retiming all utilize a formulation of the problem as an instance of minimum-cost network circulation, there are many solution methods to solving this class of linear programs, including some that are specialized to and highly effective on exactly this problem. To strengthen the comparison, we present results from two different tools that use two different solution methods. Both are mature, off-theshelf, publicly available programs. We believe these are representative of the best practices available. The first comparison is made against the CS2 software package from Andrew Goldberg’s Network Optimization Library [42]. The source of CS2 is available under a restricted 62 Name flow time cs2 time cs2 perc mcf time mcf perc s641 0.00 0.00 0.00 s13207 0.13 0.08 -38.5% 0.05 -61.5% s9234 0.01 0.02 0.01 s713 0.00 0.00 0.00 s953 0.00 0.00 0.00 s38584.1 0.07 0.46 557.1% 0.23 228.6% s400 0.00 0.00 0.00 s382 0.00 0.00 0.00 s38417 0.86 0.46 -46.5% 0.31 -64.0% s5378 0.00 0.02 0.01 s444 0.01 0.00 0.00 AVERAGE 157.4% 34.4% Table 2.3. Unconstrained min-reg runtime, LGsynth benchmarks. academic license and was compiled using the same tool flow described in Section 2.5.1. The algorithmic core of CS2 is the cost- and capacity-scaling preflow-push methods described in [43]; this method possesses one of the best known worst-case bounds for solving the minimum-cost network circulation problem. The second comparison is made against the MCF package [44], a tool available in C++ from the Zuse Institute for Berlin that is free of charge for academic use. MCF is an implementation of a primal and dual network simplex algorithm. As this represents a means of solving the minimum-cost flow problem using a different class of solution methods, this provides a second comparison point against which to evaluate our method. The algorithmic basis used in the MCF tool is described in [45] and [33]. Tables 2.3, 2.4, 2.5, and 2.6 describe the results of applying the above unconstrained register minimization algorithms to all four of the benchmark suites described in Appendix A. The minimized number of registers were identical and is presented in later tables; here are only the runtime values. Only the subset of benchmarks which had a non-zero decrease in register count are presented. The first column in each section lists the total runtime of each approach. The second columns (for CS2 and MCF) give the percentage increase in the runtimes over our approach. The runtimes are only compared for examples where it is greater than 0.05 seconds. The other values are not included in the totals. 63 Name flow time cs2 time cs2 perc mcf time mcf perc barrel16 opt 0.00 0.00 0.00 barrel16a opt 0.00 0.00 0.00 barrel32 opt 0.00 0.00 0.00 nut 004 opt 0.00 0.00 0.00 nut 002 opt 0.00 0.00 0.00 mux32 16bit opt 0.01 0.00 0.00 mux8 64bit opt 0.00 0.00 0.00 nut 000 opt 0.01 0.01 0.00 nut 003 opt 0.01 0.01 0.01 mux64 16bit opt 0.02 0.01 0.00 mux8 128bit opt 0.02 0.01 0.01 barrel64 opt 0.01 0.01 0.01 nut 001 opt 0.03 0.02 0.01 radar12 opt 0.19 n/a n/a 0.38 100.0% radar20 opt 0.89 n/a n/a 1.61 80.9% uoft raytracer opt 3.29 4.01 21.9% 10.95 232.8% AVERAGE 21.9% 232.8% Table 2.4. Unconstrained min-reg runtime, QUIP benchmarks. Name flow time cs2 time cs2 ratio mcf time mcf ratio oc ata vhd 3 opt 0.01 0.02 0.01 oc cfft 1024x12 opt 0.16 0.05 0.03 oc dct slow opt 0.01 0.00 0.00 oc ata ocidec3 opt 0.01 0.02 0.01 oc pci opt 0.04 0.07 1.75 0.06 1.50 oc aquarius opt 0.06 0.25 4.17 0.21 3.50 oc miniuart opt 0.01 0.00 0.00 oc oc8051 opt 0.02 0.05 2.50 0.10 5.00 oc ata ocidec2 opt 0.00 0.01 0.01 oc aes core inv opt 0.01 0.03 0.05 oc aes core opt 0.01 0.02 0.02 oc vid comp sys h 0.00 0.01 0.01 os blowfish opt 0.01 0.03 3.00 0.04 4.00 oc wb dma opt 0.05 0.13 2.60 0.10 2.00 oc smpl fm rcvr opt 0.00 0.01 0.01 oc vid comp sys h 0.00 0.01 0.00 oc vid comp sys d 0.22 0.27 1.23 0.66 3.00 oc vga lcd opt 0.02 0.06 3.00 0.06 3.00 oc fpu opt 0.08 1.01 12.63 0.31 3.88 oc mem ctrl opt 0.08 0.12 1.50 0.10 1.25 oc des perf opt opt 0.24 0.05 0.21 0.04 0.17 oc ethernet opt 0.03 0.08 2.67 0.09 3.00 oc minirisc opt 0.01 0.01 0.01 oc hdlc opt 0.01 0.01 0.01 AVERAGE 1.00 3.20 2.75 Table 2.5. Unconstrained min-reg runtime, OpenCores benchmarks. 64 Name flow time cs2 time cs2 ratio mcf time mcf ratio intel 005 0.00 0.02 0.01 intel 001 0.01 0.00 0.00 intel 002 0.00 0.00 0.00 intel 003 0.01 0.01 0.00 intel 004 0.00 0.00 0.00 intel 028 0.97 17.55 18.09 55.90 57.63 intel 029 0.05 0.12 2.40 0.13 2.60 intel 030 0.90 8.26 9.18 34.90 38.78 intel 031 0.04 0.18 4.50 0.12 3.00 intel 013 2.12 11.46 5.41 141.92 66.94 intel 025 0.09 0.32 3.56 0.25 2.78 intel 026 0.03 0.10 3.33 0.07 2.33 intel 027 0.61 3.51 5.75 22.85 37.46 intel 036 0.90 4.82 5.36 31.68 35.20 intel 037 0.78 9.77 12.53 33.03 42.35 intel 038 1.23 22.75 18.50 65.97 53.63 intel 039 1.64 8.90 5.43 56.85 34.66 intel 032 0.09 0.39 4.33 0.36 4.00 intel 033 0.68 6.42 9.44 19.79 29.10 intel 034 0.48 1.08 2.25 2.09 4.35 intel 035 0.94 6.57 6.99 15.60 16.60 intel 014 0.48 7.36 15.33 12.97 27.02 intel 015 0.03 0.14 4.67 0.13 4.33 intel 016 0.24 1.02 4.25 1.24 5.17 intel 009 0.86 6.44 7.49 36.59 42.55 intel 010 0.05 0.16 3.20 0.13 2.60 intel 011 0.04 0.21 5.25 0.12 3.00 intel 012 0.74 9.08 12.27 35.04 47.35 intel 021 0.03 0.14 4.67 0.06 2.00 intel 022 0.04 0.20 5.00 0.12 3.00 intel 023 0.03 0.16 5.33 0.07 2.33 intel 024 0.02 0.10 5.00 0.06 3.00 intel 017 0.03 0.08 2.67 0.08 2.67 intel 018 0.03 0.08 2.67 0.08 2.67 intel 019 0.03 0.10 3.33 0.09 3.00 intel 020 0.02 0.10 5.00 0.06 3.00 intel 042 1.28 9.36 7.31 69.34 54.17 intel 041 1.28 6.00 4.69 62.66 48.95 intel 040 1.63 7.64 4.69 69.38 42.56 intel 006 0.01 0.02 0.03 intel 007 0.08 0.18 2.25 0.41 5.13 intel 043 1.06 9.58 9.04 46.34 43.72 AVERAGE 1.00 6.42 21.66 Table 2.6. Unconstrained min-reg runtime, Intel benchmarks. 65 The runtimes are presented graphically in Figure 2.12 (the large benchmarks) and Figure 2.13 (the medium-sized benchmarks). Figure 2.12. The runtime of flow-based retiming vs. CS2 and MCF for the largest designs. 2.5.3 Characteristics For the average circuit, the unconstrained minimum-register solution had 16.9% fewer latches than the original circuit– though the optimization potential varied greatly with the function and structure of each benchmark circuit. For the synthesis examples (LGsynth, OpenCores, and QUIP), the average reduction was 5.6%; for the verification examples (Intel), the average reduction was 44.3%. If we restrict our attention to only the circuits that saw any improvement in latch count, the result was an average 26.1% reduction. All of the verification circuit had some reduction; only 51.0% of the synthesis examples did. The average non-zero reduction in the synthesis examples was 11.1%. The largest reduction in a verification example was 64.0% and occurred in “intel 002”. The largest reduction in a synthesis example was 62.5% and occurred in the OpenCores circuit “oc fpu opt”. 66 Figure 2.13. The runtime of flow-based retiming vs. CS2 and MCF for the medium designs. Tables 2.7, 2.8, 2.9, and 2.10 describe the characteristics of the retimed result and the flow-based algorithm. The final number of registers is in column Fin Regs and the percentage decrease versus the original count (available in Appendix A) in column % ∆ Levs. The columns # F Its and # B Its are the number of iterations in each direction that reduced the number of registers. This excludes the final dummy iteration in each direction to detect that the termination condition had been reached. The circuits without any decrease in register count are omitted: for these examples, the number of levels remains unchanged, and both the forward and backward iteration counts were zero. These tables also present the number of levels in the longest path after retiming, Fin Levels, and also the percentage increase from the length of the longest path before retiming, % ∆ Levs. Because the minimum-register retiming was performed without an constraint on the path length, it most often resulted in an increase in the longest path. In some cases, however, the maximum path length was actually decreased: this was the case for most the Intel verification examples as well as two of the LGsynth designs. Figure 2.14 charts the number of forward and backward iterations that were required 67 Name Fin Regs % ∆Regs Fin Levs % ∆Levs #F Its #B Its s382 18 -14.30% 17 0.00% 0 1 s400 18 -14.30% 17 0.00% 0 1 s444 18 -14.30% 19 -5.00% 0 1 s641 17 -10.50% 78 0.00% 0 1 s713 17 -10.50% 86 0.00% 0 1 s953 22 -24.10% 28 3.70% 0 1 s5378 140 -14.10% 38 15.20% 1 1 s9234 127 -5.90% 58 5.50% 1 0 s13207 466 -30.30% 54 -8.50% 3 5 s38584.1 1425 -0.10% 70 0.00% 1 0 s38417 1285 -12.30% 75 15.40% 2 2 AVERAGE -13.70% 2.39% Table 2.7. Unconstrained min-reg characteristics, LGsynth benchmarks w/ improv. Name Fin Regs % ∆Regs Fin Levs % ∆Levs #F Its #B Its barrel16 opt 32 -13.50% 9 12.50% 1 0 barrel16a opt 32 -13.50% 11 10.00% 1 0 barrel32 opt 64 -8.60% 11 10.00% 1 0 nut 004 opt 167 -9.70% 17 41.70% 2 2 nut 002 opt 195 -8.00% 19 0.00% 1 2 mux32 16bit opt 493 -7.50% 8 33.30% 1 1 mux8 64bit opt 513 -11.40% 6 50.00% 1 1 nut 000 opt 315 -3.40% 55 0.00% 1 2 nut 003 opt 234 -11.70% 40 11.10% 1 1 mux64 16bit opt 975 -6.80% 8 33.30% 1 1 mux8 128bit opt 1025 -11.30% 6 50.00% 1 1 barrel64 opt 128 -5.20% 12 9.10% 1 0 nut 001 opt 437 -9.70% 78 41.80% 2 2 radar12 opt 3767 -2.80% 44 0.00% 1 3 radar20 opt 5357 -10.70% 44 0.00% 2 1 uoft raytracer opt 11609 -11.20% 243 161.30% 3 2 AVERAGE -9.06% 29.01% Table 2.8. Unconstrained min-reg characteristics, QUIP benchmarks. to reach the optimal retiming against the total design size of each of the benchmarks in Appendix A. Each design is colored according the benchmark source. While there doesn’t appear to be much of a relationship between individual design size and iteration count, there is some evident correlation with the benchmark sources. The structure of the design– in which there is some commonality within each benchmark source– is likely to be the primary determinant. For the Intel set, many of the examples were the same underlying circuit with a different safety property at the output. For the set of benchmarks that we examined, the number of iterations required was 68 Name Fin Regs % ∆Regs Fin Levs % ∆Levs #F Its #B Its intel 001 17 -52.80% 18 -48.60% 1 1 intel 004 38 -56.30% 55 -36.00% 1 1 intel 002 27 -64.00% 43 -40.30% 1 1 intel 003 45 -48.30% 72 -31.40% 1 1 intel 005 67 -60.60% 114 -32.90% 1 1 intel 006 172 -50.90% 181 -48.30% 1 1 intel 024 237 -33.60% 612 -0.30% 1 1 intel 023 238 -33.50% 593 -3.30% 1 1 intel 020 232 -34.50% 622 -0.30% 1 1 intel 017 337 -45.50% 438 -0.50% 1 1 intel 021 242 -33.70% 634 -0.30% 1 1 intel 026 293 -40.40% 660 -0.30% 1 1 intel 018 318 -35.20% 738 -0.30% 1 1 intel 019 336 -34.10% 763 -0.30% 1 1 intel 015 379 -31.50% 918 -1.80% 1 1 intel 022 357 -32.60% 952 -0.20% 1 1 intel 029 388 -31.20% 1007 -0.20% 1 1 intel 031 358 -32.60% 954 -0.20% 1 1 intel 011 360 -32.50% 1001 -0.20% 1 1 intel 010 366 -32.10% 992 -0.20% 1 1 intel 007 593 -54.60% 596 -55.20% 1 1 intel 025 605 -46.00% 1098 -0.20% 1 1 intel 032 635 -33.90% 1785 -0.10% 1 1 intel 016 1306 -43.10% 2469 -0.10% 1 1 intel 034 1257 -61.90% 1297 -1.00% 1 1 intel 014 2317 -46.20% 3068 -4.00% 1 1 intel 035 2357 -46.50% 5946 0.00% 1 1 intel 033 2370 -46.30% 6105 0.00% 1 1 intel 027 2773 -46.10% 3434 -20.80% 1 1 intel 012 3096 -47.40% 3813 -20.20% 1 1 intel 037 3139 -47.00% 3949 -17.80% 1 1 intel 030 2879 -46.70% 7136 0.00% 1 1 intel 009 2881 -46.60% 7134 0.00% 1 1 intel 036 3140 -45.90% 7279 0.00% 1 1 intel 043 3755 -48.00% 4429 -26.30% 1 1 intel 028 3885 -47.80% 4585 -25.90% 1 1 intel 042 4660 -48.30% 5362 -22.00% 1 1 intel 038 4664 -48.20% 5356 -22.10% 1 1 intel 040 4820 -49.30% 5052 -28.70% 1 1 intel 041 4786 -48.40% 5507 -21.40% 1 1 intel 039 4850 -49.00% 5572 -21.40% 1 1 intel 013 7076 -47.00% 8194 -23.60% 1 1 AVERAGE -44.29% -13.25% Table 2.9. Unconstrained min-reg characteristics, Intel benchmarks. 69 Name Fin Regs % ∆Regs Fin Levs % ∆Levs #F Its #B Its oc miniuart opt 88 -2.20% 12 20.00% 1 1 oc dct slow opt 165 -7.30% 21 23.50% 0 2 oc ata ocidec2 opt 283 -6.60% 13 0.00% 1 0 oc minirisc opt 264 -8.70% 28 21.70% 2 1 oc vid comp sys h 47 -20.30% 22 69.20% 1 0 oc vid comp sys h 60 -1.60% 20 0.00% 0 1 oc hdlc opt 374 -12.20% 16 33.30% 1 3 oc smpl fm rcvr opt 222 -1.80% 37 8.80% 1 0 oc ata ocidec3 opt 555 -6.60% 13 -13.30% 1 1 oc ata vhd 3 opt 560 -5.70% 15 0.00% 1 1 oc aes core opt 394 -2.00% 13 0.00% 0 1 oc aes core inv opt 658 -1.60% 13 0.00% 1 1 oc cfft 1024x12 opt 704 -33.00% 157 647.60% 13 1 oc vga lcd opt 1078 -2.70% 34 0.00% 2 1 os blowfish opt 827 -7.20% 36 -2.70% 1 0 oc pci opt 1308 -3.40% 46 0.00% 2 1 oc ethernet opt 1259 -1.00% 33 0.00% 1 2 oc des perf opt opt 1088 -44.90% 81 1520.00% 16 0 oc oc8051 opt 739 -2.00% 60 15.40% 1 1 oc mem ctrl opt 1812 -0.70% 32 0.00% 1 1 oc wb dma opt 1749 -1.50% 33 83.30% 1 1 oc aquarius opt 1474 -0.20% 99 0.00% 1 0 oc fpu opt 247 -62.50% 1080 4.90% 2 0 oc vid comp sys d 2305 -35.10% 38 22.60% 1 1 AVERAGE -11.28% 102.26% Table 2.10. Unconstrained min-reg characteristics, OpenCores benchmarks. small: the average was 2.7 (with an average 1.5 forward and 1.2 backward). The maximum of any design was 16. Furthermore, not only is the number of iterations small, but the majority of the reduction comes in the earliest iterations. Figure 2.15 illustrates the fraction of the total register reduction that was contributed by each iteration for each benchmark source. F0 is the contribution of the first forward iteration, F1 the second forward iteration, and F2+ all other forward iterations; the backward iterations are labelled similarly. Almost all of the reduction in the number of registers occurs after the first iteration in either direction. The number of iterations can be bounded as necessary to control the runtime without sacrificing much of the improvement. 70 Figure 2.14. The distribution of design size vs. total number of iterations in the forward and backward directions. 2.5.4 Large Artificial Benchmarks Because the runtimes of the benchmarks in Appendix A are relatively fast, a set of larger artificial circuits was created by combining the OpenCores benchmarks in Table A.3. As the number of retiming iterations required appears to be independent of the circuit size– probably because the maximum latency around any loop or from input to output is also size independent-the circuits ”large1” and ”large2” were constructed via parallel composition to preserve this property. The 2 and 4 million gate circuits, ”larger5” and ”larger6”, were generated similarly. In contrast, the two circuits ”deep3” and ”deep4” were built by splitting and serially composing the components. The runtime of our flow-based 71 Figure 2.15. The percentage of register savings contributed by each direction / iteration. Name large1 large2 deep3 deep4 larger5 largest6 Nodes Init Regs Fin Regs 1 006 k 72.9 k 66.9 k 1 005 k 82.7 k 76.9 k 1 010 k 74.7 k 67.6 k 1 074 k 86.4 k 82.0 k 2 003 k 151.1 k 139.5 k 4 008 k 300.1 k 279.0 k CS2 Flow-Based Time #F Its #B Its Time 147.9s 3 3 33.0s 131.3s 3 3 24.5s 182.0s 3 21 34.2s 130.3s 3 3 17.9s 410.6s 3 3 67.2s 818.3s 3 3 139.9s Incr 4.48x 5.36x 5.32x 7.27x 6.11x 5.85x Table 2.11. Unconstrained min-reg runtime, large artificial benchmarks. algorithm are compared to that of the minimum-cost network-flow based solution with CS2 and presented in Table 2.11. 2.6 Summary The contribution of this chapter is a new algorithm for computing a minimum-register retiming without any constraints. This has useful applications in physical design, verification, and synthesis. The improvements over previous techniques are: Faster Runtime. The worst-case bound is non-comparable (i.e. neither better nor 72 worse) than existing minimum-cost network circulation-based approaches, but the empirical runtime comparison on both synthesis and verification examples is favorable. We measure an average improvement of 4.6x and 14.9x over two state-of-the-art solution methods. The absolute runtime is also quite fast: less than 0.10s of CPU time for 76% of the benchmarks and a maximum of 3.29s. Scalable Effort. Each iteration of the single-frame register minimization problem produces a result that is strictly better than the previous. The algorithm can therefore be terminated after an arbitrary number of iterations with partial improvement. This feature is important for guaranteeing scalability in runtime-limited applications. Furthermore, our experience indicates that the vast majority of the improvement comes from the first iteration in each direction. This incrementality is not an obvious feature of the alternative minimumcost network circulation-based approaches. Minimal Perturbation. In general, there are multiple optimal minimum-register retiming solutions. Ours returns the one with the minimum register movement. This feature is important to minimize the perturbation of the netlist and avoid unnecessary synthesis instability. Extensible problem representation. The maximum flow-based formulation provides a framework into which necessary problem constraints can be easily incorporated. We will see two examples of this in Chapters 3 (timing constraints) and 4 (initilizability constraints). 73 Chapter 3 Timing-Constrained Min-Register Retiming In this chapter we extend the flow-based algorithm in Chapter 2 to include constraints on both the worst-case minimum and maximum propagation delays in the problem of minimization the number of registers in a circuit under retiming. For synthesis applications, these constraints are critical to ensure the timing correctness of the resulting circuit. Again, we assume that the retiming transformation is understood. The reader may review Section 1.2.1 for more background on retiming. The content of this chapter also depends on the maximum-flow-based formulation of minimum-register retiming from Chapter 2. An understanding of that material is prerequisite. The chapter begins in Section 3.1 by defining the problem of register minimization under delay constraints. The existing approaches to solving this problem are described in Section 3.2. Our new maximum-flow-based approach is described Section 3.3, including a few examples. Analyses of the correctness, complexity, and limitations is presented in Section 3.4. Some experimental results are contained within Section 3.6. 74 3.1 Problem The algorithm described in Chapter 2 finds the minimum number of registers– but without regard to the effect on the lengths of the longest and shortest delay paths. In most synthesis applications (as opposed to verification), it is necessary to introduce constraints on the minimum and maximum combinational path delays. This problem is known as timing-constrained minimum-register retiming. Its computational difficulty exceeds that of both the minimum-register and minimum-delay problems. Let d(u, v) be the minimum combinational path delay along u the maximum combinational path delay along u v and D(u, v) be v. We do not make any assumptions about the timing model. Let Dk (u, v) be the maximum combinational delay that passes through exactly k registers on path u v. Therefore D(u, v) ≡ D0 (u, v). Let the number of registers– the sequential latency– along a path u v be δ(u, v). The timing constraints arise from the setup and hold constraints at the register inputs. The inputs must be stable for a defined period both before and after each clock edge. Under a given clock period, the setup and hold constraints lead to limits on the worst-case propagation delays along the longest and shortest combinational paths that terminate at each register. 3.2 Previous Work 3.2.1 LP Formulation The existing timing-constrained min-register algorithms utilize an extension of the linear program described in Section 2.2.1. Recall that G =< V, E, wi > is a retiming graph, r(v) is a retiming lag function, and wi (u, v) is the initial number of registers present on each edge u → v. To model the timing constraints, we define two matrices Wi (u, v) : V × V → Z and D(u, v) : V × V → ℜ. Wi (u, v) is the minimum total retiming weight along any path 75 u v, and D(u, v) is the worst-case total combinational delay along this path. Because wi (u, v) ≥ 0 along each edge and path delays are at least non-negative in any cycle, only acyclic paths u v need to be considered in the generation of Wi and D. The constraint on the resulting maximum path length can then be expressed as in Equation 3.2 in terms of the desired clock period T . (Equation 3.1 is the previously introduced constraint on non-negative register weight.) r(u) − r(v) ≤ wi (u, v) D(u, v) T r(u) − r(v) ≤ Wi (u, v) + 1 − ∀u → v (3.1) ∀u (3.2) v The fundamental bottleneck of this approach lies in the enumeration and incorporation of all pair-wise delay constraints u v. In the original algorithm, all connected pairs were examined, resulting in an O(V 3 ) procedure. The bound can be improved slightly: an improved algorithm can accomplish this in O(V E) time. 3.2.2 Minaret It is not actually necesary to include a delay constraint for every possible pair of connected vertices. Constraints that are redundant or will never become critical can be ignored. Path containment presents one opportunity to prune constraints. A technique to ignore redundant edges on-the-fly is described by [?]. The Minaret algorithm [46] provided a leap forward by using a retiming-skew equivalence to restrict the constraint set to those that are potentially timing-critical. This is accomplished by computing the minimum and maximum clock skews for each register and using these values to bound the permissible retiming locations. Pair-wise delay constraints need not be calculated along paths that always contain a register. One such unnecessary constraint is depicted in Figure 3.1. The earliest and latest skew values for register R are τASAP (R) and τALAP (R), respectively, and these values can be used to derive a corresponding ASAP and ALAP retiming position of R. Because R will 76 always lie between nodes u and v, it is not necessary to explore or incorporate a timing constraint between these nodes. Figure 3.1. Bounding timing paths using ASAP and ALAP positions. 3.3 Algorithm Our algorithm provides a leap forward over Minaret by only enumerating the nonconservative constraints in the subset of the circuit that are both timing- and area-critical. Only constraints that lie in the path to area improvement need to be generated. The others can not be ignored but are instead are replaced with a fast but safe approximation. The number of constraints that must be generated and introduced is much smaller. Only the intersection between the area- and timing-critical regions must be treated in detail. An additional advantage–critical for industrial scalability– is that the optimum solution is approached via a set of intermediate solutions, which are monotonically improving and always timing-feasible. Thus, the algorithm can be terminated at any point with an improved timing feasible solution. While an early termination of the unconstrained problem would only become necessary for extremely large designs or very tight runtime constraints, the increased difficulty of timing-constrained minimum-register retiming makes runtime a potential concern. Finally, short-path timing constraints are handled also. 77 The timing constraints are defined as follows. Consider the presence of a register R on be the maximum allowable arrival time (such any allowable net n in the design. Let Amax n be the minimum allowable arrival time that the setup constraint of R is met) and Amin n (such that the hold constraint of R is met). In a simple application, the maximum arrival time constraints at every net would uniformly equal to the clock period and the minimum arrival constraints zero. In a more precise application, these values would include local variations such as the estimated local unwanted clock skew (δclk ), the timing parameters of the register cell appropriate to drive the capacitive load (e.g. setup Sn and hold Hn ), and the maximum period of the local clock domain (Tclk ). Equations suggest definitions in terms of these parameters. Amax = Tclk − Sn − δclk n (3.3) Amin = Hn − δclk n (3.4) If there is physical information available, the location of net n can be used to estimate the location of a register that is retimed to its output. This can be invaluable for refining the estimates of the above parameters due to any dependence with spatial location. One example of this is the local clock skew δclk . If the register’s physical location is changed, it will require either: (i) additional clock routing latency or (ii) reconnection to a different branch of the clock distribution network with a different nominal value and variation of its latency. Other spatially-dependent effects might include manufacturing variation, supply voltage variation, and budgets for wire-delays due to known local placement or routing congestion. An important prerequisite is that the initial positions of the registers meet these maximum and minimum arrival constraints. If it is desired that a higher frequency be achieved through retiming, the design would need to be retimed first by one of the many delayminimizing retiming algorithms [8] [47] [48] [49], among which efficient exact and heuristic solutions are available. 78 3.3.1 Single Frame Consider retiming one or more registers in either direction within one combinational frame of the circuit (as described in Section 2.4.1) to the output net of some node v in the circuit. Let Rv be the potential new register. There are four timing constraints that are affected by this move: the latest and earliest arrival times on the timing paths that start and end at the retimed register. At the start or end of the path, two constraints are made potentially critical; the other two can be ignored. Observe that the degree of criticality of these constraints is strictly increasing with the distance that the registers move. We introduce two versions of each of these constraints: conservative and exact. In the conservative version, it is assumed that the end of the timing path opposite the moving register remains fixed. This is an over-constraint: the other register may have moved also in the same direction, thereby relaxing the timing criticality. In the exact constraints, we also consider the position of the register at the other end of the timing path and specify the location to which it must be moved for the path to meet the constraint. Conservative Constraints The set of conservative constraints is Ccons ⊆ V . The set Ccons defines the vertices past which a register can not be retimed without violating a delay constraint. A node v is marked as being conservatively constrained if there exists some register Ru such that D1 (Ru or d0 (v v) > Amax v Ru ) < Amin Ru . The entire set of conservative constraints can be computed in O(E) time with a static timing analysis (STA) of the original circuit. The short-path constraints can be identified in one pass. The long-path constraints require two passes (to capture the components of the path on either side of the moved register); register output arrivals are seeded with their input arrivals from the first pass and then those values are propagated forward. This process is illustrated in Figure 3.2. Each gate in this example is assumed to have unit delay and is labelled with its arrival time. In Figure 3.2(i)– the first pass– all register outputs have an initial arrival time of zero, and STA is used to propagate the arrival times forward to the outputs of each combinational gate. In the Figure 3.2(ii)– the second 79 pass– the register outputs are seeded with the arrival times at their inputs in the previous pass, and another pass of STA is applied. The resulting labels at each node v are exactly max D1 (Ru , v). If the maximum delay constraint is 5, the nodes labelled with arrival times ∀Ru ∈R higher than 5 will be conservatively constrained. Figure 3.2. The computation of conservative long path timing constraints. A conservative timing constraint can be enforced by simply redirecting the fan-ins of the constrained node to the flow sink (or, equivalently, increasing the capacity of its flow constraint to infinity). Afterward, these timing-constrained nodes will not participate in the resulting minimum cut. This operation is depicted in Figure 3.3: the conservative constraints on nodes v1 and v4 in sub-figure (i) are implemented by modifying the flow graph to that of sub-figure (ii). The redirected edges are highlighted. Exact Constraints The set of exact constraints Cexact ⊆ V × V does not presume that the other end of a timing path has remained stationary. Each exact constraint encodes the 80 Figure 3.3. The implementation of conservative timing constraints. position u to which the register on the other end of a timing path would have to move for new register Rv on the output of node v to be timing feasible. The set of exact constraints Cexact defines the vertex pairs (u, v) that describe these dependencies. These can be computed easily: the exact constraints of vertex v are the nodes U described by Equation 3.5 and 3.6 and are exactly the base of the transitive fan-in/out cones whose combinational “depth” is A and cross exactly one register. The total time to enumerate all possible exact constraints for all nodes is O(V E). Umin = {u : d1 (v → u) > Amin ∧ ∀u′ ∈ f anin(u), d1 (v → u′ ) ≤ Am v v in} (3.5) Umax = {u : D1 (v → u) > Amax ∧ ∀u′ ∈ f anout(u), D1 (v → u′ ) ≤ Am v v in} (3.6) 81 An example of the computation of a set of exact constraints is depicted in Figure 3.4 for node vc . The cone of D1 (u → vc ) ≤ 3 is colored orange. The resulting exact constraints are formed between the base (fan-ins) of this cone and vc : (v1 , vc ) and (v2 , vc ). It is helpful to note that several nodes at the end of a path of length 3 do not spawn exact constraints: vi , because the path vi vc contains two registers, and vj and vk because the paths to vc do not cross any registers. Figure 3.4. The computation of exact long path timing constraints. Enforcement of the exact timing constraints is accomplished by introducing additional unconstrained flow edges into the graph. An edge n → m is added for every exact constraint, where n is the potential new register position and m is the point to which the register boundary must also move. This is depicted in Figure 3.5, where m ≡ v4 and n ≡ v2 . By Property 1, these unconstrained flow edges will restrict the resulting cut to exactly the cases that are timing infeasible; the cut will be the optimally minimum one that meets the exact constraints. Because the depth of the cone is at least the current period, each timing arc will terminate at a node that is never topologically deeper than its source. The arcs will therefore always be in the direction from sink to source, and the flow from source to sink will remain finite. This motivates the requirement that the circuit initially meets the timing constraints. There may be cycles within the set of constraints; this occurs whenever a critical se- 82 Figure 3.5. The implementation of exact long path timing constraints. quential cycle is present in the netlist. A correct result is such that moves of registers within a critical cycle are synchronized. Timing Refinement The overall algorithm (Algorithm 7) consists of an iterative re- finement of the conservatism of the timing constraints until the optimal solution has been reached for a combinational frame. Conservative constraints are removed and replaced with exact ones. As we will see, the refinement need only be performed for regions whose timing conservative is preventing further area improvement, i.e. area-critical. Let Ccons be the current set of conservative constraints and Cexact be the current set of exact constraints. The algorithm begins by initializing Cexact to be empty and Ccons to be 83 Algorithm 7: Single-Frame Flow-based Min-register Retiming: FRETIME 1() Input : a combinational circuit graph G =< V, E > Output: a retiming cut R let Ccons = {n ∈ V : ∃m s.t. d1 (m → n) ≥ Amax } n let Cexact = ∅ of V × V repeat compute min cut Runder under Cexact compute min cut Rover under Cexact ∪ Ccons Ntighten ← T F I(Runder ) ∩ T F O(Rover ) ∩ Ccons forall n ∈ Ntighten do Plong ← {m ∈ V : Amax − dm < d1 (m n n) ≤ Amax } n Pshort ← {m ∈ V : Amin m − dm ≤ d1 (n m) < Amin n } Ccons → Ccons − n Cexact → Cexact ∪ (n × Plong ) ∪ (n × Pshort ) until Ntighten = ∅ return R = Rover = Runder complete (that is, all nodes which could possibly be conservatively constrained). During an iteration, each net will be in one of three states: (i) none, if retiming a register to this net will not introduce a timing violation, (ii) conservative, or (iii) exact. The area-critical region’s refinement is accomplished as follows. We compute two minimum cuts: Runder , the minimum cut under only the (current) exact constraints, and Rover , the minimum cut under both the exact and conservative constraints. Note that Runder is under-constrained and Rover is over-constrained. Therefore, Runder will be at least as deep as Rover , whose tighter constraints prevent the registers from being pushed as far due to the monotonically increasing criticality of both types of timing constraints with greater register movement. The vertices whose timing is to be tightened are those that are conservatively constrained and lie (topologically) between the two cuts. The nodes shallower than Rover are not of interest because the over-constrained solution already lies beyond them. Likewise, the nodes 84 deeper than Rover are not of interest because the under-constrained solution does not reach them. The exact constraints are computed for each of the tightened nodes, inserted into Cexact , and the nodes are removed from Ccons . The refinement terminates after an iteration if no new constraints are introduced or, equivalently, when Runder = Rover . 3.3.2 Multiple Frames The complete algorithm proceeds identically to the unconstrained version: the singleframe procedure is iterated in both the forward and backward directions until a fixed point is reached. To differentiate the steps of timing refinement with the interation across multiple frames, we refer to the refinement steps as sub-iterations and the computation on each frame as an iteration. 3.3.3 Examples Figure 3.6 illustrates the application of timing-constrained minimum-register retiming to the depicted circuit. It is assumed that each buffer has a unit delay: the maximum path length is 3 units. The maximum delay constraint Amax at all points is 3; there are no minimum delay constraints (or equivalently, they are −∞). The arrival time at the primary inputs is undefined and assumed to be −∞ The maximum D1 (u, n) values of the two-pass timing analysis are labelled below each of the combinational nodes n in Figure 3.6(i). Because the values are greater than the maximum delay constraint at these nodes, n4..7 are marked as being conservatively constrained. The first iteration of forward min-register retiming is performed. In the first constraint refinement sub-iteration (Figure 3.6(ii.a)), the over- and under- constrained cuts (Rover and Runder , respectively) are computed. In this case, Rover is identical to the original positions of the registers and has a cut width of 4; Runder lies at a position with a cut width of 3. Because node n7 lies between these two cuts and is conservatively constrained, it is a target for refinement. The base U of the fan-in cone with a combinational delay of Amax and a sequential delay of 1 (s.t. D1 (u n7 ) = Amax ) are collected, and in this case, U = {n4 }. 85 Figure 3.6. An example of timing-constrained min-register forward retiming. This introduces one new exact constraint, for which a corresponding unconstrained edge is added to the flow graph: from n7 → n4 . The conservative constraint on n7 is removed. This concludes the first sub-iteration. In the second refinement sub-iteration (Figure 3.6(ii.b)), we compute a new Runder and Rover . The addition of the exact constraint has altered the location of Runder . However, because the two cuts are not identical, we continue with refinement on the set of conservatively 86 constrained nodes between the two cuts. There must exist another (different) conservatively constrained node that is limiting area improvement, which in this case is node n4 . The one exact constraint on n4 arises from n1 , and an unconstrained edge n4 → n1 is added to the flow graph. The conservative constraint on n4 is removed. This concludes the second sub-iteration. In the third and final refinement sub-iteration (Figure 3.6(ii.c)), we compute the new Runder and Rover . In this case, we find that they have both moved and are now identical. There are still conservative timing constraints on nodes {n2 , n3 , n5 , n6 }, but these do not affect the register minimization and are not area-critical. We can terminate the refinement with the expected minimum of registers that meets all timing constraints. Although not illustrated, the global minimization algorithm would repeat another forward iteration and one backward iteration. As there is no further register minimization possible, it is not necessary to perform any further timing refinement. In both cases, Rover and Runder will be immediately identical and equal to the same positions of the registers as after the first forward iteration. The final solution has 3 registers and successfully meets all constraints with a maximum path delay of 3. Figure 3.7 illustrates the application of timing-constrained minimum-register retiming to a circuit with a critical cycle. Again, it is assumed that each buffer has a unit delay: the maximum path length is 2 units. The maximum delay constraint is uniformly 2; there are no minimum delay constraints (or equivalently, they are −∞). In the first refinement sub-iteration, the conservative constraint of node n2 in Figure 3.7(ii.a) is replaced by an exact constraint n2 → n4 in (ii.b). Then, in the second subiteration, the conservative constraint of n4 is replaced with the exact constraint n4 → n2 . There is now a cycle in the flow graph! The effect of this is that a register can not be retimed past any of the nodes in the cycle unless registers are retimed past all of them. In terms of the original circuit in Figure 3.7(i), the consequence is that the registers are forced to move around the cycle in “lock-step”. This is the expected behavior for such a critical cycle. If the minimum retiming required the shift to extend over multiple combinational 87 Figure 3.7. An example of timing-constrained min-register retiming on a critical cycle. nodes, multiple disjoint cycles of unconstrained edges would be introduced into the flow graph. The final solution in Figure 3.7(ii.c) is reached in the next refinement sub-iteration. The resulting single-frame solution has 2 registers and a maximum delay of 2. This is also the globally minimum multi-frame solution. 88 3.4 Analysis 3.5 Proof In this section, we prove two facts about our flow-based retiming algorithm: (i) the result is correct, meeting both the constraints on functionality and timing, and (ii) the result has the optimally minimum number of registers. Correctness: Functionality As the timing-constrained flavor of the minimum-register retiming algorithm is a more strictly-constrained version of the approach in Chapter 2, the proof for functional correctness proceeds identically to that of Section 2.4.1. We refer to that explanation for more detail. Correctness: Timing Theorem 8. Algorithm 7 meets the latest arrival time constraint Amax and earliest arrival r time constraint Amin at every register r. r Proof. Without loss of generality, let us consider the long-path constraints during forward retiming. The proof is similar for backward retiming and short-path constraints. Let vR be the node that drives the input of a new register R. At the end of the iteration that led R to be retimed to the output of vR , we know that vR was in one of two states: (i) it had never been constrained, or (ii) it was originally conservatively constrained but is now exactly constrained. vR must not have remained subject to a conservative constraint, because the cut can not lie at the output of v due to the unconstrained flow path along the timing edge in the flow graph from vR to the sink. We show that in both of the possible cases that there exists no long combinational path D0 (u, vR ) for any u that violates the timing constraint at R. • Case (i) Not having ever been constrained implies that there existed no long-path D1 (Ru , vR ) > Amax vR for any register Ru . Because register R had been retimed from 89 the transitive fan-in of vR to its output, every combinational path D0 now corresponds to a previous D1 (though the converse is not necessarily true). Because there existed no D1 with length greater than the constraint, there is now no combinational path that violates it. • Case (ii) Having been originally conservatively constrained implies that there was some path p = u vR such that D1 (u, vR ) > Amax vR . Let the single register along this path be R1 . Because R1 will be retimed forward to R, δ(p) could potentially be reduced zero, thereby creating a long-path timing violation. We must show that δ(p) remains > 0. Every such path must contain two consecutive nodes u′ and u′′ such that D1 (u′ , vR ) > ′ ′′ max Amax vR ∧ D1 (u , vR ) ≤ AvR . The first node u must always exist as u itself satisfies this condition. The second node u′′ must also exist: the zero-register path delay D0 (R1 , vR ) is ≤ Amax vR because the original circuit meets timing. There must therefore be some node between u and R1 where the length of the path is ≤ Amax vR and crosses exactly one register (i.e. R1 ). The node u′ happens to be the vertex that satisfies the condition (from Equation 3.5) that generates an exact constraint when vR is tightened. An unconstrained edge will be added from vR → u′ in the flow graph. From Property 1, this imposes the condition that the cut lies beyond vR only if it lies beyond u′ . That is, R is retimed to vR only if another register is retimed past u′ . This adds one unit of sequential latency to p and ensures that δ(p) > 0. Because there exists no path delay D1 (u, vR ) that becomes combinational and introduces a timing violation D0 (u, vR ) > Amax vR , the timing at register R will be therefore be always correct. Optimality 90 Theorem 9. Algorithm 1 results in the minimum number of registers of any retiming that meets the given timing constraints. Proof. Consider a counterexample. If Cresult is the cut returned by any iteration of our algorithm, let C ′ be the smaller cut that also meets all of the timing constraints. We can show that this situation will never arise. By minimum-cut maximum-flow duality, Cresult is larger than C ′ because the corresponding flow graph has one or more additional paths from the source to the sink. Each one of these paths must also cross C ′ . Because the structural edges in the two flow graphs are identical (because the circuits are the same), the additional edge across C ′ must be due to a timing constraint. This was the result of the implementation of either a conservative or an exact constraint. If this edge were due to a conservative constraint, it must have originated at a node vcons that lies topologically between Cresult and C ′ . (It must be deeper than Cresult as the unconstrained edge of our attention could not have been part of the finite width cut Cresult.) However, this situation violates the termination condition of our algorithm: vcons is a conservatively constrained node that lies between the current over-constrained cut and a strictly smaller under-constrained cut Cresult. If this edge were due to an exact constraint, let that particular timing arc be e = uS → uR . Because this adds additional flow from S → R (where S is the subgraph partition created by C ′ that lies closer to the source; R is the other), this implies that uS ∈ S and uR ∈ R. Therefore, in the C ′ solution a register is being retimed past uS but not uR . This explicitly violates the corresponding timing constraint and would imply that C ′ is not timing feasible. Therefore, there can exist no such cut C ′ that has fewer registers and also meets the timing constraints. The result returned by algorithm is optimal. 91 3.5.1 Complexity The complexity of the original formulation of delay-constrained minimum-register retiming is limited by the pair-wise delay constraints. While we significantly improve upon the average runtime by identifying many cases that do not require the enumeration of these constraint pairs, the worst-case is still limited by this quadratic behavior. An antagonistic circuit (with a very wide and interconnected structure) can be constructed with V nodes and V 2 critical delay constraints. In our algorithm, each delay constraint pair results in an addition edge in the flow graph. Because the complexity of computing the minimum-cut is O(RE), the additional of these delay constraints increases the worst-case runtime to O(RV 2 ). (The number of original edges E is a subset of V 2 and thus subsumed by this quantity.) The minimum-cut needs to be computed (twice) in each timing refinement sub-iteration. An antagonistic circuit (with a very long and narrow structure) can be constructed that results in the refinement of only one node per sub-iteration. The worst-case total runtime would then be O(RV 3 ) per iteration. It should be noted that the structures which result in a large number sub-iterations are apparently complementary to those that result in a quadratic number of delay constraints. It may be possible to use this fact to prove a tighter bound than O(RV 3 ). Finally, the number of iterations of the main loop remains at most R. The final worstcase bound is therefore O(R2 V 3 ). This is worse than the unconstrained version of the problem by a factor of V . 3.6 Experimental Results We utilize an experimental setup identical to that described in Section 2.5. 92 Name s38417 b17 opt mux8 128bit oc cfft oc des perf oc pci oc wb dma oc vga MEAN Original Gates Init 19.5k 49.3k 7.8k 19.5k 41.3k 19.6k 29.2k 17.1k Flow-Based Minaret Regs Delay Final Regs Runtime Runtime 1465 54 1288 2.08s 17.8s 1414 44 1413 6.9s 227.6s 1155 14 1149 0.07s 0.5s 1051 111 874 12.1s 769.s 1976 31 1920 10.2s 114.6s 1354 88 1311 0.10s 33.8s 1775 36 1754 0.24s 24.6s 1108 123 1079 0.10s 30.6s 1x 102x Table 3.1. Delay-constrained min-reg runtime vs. Minaret. 3.6.1 Runtime One of the primary contributions of the flow-based timing-constrained minimum-register retiming approach is the reduction in the computational effort required to compute the optimal minimum-register solution. In this section, we contrast the runtime of our approach against the best-known available alternative. The Minaret tool [46] serves as the primary comparison point as the best known publiclyavailable software for this problem. We greatly appreciate the source being made available by the authors, which was then recompiled on our experimental platform. Minaret worked wonderfully for the packaged benchmarks, but we did have some problems with applying it to the circuits in our benchmark suites. Because of these errors (which we were not able to correct), not all of the circuits in Appendix A are available as comparison points. Table 3.1 compares the performance of our timing-constrained algorithm against Minaret. The test-cases presented are the ones with over 1000 registers that were processed by Minaret without error. The maximum arrival constraint Amax for every node was set to the initial circuit delay, and minimum delay constraints Amin were set to negative infinity (because these are not supported by Minaret). The runtimes of both Minaret and our flow-based method are listed. The average runtime of Minaret is 102x that of our tool. We implemented a unit timing model for comparison with Minaret (hence the integer worst-case delay values), but the algorithm can be used with one that is much more descriptive. A second implementation used a standard load- and slew-dependent interpolating 93 table lookup to compute path delays. Because computation of timing data dominates the runtime, this extra effort increased the runtime to 5x that of the unit delay version. The load-aware timing analysis can be written to include not only the effects of the capacitive loads on the propagation delays through the combinational elements, but also on the potential positions of the retimed registers. The constraints Amax and Amin at each v v node v are those that would apply to a register at its output; these values can be adjusted to include the effects of using a register instead of the existing combinational gate to drive the capacitive load at both ends of the critical path. However, under such a non-linear delay model, the result may be more accurate but can no longer be guaranteed to be optimal. 3.6.2 Characteristics Tables 3.2, 3.3, and 3.4 present some characteristics about the behavior of the timingconstrained retiming algorithm on the synthesis benchmarks. Again, the initial period of the circuit (column Initial Del) was used as the maximum arrival constraint of each register. The worst-case combinational delay path after retiming is listed in column Finals Del; this is verifiably less than the constraint in the previous column. The resulting number of registers is in column Final Regs. Further details about the behavior are captured in the next four columns: Iters, the number of iterations with improvement, Cons, the number of initially conservatively constrained nodes, Refined, the number of the initially conservatively constrained nodes that were refined, and Exact, the number of exact constraints that resulted. For the latter three metrics, the average values of the first forward and backward iteration are presented. Finally, column Time lists the total runtime in seconds. A closer examination of the number of conservative nodes that needed to be refined in each refinement iteration is presented in Figure 3.8. The upper graph measures the number of refined nodes; the bottom the number of resulting exact constraints. These quantities are averaged across all iterations and all non-verification benchmarks, though the forward and backward phases are graphed separately. From these values, we can conclude that the majority of the conservatively-constrained nodes that need to be refined are identified in 94 Initial Final Name Regs Del Del Regs Iters Cons Refined Exact Time s641 19 78 78 17 1 111 0 0 0.01 s713 19 86 86 17 1 122 0 0 0 s400 21 17 17 18 2 93 8 38 0.01 s382 21 17 17 18 2 86 7 32 0 s444 21 20 19 18 1 123 25 199 0 s953 29 27 26 24 1 322 68 1100 0.02 s9234 135 55 55 127 1 715 2 14 0.01 s5378 163 33 33 149 2 317 0 0 0.02 s13207 669 59 54 466 8 279 12 71 0.46 s38584.1 1426 70 70 1425 1 849 0 0 0.24 s38417 1465 65 65 1288 4 1358 910 24786 6.37 Table 3.2. Period-constrained min-reg characteristics, LGsynth benchmarks. Initial Name oc miniuart oc dct slow oc simple fm rcvr oc minirisc oc ata ocidec2 oc aes core oc hdlc oc ata ocidec3 oc ata vhd 3 oc fpu oc aes core inv oc oc8051 os blowfish oc cfft 1024x12 oc vga lcd oc ethernet oc pci oc aquarius oc wb dma oc mem ctrl oc des perf Final Regs Del Del Regs Iters 90 10 10 89 1 178 17 17 168 2 226 34 34 224 1 289 23 23 269 2 303 13 13 283 1 402 13 13 394 1 426 12 12 383 3 594 15 13 555 2 594 15 15 560 2 659 1030 1030 298 2 669 13 13 658 2 754 52 52 746 2 891 37 36 827 1 1051 21 21 910 10 1108 34 34 1078 3 1272 33 33 1259 3 1354 46 46 1308 3 1477 99 99 1474 1 1775 18 18 1767 2 1825 32 32 1812 2 1976 5 5 1976 0 Cons Refined Exact Time 190 2 5 0.01 33 3 9 0.01 184 5 12 0.03 308 3 37 0.02 146 19 77 0.01 828 0 0 0.01 186 9 74 0.04 161 24 89 0.02 225 35 132 0.03 288 133 1172 2.51 512 0 0 0.03 1793 2 17 0.06 17 0 0 0.03 596 1154 9881 3.27 191 0 0 0.05 315 0 0 0.09 46 0 0 0.09 5567 0 0 0.09 1323 191 3614 0.43 1043 0 0 0.13 6077 408 6714 0.08 Table 3.3. Period-constrained min-reg characteristics, OpenCores benchmarks. 95 Initial Final Name Regs Del Del Regs barrel16 37 8 6 36 barrel16a 37 10 10 34 barrel32 70 10 8 69 barrel64 135 11 9 134 nut 004 185 12 12 168 nut 002 212 19 19 195 nut 003 265 36 36 235 nut 000 326 55 55 315 nut 001 484 55 55 446 mux32 16bit 533 6 6 503 mux8 64bit 579 4 4 573 mux64 16bit 1046 6 6 985 mux8 128bit 1155 4 4 1149 radar12 3875 44 44 3767 radar20 6001 44 44 5357 uoft raytracer 13079 93 93 12030 Iters Cons Refined Exact Time 1 17 0 0 0 1 18 0 0 0 1 33 0 0 0.01 1 65 0 0 0.01 4 18 3 10 0.02 3 57 0 0 0.01 2 102 2 10 0.03 3 62 0 0 0.01 5 142 16 77 0.17 1 140 0 0 0.01 1 290 0 0 0.01 1 149 0 0 0.02 1 578 0 0 0.02 4 21 0 0 0.45 3 23 0 0 2.48 4 7126 4261 57401 58.91 Table 3.4. Period-constrained min-reg characteristics, QUIP benchmarks. the first two sub-iterations. Also, it appears that there is a strong relationship between the number of tightened nodes and the resulting number of exact constraints. Figure 3.9 illustrates the effect of the refinement of conservative constraints into exact constraints on the over- and under-constrained cuts. In each refinement sub-iteration, the removal of conservative constraints decreases the number of registers in the over-constrained cut while the addition of exact constraints increases the number in the under-constrained one. Eventually, the number of registers (and structural location) of these two cuts converges. In this graph, the number of registers in these two cuts is presented relative to the size of the final one, and the progressing convergence of their sizes is captured over time. Again, these are the average values over all iterations and all non-verification benchmarks; the forward and backward phases are graphed separately. Not all of the bechmarks required so many iterations, and almost all of the refinement occurred in the last one or two iterations before convergence. Figure 3.9 can be used to get an idea of the optimality that is lost through early termination with an over-constrained cut. Even after only the first sub-iteration of timing refinement, the over-conservative cut is only suboptimal by less than 6% on average. 96 Figure 3.8. Average fraction of conservative nodes refined in each iteration. We also examine the relationship between the tightness of the maximum delay constraints and the ability of the minimum-register retiming to decrease the registers in the design. This is primarily a characteristic of the target circuits and not the minimization algorithm, but it gives an idea of the tradeoffs involved in timing-constrained register minimization. Five designs are described in detail in Figure 3.10. The delay values in the x-axis are normalized to the delay of the unconstrained minimum-register solution. (Points toward the left are more tightly constrained.) The number of registers in the y-axis are normalized to the number of registers in the unconstrained minimum-register solution. Even within this small sample of designs, it can be observed that the relationship between delay and register count is very variable and highly design-dependent. The relativeness tightness of the imposed delay constraints affects not only the result but also the behavior of the algorithm. Tighter constraints result in more of the nodes being initially constrained, more of the nodes being refined, and more exact constraints 97 Figure 3.9. Registers in over-constrained cut vs under-constrained cut over time relative to final solution. per tightened node. On the other hand, the mobility of the registers is more limited. To quantify these tradeoffs, we took all of the designs in Tables 3.2, 3.3, and 3.4 and applied the heuristic minimum-delay retiming algorithm of [47]. The resulting circuits had smaller delays and (typically) and increased register count. We then applied the delay-constrained min-register algorithm using the optimized worst-case delays as the global constraints. The characteristics of the resulting run are presented in Tables 3.5, 3.6, and 3.7. The meaning of the table columns is identical to those in original-period-constrained tables. One design, “uoft raytracer”, is omitted due to problems in the min-delay optimization. There were both examples for which the runtime increased and decreased. While the decreases outnumbered the increases, the average runtime rose 10.0x. As an experiment to verify the correctness of our result, we compared the result of applying our algorithm to both the original circuits and the delay-minimized versions using the same original delay constraints. The expected result of an identical minimized register count was observed. Also, as expected, the resulting circuits were generally not isomorphic, due to the differing initial positions of the registers at the entry to our algorithm. 98 Figure 3.10. Registers after min-reg retiming vs. max delay constraint for selected designs. Initial Final Name Regs Del Del Regs Iters Cons Refined Exact Time s382 30 17 11 26 1 131 9 61 0 s400 32 17 11 27 1 142 11 76 0 s444 36 20 12 30 4 88 20 108 0.01 s953 33 27 22 31 1 358 81 1898 0.03 s5378 177 33 29 166 1 785 6 28 0.02 s9234 135 55 51 127 1 816 8 62 0.02 s13207 690 59 46 455 8 469 12 91 0.49 s38417 1629 65 45 1356 4 3173 1837 46937 11.33 s38584.1 1428 70 63 1427 1 974 0 0 0.22 s641 19 78 78 17 1 111 0 0 0 s713 19 86 86 17 1 122 0 0 0 Table 3.5. Min-delay-constrained min-reg characteristics, LGsynth benchmarks. 99 Initial Name oc des perf oc miniuart oc hdlc oc ata ocidec2 oc aes core oc aes core inv oc ata ocidec3 oc ata vhd 3 oc dct slow oc wb dma oc cfft 1024x12 oc minirisc oc mem ctrl oc ethernet oc simple fm rcvr oc vga lcd os blowfish oc pci oc oc8051 oc aquarius oc fpu Final Regs Del Del Regs Iters 1976 5 5 1976 0 90 10 10 89 1 431 12 9 397 2 313 13 10 295 2 402 13 13 394 1 669 13 13 658 2 616 15 11 566 3 619 15 11 575 3 204 17 7 192 2 1870 18 13 1793 2 1580 21 13 1088 6 292 23 21 273 2 1835 32 31 1813 2 1298 33 28 1274 3 254 34 22 241 1 1121 34 23 1091 3 899 37 33 834 1 1421 46 24 1371 3 770 52 49 765 2 1531 99 92 1504 2 1992 1030 217 856 5 Cons Refined Exact Time 6077 408 6714 0.08 190 2 5 0 581 29 286 0.03 184 21 156 0.03 828 0 0 0.01 512 0 0 0.03 380 58 393 0.07 405 55 397 0.07 145 40 244 0.01 1370 122 2670 0.39 720 1000 7185 1.57 348 7 64 0.01 1051 3 19 0.25 500 27 213 0.18 236 110 910 0.04 415 0 0 0.06 48 0 2 0.03 584 24 62 0.15 1960 3 57 0.06 4217 37 2096 0.29 2457 5217 29768 12.23 Table 3.6. Min-delay-constrained min-reg characteristics, OpenCores benchmarks. Initial Final Name Regs Del Del Regs Iters Cons Refined Exact Time mux8 64bit 767 4 2 767 0 955 95 159 0.02 mux8 128bit 1535 4 2 1535 0 1915 191 319 0.03 mux32 16bit 653 6 3 586 2 300 68 108 0.02 mux64 16bit 1244 6 3 1171 2 518.5 98 263.5 0.05 barrel16 86 8 3 86 0 82 32 232 0 barrel16a 99 10 4 87 2 48.5 69.5 292 0.01 barrel32 198 10 4 183 1 137 96 736 0.02 barrel64 647 11 4 647 0 514 320 3712 0.1 nut 004 242 12 5 204 2 163.5 64.5 288 0.02 nut 002 228 19 13 210 3 104.5 13 56 0.02 nut 003 323 36 19 280 2 194.5 44.5 114.5 0.04 radar12 3970 44 23 3840 2 847.5 0.5 24 0.59 radar20 6862 44 23 5893 3 1746.5 1548 11181 4.63 nut 000 414 55 24 367 3 161 47 268 0.04 nut 001 579 55 33 485 4 351.5 52 354 0.11 Table 3.7. Min-delay-constrained min-reg characteristics, QUIP benchmarks. 100 3.7 Summary The contribution of this chapter is a new algorithm for computing a minimum-register retiming under constraints on both minimum and maximum path delay constraints. The improvements over previous techniques include: Faster Runtime. We measured our technique against the best-known published solution to the delay-constrained minimum-register retiming problem, implemented in the academic Minaret tool. The runtime was 102x faster. The worst-case bound is difficult to compare to other techniques, but the empirical improvement for real-life examples is substantial. Intermediate timing feasibility. In every iteration of both the outer algorithm and the inner timing refinement, there exists a solution that is strictly smaller than the original solution and meets all of the timing constraints. The algorithm can therefore be terminated at any point, and the tradeoff of runtime versus quality be adjusted as desired. Flexible timing model. Our formulation considers both long- and short- path constraints. While Minaret uses a unit-delay timing model, we allow arbitrary values. The timing analysis can also be replaced with other more descriptive models. A common problem formulation. This approach to delay-constrained register minimization introduces timing constraints into the problem structure described in Chapter 2 but does not alter the underlying algorithm. Other constraint types (such as those from the subsequent chapter) are completely compatible with this optimization. 101 Chapter 4 Guaranteed Initializability Min-Register Retiming In this chapter we extend the algorithm in Chapter 2 to include constraints on the initializability of the circuit. The initializability requirement is a guarantee that upon reset, the registers can be initialized to a set of values that results in identical behavior. Again, we assume that the retiming transformation is understood. The reader may review Section 1.2.1 for more background. The content of this chapter also depends on the maximum-flow-based formulation of minimum-register retiming from Chapter 2. An understanding of that material is a prerequisite. The chapter begins in Section 3.1 by defining the problem of guaranteeing initializability after retiming. The existing approaches to solving this problem are described in Section 3.2. Our new maximum-flow-based approach is described Section 3.3, including a few examples. An analysis of the correctness, complexity, and limitations is presented in Section 3.4. Some experimental results are given within Section 3.6. 102 4.1 Problem The majority of sequential devices contain a mechanism for bringing the state to a known value. This is useful at initialization, upon the detection of an unexpected error, or when a restart is desired by the user. The typical implementation of this mechanism is through a reset signal that is distributed to all sequential components. When the reset is asserted, all state elements are reverted to a known initial state or initialization value. There do exist other procedures for bringing the system to a known state, but we only address resetting the registers to specific values at this time. The initial state of each register bit is specified as part of the design. It may be required to be either zero or one or allowed to be either (often referred to as x-valued). In the unspecified case, the register may still require a reset to drive its output to a legal logic value (to guarantee correct electrical functionality by removing any meta-stability present in the physical device), but the particular logic value is unimportant. This affords additional freedom that can be exploited at different points of the design flow, including during retiming. When a circuit is transformed by retiming, the initialization values of the new retimed registers must be assigned to maintain the output functionality of the circuit. From the point of reset onward, the output must be identical under any possible input trace. A set of initial values that satisfy this requirement is known as an equivalent initial state. There may not exist any such equivalent initial state. The retiming must then be rejected, adjusted, or the circuit structure altered. The difficulty of efficiently modifying the retiming and/or the circuit so that the circuit has identical initialization behavior is one of the central challenges to retiming. It has even been suggested that this is one of primary obstacles to its further industrial adoption [50], though this does not conform to our experience. It is the problem of retiming under the constraint that an equivalent initial state must exist that we turn to now. 103 4.2 Previous Work The effect of retiming on reset behavior can be conceptualized using the equivalent finite state machine representation of the sequential circuit. Each register corresponds to a bit in the set of possible state representations, and the initial value– or multiple possible values, if unspecified– of each register dictates the one or more initial states from which the state machine progresses. There may be multiple states that belong to an initialization sequence and are not reachable from later states. The remaining set of states comprise the cyclic core. While retiming is guaranteed to maintain the cyclic core of the state machine, the transformation may alter the initialization sequences [51]. If the retimed circuit has an equivalent initial state, then transitively, it is known that all of the subsequent initialization sequences must also exist. The problem of finding an equivalent initial state is known as the initial state computation. In the next subsection, Section 4.2.1, we discuss a method for computing an equivalent initial state or determining that none exists. In Section 4.2.2, we given an overview of previous solutions for excluding particular retimings that do not have such a state. 4.2.1 Initial State Computation The problem of computing an equivalent initial state can be solved by finding a set of register assignments that reproduces the same logic values on the outputs of all combinational gates. While this is sufficient to guarantee output equivalence, it is not strictly necessary; however, this formulation contains the scope of the computation to the retimed logic and is the problem commonly solved in practice. Determining such an equivalent initial state is substantially different for combinational nodes with a positive retiming lag (over which registers were retimed in the direction of signal propagation) than for ones with a negative retiming lag. For registers that were retimed in the forward direction, an equivalent initial state can be computed by logically propagating the initial states forward through the combinational logic to the new register locations. This process is identical to functional simulation. With 104 Figure 4.1. A circuit with eight registers and their initial states.. unspecified x values, three-valued simulation should be used. In either case, an equivalent initial state is guaranteed to exist. For registers that were retimed in the backward direction, it is necessary to find a set of values, that when propagated forward, are identical to original initial state. This problem can be solved using SAT. As the registers are moved backwards, the initialization problem can be simultaneously constructed by unrolling the logic over which the registers were moved. As is necessary, this may result in some or all of the circuit being replicated multiple times. The SAT problem then consists of finding an assignment to the base of this unrolled cone (corresponding to the new positions of the registers) such that the values at the leaves (corresponding to the original locations of the registers) are identical to their original initial values. Any registers that do not have a reset value specified are omitted as constraints at the top of the cone. Consider the example of initial state computation after retiming the circuit depicted in Figure 4.1. The original circuit contains eight registers r1 through r8 , each labeled with their values at initialization. We will apply minimum-register retiming in separate forward and backward phases (as is done in the algorithm in Chapter 2). The forward minimum-register retiming phase will replace the four registers r1 through r4 with one register r1−4 in the location shown in Figure 4.2. If the initial values of the original registers are propagated forward, it can be seen that the logical state of the net on 105 Figure 4.2. Computing the initial states after a forward retiming move. Figure 4.3. Computing the initial states after a backward retiming move. which the new register lies is 0. If the new register r1−4 initializes to 0, the retimed circuit will behave identically to the original. The backward minimum-register retiming phase will replace the four registers r5 through r8 with the one register r5−8 in the location shown in Figure 4.3(i). Here, we must find an initial value assignment to the new register that when propagated forward results in the specified initial values on the nets on which the original registers lie. In this particular example, no such single satisfying assignment exists; this is not an initializable retiming. One could accept the solution after forward retiming and abandon the backward phase completely, though this clearly sacrifices significant optimization potential. This is overkill for a conflict that may be confined to a very local portion of the circuit. There exist other 106 backward retiming solutions that improve the objective and yet possess an equivalent initial state. One such example is illustrated in Figure 4.3(ii). The question at hand is how to constrain the retiming algorithm to avoid only the uninitializable solutions. 4.2.2 Constraining Retiming It is always possible to restrict the retiming transformation such that the resulting circuit will have a set of feasible initial states. The most straightforward such restriction is that the registers only be moved in the forward direction. However, this comes at the cost of a loss in optimization potential. Our experiments indicate that almost half of the average reduction in register count is achieved only through movement in the backward direction (Figure 2.15); it is clearly desirable to capture as much of this improvement as possible. One solution is to transform a general retiming solution into an equivalent forward-only one. The lag function of the minimal forward-only retiming, r ′ (v) : V → Z{0,+} , can be derived from any other lag function r(v) : V → Z by subtracting the minimum element in the range of r(v) (Equation 4.1). There is no loss in generality in this procedure. r ′ (v) := r(v) − min r(v) ∀v∈V (4.1) This offset also applies to the lags of primary inputs (and outputs) and will result in non-zero lags on these nodes if there exist any node v such that r(v) < 0 (i.e. that was backward retimed). The registers that were ”passed through the environment” may not have a corresponding single initialization value; in general, additional combinational logic is required to handle their initialization. The synthesis of this is described by [52]. There are limitations to this approach in both analysis and implementation: the construction of the circuit’s state machine is necessary to perform the analysis (which can only be explicitly built for small control circuits), and the resulting changes to the combinational netlist may be unpredictable and/or extensive. It is highly desirable for retiming to leave the structure of the rest of the netlist mostly intact. The ability to retime in the reverse direction while still maintaining initial state feasi107 bility was first addressed by [53]. This is an improvement over the forward-only retiming of [52] as it allows individual registers to be moved backwards without having to be pushed through the environment. However, the method used to restore initializability is heuristic and not well-targeted to the conflict. A constraint on the minimum lag is introduced to gradually reduce backward movement. For minimum-delay retiming, [54] describe an elegant method that finds the provably “most-initializable” co-optimal solution and then relaxes the target period until it is found. The relaxation of the target period allows the magnitude of the retiming lags in all portions of the circuit to be scaled back simultaneously. This approach is unfortunately not applicable to the min-register problem. Here, there is no equivalent global objective to be relaxed: an incrementally relaxation of the objective– an additional register– can be assigned to any number of locations to improve the initializability and improves only a local portion of the circuit. Choosing the optimal assignments in the min-register problem is a significantly more difficult problem. For the min-register problem, the work of [55] maintains the guarantee of optimality in the number of registers. Complete sets of feasible initial states are generated by justifying the original initial values backward through the circuit; there are many such sets as the justification process is non-unique. The choice of initial states then imposes constraints on the retiming which are incorporated by only allowing registers with identical values to be merged. If this constrained problem produces a result that is identical to the lower bound on the number of registers with retiming, a feasible optimal solution has been identified; if not, additional sets of initial values are generated until all have been enumerated. This approach, however, is not scalable to large circuits. The constrained retiming problem (when fan-out sharing is considered) is formulated as a general MILP (mixed-integer linear program). Without leveraging any particular structure of the problem, the scalability of the problem disappears into the poor performance of a general solver on a large problem. Furthermore, as the number of initial states is exponential in the worst case, and this approach requires as many iterations if the feasible optimal solution is even slightly worse than the lower bound. 108 4.3 Algorithm We propose a technique to generate a minimum-register retiming with a known equivalent initial state that is both optimal under this constraint and empirically scalable in its runtime. This is accomplished by using the formulation of minimum-register retiming introduced in Chapter 2. While the problem remains NP-hard, the algorithm appears to be efficient for real circuits. The general procedure consists of generating a set of feasibility constraints to incorporate into the maximum-flow problem to bias the registers from being retimed into local portions of the circuit that are known to introduce conflicts. These constraints are introduced incrementally and in such a way that every iteration adds exactly one additional register to the final solution. In this section, we use the notion of depth to express the distance with which the registers have been retimed from their original locations. As the initializability algorithm only operates during the backward phase of retiming, a deeper retiming refers to a cut that has a more negative lag function and lies more backward relative to the direction of signal propagation. In terms of the corresponding flow problem, however, a deeper cut lies closer to the sink and more forward relative to the direction of flow. 4.3.1 Feasibility Constraints It is important to draw the distinction between the original circuit and the initialization circuit which is used to test for and compute a new set of initial register values. The circuit on which the initial state computation is performed consists of the unrolled logic between the original and retimed locations of the registers. This initialization circuit is not necessarily a subset of the original design, as each of the original nodes may be replicated zero or more times. Let this initialization circuit be the graph Ginit =< W, E > where W ⊆ V × Z. V are 109 the nodes in the original design, and each initialization node w corresponds to a vertex in the original problem and a lag value at the time that copy w was added to the problem. Let a feasibility constraint γ be a subset of the problem variables W that has the following property: it is sufficient for a retiming to be as topologically deep as γ to imply infeasibility. Correspondingly, a retiming must be at least partially shallower than γ to be feasible. We require γ to be a partial cut in the initialization circuit: a set of vertices W where there exists no (i, j) ∈ W such that i ∈ T F O(j). Lemma 10 implies that infeasibility is monotonically increasing with topological order, and that such a partial cut γ must exist. Lemma 10. If a particular retiming is initial state infeasible, all strictly deeper retimings are also infeasible. Proof. Consider a feasible assignment at some strictly deeper cut. The forward propagation of these initial states implies a set of initial values at the location of shallower cut. This set of values comprises a feasible initial state for a retiming at the shallower cut and violates the assumption. We find such a partial cut γ using the procedure described by Algorithm 8 and as follows. The initialization circuit is ordered topologically from the retimed locations of the registers to the initial locations. Binary search is then used to find the shallowest complete cut, which results in an UNSAT initial state. In each step at node w, the implied cut lies between the nodes whose ordering label ≤ w and the nodes whose ordering label is > w. The subset of the variables in the SAT problem beyond w are excluded (i.e. all of the clauses which contain them removed). This reduces the problem to the exact point in the topological order at which it is sufficient to imply infeasibility. Note that this point is a result of the particular ordering chosen amongst the multiple partial orderings implied by topology. The last variable w that was required to produce UNSAT is then added to γ. However, in the new test for SAT, only the variables and constraints in the transitive fan-in of the 110 Figure 4.4. Binary search for variables in feasibility constraint. γ are included in the problem. This is done to ignore the variables that were included not because of their topological relationship to w but because of the particular ordering. Because of the exclusion of this region, the initialization circuit may be SAT once again, and the addition of more variables required. The procedure is repeated until the transitive fan-in of γ by itself is sufficient to imply UNSAT. In subsequent binary searches, the variables in T F O(γ) are always included in the problem. The final γ is the new feasibility constraint. Figure 4.4 illustrates the identification of a variable w through binary search on the topological ordering. The green region represents the ordered initialization circuit Ginit in which w is a node. The cut Ctopo>w excludes this particular variable while the cut Ctopo≥w includes it. The transitive fan-out cone of w that is tested as being sufficient to imply UNSAT is highlighted in orange. If it is available, the UNSAT core can be used to generate a feasibility constraint. The UNSAT core localizes the variables (e.g. the nets) in the problem that resulted in the 111 Algorithm 8: Construction of a feasibility constraint: FIND FEAS CONST() Input : an initialization circuit graph Ginit =< W, E > Output: a feasibility constraint γ let γ be an empty ⊂ 2W assert SAT (Ginit ) = false repeat topologically order W binary search w ∈ W until ¬SAT (Ginit w/o vars T F O(γ) ∪ u : topo(u) ≥ topo(w) ∧SAT (Ginit w/o vars T F O(γ) ∪ u : topo(u) > topo(w) add w to γ until SAT (Ginit ) w/o variables T F O(γ) = true return γ conflict that prevented the existence of a satisfying assignment. This eliminates the need for a binary search of the problem variables. 4.3.2 Incremental Bias To implement each constraint γ, a penalty structure is added to the flow graph to bias it against any cuts that lie further from the initial positions of the registers than γ. This is accomplished using the graph feature in Figure 4.5. A new node nbias is added: its flow fan-ins are the nodes v ∈ γ, and its fan-out is the sink node vsink . The effect of this structure is to add exactly one additional unit of flow from γ to vsink . Without the model for fan-out sharing, the edge from nbias → vsink would constrain the flow; with fan-out sharing, the internal edge in nbias with unit capacity constrains the flow. The sum total of the width of the edges crossing Cmin is increased by one, and this cut may no longer be the minimum width cut in the graph. The next iteration has been incrementally biased against selecting Cmin and any other cut that lies beyond γ and cuts this additional flow path. The feasibility constraint γ is in the space of W , but it must be applied to the flow graph 112 Figure 4.5. Feasibility bias structure. in the space of V . This is accomplished by delaying the implementation of γ’s bias until all of the nodes (with the appropriate lags) come into the scope of the current combinational frame. Until this happens, the cut is also prevented from prematurely passing any of the nodes already in scope. Their fan-outs are temporarily redirected to the sink. This temporary delay does not affect the result. Each feasibility constraint introduces exactly one register. As the register count is increased, one of two cases will occur: (i) the minimum cut is now shallower than γ and the result is initializable, (ii) the minimum cut is still as deep as γ and another penalty is necessary. In this manner, the minimum cut is “squeezed forward” out of the conflict region and the register count is incremented until it first becomes possible to find an equivalent initial state. The overall algorithm consists of the repeated identification of a feasibility constraint γ and its addition to the cumulative set Cf eas . The bias structure for every constraint in Cf eas is added to the graph and the new (larger) backward minimum cut computed. The iteration 113 terminates when the minimum cut has an equivalent initial state. This is summarized in Algorithm 9. If optimality is desired and multiple penalties with overlapping elements are generated, search is required to check for the cases where confining the biases to a single subset of the constraint variables is sufficient to push the resulting cut forward beyond that subset of the constraint. The solution may be feasible due to the boolean relationships between the various overlapping elements and constraints; this can not be addressed with any strictly topological analysis. As the search process is exponential, the expected NP-hard worst-case complexity of the initialization problem is contained within this case. However, if optimality is not necessary, the problem can be simplified. The biases can be added in a straightforward manner until the minimum cut is pushed forward beyond the feasibility constraints. Alternatively, the initialization circuit can be chopped at the feasibility cut (similarly to the manner in which conservative timing constraints were added in Section 3.3) and the cut made feasible in a single step. We can also consider the duplication of registers on the same net to provide both initial states (if this is allowed). This corresponds to the case where some portion of the minimum cut was pushed to exactly w ∈ γ, the point at which register duplication is capable of resolving the conflict. If the fan-out of w is greater than 2, this technique may be more efficient than biasing the cut until it is pushed past w. Both duplication and structural bias can be integrated into the search for the case requiring multiple registers. 4.4 4.4.1 Analysis Proof Correctness: Functionality As the guaranteed-initializable minimum-register retiming algorithm is a more strictly-constrained version of the approach in Chapter 2, the proof for functional correctness proceeds identically to that of Section 2.4.1. We refer to that explanation for more detail. 114 Algorithm 9: Guaranteed Initializable Retiming: INIT RETIME() Input : a combinational circuit graph G =< V, E > Output: an initializable retiming cut R let init state problem variables W be ⊆ V × Z let Cf eas be an empty list of 2W // forward retiming phase repeat nprev ← |G| G ← M IN REGf orward (G) until |G| = nprev // backward retiming phase Gsaved ← G repeat G ← Gsaved Ginit ← ∅ repeat nprev ← |G| G ← M IN REGbackward (G) build Ginit until |G| = nprev if G has initial state then return G γ ← FIND FEAS CONSTRAINT(Ginit) Cf eas ← Cf eas ∪ γ until forever 115 Correctness: Initializability The initializability of the final circuit is implicit in the termination condition of the outermost loop of the algorithm. The equivalent initial state is computed, and if one exists, we have an example that proves the initializability of the resulting circuit. If no such state exists, another iteration is necessary to modify the retiming solution. 4.4.2 Complexity Because the problem of computing an equivalent initial state after backward retiming– let alone transforming that retiming–is already N P -hard, is not possible to establish a polynomial upper bound on the runtime of this algorithm. However, this in no way precludes its speed and scalability on the class of circuits typically seen in the real world. Our experience has shown that, for the circuits that we examined, the check for an equivalent initial state via SAT is extremely fast. The total number of calls to the SAT solver is bounded by O(F RlogR), where R is the original number of registers in the design and F is the number of additional registers that are required to ensure initial state feasibility. F is quite small in all of the examined circuits. 4.5 Experimental Results We utilize an experimental setup identical to that described in Section 2.5. First, unconstrained minimum-register retiming was applied to all of the non-verification benchmarks. The initial state was preserved in the overwhelming majority of cases; only one design in the entire suite (“s400”) did not have an equivalent initial state. This may be an unrepresentatively low rate of non-initializability; industrial collaborators have reported experiencing a rate of approximately 10%. To provide a more thorough evaluation of the initial-state-feasible minimum-register retiming, the initializable benchmarks were modified to create initialization conflicts. The 116 Name Nodes Init Regs Infeas Regs Addl Feas Regs Avg |γ| Runtime s400 0.3k 21 18 1 8 0.08 oc aes core 16.6k 402 395 3 2 2.55 oc vga lcd 17.1k 1108 1087 1 1 1.09 nut 003 6.6k 484 450 3 1 1.41 radar12 71.1k 3875 3771 27 2.3 108.3 oc wb dma 29.2k 1775 1757 2 3.5 5.7 Table 4.1. Guaranteed-initializability retiming applied to benchmarks. original reset values of the registers were replaced with random bits, though in many cases with multiple sets of random values, the result was still not uninitializable. The results of applying our algorithm to both “s400” and the initial-state-randomized designs are described in Table 4.1. The number of initial registers in each design is listed in column Init Regs, and after unconstrained minimum-register retiming, this value was reduced to the number of registers in column Infeas Regs. There existed no equivalent initial states for these solutions. The guaranteed-initializable version was then applied, and the number of additional registers (or, equivalently, the number of iterations) is listed in the column Addl Feas Regs. Column Avg. |γ| is the average number of nodes in each of the feasibility constraints. Runtime is the total runtime in seconds. The randomization of the initial states likely results in more difficult problems than would be generated in any actual design, and yet the optimal feasible retiming can be found in a median runtime of a little over a second. The circuit “radar12” is the out-lier and presents a challenge due to its particular arithmetic structure. The small average size of the feasibility constraints indicates that the conflicts that prevent the existence of an equivalent initial state are indeed very local. 4.6 Summary The contribution of this chapter is a new algorithm for computing a minimum-register retiming that is guaranteed to have an equivalent initial state. This is a requirement for the correct initialization behavior of the retimed design. The solution can be computed 117 either optimally or heuristically, and the approach appears to scale well to moderatelysized industrial designs. 118 Chapter 5 Min-Cost Combined Retiming and Skewing In this chapter we discuss algorithms for simultaneously minimizing the both number of registers in a circuit and the number of clock skew buffers under a maximum path delay constraint. We assume that the reader is familiar with both retiming and clock skew scheduling, which are introduced in Sections 1.2.1 and 1.2.2, respectively. This chapter also utilizes minimum-register retiming, such as was discussed in Chapter 2, though the material is not prefaced on that particular algorithm. The chapter begins in Section 5.1 by defining the problem of joint register and skew buffer minimization. We introduce a combined cost function that can be generalized to different objectives, including dynamic power consumption. In Section 5.2, we discuss previous work. Section 5.3 introduces a formulation that solves the joint optimization exactly under linear cost functions. Because the solution of this problem is not scalable to larger designs, we instead turn to the new heuristic algorithm described in Section 5.4. Experimental results are presented in Section 5.5. 119 5.1 5.1.1 Problem Motivation Both retiming [8] and skew scheduling [9] are sequential optimizations with different means of implementation that have the same objective: balancing the delay along long combinational paths with adjacent shorter ones. Retiming relocates the structural position of the registers in a design, and skew scheduling inserts intentional delays into the clock distribution network to move the temporal position of the registers. The optimal minimum delay of both techniques is bounded by the maximum mean cycle time of the worst-case register-to-register delays. There are costs associated with applying each of the two techniques. Retiming alters the number of registers (in either direction), affecting the area, dynamic power, and the other design metrics discussed in Section 2.1. Clock skewing requires the implementation of a particular set of relative delays. These specific and non-uniform clock path delay requirements impose a real challenge to clock network design. The implementation is usually accomplished with carefully-planned additional wiring, buffers, or delay elements, and each of these elements consumes more power and area. We use the notion of a cost function to describe the value of a particular implementation choice. Any or all of these metrics could be included in such a function, either quantitatively or heuristically. For this reason, we treat cost as a very general and user-definable concept. However, our focus is primarily on the dynamic power consumption of the registers and skew elements that must be driven by the clock tree. Separately and secondarily, we also consider the area of these cells. Whenever the concept of cost is visited, these two metrics could be used by the reader as a concrete example of its potential use. The problem that we consider in this chapter is how to select a combination of simultaneous retiming and skewing to meet the given delay constraint in such a way that the desired cost function is minimized. This approach is motivated by the observation that the cost of both sequential opti- 120 Figure 5.1. Costs of moving register boundary with retiming and skew on different topologies. mization techniques have a strong dependence on the circuit topology-but in different and often complementary ways. Within a single design, there are critical elements where performance can be improved more efficiently through retiming and others where skewing is more suitable. When the two are used in combination, a performance improvement can be obtained with less implementation cost than either in isolation. Figure 5.1 describes a pair of examples illustrating this different in cost. There are two circuits in parts (i) and (ii) with different topologies; in both of these we desire to move the register boundary forward in time and/or structure (such that the slacks on the fanout paths are increased). The graphs below each circuit demonstrate the approximate clock tree (power or area) cost of implementing this re-balancing with either retiming or skew. In Figure (i), a circuit with a narrowing fanout can be retimed forward with an outright reduction in cost; skewing is expensive. In Figure (ii), the transitive fan-out width grows outward from the latch boundary; in this case, retiming is expensive but skewing is relatively 121 cheap. The cost of moving the register boundary forward by some amount of time requires fewer skew buffers in (ii) than in (i) due to the smaller initial number of registers. The simultaneous application of retiming and intentional skew also has the advantage of avoiding extreme solutions and the associated problems of either. While both clock skewing and retiming are present in several commercial design tools, but their role is typically confined to small incremental resynthesis. The scope of the allowed change is local and limited. Global optimization is specifically avoided because of its unpredictability in the difficulty of implementation in the extreme (i.e. globally optimal) solutions. While the algorithms for pursuing optimal solutions are well understood, strategies for backing off from extreme solutions to feasible intermediates are less developed. An example is delay-constrained minimum-area retiming. Even if this technique were computationally tractable for large designs, it gives no information about the shape of the cost curve or the quality of nearby alternatives (such as was presented in Figure 3.10 at the expense of significant computational effort). Especially in the context of a complete design flow, the designer is left with little to no information about how to balance the extent of retiming with other means of meeting the design specifications. Combining multiple techniques provides exactly such a mechanism to back off extreme solutions of either. Because the cost of a retiming movement can be negative when registers are shared, it is possible to reduce the cost of a set of registers with retiming below their cost in the original design. This allows relaxing the performance if possible with the aim to recover registers. Even with an aggressive performance constraint, it may still be beneficial to introduce timing violations with retiming and then correct for them with intentional skew– if the cost of the additional skew buffers is outweighed by the reduction in registers. Figure 5.1 (i) could be one such example of this. 122 5.1.2 Definitions First, we consider the exact formulation of a cost function to measure and minimize the dynamic power consumption of the skew elements and registers in the clock network. We assume that either or both retiming and clock skew scheduling have been applied. Let Creg be the clock input capacitance of a register. The (weighted) number of registers between two combinational nodes is wr (e); this is exactly the retimed register weight in Section 1.2.1 and is defined in terms of the original register weight and some retiming lag function r(v) that completely characterizes the retiming transformation. We assume that skews are implemented at each register r’s clock input with a string of delay buffer that produce the required relative skew τ (r). The input capacitance of each one of these buffers is Cbuf and its delay dbuf . The periodicity of the clock can be used to reduce the required delay to the fractional component of a clock cycle T . The power after retiming (with a lag function r(v)) is expressed by Equation 5.1. The power to implement a clock skew schedule τ (r) is expressed by Equation 5.2. Ptot is the sum of these two quantities. Pret = Creg X wr (e) (5.1) ∀e∈E Pskew = Cbuf X τ (r) ⌋ τ (r) − T ⌊ dbuf T (5.2) ∀r∈R Ptot = Pret + Pskew (5.3) A similar set of cost functions can be described to capture the change in area. Let Areg be the area of a register and Abuf be the area of a buffer. The total area cost Atot of applying a (simultaneous) retiming transformation and skew is expressed by Equation 5.6. 123 Aret = Areg X wr (e) (5.4) Abuf X τ (r) τ (r) − T ⌊ ⌋ = dbuf T (5.5) ∀e∈E Askew ∀r∈R Atot = Aret + Askew (5.6) If we ignore the utilization of periodicity, each of the above cost functions in linear in the number of registers and the applied skew schedule. 5.2 Previous Work The parallels between clock skew scheduling and retiming have been recognized and utilized by others. The work of [56] describes the technique of continuous retiming, whereby the sequential arrival times are calculated and then converted into a retiming lag function. The real-valued arrival times can be computed quite efficiently using Bellman-Ford [57], and while this does not provide an optimal solution to the discrete retiming problem, it often provides a very good solution. The Minaret algorithm [58] (discussed in depth in Chapter 3) and its min-delay variant ASTRA [49] both utilize retiming-skew equivalence to bound the number of pair-wise delay constraints that must be enumerated. This is accomplished by computing the as-late-aspossible (ALAP) and as-soon-as-possible (ASAP) skews. This is again done by using a propagation of sequential arrival times via Bellman-Ford. Each register in the circuit can not be retimed over gates of a greater total delay of those skews in either direction and yield a valid solution. While these techniques utilize the similarities between skewing and retiming to simplify the computational effort, as far as we are aware, this is the first work that motivates and explores the implementation benefits of the simultaneous use and joint optimization of retiming and skew. 124 5.3 Algorithm: Exact We now construct an exact formulation to minimize the cost of the simultaneous application of retiming and skew under a set of simple longest-path delay constraints. Retiming Component In Section 1.2.1, we discussed a network ILP formulation of retiming that required the enumeration of all pair-wise timing paths to incorporate delay constraints. There is an MILP formulation of the delay-constrained problem, that while not as efficient to solve directly, describes an equivalent problem. For this formulation, we introduce a real-valued quantity R(v) : V → ℜ that describes the sequential arrival time at each combinational node v. Note that our definition of the real-valued function R(v) is different than the similar problem described by [8] but is linearly related. This change will be motivated shortly. The constraints on R(v) and the retiming lag function r(v) are captured in Equations 5.7 to 5.9. As before, wi (e) is the initial register weight of each edge, and wr (e) is the retimed weight. d(v) is the worst-case delay of combinational node v, and T is the overall period constraint. With these constraints, the retimed circuit will satisfy: correct timing propagation (Equation 5.7), non-negative register count (Equation 5.8), and correct setup timing (Equation 5.9). R(v) − R(u) ≥ d(e) − wr (e)T r(u) − r(v) ≤ wi (u → v) d(v) ≤ R(v) ≤ T Clock Skew Component ∀e = (u → v) (5.7) ∀e = (u → v) (5.8) ∀v ∈ V (5.9) Clock skew scheduling is typically formulated in terms of con- straints and variables on the registers in the circuit. Enumerating the registers is convenient, because one independent variable (i.e. its skew) can be created for each. We depart from this traditional formulation and instead introduce one that is compatible with the MILP retiming above. The MILP formulation can be modified to the traditional skew schedul125 ing problem by 1) assuming all retiming lags r(v) = 0 and 2) removing the constraint of Equation 5.9. Because all of the integer variables are fixed to zero, the resulting problem is a linear program. Furthermore, determining the minimum feasible T is an instance of the maximum mean cycle problem, for which several efficient algorithms exist. [13] The resulting solution will comprise a set of feasible sequential arrival times R(v), and while there is not a one-to-one correspondence between these variables and the skews of the registers in the design, the mapping is trivial. Given the sequential arrival time R(u) at the output of node u, the necessary skew S on the k-th register (ordered topologically on edge u → v) is as in Equation 5.10. S(u → v, k) = R(u) − kT (5.10) In general, there are many possible skew schedules to meet a target period. Analogously to delay-constrained minimum-area retiming, a schedule can be chosen that minimizes the amount of total skew by adding an appropriate optimization objective. However, compared to retiming, this is much less difficult; the minimum-cost schedule can be generated on the graph using the Bellman-Ford algorithm [57]. Combined Formulation The problems of retiming and skewing can be combined into a single MILP to minimize the cost when simultaneously employing both retiming and skewing. In this combined problem, because the number and location of registers will vary with the retiming, the skew component of the total cost is not a straightforward quantity. Instead, we describe a set of variables cs (e) to capture the total sum of skews along any edge e in the retiming graph. This is described in Equation 5.11. cs (e) = X S(e, k) (5.11) k=1..wr (e) This is a piecewise linear function of R(u) and wr (u → v). First, consider the case where 126 there are either zero or one registers present on the edge. If zero, the skew cost should also be zero, and if one, the cost should be S(e, 1). This is realized in the linear constraint of Equation 5.12, where β is arbitrary and larger than any R(v). cs (u → v) ≥ β(wr (u → v) − 1) + R(u) − T (5.12) In our experience, there seems to be little to no loss in optimality by leaving the restriction that wr (e) is at most one, but for completeness, it is possible to relax this constraint by introducing a set of M ordered binary indicator variables wr1..M (e) to represent wr (e) via the constraints of Equations 5.13 and 5.14. The general expression of cs (e) that allows up to an arbitrary maximum M registers to be retimed along each edge is Equation 5.15. X wrj (e) = wr (e) (5.13) j=1..M 0 ≤ wrM (e) ≤ ... ≤ wr2 (e) ≤ wr1 (e) ≤ 1 cs (u → v) ≥ X β(wrj (u → v) − 1) + R(u) − jT (5.14) (5.15) j=1..M Coupled with an objective that is a linear function of the variables (such as the ones from Section 5.1.2), the program becomes a mixed integer linear program (MILP). The result will be the optimal combination of retiming and skewing to minimize the given linear cost. While complete, this formulation for minimum cost is of little practical use: it is computationally intractable for all but the smallest circuits. A better approach is needed. 5.4 Algorithm: Heuristic We describe a heuristic technique to minimize the cost of the joint application of retiming and clock skew to meet a period target Ttarg . We refer to this general process as end-to-end retiming in [59] because it visits: 127 1. both the min-delay retiming solution and min-register retiming, and 2. a continuous set of solutions between them In contrast to retiming to a single solution with a specific goal (e.g. minimum-delay, delay-constrained minimum-area), end-to-end retiming explores an entire spectrum of performance possibilities. Other than the endpoints, no guarantee is made that any single point is exactly optimal in either register count or delay; in general, the path that is generated will be suboptimal to varying degrees. Each of these retiming solutions defines the retiming component of the joint skewretiming solution. The remaining skew necessary to meet the performance target Ttarg can be efficiently computed using Burns’ algorithm [60]. The motivation behind this approach is the value in having a complete performance/cost curve available and the information it gives to optimize the desired cost function. A heuristic solution chosen with knowledge of its alternatives is often more valuable than meeting the performance target with a fixed but blindly chosen combination of the two optimizations, even if each is applied optimally. 5.4.1 Incremental Retiming The process of incremental retiming for delay is the primary engine for generating sequence of possible retimings. In incremental retiming, the registers are only retimed over a single gate at a time. A heuristic incremental retiming for minimum delay that produces near-optimal results but allows a wide variety of design constraints to be included in the problem is described by [47]. Because the decision-making is not premised on a simplified timing model, the timing information can be as accurate as needed, even including wire delays and other physical information. The flexibility in including constraints is particularly powerful. The authors of [47] concentrate on excluding solutions that violate physical constraints, but any move that leads to a blow-up of any implementation cost can be similarly blocked or delayed. 128 If the optimal minimum-delay retiming is required, [48] proposes an elegant incremental algorithm for finding the exact minimum-delay solution, even with non-uniform gate delays. The drawback of this approach is its simplistic timing model. We intentionally do not specify an exact recipe for choosing incremental moves, because both of the above solutions offer different but useful tradeoffs of timing accuracy, computational effort, and optimality. It is also possible to devise other customized alternatives as the application sees fit. Our only requirements are that: 1. Each register is retimed over no more than one gate per iteration. 2. The solution is legal after every iteration. 3. The objective of each move is to minimize the worst-case path delay. 5.4.2 Overview The overall algorithm is described in Algorithm 10 and illustrated graphically in Figure 5.2. The general progression consists of multiple applications of the supplied incremental retiming engine from a few important starting points (which will be described shortly). At each solution, the required skew is computed to meet the performance target Ttarg and the implied total cost evaluated. This combination gives us a new point along a performance/cost trade-off curve along which the best solution(s) are retained for later use. After a fixed number of incremental retiming steps, the optimal minimum-delay retiming is (optionally) computed and examined. Then, the solution is retimed to the minimum-area solution and incremental delay retiming is then applied for a number of iterations. Starting from the original design, incremental delay retiming is applied. Once the first application of incremental retiming for delay has reached its limit at the minimum-delay solution, the set of points with better performance than the initial design has been fully explored; however, this is only half of the space. Next, minimum-register retiming is applied to generate a solution that has the exact minimum number of registers of any retiming. The minimum-register retiming algorithm 129 Algorithm 10: End-to-end retiming: END2END() Input : a sequential circuit G, a target period Ttarg , number of retiming steps k Output: a retiming lag function r(v) Output: a clock skew schedule τ (r) // Original solution Data: Gcur ← G Data: rcur (v), rbest (v) : V → Z = 0 Data: τcur (r), τbest (r) : R → ℜ = computeskew(Gcur , Ttarg ) Data: costcur , costbest = cost(τcur ~ , rcur ~ ) // Incremental retiming repeat Gcur , rcur ~ = incremental retiming of Gcur τcur ~ = computeskew(Gcur , Ttarg ) costcur = cost(τcur ~ , rcur ~ ) if (costcur < costbest ) then costbest , rbest ~ , τbest ~ ← costcur , rcur ~ , τcur ~ until k times // Min-register solution Gcur , rcur ~ = min-register retiming of G τcur ~ = computeskew(Gcur , Ttarg ) costcur = cost(τcur ~ , rcur ~ ) if (costcur < costbest ) then costbest , rbest ~ , τbest ~ ← costcur , rcur ~ , τcur ~ repeat Gcur , rcur ~ = incremental retiming of Gcur τcur ~ = computeskew(Gcur , Ttarg ) costcur = cost(τcur ~ , rcur ~ ) if (costcur < costbest ) then costbest , rbest ~ , τbest ~ ← costcur , rcur ~ , τcur ~ until k times return τbest ~ , rbest ~ 130 Figure 5.2. Overall progression of retiming exploration. described in Chapter 2, even for the largest circuits. Because it is also canonical in the number of registers, the result is not dependent upon the particular retiming supplied as an input. A second phase of incremental retiming is then applied until a period at least as small as that of the original circuit has been recovered, at which point the algorithm terminates. The netlist at exit does not generally correspond to the original, and no guarantee can be made about the relative numbers of registers or the total cost. Empirically, after retiming across the entire performance axis, a heuristic incremental method is typically not able to reproduce the quality of the original and exits with a slight increase in register count. This motivates the exploration in two segments, starting first from the initial netlist, as shown in Figure 5.2. The smoothness of the resulting curve follows from the restriction on the incremental moves to be one gate at a time. The distance between two adjacent points on the retiming curve can thus be guaranteed to be within a certain delay granularity g. If the delay model is load-independent, then g can be bounded by Equation 5.16: the largest gate delay. If 131 the delay model is load dependent, that is, d(v) ≈ dintrinsic (v) + dload (v)Cload (v), the bound becomes Equation 5.17. g =max d(v) (5.16) g =max dintrinsic (v) + dload (v)∆Cload (v) (5.17) ∀v∈V ∀v∈V In almost all cases, the total change in load capacitance along any path is very small; however, if necessary, the maximum change in capacitive load can be fixed by limiting the number of registers than can be retimed in any iteration. 5.5 Experimental Results End-to-end retiming is applied to a set of industry-supplied and academic designs. All benchmarks were first pre-optimized using the ABC logic synthesis package [3]. The timing data was extracted using a full table-based slew and load-aware timing analysis, and this model was used in an incremental min-delay retiming algorithm similar to [18]. The maximum mean cycle times were measured and used as the target performances. The following experiments were conducted on a set of 64-bit 2.33GHz Pentium Xeon computers. First, the combined retiming/skewing obtained from end-to-end retiming is compared against the optimum solutions obtained from the exact formulation to illustrate the tradeoff between optimality and runtime. Because of the limitations of the exact method, only the smallest of the ISCAS benchmarks, solvable as a MILP within an hour of runtime, were used. The results of four of the largest designs are presented in Table 5.1. In half of these cases, the optimal minimum-cost solution was found with heuristic end-to-end retiming, though the average runtime was over two orders of magnitude faster. Next, end-to-end retiming was applied to a set of larger benchmarks [38]. Our technique was used to minimize both the dynamic power consumption of the clock endpoints (Equation 5.3) and the total area of the required registers and buffers (Equation 5.6). The power, area, 132 Name s349 s526n s1196 s1423 Exact Min-Power Area Runtime 2.80e3 49.0s 3.70e3 34.0s 2.92e3 1.5s 1.25e4 2.3s Heuristic Min-Power Area Runtime 3.00e3 0.07s 4.14e3 0.03s 2.92e3 0.04s 1.25e4 0.14s Table 5.1. Runtime and quality of exact and heuristic approaches. Name mux32 16bit mux64 16bit mux8 128bit mux8 64bit nut 000 nut 001 nut 002 nut 003 oc ata ocidec2 oc ata v 3 oc cordir p2r oc hdlc oc minirisc oc mips oc oc8051 oc pavr oc pci oc vga lcd oc wb dma os blowfish radar12 AVERAGE Orig Sk-Only Ret-Only Comb %Improv 7.7 8.3 10.5 8.3 0.0% 15.1 15.9 23.2 15.2 4.4% 16.6 20.1 24.6 15.2 24.4% 8.3 10.1 12.3 7.9 22.3% 4.7 7.5 5.7 5.6 2.6% 7 19.2 13.8 10.8 21.7% 2.4 3.4 3.3 3.1 4.6% 3.8 4.2 4.4 3.8 7.9% 4.4 4.7 4.7 4.5 4.0% 2.3 3 2.8 2.7 4.6% 10.4 57.7 38.6 22 43.0% 6.1 6.1 6.2 5.5 10.1% 4.2 4.3 4.3 4.2 0.7% 18.1 41.7 25.5 20.8 18.4% 10.9 10.9 11.1 10.7 1.8% 17.7 34.1 26.5 23.3 12.1% 19.5 21.8 21.2 20.3 4.2% 16 16.4 18.1 15.9 3.0% 25.6 27.7 27.4 26.1 4.7% 12.8 28.3 20.2 14 30.7% 55.8 74.8 59.5 59.4 0.2% 10.7% Table 5.2. Power-driven combined retiming/skew optimization. and skew buffer delay values were taken from the GSC 0.13um standard cell library provided with [39]. Short-path timing was not an issue in these designs. The power-driven results are presented in Table 5.2. The dynamic clock power consumption of the original design (before delay optimization) is listed in column Orig. The sequential elements in the circuit were then optimized to meet the minimum feasible delay using three different methods: only skewing, only retiming, and a combined application of retiming and skewing computed with our algorithm. These results are listed under columns 133 Name mux32 16bit mux64 16bit mux8 128bit mux8 64bit nut 000 nut 001 nut 002 nut 003 oc ata ocidec2 oc ata v 3 oc cordir p2r oc hdlc oc minirisc oc mips oc oc8051 oc pavr oc pci oc vga lcd oc wb dma os blowfish radar12 AVERAGE Orig 8.64E+4 1.70E+5 1.87E+5 9.38E+4 5.28E+4 7.84E+4 2.76E+4 4.29E+4 4.91E+4 2.54E+4 1.17E+5 6.90E+4 4.68E+4 2.04E+5 1.22E+5 2.00E+5 2.19E+5 1.80E+5 2.88E+5 1.44E+5 6.28E+5 Sk-Only Ret-Only Comb %Improv 8.88E+4 1.14E+5 8.68E+4 2.3% 1.73E+5 2.58E+5 1.64E+5 5.2% 2.02E+5 2.72E+5 1.68E+5 16.8% 1.01E+5 1.36E+5 8.55E+4 15.3% 6.43E+4 6.44E+4 6.24E+4 3.0% 1.29E+5 1.15E+5 1.03E+5 10.4% 3.14E+4 3.66E+4 2.90E+4 7.6% 4.36E+4 4.98E+4 3.88E+4 11.0% 9.64E+4 5.30E+4 4.89E+4 7.7% 9.64E+4 3.06E+4 2.82E+4 7.8% 3.14E+5 2.90E+5 2.53E+5 12.8% 6.89E+4 7.00E+4 6.11E+4 11.3% 4.72E+4 4.72E+4 4.72E+4 0.0% 3.02E+5 2.87E+5 2.30E+5 19.9% 1.22E+5 1.24E+5 1.21E+5 0.8% 2.68E+5 2.98E+5 2.55E+5 4.9% 2.29E+5 2.38E+5 2.24E+5 2.2% 1.81E+5 2.04E+5 1.76E+5 2.8% 2.96E+5 3.08E+5 2.92E+5 1.4% 2.09E+5 1.79E+5 1.43E+5 20.1% 7.05E+5 6.54E+5 6.53E+5 0.2% 7.8% Table 5.3. Area-driven combined retiming/skew optimization. Sk-only, Ret-only, and Comb, respectively. In the few cases where retiming alone was not able to meet the target (due to the discrete delays of the gates), the difference was corrected with a small amount of skew and included in that cost. Finally, the column % Improv indicates the improvement in the power using our technique over the best of either the skew-only or retiming-only solution. On average, the combined solution results in 10.7% less dynamic clock power consumption. The area-driven results are presented in Table 5.3. The columns are identical in meaning in the power-driven versions, but here we measure the total number of layout unit squares required for the registers and clock skew buffers. The average reduction in area using the combined optimization is 7.8% better than the best solution of either retiming or skewing alone. This is not a fair method of comparing the optimization alone. As the size of the original 134 Name Ttarg /Torig Power %Improv Area %Improv Runtime (s) mux32 16bit 0.37 0.0% 2.3% 2.8 mux64 16bit 0.36 4.4% 5.2% 9.3 mux8 128bit 0.38 24.4% 16.8% 10.4 mux8 64bit 0.38 22.3% 15.3% 2.7 nut 000 0.46 2.6% 3.0% 2.3 nut 001 0.49 21.7% 10.4% 16.7 nut 002 0.41 4.6% 7.6% 1.2 nut 003 0.65 7.9% 11.0% 2.6 oc ata ocidec2 0.93 4.0% 7.7% 5.1 oc ata v 3 0.93 4.6% 7.8% 5.8 oc cordir p2r 0.71 43.0% 12.8% 149.8 oc hdlc 0.93 10.1% 11.3% 16.3 oc minirisc 0.92 0.7% 0.0% 3.9 oc mips 0.95 18.4% 19.9% 380.9 oc oc8051 0.95 1.8% 0.8% 164.9 oc pavr 0.82 12.1% 4.9% 292 oc pci 0.55 4.2% 2.2% 90.1 oc vga lcd 0.71 3.0% 2.8% 293.4 oc wb dma 0.87 4.7% 1.4% 614.7 os blowfish 0.53 30.7% 20.1% 165.9 radar12 0.51 0.2% 0.2% 2337 AVERAGE 10.7% 7.8% Table 5.4. Results summary. design is fixed by functionality and timing, it is really only this increase that we are targeting for improvement. Alternatively, we can compare the difference in additional power or area required to improve the worst-case delays from the original values to the tighter targets. On average, meeting the delay target with combined retiming/skewing required over 131% less additional area and 79% less additional dynamic power than the best of either technique in isolation. In many cases, the faster period was met using less area and/or power than the slower original design (resulting in the >100% reduction in average additional area). This is due to the improvement in area from the application of the technique in Chapter 2. Table 5.4 summarizes both the power- and area-driven results. Here, we also specify the target delay constraint that was used. Column Ttarg /Torig expresses this as a fraction of the original period, indicating the relative aggressiveness of the delay optimization. The 135 column Runtime measures the total runtime of the heuristic in seconds: this was identical for both the area- and power-driven optimizations. Figure 5.3 illustrates the cost curve generated by end-to-end retiming in more detail for two benchmarks. The two regions represent the relative contributions of skew buffers and registers to the total dynamic power consumed on the leaves of the clock tree. The circuit “nut 003” is a case where the minimum-cost solution lies in the portion of the retiming curve revealed by minimum-register retiming. The aggressive target timing (Ttarg = 24.0) can be met with the minimum dynamic power consumption by first retiming to a slower period (T = 49.0) and then recovering the performance with clock skew. The increase in additional skew buffers is outweighed by the decrease in the number of registers. The balance between retiming and skewing in the minimum cost solution was highly design dependent. In some of the benchmarks, the result consisted of either maximally retiming or maximally skewing (as is the case in “oc pavr” and “nut 003”, respectively, in Figure 5.3). However, in general, the minimum-cost solution utilized instances of both. 5.6 Summary The technique efficiently explores a smooth set of retiming solutions between the minimum-register and minimum-delay retiming solutions. Because the cost of retiming is unpredictable a priori, combined retiming and skewing allows an informed decision about the best balance between the two optimizations. This was used to minimize the dynamic power in the clock tree. For the set of benchmarks examined, the total dynamic power consumption of the clock tree endpoints was reduced by an average of 10.7%. 136 Figure 5.3. Dynamic power of two designs over course of optimization. 137 Chapter 6 Clock Gating In this chapter we examine an algorithm for introducing clock gating, a power- and areaoptimization technique whereby the clock signal is selectively propagated to subgroups of registers in the design. Because clock gating represents a fundamentally different optimization technique than retiming and clock skew scheduling, the content of this chapter is fairly self-contained. The necessary background will be introduced herein. The chapter begins by introducing and motivating the technique of clock gating. The existing approaches to computing clock gating conditions are summarized in Section 6.2. Our new simulation and SAT-based approach is described Section 6.3. Experimental results are presented in Section 6.5. 6.1 Problem Clock gating inserts conditions on the propagation of a clock transition to one or more registers in the design. By limiting any unnecessary switching, the dynamic power required to charge and discharge the capacitive load of the register inputs is reduced. The capacitance of a large group of registers is thereby shielded behind the smaller capacitance of a single clock gate. 138 The condition under which a clock transition is inhibited is known as the gating condition, clock disable, or activation function. In general, this function G may be sequential and dependent on variables from previous time frames. Architectural implementations of clock gating often implicitly make use of this property to implement more powerful conditions (e.g. through the use of a dedicated low-power controller). In this work, we restrict the problem to the combinational version. We also focus on the application of clock gating to a netlist-level circuit. All of the functionality is assumed to have been decomposed into a set of standard cells. Physical information may be available. The problem is how to find a function G that preserves functionality, maximizes the power savings, minimizes the perturbation to the netlist, and can be identified and synthesized in a manner that is scalable to large designs. 6.1.1 Motivation The dynamic switching of the clock network typically accounts for 30-50% of the total power consumption of a modern design, and with the proliferation of low-power requirements and thermal limitations, minimizing this total is imperative. One of the most effective and widely adopted techniques is clock gating, whereby the clock signal is selectively blocked for registers in the design that are inactive or do not otherwise need to be switched. 6.1.2 Implementation Two example implementations of clock gates are depicted in Figure 6.1. If the gating condition is true (here, G = g1 ∨ g2 ), the clock with be blocked from passing through the clock gate. If G is monotonic or its transitions are strictly confined to one half of the clock cycle, the gating can be implemented with only one logic gate (e.g. Figure 6.1(i)). Otherwise, the gating condition must be latched as in Figure 6.1(ii) to prevent glitches from being propagated onto the clock line. Glitches are undesirable because of both the extra dynamic power required and the potential change in the sequential behavior of the circuit. 139 In the circuit of Figure 6.1(ii), any glitch is either filtered by the controlling logic value of the clock or by the non-transparency of the latch; each cycle of the clock will be a fully propagated or completely constant. Many standard cell libraries include a merged gate and latch, often referred to as a clock gating integrated cell (CGIC). Figure 6.1. Clock gating circuits. 6.2 Previous Work The most common clock gating approach is to identify architectural components that can be deactivated and to explicitly design the control logic of the gating signal. However, the benefits of gating can also be extended to very local sections of the circuit and small clusters of registers. Utilizing clock gating on this finer level of design abstraction requires automatic synthesis to be practical. 6.2.1 Structural Analysis The most straightforward automatic synthesis of useful clock gating conditions relies solely on structural analysis. A typical implementation of structural gating involves using either existing synchronous enables or detecting multiplexors at the input of a register that 140 implements synchronous-enable-like behavior. These two structures are illustrated in Figure 6.2(i) and (ii), respectively. Figure 6.2. Opportunities for structural gating. The advantage of structural methods is that runtime is quite fast, and grows only linearly with the number of registers in the circuit. Only a small region local to each register is examined for specific patterns that imply the ability to gate the register. The disadvantage is that the limitation on the utilized gating functions is unnecessarily strict and may miss significant potential for additional savings. A simple example is illustrated by Figure 6.3: each of the pair of registers may be gated by the function g. This is max demonstrated in the truth table (ii). The columns Gmax R1 and GR2 describe the cases where it is safe to gate each of the respective registers. Note that there is no self-feedback loop from the register outputs to the register inputs. These missed gating opportunities can be caught by using a stronger functional analysis. 6.2.2 Symbolic Analysis Even if there is not a physical signal whose structure indicates that a register can be gated, it is possible to compute and analyze the next state function of a register to generate a functional description of when it is safe to gate it. The methods of [61] [62] use symbolic representations to directly compute the conditions under which a register does not switch. This requires generating a BDD [63] for the next state function of a register. Unfortunately, the limits of such symbolic functional manipulation are often below the size of even moderate designs. This problem is further 141 Figure 6.3. Non-structural gating. compounded by the need to find conditions that are able to gate multiple registers simultaneously, requiring that multiple next state functions are kept in memory. The grouping of gated registers may not even be known a priori; it may be desirable to consider all of them simultaneously. Constructing the BDDs for the entire circuit is expensive and often impossible. Figure 6.4. Unknown relationship between BDDs and post-synthesis logic. Once (and if) a gating conditional can be derived symbolically, it must be implemented in the netlist. This requires a general synthesis method to implement the BDDs as mapped logic. A strong disadvantage to this technique is that this general synthesis may result in an unknown amount of additional logic, as is suggested by Figure 6.4. Even if the physical 142 design is not disrupted by the additional area and wire requirements of this hardware, its dynamic power consumption eats away from the power-saving benefits that it seeks to provide. It is therefore necessary to prune the coverage of the function to save on implementation cost, but determining a good balance is a difficult synthesis problem. When timing must also be considered, the complexity is increased further. 6.2.3 RTL Analysis There are also clock-gating techniques that target higher-level descriptions of a system than a netlist. One example is described in [64] and operates on an RTL description of a system. A second example is the industrial tool PowerPro from Calypto [65]. Because the design representation is abstract, an RTL-level analysis may facilitate a functional analysis that is more powerful and deeper than is possible with a finer abstraction. For example, it is often possible to identify idle cycles at the beginning of a pipelined computation (sometimes known as pipeline “bubbles”) and use these to gate the subsequent registers. This analysis requires reaching across multiple clock cycles and complex arithmetic components and is not suited to a gate-level approach. The critical disadvantage to an RTL-based approach is the lack of information about design timing, placement, or logic implementation. It is not possible to back-annotate or predict this information with much accuracy and still reorganize the RTL netlist. This makes the consequences of clock gating on important metrics such as timing and area (and even power) difficult to predict. The same problem suffered by symbolic gating techniques is faced to an even greater degree by RTL-based methods. Ideally, both RTL and netlist level clock gating have an important place in a low-power design flow. The new algorithm described in this chapter is not meant to replace other levels of analysis so much as to complement them. 143 6.2.4 ODC-Based Gating The work of [62] leverages another opportunity to gate unnecessary clock transitions. Not only is it sufficient for gating to be applied when the state of a register does not change but also when a change that does occur is not ever observable at a primary output. This presents an opportunity to gate a register that is different from of whether it is switching or not. The ODC-based techniques can be applied in parallel with ones that predict switching. A combined algorithm for capturing both observability and switching is of interest and in development, though it will not be discussed within the scope of this work. 6.3 Algorithm We examine the automatic synthesis of combinational gating logic for netlist-level circuits and propose an approach that addresses the dual problems of gating condition selection and synthesis by constructing these functions out of signals in the existing logic network. While this is less flexible than the synthesis of an arbitrary function, the result is still quite good and, importantly, scalable to large designs with very predictable results. We also discuss how necessary constraints on the placement and timing of the design can be included in the problem. These allow control over the resulting netlist perturbation. In particular, the algorithm introduced in this chapter seeks to maximize the power savings by finding a gating condition GR for each register R such that GR is the disjunction of up to M literals, as described in Equation 6.1. GR = _ j=1..M 144 gj (x) (6.1) 6.3.1 Definitions We model a circuit to be clock gated as a hypergraph whose nodes are either singlebit registers or single-output combinational logic nodes. The combinational nodes may implement any arbitrary functions. If there are multiple clock domains, each group of similarly-clocked registers must be gated separately. While the same gating conditions can be used in multiple clock domains, the clock gates themselves can not be shared. The net costs of implementing these gating opportunities are therefore independent, and it is desirable from a complexity perspective to treat the problems independently. Let x be the set of external inputs and current state variables. xR is the current state of register R, and FR (x) is its next state function. Let fn (x) be a function of the external inputs and current state variables that is implemented at some circuit node n’s output. We define a literal gn (x) to be either fn or fn . The set of literals is the set of functions implemented at node outputs and their complements. The support of function fn (x), support(fn ), is the subset of x (the primary inputs and current state variables) on which the function has an algebraic dependence. Structurally, this implies that node n lies in the transitive fan-out of x: support(fn) = {x : n ∈ T F O(x)}. To maintain the functional correctness of the circuit, each register’s gating condition GR (x) must only be active when the register does not change state. This functional correctness condition is described by Equation 6.2. GR (x) ⇒ FR (x) ⊕ xR (6.2) If GR (x) = FR (x) ⊕ xR , then it is the unique maximal complete gating condition; otherwise it is incomplete. The complete condition captures every set of inputs for which the registers doesn’t switch. Because the timing requirements of the clock gate typically necessitate that GR is available earlier than FR , it is desirable to find an incomplete gating condition that can be generated early in the clock cycle with maximal coverage and minimal 145 implementation cost. Furthermore, the maximal gating condition is typically not useful for gating multiple registers, and power considerations typically dictate that a condition that is incomplete but correct for multiple registers is chosen. We define two probabilistic quantities of interest. Given a set of simulation vectors vn1..i for net n, the signal probability P signal (n) and switching probability P switch (n) are defined by Equation 6.3 and Equation 6.4, respectively. P signal 1 = i 1 i−1 i−1 X (n) P switch (n) = i X j=1 1 if vnj (6.3) 0 otherwise 1 if vnj ⊕ vnj+1 (6.4) j=1 0 otherwise The concepts of signal and switching probability can be extended to functions that are present in the netlist but describe combinations of physical nets. 6.3.2 Power Model The power that is saved by implementing a set of gating signals G for some set of registers RG ⊆ R (where Gr is the signal used to gate each register r) is approximated by Equation 6.5. This quantity is a function of (i) the probability that Gr disables a given clock, P(Gr ), (ii) the number of registers gated, and (iii) the relative capacitances of the clock gate and register clock inputs, Cr and Ccg respectively. It is assumed that the clocks of each register are always switching, but if this is not the case (perhaps due to existing clock gating), these cases can be probabilistically excluded from P(Gr ). The registers in the design need not all have identical clock input capacitances. Psavings = X Cr P(Gr ) − X Ccg (6.5) ∀ unique G ∀r∈R Typical values of Ccg and CR imply that the gated clock signals must be shared amongst multiple registers, for each of which the corresponding gating condition must be valid. This 146 motivates a global approach to the clock gating synthesis problem, where multiple registers are considered at the same time. Additional dynamic power may be dissipated in the logic network by increasing the fan-out loads of the literals used to generate a gating condition, but this is typically much smaller than the power saved in the clock network. We restrict our power analysis to only the clock network. 6.3.3 Overview Our approach is based on the combination of simulation and SAT-checking. This algorithmic duo has proven to be incredibly powerful in several contexts, and it is useful in clock gating synthesis as well. Random simulation quickly identifies most of the signals in the design as being useless for gating a given register, and a SAT solver is used to conclusively prove the functional correctness of those that remain. The overall steps of our technique are summarized in Algorithm 11 and described in the following sections in sequential order. Algorithm 11: Simulation and SAT-based Clock Gating: SIMSAT CLG() Input : a sequential circuit graph < V, E > Output: a correct gating conditions GR for each register R, if one exists let R be the set of registers forall r ∈ R do collect candidate literals set Vcand (R) repeat prune every literal v ∈ Vcand (R) where v ∧ (FR (x) ⊕ xR ) until k simulation steps prove for each literal v that v ⇒ FR (x) ⊕ xR create disjunctive candidate sets ⊂ Vcand (R) select a subset of disjunctive sets G to cover registers R return selected gating conditions 147 6.3.4 Literal Collection The first step consists of extracting a set of candidate nodes for the inputs of the gating signal GR , for each register R. Inclusion in the candidate set is not (yet) meant to imply correctness. All node functions and their complements could initially be considered as candidates for each register, but it is useful and necessary to immediately narrow the set by removing nodes that violate either timing, physical, or structural constraints. Timing Constraints The added delay of the clock gating logic and the clock gate itself dictate that every candidate literal be available earlier than the latest register input. This can be expressed by a timing constraint as in Equation 6.6: ag is the latest arrival time at g, dgate is the delay through the clock gate, and rclk is the required time at which the clock must be gated. If all of the times are relative to the same clock, this can be expressed in terms of the period T . ≤ rclk ag + dgate ag + dgate ≤ T − Sclk (6.6) (6.7) We also introduce a term Sclk , the setup time of the clock gate. The meaning of this quantity is identical to the setup time at a register input. While delay is typically measured between the mid-points of two transitions (which is what has been assumed here), it is not sufficient to have the clock gating condition still in transition (either at its input or internally within the clock gate) when the clock edge arrives. The partially-switched enable transistor would introduce additional slew onto the clock line: this is typically unacceptable. Quantity Sclk therefore pads the slack to ensure that the clock gating condition is fully “setup”. The consequence of the timing constraints is that it is only necessary to select from amongst nodes that will be available early enough to meet the timing requirements. The literals can be further subdivided into groups that restrict exactly how they can be used in a gating condition (e.g. only directly, complemented, in a disjunctive set, etc.). This is illustrated in Figure 6.5. 148 Figure 6.5. Timing constraints based upon usage. Physical Constraints It is undesirable to route gated clock signals over large distances, as an undesirably long wire propagation delay may result in a late-arriving gating control signal and a timing violation. These long wires may also unnecessarily complicate routing. Constraints between the proximity of the candidate gating literal and the gated registers are therefore necessary. To this end, we introduce distance constraints of the form of Equation 6.8. (xgr , ygr ) is the placed location of the candidate literal and (xr , yr ) the location of a register. dmax is the proximity constraint, here in terms of L1 distance. distL1 (gr , r) = (|xgr − xr | + |ygr − yr |) ≤ dmax (6.8) The result is to limit the region from which the literals used in gating conditions are selected to one that is local to the register(s) to be gated. This is illustrated in Figure 6.6. An important effect of the distance constraints is to bound the number of literals that are considered for any register in the design. In practice, the size of the die is sufficiently larger than the maximum allowable separation, and this is an strong constraint on the literal 149 Figure 6.6. Distance constraints. count. This permits a linear worst-case bound on the runtime of the literal collection: O(R), where R is the set of registers to be gated. With a model to estimate wire delays from pin locations, it is possible to implicitly constrain the proximity of the gating logic and the gated register using the timing constraints. A routing estimator may provide a function dwire (pdriver , pload , P ′load ), whose inputs are pdriver , the location of driving pin, pload , the location of the load pin of interest (e.g. the clock gate), and P ′load , the locations of the other load pins. Including this information into the timing constraint of Equation 6.6, we have something of the form of Equation 6.9. This physically-aware timing constraint is now a function of location. agr + dgate + dwire (pgr , pgate , P ′f anout(gr ) ) ≤ rclk (6.9) Structural Constraints The candidate literals can be restricted to those whose structural support are partially common to the next state function or include the register output. This is implied by Equation 6.2. If this were not the case for a candidate literal, it could not possibly have an predictive value in determining whether the register might switch. (Unless the register always switches.) 150 A tighter structural constraint can be applied if it is known whether the register contains its output in the support of its next state function (i.e. xR ∈ support(FR (x)). If runtime considerations require bounds to be placed on structural traversal– such that it can not be determined whether a register has any self-feedback path– it is conservative to assume that no such loop exists. (support(g) ∩ support(FR (x))) ∧ (xR ∈ (support(FR (x) ∪ support(g))) 1 if xR ∈ support(FR (x)) (support(g) ∩ support(FR (x))) ∧ xr ∈ support(g) otherwise (6.10) (6.11) Signal and Switching Probability Constraints As it is only possible to gate a register when it does not switch state (assuming there is no knowledge of its future observability), those that are frequently switching are not likely to offer many possibilities to gate their clocks. To reduce the runtime of the gating optimization, it may desirable to exclude such registers from consideration. Given the switching probability of a register output RQ , we can ignore those that exceed some maximum frequency smax . P switch (RQ ) > smax (6.12) Similarly, literals that are not often true are not likely to gate the clock with much frequency. While these would probably never be selected by the optimization technique, it may be beneficial to exclude them from the beginning to reduce runtime. Given the signal probability of a literal g, we ignore those that fall below some threshold pmin . P signal (g) < pmin 151 (6.13) 6.3.5 Candidate Pruning Because the sum of a set of terms satisfies the correctness condition (Equation 6.2) only if each term satisfies it, a literal is only useful in a gating condition if it itself is a correct gating signal. Therefore, each literal is only kept as a candidate if it is not inconsistent with this condition. Simulation is applied in several passes to prune the set of candidate literal/register pairs. The pruning passes are quite fast and effective, and if any literal is found to violate the correctness condition, it is immediately removed from consideration. Besides generating counterexamples to the correctness condition for candidate literals, simulation provides a probabilistic estimate of the number of unnecessary clock transitions that each legal literal g will block, P(g). This provides more accurate information than assuming that the size of the Boolean ON-set of a gating condition correlates to its actual ON-probability. 6.3.6 Candidate Proof Once the set of candidates has been reduced with pruning to literal/register pairs that are reasonably likely to be legal, these are proved to satisfy the correctness condition using a satisfiability solver. The test structure is depicted in Figure 6.7. If the output is satisfiable, then there exists an input that violates the correctness condition; otherwise, g is now known to be a valid gating condition for register R. There are two ways that the features of a modern SAT solver can be leveraged to speed up the repeated proving of candidate gating conditions. A single problem structure for the circuit can be constructed and reused for repeated queries to the same solver running in incremental mode. Because the portion of the problem describing the circuit functionality does not change, learned clauses are kept to speed up future runs. Alternatively, the structure that is generated can be restricted to only the transitive fan-in cones of the registers and gating conditions under test. As these regions may only comprise a tiny fraction of the total circuit, the overall size of the SAT problem is dramatically reduced. 152 Figure 6.7. Proving candidate function. Our experience indicates that the latter is the faster of the two methods for CNF-based SAT solvers that generate assignments to all circuit variables. (This was the case with our use of MiniSat [66].) This may not hold true for circuit-based SAT. The counterexamples return by the SAT solver are also of immense utility. Because these exercise corners of the Boolean space that were not reached by simulation alone, it is likely that they might serve as counterexamples for other not-yet-disproven candidate literals. We use these results to further prune the candidates. 6.3.7 Candidate Grouping From simulation, an estimate of each node’s probability in inhibiting the switching of each register is known. However, assuming that it is not possible to keep the full set of simulation traces for every node, we lack any information about the correlations between these probabilities. The minimum-cost set of activation functions can not be constructed without information about their overlap. Instead, an intermediate set of candidates groups are generated, each a set of 1 to M unique nodes. A candidate set describes a potential gating condition G of the form of Equation 6.1 for one or more registers. If M is sufficiently small, it is feasible to enumerate all such subsets, but we also propose the following heuristic to extract a number of useful 153 candidate sets. This method is motivated by its guarantee that there will exist at least one cover for each register that contains each legal term. A set is constructed from a ”seed”, consisting of a node/register pair. The seed node is inserted into the set, and the set is incrementally expanded by adding candidates to maximize the sum of the gating probabilities under the constraint that the set will continue to gate the seed register. The initial set can legally gate exactly the registers of the initial candidate. Each addition may simultaneously (i) shrink the set of registers for which the set is valid, because if one term is not a legal gating condition for a register, the set is no longer a legal gating function for that register, and (ii) increase the probability that the clock transitions of others will be gated. Again, because information about the correlation between the multiple elements in the set is not known, the probabilities are heuristically summed (and may therefore be greater than one). The process terminates when there are no more candidates to be greedily added or the maximum set size M has been reached. Non-unique results can be discarded. Figure 6.8. Heuristic candidate grouping. The candidate grouping process is illustrated in Figure 6.8 for the seed pair (g3 , R2 ). 154 The columns are of the table are the gateable registers, and the rows of the table are the candidate gating functions. The value at each entry is the probability that the corresponding function will gate the corresponding register; empty entries indicate that the resulting gated register would not be functionally correct. Consisting of only the seed function, the initial set {g3 } can gate the registers R1 , R2 , and R3 with a summed gating probability of 0.6. This combination is contained within the red box labelled 1. We then proceed to greedily add other signals to the set. While adding g5 would increase the sum of the set by the largest net amount (1.6), it would remove the seed register and is therefore ignored. Function g1 represents the next best improvement: the estimated gating probability of registers R1 and R2 is increased by 1.6 but register R3 must be dropped from the set, resulting in a net increase of 1.4. The new combination is depicted with the blue yellow box labelled 2. Lastly, we add the function g4 for the final candidate set of {g1 , g3 , g4 } with a summed gating probability of 2.4. There are no other candidates for greedy addition that would increase this total. 6.3.8 Covering The circuit is simulated again– with actual simulation traces, if available– and probabilistic information is collected about the candidate sets, thereby capturing the correlated probabilities. The correlation between sets is not pertinent: each register can only be switched by a single gated clock, generated by the one gating condition that is chosen for it. (This restriction can be relaxed by amending our technique to employ hierarchical clock gating.) The problem now reduces to the weighted maximum set covering problem, where the weight of each element set is exactly its net contribution to the power objective (as in Equation 6.5), the total dynamic power. If an insufficient number of registers or clock transitions are gated, the net weight of an element may be negative; these will never be selected. The maximum set covering problem is NP-hard, but there exist good heuristics. The problem is also less difficult for practical circuits because of the relatively small number of partially overlapping sets. We utilize the greedy addition heuristic. [67] 155 Once a subset of candidate sets has been selected, the disjunction of each is used as the disable of a clock gate to produce a single gated clock signal. This gated clock is then connected to the covered registers. 6.4 Circuit Minimization The insertion of a clock gating condition creates a set of observability don’t cares (ODCs) for the next state function FR (x) at the input of register R. When the gated clock signal is inactive, the value of the next state function is irrelevant; the output of the register will remain constant. This fact can be used to minimize the logic implementation of the next state function. In general, the task of reducing a large logic network with ODCs is difficult, but in this specific case, a structural simplification can be immediately applied. Let h be an immediate fan-out of the node of literal g. If the combinational transitive fan-out of any h does not include any (i) primary outputs, (ii) clock gate inputs, or (iii) register inputs gated by a signal G such that g ∈ / G, this connection can be replaced with a constant. The inserted constants are then propagated forward in the network and any dangling portions dropped. In many instances, the function GR is constructed of terms entirely from within R’s fan-in cone. Multiple ODC-based simplifications can generally not be simultaneously applied, but in this case, their mutual compatibility is guaranteed because the structure of all GR signals is perfectly preserved. An example of this simplification is illustrated in Figure 6.9. The signal g has been selected to gate the two registers. Because the propagation of the clock is disabled when g is 1, this introduces don’t cares. Using the structural simplification described above, we can replace the connection from g → h with a constant 0. If this constant is propagated forward, we reduce the next state function of both registers to i. This also leaves a dangling input and allows the fan-out-free fan-in of node h (the OR-gate) to be dropped. 156 Figure 6.9. ODC-Based Circuit Simplification after Gating. 6.5 6.5.1 Experimental Results Setup We used OpenAccess Gear [68] as the platform upon which the algorithm described in this chapter was implemented. OpenAccess Gear is “an open source software initiative intended to provide pieces of the critical integration and analysis infrastructure that are taken for granted in proprietary tools.” [69] It is built upon the OpenAccess database, an industry-standard database and API for manipulating and utilizing design. The combination provides a powerful set of tools to explore new VLSI design techniques and software. In this work, the Func [70], Aig, and Dd components were utilized. 157 The logic representation in OpenAccess Gear is a sequential AND/INVERTER graph (AIG). For a modern treatment of the features and properties of AIGs, we refer the reader to [71] or [34]. The sequential extension is described in [72]. The experimental setup consisted of a 2.66Ghz AMD x64 machine running our tools under Linux. They were all written in C/C++ and compiled using GNU g++ version 4.1. All benchmarks were pre-optimized using ABC. Greedy AIG rewriting was applied until a fix-point reached (i.e. no more reduction in the number of nodes seen). For the purposes of power optimization, the relative capacitances of the clock gate and register clock inputs were assumed identical. An additional constraint was added to further constrain the insertion of clock gates: every gated clock signal was required to drive a minimum of three registers. This requirement was applied to both the structural and sim-sat-based algorithms. 6.5.2 Structural Analysis Our algorithm is compared against a purely structural analysis. The goal of this technique is to identify particular circuit structures (e.g. Figure 6.2) that can be used as clock gating conditions. Because there was not a readily-available algorithm or tool to perform this optimization, we implemented our own version. As our circuit representation is based entirely on AND/INVERTER graphs with simple sequential elements, it was not possible to explicitly identify either multiplexors or synchronous enable signals. However, we describe a technique that can detect many of their resulting representations in the corresponding AIG. As we have knowledge of the synthesis flow, it is known that the initial implementations (before the rewriting passes) of both of these situations conform to the structure detected by this method. In general, however, AIGs are non-canonical representations of functionality, and there are a multitude of possible structures to represent a global or local function. There is also the potential to catch additional circuit structures that are functionally identical to the two previously mentioned. For example, alternative mappings of the multi158 Figure 6.10. Four-cut for structural check. plexor in Figure 6.2(ii) will be detected. The generality is further increased when structural hashing is applied during the construction of the AIG. Our structural analysis proceeds as follows. For every register, we seek to identify two signals: one that implies that the state of the register does not change, and one that indicates the next state if it does. These are functionally equivalent to the synchronous enable and D input, respectively, of the enable-DFF depicted in Figure 6.2. The minimum AIG structure required to implement this synchronous-enable-like behavior contains at least three AND nodes: this is exactly the size of the simplest representation of a 2-input multiplexor. With fewer AND nodes it is not possible to either (i) drive the next state to both 0 and 1 and (ii) accomplish it with the same input signal. However, because of the input permutations and non-canonical edge complementation, there are still many graph structures meet this description. To identify the three-AND structures that describe synchronous-enable-like behavior, we collect the fan-in cone of depth 2 at the input of each register. This cone is depicted in Figure 6.10; the large vertices are AND nodes and the small hashed circles represent complementation that may or may not be present on the edge. If it does not contain 3 AND nodes, the register is skipped. Otherwise, the resulting cut at its base is tested to see if it contains any two signals that satisfy the above criteria. This is done by testing the properties described in Equation 6.14. Here, ≡ denotes structural equivalence and = functional equivalence. In our tool, functional equivalence was checked with BDDs. 159 ∃(s0 , s1 , loop) ∈ {0..3} such that (6.14) s0 6= s1 6= loop is0 ≡ is1 iloop ≡ q ((is0 ⇒ d = q) = 1) ∨ ((¬is0 ⇒ d = q) = 1) Table 6.1 describes the results of applying the structural gating to the OpenCores and QUIP benchmark suites. The column Tot is the total number of registers in the design. Match is the number of these for which a gating condition was detected structurally, but because these did not necessarily result in a power savings, only the number in column Shared were sufficiently common to be implemented. % Shared is the percentage of the total registers with shared enable signals. This is the fraction that is gateable. The number of enables that were used to gate the registers is listed in column # En. On average, each one of these enables gated the number of registers in column Regs/En. As expected, the average ratio (as well as the individual ratio of each selected enable) satisfies the aforementioned constraint on the minimum number of gated registers per clock gate. The last two columns provide estimates on both the number of clock transitions avoided at the register inputs and also the total estimated power savings. 6.5.3 Power Savings The results of applying the simulation-and-SAT-based gating approach to the same set of benchmarks as Table 6.1 is presented in Table 6.2. The column Gated is the number of registers out of Total that were clock gated. Again, # En is the number of enable signals used. The percentage of the total clock transitions and power saved are in columns Clks and Pow, respectively. The power savings is also compared against the purely structural approach in the next column. Here, the † symbols indicates the cases where gating was 160 Regs % Saved Name Tot Match Shared %Shared #En Regs/En Clks Pow fip cordic cla 55 0 0 0.0% 0 0.0% 0.0% fip cordic rca 55 0 0 0.0% 0 0.0% 0.0% oc vid com enc 59 28 24 40.7% 1 24.0 20.5% 18.8% oc vid com dec 61 28 26 42.6% 2 13.0 11.3% 8.0% oc miniuart 90 34 26 28.9% 3 8.7 11.1% 7.8% oc ssram 95 0 0 0.0% 0 0.0% 0.0% oc gpio 100 83 0 0.0% 0 0.0% 0.0% oc sdram 112 74 54 48.2% 4 13.5 26.8% 23.3% oc rtc 114 59 27 23.7% 1 27.0 11.8% 11.0% oc i2c 129 89 86 66.7% 9 9.6 48.1% 41.1% os sdram16 147 105 82 55.8% 4 20.5 13.2% 10.5% oc ata v 157 79 79 50.3% 3 26.3 22.1% 20.2% oc dct slow 178 38 38 21.3% 4 9.5 15.4% 13.1% nut 004 185 15 0 0.0% 0 0.0% 0.0% nut 002 212 21 0 0.0% 0 0.0% 0.0% oc correlator 219 0 0 0.0% 0 0.0% 0.0% oc simple fm rcvr 226 0 0 0.0% 0 0.0% 0.0% nut 003 265 45 0 0.0% 0 0.0% 0.0% oc ata ocidec1 269 191 191 71.0% 7 27.3 34.0% 31.4% oc minirisc 289 106 74 25.6% 6 12.3 7.9% 5.8% oc ata ocidec2 303 192 191 63.0% 7 27.3 30.1% 27.8% nut 000 326 15 0 0.0% 0 0.0% 0.0% oc aes core 402 128 128 31.8% 1 128.0 15.9% 15.7% oc hdlc 426 138 62 14.6% 7 8.9 4.9% 3.2% nut 001 484 50 0 0.0% 0 0.0% 0.0% oc ata vhd 3 594 373 298 50.2% 13 22.9 22.4% 20.2% oc ata ocidec3 594 365 290 48.8% 12 24.2 21.8% 19.7% oc fpu 659 8 8 1.2% 1 8.0 0.9% 0.8% oc aes core inv 669 132 132 19.7% 2 66.0 10.0% 9.7% oc oc8051 754 481 236 31.3% 24 9.8 14.2% 11.0% os blowfish 891 373 372 41.8% 10 37.2 11.5% 10.3% oc cordic r2p 1015 0 0 0.0% 0 0.0% 0.0% oc cfft 1024x12 1051 163 143 13.6% 7 20.4 4.4% 3.7% oc vga lcd 1108 672 659 59.5% 30 22.0 29.2% 26.4% fip risc8 1140 1097 62 5.4% 8 7.8 3.0% 2.3% oc pavr 1231 1102 822 66.8% 40 20.6 49.7% 46.4% oc mips 1256 1204 211 16.8% 9 23.4 7.6% 6.9% oc ethernet 1272 653 254 20.0% 19 13.4 6.2% 4.7% oc pci 1354 768 470 34.7% 21 22.4 9.1% 7.5% oc aquarius 1477 1378 801 54.2% 27 29.7 20.9% 19.0% oc wb dma 1775 1489 936 52.7% 42 22.3 12.4% 10.1% oc mem ctrl 1825 1382 901 49.4% 38 23.7 25.3% 23.2% oc vid com d 3549 12 12 0.3% 1 12.0 0.1% 0.1% radar12 3875 2222 2019 52.1% 46 43.9 17.5% 16.4% oc vid com j 3972 88 88 2.2% 5 17.6 0.8% 0.7% radar20 6001 2922 2679 44.6% 59 45.4 16.0% 15.1% uoft raytracer 13079 3926 2216 16.9% 70 31.7 5.8% 5.3% AVERAGE 26.9% 25.1 9.1% 8.0% Table 6.1. Structural clock gating results. 161 found with our method but none were found with the structural analysis. Although they represent a clear improvement, these instances are not reflected in the average additional power savings. The remaining average power reduction was 27% greater than that of the purely structural method, and as much as 2.2x times higher. 6.5.4 Circuit Minimization The post-gating optimization described in Section 6.4 was then applied, and the resulting netlist optimized with the ABC package using the same procedure of repeated rewriting. This was done primarily to perform the constant propagation and to leverage any resulting simplifications. The results for a selected subset of the benchmarks are in Table 6.3. The number and improvement in the number of AND nodes are reported in the final two columns. On average, the size of the combinational logic was reduced by 7.0%. In five of the benchmarks the depth of the combinational logic was also reduced, resulting in a potential performance improvement. The correctness of both the gating conditions (modeled as synchronous enables) and the logic optimization was successfully verified using combinational equivalence checking. 6.6 Summary We have introduced a method for clock gate synthesis that constructs the gating condition out of the disjunction of functions that are already present in the existing logic and their complements. Applied to a set of industry-supplied benchmarks, the dynamic clock power consumption is reduced over synchronous methods alone. The gating condition can also be utilized in a straightforward structural optimization to reduce the area of the circuit. 162 % Saved Name Total Gated #En Clks Pow vs Struct Runtime(s) fip cordic cla 55 3 1 2.3% 0.5% † 2.50 fip cordic rca 55 3 1 2.3% 0.4% † 2.48 oc vid com enc 59 28 2 23.9% 20.5% 9.0% 2.84 oc vid com dec 61 26 2 11.3% 8.0% 0.0% 2.28 oc miniuart 90 44 6 21.2% 14.5% 85.8% 1.56 oc ssram 95 32 2 15.8% 13.7% † 0.82 oc gpio 100 0 0 0.0% 0.0% 0.0% 1.76 oc sdram 112 91 10 43.3% 34.4% 47.8% 2.56 oc rtc 114 27 1 11.8% 11.0% 0.0% 4.47 oc i2c 129 86 9 48.1% 41.1% 0.0% 3.13 os sdram16 147 122 11 26.0% 18.5% 76.5% 5.92 oc ata v 157 79 3 22.1% 20.2% 0.0% 3.12 oc dct slow 178 47 5 17.9% 15.1% 14.9% 2.65 nut 004 185 8 2 3.0% 1.9% † 1.05 nut 002 212 19 5 4.0% 1.7% † 1.31 oc correlator 219 0 0 0.0% 0.0% 0.0% 4.38 oc simple fm rcvr 226 3 1 0.7% 0.2% † 4.19 nut 003 265 17 5 3.6% 1.7% † 4.23 oc ata ocidec1 269 191 7 34.0% 31.4% 0.0% 8.23 oc minirisc 289 86 8 9.1% 6.4% 8.7% 5.72 oc ata ocidec2 303 191 7 30.1% 27.8% 0.0% 9.22 nut 000 326 13 4 1.6% 0.4% † 3.05 oc aes core 402 128 1 15.9% 15.7% 0.0% 15.90 oc hdlc 426 133 16 12.9% 9.2% 183.6% 5.08 nut 001 484 23 6 3.0% 1.8% † 9.39 oc ata vhd 3 594 298 13 22.4% 20.2% 0.0% 19.08 oc ata ocidec3 594 293 13 22.0% 19.8% 0.5% 19.07 oc fpu 659 56 2 2.8% 2.5% 223.7% 40.79 oc aes core inv 669 132 2 10.0% 9.7% 0.0% 17.52 oc oc8051 754 339 34 22.4% 17.9% 62.5% 58.91 os blowfish 891 378 11 11.7% 10.5% 1.4% 30.34 oc cordic r2p 1015 0 0 0.0% 0.0% 0.0% 40.94 oc cfft 1024x12 1051 279 16 8.5% 7.0% 89.2% 26.33 oc vga lcd 1108 707 35 31.6% 28.4% 7.6% 44.09 fip risc8 1140 68 10 3.3% 2.4% 4.9% 44.94 oc pavr 1231 841 43 50.2% 46.7% 0.6% 97.34 oc mips 1256 243 10 8.9% 8.1% 16.7% 78.21 oc ethernet 1272 367 35 10.8% 8.0% 72.5% 40.59 oc pci 1354 615 33 12.4% 10.0% 33.1% 48.33 oc aquarius 1477 816 29 21.2% 19.3% 1.2% 72.36 oc wb dma 1775 971 44 13.3% 10.8% 7.4% 82.16 oc mem ctrl 1825 1064 45 27.8% 25.4% 9.3% 93.92 oc vid com d 3549 16 2 0.1% 0.1% 33.3% 141.35 radar12 3875 2209 64 20.4% 18.7% 14.6% 188.64 oc vid com j 3972 93 6 0.8% 0.7% 3.0% 227.75 radar20 6001 2968 81 18.8% 17.4% 15.8% 476.36 uoft raytracer 13079 2355 102 6.5% 5.7% 8.3% 1695.09 AVERAGE 14.7% 12.4% 27.2% Table 6.2. New clock gating results. 163 Name Init ANDs % Gated Regs Final ANDs % Change oc ssram 274 33.68% 179 34.70% oc sdram 894 81.25% 720 19.50% oc hdlc 1873 31.22% 1734 7.40% oc vga lcd 6923 63.81% 6555 5.30% oc ethernet 8926 28.85% 8890 0.40% oc cfft 9177 26.55% 9124 0.60% oc 8051 9746 44.96% 9622 1.30% oc fpu 16260 8.50% 16179 0.50% radar20 60835 49.46% 60576 0.40% uoft raytracer 138895 18.01% 138542 0.30% AVERAGE 38.63% 7.04% Table 6.3. ODC-based simplification results. 164 Chapter 7 Conclusion We believe to have described a set of sequential optimization techniques that are novel and useful when applied to low power digital design. In this chapter, we review and highlight the main contributions of this work. We have focused on reducing the dynamic power that is dissipated due to capacitive switching. This quantity is proportional to the total capacitance switched, the square of the voltage swing, and the switching frequency. Because the switching of one particular class of signal–the clock– drives both more capacitance and switches more frequently (by at least a factor of two) than any other synchronous signal in the design, it accounts for approximately 30%-50% of the total power consumed in a modern integrated circuit. The clock and the synchronizing circuit elements that it drives are exactly the targets of sequential optimization. We believe this strongly motivates the application of this class of synthesis transformations for reducing clock dynamic power consumption. Experimental results have been presented throughout this work to quantify the successful minimization of the clocks fraction of the total power. This is accomplished through two different avenues: reducing the total clock capacitance and reducing the effective switching frequency. We now summarize the algorithms described to improve each of these objectives and review their main features. 165 7.1 Minimizing Total Clock Capacitance The sinks of the clock distribution network are sequential components of the design. For the purposes of this work, this is the set of registers in the design. Chapter 2 introduces a new formulation of the unconstrained minimum-register retiming problem. The number of registers in the synthesis examples in our benchmark can be reduced on average by 11% using retiming. The contribution of our algorithm is a speedup in the runtime of approximately 5 times over the fastest available method, using a formulation of the problem as an instance of minimum-cost network circulation (MCC). The algorithm has the desirable property of minimizing the relocation of registers within the circuit. The solution returned is the optimal one that moves the registers over the fewest number of gates, or equivalently, minimizes the sum of the absolute values of the retiming lag function. Another important feature– which is not available when using MCC-based solution techniques– is that the result after each iteration is both legal and monotonically decreasing in the number of registers. It is possible to terminate early with a result that is still better than the original. This is important for industrial scalability and runtime-limited applications. We have shown in Chapter 3 how to incorporate constraints on the maximum and minimum combinational path delays into the min-register retiming problem. Timing constraints are critical for performance-constrained synthesis applications. The resulting algorithm runs an average of 102 times faster than the best available academic tool. Our timing-constrained min-register retiming algorithm also has the property that termination in either its inner or outer loop results in an improved and timing feasible result. This is again a feature not present in competing methodologies. Chapter 4 extends the constraint set to include the requirement that the resulting retiming has an equivalent initial state. While this is an NP-hard problem in the worstcase, we show that the runtime of our method is quite tractable for all of our example 166 circuits. The median benchmark required the addition of only 2 extra registers to restore initializability. Finally, Chapter 5 explores the simultaneous combination of retiming and clock skew scheduling to minimize the overall dynamic power of the clock network under a delay constraint. The topology of the circuit makes skewing a more power-effective means of improving worst-case delay in some cases and retiming a more power-effective means in others. Our experiments show that a heuristically determined combination of both can improve the total dynamic power consumption of the clock endpoints by 11%. 7.2 Minimizing Effective Clock Switching Frequency Clock gating allows the effective frequency at which some register clock inputs must be switched to be reduced. The clock is selectively disabled in cases where it is not necessary for a register to latch a new value. Chapter 6 describes a new method for synthesizing clock gating logic. We employ a combination of simulation and SAT to identify existing signals within the logic network that are functionally correct gating conditions. This technique is more general and powerful than strictly structural methods but doesn’t have the scalability problems associated with symbolic methods. The result is an average reduction in the dynamic power of the clock of 12% for the sets of synthesis benchmarks that were examined. The synthesis of the gating logic is done so as to control the perturbation to the netlist. Unlike the RTL or symbolic methods, the implementation of a gating condition does not require the general synthesis of a function. This can have unpredictable effects and result in an unknown amount of additional logic. Our method creates requires only the OR and possible complementation of a set of signals. Furthermore, it is quite easy to incorporate important constraints on the resulting clock gates. Physical constraints can be applied to limit the maximum distance between the 167 driver of a gating signal and the register which it gates. Timing constraints can be used to ensure that there is no violation in the resulting netlist. Finally, the use of disjunctions of circuit signals allows a straightforward logic simplification to be applied. In certain cases, the signal used in the gating condition can be replaced with a constant, and the surrounding logic simplified. We have shown an average 7% decrease in the number of AIG AND nodes for a subset of the benchmarks. This is likely to result in additional power savings. 168 Bibliography [1] Energy Information Administration, U.S. Department of Energy, “Annual energy review,” tech. rep., Washington, DC, United States, 2006. [2] A. Stepin et al., http://www.xbitlabs.com. “Various benchmarking reports,” 2004–2008, [3] A. V. Goldberg, “Recent developments in maximum flow algorithms,” Tech. Rep. 98045, NEC Research Institute, Inc., 1998. [4] K. S. Chatha, “Talk on thermal system design,” Berkeley, CA, USA, 2008. [5] I. Buchmann, “The future battery,” tech. rep., Cadex Electronics, Inc, 2005. [6] J. P. Research, “Jon peddie’s market watch,” tech. rep., 2007-2008. [7] U.S. Department of Energy and U.S. Environmental Protection Agency, “Carbon dioxide emissions from the generation of electric power in the united states,” tech. rep., Washington, DC, United States, 2000. [8] C. E. Leiserson and J. B. Saxe, “Retiming synchronous circuitry,” Algorithmica, vol. 6, pp. 5–35, 1991. [9] J. P. Fishburn, “Clock skew optimization,” IEEE Trans. on Computing, vol. 39, pp. 945–951, July 1990. [10] A. P. Hurst, P. Chong, and A. Kuehlmann, “Physical placement driven by sequential timing analysis,” in Proceedings of the International Conference on Computer Aided Design, 2004. [11] L. G. Khachiyan, “A polynomial algorithm in linear programming,” Soviet Mathematic Doklady, 1979. [12] N. Karmarkar, “A new polynomial time algorithm for linear programming,” Combinatorica, vol. 4, pp. 373–395, 1984. [13] A. Dasdan, S. Irani, , and R. K. Gupta, “Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems,” in Proceedings of the Design Automation Conference, pp. 37–42, 6 1999. [14] I. S. Kourtev and E. G. Friedman, “Synthesis of clock tree topologies to implement nonzero clock skew schedule,” in IEEE Proceedings on Circuits, Devices, Systems, vol. 146, pp. 321–326, December 1999. 169 [15] K. Ravindran, A. Kuehlmann, and E. Sentovich, “Multi-domain clock skew scheduling,” in Proceedings of the International Conference on Computer Aided Design, 2003. [16] Q. K. Zhu, High-Speed Clock Network Design. Springer, 2002. [17] E. G. Friedman, “Clock distribution networks in synchronous digital integrated circuits,” in Proceedings of the IEEE, vol. 89, pp. 665–692, May 2001. [18] I. E. Sutherland, “Micropipelines,” Commun. ACM, vol. 32, no. 6, pp. 720–738, 1989. [19] S. Hauck, “Asynchronous design methodologies: an overview,” in Proceedings of the IEEE, vol. 83, (Seattle, WA), pp. 69–83, 6 1995. [20] C. J. Myers, Asynchronous Circuit Design. Wiley-Interscience, 1 ed., 2001. [21] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev, Logic Synthesis of Asynchronous Controllers and Interfaces. Springer, 2002. [22] J. Woods, P. Day, S. Furber, J. Garside, N. Paver, and S. Temple, “Amulet1: An asynchronous arm microprocessor,” IEEE Transactions on Computers, vol. 46, pp. 385– 398, Apr 1997. [23] U. Cummings, “Focalpoint: A low-latency, high-bandwidth ethernet switch chip,” in Proceedings of Hot Chips 18 Symposium, (Palo Alto, CA, USA), 2006. [24] M. Abramovici, M. A. Breuer, and A. D. Friedman, Digital Systems Testing and Testable Design. Wiley-IEEE Press, 1994. [25] C. Kern and M. R. Greenstreet, “Formal verification in hardware design: a survey,” Transactions on Design Automation of Electronic Systems, vol. 4, no. 2, pp. 123–193, 1999. [26] J.R. Burch, E.M. Clarke, K.L. McMillan, D.L. Dill, and L.J. Hwang, “Symbolic Model Checking: 1020 States and Beyond,” in Proceedings of the Fifth Annual IEEE Symposium on Logic in Computer Science, (Washington, D.C.), pp. 1–33, IEEE Computer Society Press, 1990. [27] E. Clarke, E. Emerson, , and A. Sistla, “Automatic verification of finite-state concurrent systems using temporal logic specifications,” Transactions on Programming Languages and Systems, vol. 8, pp. 244–263, 1 1986. [28] O. Coudert, C. Berthet, , and J. C. Madre, “Verification of synchronous sequential machines based on symbolic execution,” in Proceedings of International Workshop on Automatic Verification Methods for Finite State Systems, June 1989. [29] F. S. G. Cabodi, S. Quer, “Optimizing sequential verification by retiming transformations,” in Proceedings of the Design Automation Conference, 2000. [30] J. Baumgartner, H. Mony, V. Paruthi, R. Kanzelman, and G. Janssen, “Scalable sequential equivalence checking across arbitrary design transformations,” in Proceedings of the International Conference on Computer Design, 2006. [31] A. Goldberg, “An efficient implementation of a scaling minimum-cost flow algorithm,” Algorithms, vol. 22, no. 1, pp. 1–29, 1992. 170 [32] va Tardos, “A strongly polynomial minimum cost circulation algorithm,” Combinatorica, vol. 5, no. 3, pp. 247–255, 1985. [33] R. K. Ahuja, T. L. .Magnanti, and J. B. Orlin, Network Flows: Theory, Algorithms, and Applications. Englewood Cliffs, NJ, USA: Prentice-Hall, Inc., 1993. [34] J. Baumgartner and A. Kuehlmann, “Min-area retiming on flexible circuit structures,” in Proceedings of the ICCAD, 2001. [35] B. V. Cherkassy and A. Goldberg, “On implementing push-relabel method for the maximum flow problem,” Algorithmica, vol. 19, pp. 390–410, 1997. [36] A. V. Goldberg, “HIPR,” version 3.5, http://www.avglab.com/andrew/soft.html. [37] “UC berkeley clustered http://www.millennium.berkeley.edu/. computing.” Berkeley, CA, [38] “Opencores project,” http://www.opencores.org. [39] C. Albrecht, “IWLS benchmarks,” 2005. [40] Altera Corporation, “Quartus II,” 101 Innovation Drive, San Jose, CA. [41] M. Hutton and J. Pistorius, “Altera QUIP http://www.altera.com/education/univ/research/unv-quip.html. benchmarks,” [42] IG Systems, “CS2,” version 3.9, http://www.igsystems.com/cs2/index.html. [43] B. V. Cherkassky and A. Goldberg, “On implementing push-relabel method for the maximum flow problem,” Algorithmica, vol. 19, pp. 390–410, 1997. [44] A. L¨obel, “MCF - a network simplex http://www.zib.de/Optimization/Software/Mcf/. implementation,” version 1.3, [45] R. V. Helgason and J. L. Kennington, Primal Simplex Algorithms for Minimum Cost Network Flows, ch. 2, pp. 85–133. 1993. [46] S. Sapatnekar, “Minaret source code,” 2007. Personal Communication. [47] D. Singh, V. Manohararajah, , and S. D. Brown, “Incremental retiming for FPGA physical synthesis,” in Proceedings of the Design Automation Conference, pp. 433–438, 2005. [48] H. Zhou, “Deriving a new efficient algorithm for minperiod retiming,” in Proceedings of ASPDAC, pp. 990–993, 2005. [49] S. S. Sapatnekar and R. B. Deokar, “Utilizing the retiming skew equivalence in a practical algorithm for retiming large circuits,” IEEE Transactions on Computer-Aided Design, vol. 15, pp. 1237–1248, 10 1996. [50] J. Cong and C. Wu, “Optimal fpga mapping and retiming with efficient initial state computation,” in IEEE Transactions on the Computer-Aided Design of Integrated Circuits and Systems, vol. 18, pp. 330–335, 11 1999. 171 [51] J. Jiang and R. Brayton, “Retiming and resynthesis: A complexity perspective,” IEEE Transactions on CAD, 12 2006. [52] H. J. Touati and R. K. Brayton, “Computing the initial states of retimed circuits,” IEEE Transactions on the Computer-Aided Design of Integrated Circuits and Systems, vol. 12, pp. 157–162, 1 1993. [53] L. Stok, I. Spillinger, and G. Even, “Improving initialization through reversed retiming,” in Proceedings of the European conference on Design and Test, (Washington, DC, USA), p. 150, IEEE Computer Society, 1995. [54] P. Pan and G. Chen, “Optimal retiming for initial state computation,” in Proc. Conf. on VLSI Design, 1999. [55] N. Maheshwari and S. S. Sapatnekar, “Minimum area retiming with equivalent initial states,” in Proceedings of the International conference on Computer-aided design, (San Jose, CA, USA), pp. 216–219, IEEE Computer Society, 1997. [56] P. Pan, “Continuous retiming: Algorithms and applications,” in Proceedings of the ICCD, pp. 116–121, 1997. [57] T. Cormen, C. Leiserson, and R. Rivest, Introduction to Algorithms. The MIT Press, 2 ed., 2001. [58] N. Maheshwari and S. Sapatnekar, “Efficient minarea retiming of large level-clocked circuits,” in Proceedings of DATE, pp. 840–845, 1998. [59] A. P. Hurst, A. Mishchenko, and R. K. Brayton, “Minimizing implementation costs with end-to-end retiming,” in Proceedings of the International Workshop on Logic Synthesis, 2007. [60] S. Burns, Performance Analysis and Optimization of Asynchronous Circuits. PhD thesis, California Institute of Technology, Pasadena, CA, USA, December 1991. [61] W. Qing, M. Pedram, and W. Xunwei, “Clock-gating and its application to low power design of sequential circuits,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 47, Mar 2000. [62] L. Benini, G. D. Micheli, E. Macii, M. Poncino, and R. Scarsi, “Symbolic synthesis of clock-gating logic for power optimization of synchronous controllers,” ACM Transactions on Design Automation of Electronic Systems, vol. 4, no. 4, pp. 351–375, 1999. [63] R. E. Bryant, “Graph-based algorithms for boolean function manipulation,” IEEE Trans. Comput., vol. 35, no. 8, pp. 677–691, 1986. [64] M. Donno, A. Ivaldi, L. Benini, and E. Macii, “Clock-tree power optimization based on rtl clock-gating,” in DAC ’03: Proceedings of the 40th conference on Design automation, (Anaheim, CA, USA), pp. 622–627, ACM, 2003. [65] Calypto Design Systems, 2933 Bunker Hill Lane, Suite 202, Santa Clara, CA, PowerPro CG Datasheet. [66] N. Een and N. S¨orensson, “An extensible SAT-solver,” in Proceedings of International Conference on Theory and Applications of Satisfiability Testing, 2003. 172 [67] S. Khuller, A. Moss, , and J. Naor, “The budgeted maximum coverage problem,” Information Processing Letters, 1999. [68] C. D. S. Inc., “OpenAccess Gear,” http://openedatools.si2.org/oagear/. [69] Z. Xiu, D. A. Papa, P. Chong, C. Albrecht, A. Kuehlmann, R. A. Rutenbar, and I. L. Markov, “Early research experience with openaccess gear: an open source development environment for physical design,” in ISPD ’05: Proceedings of the 2005 international symposium on Physical design, (San Francisco, CA, USA), pp. 94–100, ACM, 2005. [70] A. P. Hurst, “Openaccess gear functionality - a platform for functional representation, synthesis, and verification,” tech. rep., Cadence Design Systems, 2006. Invited Presentation. [71] A. Kuehlmann and F. Krohm, “Equivalence checking using cuts and heaps,” in Proc. of the 34th Design Automation Conference, pp. 263–268, June 1997. [72] A. P. Hurst, “Representing sequential functionality,” tech. rep., Cadence Design Systems, Berkeley, CA, 8 2005. 173 Appendix A Benchmark Characteristics 174 Name # Gates # Registers Max Depth Pri Inps Pri Outps daio 23 4 6 2 2 s208.1 140 8 16 11 1 mm4a 146 12 19 8 4 traffic 176 13 11 6 8 s344 238 15 28 10 11 s349 242 15 28 4 6 s382 242 21 17 10 11 s400 256 21 17 4 6 mult16a 258 16 43 18 1 s526n 261 21 14 4 6 s420.1 282 16 18 19 1 s444 284 21 20 18 1 mult16b 284 30 11 4 6 s526 287 21 14 4 6 s641 399 19 78 36 23 s713 437 19 86 36 23 mult32a 498 32 75 34 1 s838.1 566 32 22 35 1 mm9a 582 27 55 13 9 s838 595 32 94 36 2 s953 661 29 27 17 23 s1196 716 18 34 15 14 s1238 719 18 32 15 14 mm9b 729 26 80 13 9 s1423 819 74 67 18 5 gcd 1011 59 33 19 25 sbc 1024 27 22 41 56 ecc 1477 115 19 12 14 phase decoder 1602 55 33 4 10 daio receiver 1796 83 45 17 46 mm30a 1905 90 139 34 30 parker1986 2558 178 61 50 9 s5378 2828 163 33 36 49 s9234 2872 135 55 37 39 bigkey 2977 224 9 263 197 dsip 3429 224 21 229 197 s13207 8027 669 59 31 121 s38584.1 18734 1426 70 39 304 s38417 22821 1465 65 29 106 clma 25124 33 78 383 82 Table A.1. Benchmark Characteristics: LGsynth 175 Name # Gates # Registers Max Depth Pri Inps Pri Outps ts mike fsm 48 3 6 5 10 xbar 16x16 178 32 4 81 16 barrel16 263 37 8 22 16 barrel16a 293 37 10 22 16 barrel32 712 70 10 39 32 nut 004 714 185 12 31 202 nut 002 760 212 19 34 78 mux32 16bit 887 533 6 38 16 mux8 64bit 965 579 4 12 64 fip cordic rca 983 55 45 19 34 fip cordic cla 1044 55 49 19 34 nut 000 1160 326 55 27 237 nut 003 1507 265 36 106 102 mux64 16bit 1704 1046 6 71 16 mux8 128bit 1925 1155 4 12 128 barrel64 1933 135 11 72 64 nut 001 2620 484 55 76 59 fip risc8 2951 1140 39 30 83 radar12 11005 3875 44 2769 1870 radar20 29552 6001 44 3292 1988 uoft raytracer 71598 13079 93 4364 4033 Table A.2. Benchmark Characteristics: QUIP 176 Name oc ssram oc miniuart oc gpio oc dct slow oc ata v oc i2c oc correlator oc ata ocidec1 oc sdram oc ata ocidec2 oc rtc os sdram16 oc minirisc oc vid comp sys oc des area oc vid comp sys oc hdlc oc smpl fm rcvr oc des des3area oc ata ocidec3 oc ata vhd 3 oc aes core oc cordic p2r oc aes core inv oc cfft 1024x12 oc vga lcd os blowfish oc cordic r2p oc pci oc ethernet oc des perf oc oc8051 oc mem ctrl oc mips oc wb dma oc pavr oc aquarius oc fpu oc vid comp sys oc vid comp sys oc des des3perf # Gates # Registers Max Depth Pri Inps Pri Outps 238 95 2 110 88 381 90 10 16 11 385 100 10 74 67 513 178 17 6 18 516 157 13 65 60 598 129 19 19 14 613 219 21 83 2 697 269 13 65 60 731 112 13 95 90 842 303 13 65 60 887 114 37 58 35 947 147 22 19 80 1062 289 23 48 83 h 1162 59 13 13 10 1192 64 14 125 64 h 1328 61 20 19 19 1341 426 12 61 84 1377 226 34 18 23 1784 64 22 240 64 1858 594 15 99 103 1919 594 15 99 103 2806 402 13 387 258 3175 719 21 50 32 3523 669 13 516 395 3977 1051 21 52 86 4039 1108 34 223 284 4118 891 37 840 282 4495 1015 22 34 40 5061 1354 46 304 385 5633 1272 33 192 239 5818 1976 5 121 64 5945 754 52 166 176 6792 1825 32 115 152 7036 1256 72 54 178 7680 1775 18 226 218 8051 1231 58 35 55 11825 1477 99 464 283 15431 659 1030 262 232 d 20480 3549 31 1903 1077 j 22798 3972 30 1720 985 24582 5850 7 346 187 Table A.3. Benchmark Characteristics: OpenCores 177 Name # Gates # Registers Max Depth Pri Inps Pri Outps intel 001 240 36 35 31 1 intel 004 618 87 86 82 1 intel 002 720 75 72 72 1 intel 003 848 87 105 82 1 intel 005 1538 170 170 165 1 intel 006 2738 350 350 345 1 intel 024 5212 357 614 352 1 intel 023 5226 358 613 353 1 intel 020 5248 354 624 349 1 intel 017 5337 618 440 613 1 intel 021 5373 365 636 360 1 intel 026 5462 492 662 486 1 intel 018 5871 491 740 486 1 intel 019 6100 510 765 505 1 intel 015 7739 553 935 548 1 intel 022 8042 530 954 525 1 intel 029 8045 564 1009 559 1 intel 031 8095 531 956 523 1 intel 011 8233 533 1003 528 1 intel 010 8267 539 994 534 1 intel 007 11387 1307 1329 1302 1 intel 025 11550 1120 1100 1112 1 intel 032 14268 961 1787 890 1 intel 016 24869 2297 2471 2232 1 intel 034 25637 3297 1310 3292 1 intel 014 51042 4309 3197 4293 1 intel 035 64406 4404 5948 4407 1 intel 033 65090 4416 6107 4419 1 intel 027 65271 5143 4337 5127 1 intel 012 71108 5884 4776 5874 1 intel 037 72345 5927 4806 5911 1 intel 030 77923 5397 7138 5400 1 intel 009 77932 5399 7136 5400 1 intel 036 83547 5805 7281 5807 1 intel 043 86086 7223 6006 7213 1 intel 028 88692 7436 6186 7426 1 intel 042 99495 9005 6876 8994 1 intel 038 99712 9010 6876 8992 1 intel 040 101201 9510 7084 9499 1 intel 041 101826 9271 7009 9261 1 intel 039 103147 9501 7086 9493 1 intel 013 159957 13354 10725 13284 1 Table A.4. Benchmark Characteristics: Intel 178

© Copyright 2018