Overview
Application-Specific Superscalar Processors
An aggressive, state-of-the-art superscalar processor achieves good performance across a wide range of applications.
For a given application, however, its generic design is not the highest-performing one and consumes excessive power.
In contrast, a superscalar processor that is tailored to an application, class of application, or class of application behavior,
will achieve the highest performance possible and the least power consumption for this level of performance.
Customizing cores to workloads is an exciting new direction for continuing to scale performance,
as conventional technology and microarchitecture scaling slows.
Why Different Workloads Favor Different Cores: The Issue of Propagation Delay
Some interesting questions are why different applications achieve their highest performance with different superscalar processors,
why there is no obvious rank-ordering among these superscalar processors,
and why even the most aggressive state-of-the-art superscalar processor is not universally the highest-performing superscalar processor.
The answer to these questions is that there is a tradeoff between accelerating the execution of independent instructions and reducing the latency of dependent instructions.
This tradeoff is illustrated in the figure below.
Extracting more instruction-level parallelism (ILP) requires increasing the complexity of ILP-extracting units (instruction fetch unit, instruction scheduling unit, etc.),
increasing their propagation delays.
These longer delays can be hidden to a large extent by deeper pipelining but might be exposed by inter-instruction dependencies (which exercise pipeline loops).
Different applications have different arrangements of independent and dependent instructions,
and interact differently with various pipeline configurations.
This causes different applications, classes of applications, and classes of application behavior to be characterized by different optimal pipeline designs.
For example, for App. #1, the speedup of independent instructions outweighs the slowdown of dependent instructions (for simplicity these are shown
as separable components whereas in reality they are co-mingled in complex ways), leading to higher performance overall with the 4-way superscalar processor.
The opposite is true for App. #2 which favors the 2-way superscalar processor.
App. #2 is very sensitive to longer propagation delays.
The Importance of Whole-Pipeline Customization
The optimal configuration of a pipeline unit depends strongly on the subset of workload characteristics directly relevant to that unit,
e.g., the instruction scheduling unit depends on register data-flow characteristics.
Local workload characteristics are not the only factors affecting a unit's optimal configuration, however.
Different pipeline units are not entirely independent of each other.
First, they are bound by the same clock period.
This raises the issue of imbalances among pipeline stages.
Imbalances inflate the propagation delays of pipeline loops.
Imbalances can be addressed to some extent by pipelining longer stages;
it is not a perfect solution due to residual imbalances and latch overheads.
Second, even ignoring the issue of imbalances, multiple pipeline units may contribute to the total propagation delay of a pipeline loop.
Since the total propagation delay affects performance, the optimal configurations of different pipeline units that contribute to this
delay cannot be determined independently.
We conclude that the optimal configurations of different pipeline units depend on one another and, by implication, on each other's local workload characteristics.
In other words, the design of different pipeline units should not be based on local decisions,
rather, whole-pipeline customization is required in order to capture global interactions.
Architecture Research Topics
-
Providing Real Designs of Arbitrary Superscalar Processors.
Fundamentally, core customization captures the interplay among workload characteristics, the microarchitecture, and the physical implementation.
With regard to the latter, the above discussion makes it clear that accounting for propagation delay (and power and area) and whole-pipeline customization
is central to any research agenda based on customizing cores.
An important question is how to provide real designs of arbitrary superscalar processors.
This goal is challenging in two ways.
First, there is no openly available fully-synthesizable HDL model of an aggressive superscalar processor.
Second, even more daunting is the fact that customizing cores to workloads requires the availability of real designs for arbitrarily many and arbitrarily different superscalar processors
in a huge design space.
We are currently developing FabScalar, a tool-set that will enable researchers and designers to automatically compose the physical designs of
arbitrary superscalar processors.
Within our own group, FabScalar provides an infrastructure for exploring a rich research agenda,
which is partially enumerated below.
We plan to publicly release FabScalar tools when they have sufficient capabilities.
The current status of the tools is described in the following workshop paper and presentation:
N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, S. S. Navada, H. Hashemi Najaf-abadi, and E. Rotenberg.
FabScalar.
4th Workshop on Architectural Research Prototyping (WARP'09), in conjunction with ISCA-36,
June 2009.
[paper: pdf]
[presentation: pps]
-
Fast Exploration of a Huge Design Space.
The design space for a superscalar processor is very large.
Searching this space needs to be accelerated while ensuring that the final design is truly optimal.
We are pursuing two research topics in this area.
First, we are exploring ways to provide fast and cycle-accurate simulation (cycle-accurate with respect to the RTL generated by FabScalar) of a single point in the space.
Our first approach is to co-design C++ and RTL representations, that are cycle-accurate at the granularity of composable units.
This approach is described in the FabScalar workshop paper.
Eventually, we would like to synthesize whole pipelines to sophisticated FPGA platforms such as BEE3 modules.
Second, we are developing intelligent search techniques that very quickly converge on the best core design.
Longer term, we are interested in developing a closed-form analytical model that outputs the best core design.
-
Designing Optimal Heterogeneous Multi-core Architectures.
While an application-specific superscalar processor is optimal for its intended application, it is severely suboptimal for other applications.
Thus, a single application-specific superscalar processor is not viable for general-purpose computing systems that support many diverse workloads.
Customization can be extended to general-purpose computing systems by incorporating many differently-designed workload-customized superscalar processors
in an overall hetereogeneous multi-core architecture.
Heterogeneous multi-core opens up several interesting research topics.
(i) One issue is how to divide up the workload space for customizing a limited number of core types to subsets of the space.
The quality of workload subsetting strongly influences the optimality of the resulting heterogeneous multi-core architecture and the speedup extracted from customization.
The paper "Configurational Workload Characterization" attempts to make a case for characterizing and subsetting workloads based on
their performance on each other's customized cores rather than raw workload characteristics, due to global interactions among different pipeline units.
H. Hashemi Najaf-abadi and E. Rotenberg.
Configurational Workload Characterization.
Proceedings of the
2008 IEEE International Symposium on
Performance Analysis of Systems and Software
(ISPASS'08),
pp. 147-156,
April 2008.
[ pdf]
(ii) There is the question of whether to customize a core for an application, a class of applications, or a class of application behavior.
(iii) Incomplete knowledge of the workload space is perhaps inevitable.
This makes the task of a designing a heterogeneous multi-core architecture that much more difficult.
The architect must strike a balance between tuning cores to known workloads for maximum performance and ensuring robust performance for unforeseen workloads.
Item (ii) above is relevant here.
It would seem that customizing different cores to different behaviors would provide the most robust performance both within and across applications.
(iv) In a heterogeneous multi-core architecture comprised of a limited number of core types,
the pattern of task arrivals influences the optimal choice of core types.
For example, if tasks of the same type arrive in bursts (characteristic of multithreaded workloads), homogeneity is preferable.
For non-bursty arrivals of a given task type (characteristic of multiprogrammed workloads), heterogeneity is very beneficial to performance.
We developed a generalized model of task arrival characteristics and determined the influence that each characteristic has on the optimal selection of core types.
The conclusion is that task arrival characteristics need to be accounted for the design of processing cores, not just conventional workload characteristics.
Another finding is that, in the context of multiple task arrival behaviors, using a single averaged behavior to guide the design of processing cores
leads to a suboptimal architecture.
The study culminates in two recommendations based on these findings,
first, that benchmark vendors include this facet in future benchmark specifications and,
second, that architects account for different task arrival characteristics in the design.
Our paper "Core-Selectability in Chip Multiprocessors" has some relevance with respect to the latter recommendation.
Core-selectability in a chip multiprocessor provides a selection of homogeneous and heterogeneous configurations.
H. Hashemi Najaf-abadi and E. Rotenberg.
The Importance of Accurate Task Arrival Characterization in the Design of Processing Cores.
Proceedings of the
2009 IEEE International Symposium on Workload Characterization (IISWC'09),
pp. ??-??,
October 2009. (to appear)
[pdf]
H. Hashemi Najaf-abadi, N. K. Choudhary, and E. Rotenberg.
Core-Selectability in Chip Multiprocessors.
Proceedings of the
18th IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT-18),
pp. 113-122,
September 2009.
[ pdf]
(v) The following study explores the problem of providing the best instantaneous architectural configuration for a running program.
We find that the best configuration varies at a fine granularity, too fine to be exploited by conventional adaptive pipelines or heterogeneous multi-core architectures.
Architectural contesting is a novel approach that automatically and rapidly switches effective execution to the best core for the instantaneous workload behavior.
This study is a specific instance of a broader research topic: how to schedule tasks on a heterogeneous multi-core architecture.
H. Hashemi Najaf-abadi and E. Rotenberg.
Architectural Contesting.
Proceedings of the
15th IEEE International Symposium on High-Performance Computer Architecture (HPCA-15),
pp. 189-200,
February 2009.
[ pdf]
-
Understanding the Value of Core Customization.
With the FabScalar framework, we will gain a much better understanding of the value of core customization.
(i) How much additional performance does customization actually provide?
(ii) How different are application-customized cores?
Does a given application perform substantially differently on these cores?
Is there one core among them that achieves close to the highest performance for every benchmark?
(iii) Can a fully reconfigurable core approximate the performance of arbitrary customized cores?
In future work, we aim to design a fully reconfigurable core (a valuable contribution in its own right) and compare its configurations with corresponding customized cores.
The paper "Core-Selectability in Chip Multiprocessors" discusses three overheads of reconfigurability:
overhead at the logic level (additional propagation delay),
overhead at the pipeline level (the difficulty in approximating whole-pipeline customization with locally reconfigurable units),
and overhead at the verification level (the difficulty in verifying all valid configurations of the pipeline and the configuration logic itself).
The paper alludes to a unique advantage of using multiple customized cores:
the microarchitecture design effort is partitioned and focused on specific types of workload behavior,
rather than attempting to pack everything into one complex design --
a single fixed or reconfigurable core that performs well for all types of workload behavior.
The partitioned approach both simplifies design-for-performance and achieves a better, uncompromised result.
H. Hashemi Najaf-abadi, N. K. Choudhary, and E. Rotenberg.
Core-Selectability in Chip Multiprocessors.
Proceedings of the
18th IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT-18),
pp. 113-122,
September 2009.
[ pdf]
-
Better Baselines in Microarchitecture Research.
Choosing a baseline microarchitecture for evaluating a proposed microarchitecture enhancement is tricky.
The choice of baseline configuration can influence the perceived speedup, either exaggerating or obscuring the speedup of the enhancement due to intrinsic imbalances.
Customization is a solution to the baseline problem.
In one approach, the baseline microarchitecture is the customized core for the benchmark (which is unusual in that different benchmarks use different baselines).
In a second approach, the baseline microarchitecture is the customized core for the benchmark, as before, but this baseline is compared to
a recustomized core with the microarchitecture enhancement.
Recustomizing the core with the enhancement accounts for the fact that there is a new global optimum due to the new dynamic introduced by the enhancement.
-
Reconsidering Microarchitecture Techniques in Heterogeneous Multi-Core Architectures.
The 1990's was a golden age of microarchitecture research: many microarchitecture optimizations were proposed during that time.
Many of them have not been put into practice.
One plausible reason is that many techniques are not universally beneficial, i.e., they provide significant benefit only in limited circumstances (and may even degrade performance in other circumstances).
There is good reason to reconsider previously proposed microarchitecture optimizations in the context of heterogeneous multi-core architectures,
because universal applicability is relaxed in these architectures by design.
Techniques such as trace caches, value prediction, instruction/trace/computation reuse, speculative multithreading, control independence, etc.,
may be highly beneficial under certain workload behavior and very cost-effective (area and power efficient) in the context of whole-pipeline customization.
Tools
FabScalar:
- Memory compiler for specialized highly-ported RAMs (coming soon)
- Fully-synthesizable HDL model of a superscalar processor (coming soon)
- CVS repository of the Standard Superscalar Library (SSL) and high-level superscalar synthesis tool
XPScalar: Our initial superscalar processor exploration framework that uses CACTI to model propagation delays.
(Currently, the tables containing customized core configurations for various benchmark suites can only be viewed with Internet Explorer.)
Publications
Conference and Journal Papers
H. Hashemi Najaf-abadi and E. Rotenberg.
The Importance of Accurate Task Arrival Characterization in the Design of Processing Cores.
Proceedings of the
2009 IEEE International Symposium on Workload Characterization (IISWC'09),
pp. ??-??,
October 2009. (to appear)
[pdf]
H. Hashemi Najaf-abadi, N. K. Choudhary, and E. Rotenberg.
Core-Selectability in Chip Multiprocessors.
Proceedings of the
18th IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT-18),
pp. 113-122,
September 2009.
[pdf]
H. Hashemi Najaf-abadi and E. Rotenberg.
Architectural Contesting.
Proceedings of the
15th IEEE International Symposium on High-Performance Computer Architecture (HPCA-15),
pp. 189-200,
February 2009.
[pdf]
H. Hashemi Najaf-abadi and E. Rotenberg.
Configurational Workload Characterization.
Proceedings of the
2008 IEEE International Symposium on
Performance Analysis of Systems and Software
(ISPASS'08),
pp. 147-156,
April 2008.
[pdf]
Workshop Papers
N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, S. S. Navada, H. Hashemi Najaf-abadi, and E. Rotenberg.
FabScalar.
4th Workshop on Architectural Research Prototyping (WARP'09), in conjunction with ISCA-36,
June 2009.
[pdf]
H. Hashemi Najaf-abadi and E. Rotenberg.
Exploiting Detachability: A Non-Silicon Approach to Polymorphism.
4th Workshop on Non-Silicon Computing (NSC-4), in conjunction with ISCA-34,
June 2007.
[pdf]
H. Hashemi Najaf-abadi and E. Rotenberg.
Architectural Contesting: Exposing and Exploiting Temperamental Behavior.
-
Reconfigurable and Adaptive Architecture Workshop (RAAW), in conjunction with MICRO-39,
December 2006.
-
Also appears in
ACM SIGARCH Computer Architecture News (CAN),
35(3):28-35,
June 2007.
[pdf
]
Student Theses
N. K. Choudhary.
A Synthesizable HDL Model for Out-of-Order Superscalar Processors.
M.S. Thesis,
Department of Electrical and Computer Engineering,
North Carolina State University,
August 2009.
[NCSU library: on-line thesis]
Talks
FabScalar.
Presented at WARP-2009 (held in conjunction with ISCA-36) by E. Rotenberg.
[pps]
Funding
This project is supported by NSF grant No. CCF-0811707 (CPA-CSA: FabScalar: A Standard Superscalar Library for Fabricating Heterogeneous Chip Multiprocessors),
Intel, and IBM.
Any opinions, findings, and conclusions or recommendations
expressed in this website and publications herein are those of the author(s) and
do not necessarily reflect the views of the National Science Foundation.
|