Andrew Lukefahr - Research Statement

Today’s world increasingly depends upon computation. From data centers, which consume megawatts of power, to mobile phones, where battery life is precious to their users, to wireless sensors, which survive entirely on harvested energy, the efficiency of the computation is critical for continued innovation in the computing. My past research has focused on improving the efficiency of mobile-class CPU architectures by introducing heterogeneity within the processor itself, enabling it to dynamically customize the execution path to maximize energy efficiency.

While smartphones allowed us to bring computing wherever we wanted, the world of tomorrow will include computing everywhere and in everything. Yet, if done correctly this computing will pass largely unnoticed, silently monitoring our world and bringing more information and control to our fingertips. No power cords, no huge batteries, just simple devices hiding in everyday life. These devices must be efficient enough to survive on energy harvested from their surroundings. However, this dream has been very slow in becoming reality, due to the complexity of designing and deploying these systems. Toward that end, my future work will focus on improving the efficiency and easing the development hurdles of these Internet-of-Things systems.

Current Research: Composite Cores

Smartphones are powerful devices capable of bringing the entire Internet to your fingertips, and streaming video calls from practically anywhere. They accomplish these tasks while fitting neatly in the palm of your hand. However, once the battery is depleted, these super-devices become nothing more than paperweights. Future generations of these devices must continue to provide the high-performance capabilities we have come to expect while further pushing energy efficiency to maximize battery life.

One successful approach to improving energy efficiency is to include a heterogeneous mix of high-performance, but high power, processors and low-power, but lower performing, processors in the same system. These architectures attempt to detect execution regions where the high-performance processors are underutilized or unnecessary, and map the execution to more efficient processors [7]. This idea has been successfully commercialized as big.LITTLE by ARM Ltd [5]. As the processors in these designs only communicate through memory, migrating between processors can cause significant performance overhead. To amortize these overheads, migration is only done on the order of hundreds of millions of instructions.

My work argues that the current, coarse-grained interval approach overlooks many opportunities to improve efficiency. Even regions that effectively utilize the high-performance processor still experience short periods, or phases, of low utilization. Successfully capturing these phases has the potential to triple the utilization of a low-power processor with no impact on performance [11]. However, these low-utilization phases are too brief to be detected and effectively utilized by traditional techniques, wasting precious energy reserves. Rather than relying on heterogeneity across processors, my research work has attempted to bring heterogeneity within a processor [11, 10]. To accomplish this goal, I introduced a novel processor architecture, called a Composite Core, which incorporates both a high-performance pipeline and an energy-efficient pipeline. The architecture is designed to exploit these brief fine-grained phases by allowing low-overhead migration to and from the energy-efficient pipeline.

For fine-grained migration to be feasible, the migration overheads must be negligible. Applications have considerable state associated with them, beyond simply register contents. Instruction and data caches, TLBs, branch predictors, etc. all contain state that must be explicitly migrated or implicitly rebuilt after migration. Transferring this state require tens of thousands of cycles, limiting migration granularity.

In contrast, a Composite Core is a single processor that shares many of these stateful structures, enabling it to migrate between pipelines without stopping the in-flight instruction stream within a pipeline. This seamless transition requires correctly handling transient state contained within both pipelines. This is particularly challenging for memory operations between pipelines, where order violations can cause potentially incorrect execution. Composite Cores relies on a shared register file and light-weight memory alias detection mechanism to enable seamless transitions between pipelines. These features enable a Composite Core to migrate between high-performance and energy-efficient modes as often as every fifty instructions, seven orders of magnitude smaller than traditional systems, and allows it to double the utilization of the energy-efficient mode without impacting performance.

Traditional heterogeneous architectures relied on sampling performance on both processor types to determine the region characteristics, then fixed the application to one processor type for millions of cycles. However, fine-grained phases are too unstable for this sampling approach to be effective. Therefore I designed a controller that relies on simple hardware-level performance modeling to detect and react to the constant fine-grained performance changes. My colleagues and I further refined the controller to enable it to predict the performance of the upcoming trace, enabling a Composite Core to preemptively migrate between pipelines [16].

The architecture enables a Composite Core to reduce energy consumption by an average of 18%. For a typical smartphone, this translates into several additional hours of battery life. This approach achieves higher energy-efficiency than both traditional heterogeneous multicore designs, such as big.LITTLE, as well as competing technologies such as fine-grained DVFS [8]. The resulting technologies have been patented and are being licensed by ARM Ltd  [9, 15, 14, 12].

Future Research Directions

The world we live in is increasingly filled with computation. We are already accustomed to computation in our offices, pockets, and cars. After many years of promises, we are beginning to see connected-computation creep into our everyday surroundings. This ongoing connection revolution has already brought us everything from medicine bottles that track when we take medications to trash cans that send alerts when full to bridges that monitor their own structural integrity. By the year 2020, our world will include over 20 billion of these connected devices working right under our noses [2]. This Internet of Things (IoT) features challenges in both the architectural and development space.

Achieving computing that fades into the background of life will require future IoT devices to live without the cord. Unlike smartphones, which must survive a mere day disconnected, for IoT devices to be practical they must be able to survive indefinitely, powered only by energy harvested from their surroundings. Yet we still demand they be silently but tirelessly monitoring, collecting information, and responding to our commands. These devices will have tiny energy budgets but big requirements, and every last (milli)joule will count.

What is exciting about the Internet of Things is the possibility of tackling grand problems using simple devices. For example, it is no secret that modern agribusiness can be hard on the environment. One need look no further than the resulting >6000 sq. mile “dead zone” in the Gulf of Mexico (caused largely by excessive agricultural fertilizer runoff) or the 9% drop in the Ogallala Aquifer (caused largely by agricultural irrigation). Yet everyone needs to eat, and farmers have families to support. Now imagine an army of IoT sensors embedded directly into the ground. They would monitor available nutrients and moisture content, enabling farmers to apply only targeted fertilizer and irrigation, saving them money and reducing the environmental impact.

Yet these devices are not available today. Why? Complexity. Building these IoT devices is difficult, requiring expert-level skills in multiple fields, e.g. signal processing, wireless communication, circuit layout, software design, etc. Even when leveraging existing hardware, deploying these devices at scale also brings complications. How do these devices automatically their location within the field? How do they communicate their sensor information to the concerned user? How do you detect and debug malfunctioning sensors? What are the security implications if one of these devices becomes compromised? IoT devices must do all of this, on an energy budget of a millijoule [4]. My goal is to aid IoT development by enabling hardware and software techniques to automatically reduce energy consumption, freeing developers to worry more about correctness and less about energy.

To cope with limited energy reserves, many IoT processors are aggressively duty-cycled, often spending 99-99.9% of their operational time in standby mode. Hence optimizing standby power is critical. To achieve this, many IoT systems incorporate only a tiny processor, which offers single-digit microwatt standby power but is incapable of significant computation[1]. Instead, these systems offload computation to specialized accelerators, painstakingly designed and customized for a single application. This increases development time and leaves these systems incapable of adapting to new applications. One solution I want to explore is introducing processor heterogeneity into the IoT space. Here a tiny processor would continue to control the system, enabling low standby power. A second, more capable, processor would share the memory space, and could be powered on only when increased computation is needed, decreasing the reliance on accelerators, and enabling support for a broader range of applications. However, as these systems rely on harvested (and hence intermittent) energy, the designer must now decide which processor is required and ensure sufficient energy reserves to power it. My goal is to relieve some of this burden by designing a system that will automatically and seamlessly map computation to the most appropriate processor to match energy reserves and computational requirements of the system.

In addition to processors, IoT designers must also cope with performance and standby energy consumption within the memory system. Volatile memories (i.e. SRAM) are fast and low energy, but consume energy to retain data while in standby mode. Non-Volatile memories (e.g. Flash, PCM, STT-RAM, RRAM, etc.) offer energy-free data retention at the cost of write energies and times that are orders of magnitude higher [13]. Today, IoT systems often include both memory types, and the designer is responsible for managing all data movement to and from each memory type. Worse, available energy (from harvesting) can drop below even standby levels at unexpected points, losing all memory contents not stored in non-volatile memories [3]. Imagine trying to debug such systems!

One solution to reduced standby power is to automatically power off unused portions of volatile memory (losing their contents). Here, we can re-use profiling, i.e. running the application with a sample input to determine runtime behavior, originally designed to enable a compiler to perform additional optimizations to increase performance [6]. However, in the energy constrained world of IoT, we could leverage this profiling capability to improve energy efficiency. A energy-profiling compiler could determine how much volatile memory is required to preserve critical data, and automatically power off unnecessary memory during standby, reducing energy consumption. It can also automatically insert periodic backups of critical data to non-volatile memory to protect against complete power loss.

However, energy-profiling can do even more. To service an interrupt, many of today’s IoT architectures require the programmer to manually power up any required peripherals. Here, an energy-profiling compiler or runtime system can push energy savings further by automatically powering up only the peripherals needed for the given interrupt. In these cases, instructions for enabling the energy savings can be generated automatically by the compiler, freeing the programmer to focus on correctness rather than energy savings.

This world of IoT is not without its challenges, requiring a constant focus on minimizing energy consumption throughout the system. My goal is to enable hardware and software techniques to automatically reduce energy consumption, freeing developers to worry more about application correctness and less about energy. While I plan to explore a multitude of ideas, this statement discusses three concrete directions: heterogeneity of IoT processors, energy aware data orchestration for memory, and minimizing energy consumption for interrupt handling. With access to a stable energy-worry-free platform, multitudes of creative people will be able to leverage these devices to take an idea and quickly turn it into truely ubiquitous computing.



References

[1]
Atmel sam l22 datasheet complete. http://www.atmel.com/Images/Atmel-42402-SAM-L22_Datasheet.pdf.
[2]
Gartner says 6.4 billion connected "things" will be in use in 2016, up 30 percent from 2015. http://www.gartner.com/newsroom/id/3165317.
[3]
A. Colin, G. Harvey, B. Lucia, and A. P. Sample. An energy-interference-free hardware-software debugger for intermittent energy-harvesting systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pages 577–589. ACM, 2016.
[4]
S. DeBruin, B. Campbell, and P. Dutta. Monjolo: An energy-harvesting energy meter architecture. In Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems, page 18. ACM, 2013.
[5]
P. Greenhalgh. Big.little processing with arm cortex-a15 & cortex-a7, Sept. 2011. http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdf.
[6]
W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, et al. The superblock: an effective technique for vliw and superscalar compilation. the Journal of Supercomputing, 7(1-2):229–248, 1993.
[7]
R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-isa heterogeneous multi-core architectures: The potential for processor power reduction. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-36, 2003.
[8]
A. Lukefahr, S. Padmanabha, R. Das, R. Dreslinski, T. F. Wenisch, and S. Mahlke. Heterogeneous microarchitectures trump voltage scaling for mobile cores. In Proceedings of the 23rd Annual Conference on Parallel Architectures and Compilation Techniques, PACT-23, 2014.
[9]
A. Lukefahr, S. Padmanabha, R. Das, and S. Mahlke. Heterogeneity within a processor core. US Patent 14/093090, Filed November 29, 2013.
[10]
A. Lukefahr, S. Padmanabha, R. Das, F. Sleiman, R. Dreslinski, T. Wenisch, and S. Mahlke. Exploring fine-grained heterogeneity with composite cores. Computers, IEEE Transactions on, PP, 2015.
[11]
A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R. Dreslinski, T. F. Wenisch, and S. Mahlke. Composite cores: Pushing heterogeneity into a core. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, 2012.
[12]
A. Lukefahr, S. Padmanabha, J. Yu, R. Das, and S. Mahlke. Controlling transition between using first and second processing circuitry. US Patent 15/063651, Filed March 08, 2016.
[13]
S. Mittal, J. S. Vetter, and D. Li. A survey of architectural approaches for managing embedded dram and non-volatile on-chip caches. IEEE Transactions on Parallel and Distributed Systems, 26(6):1524–1537, 2015.
[14]
S. Padmanabha, A. Lukefahr, R. Das, and S. Mahlke. Control of switching between executed mechanisms. US Patent 14/060393, Filed October, 14, 2014.
[15]
S. Padmanabha, A. Lukefahr, R. Das, and S. Mahlke. Trace based phase prediction for tightly-coupled heterogeneous cores. US Patent 14/093042, Filed November 29, 2013.
[16]
S. Padmanabha, A. Lukefahr, R. Das, and S. Mahlke. Trace based phase prediction for tightly-coupled heterogeneous cores. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, 2013.