publications | Radu Teodorescu

2025

DSN
MicroSampler: A Framework for Microarchitecture-Level Leakage Detection in Constant Time Execution

Moein Ghaniyoun, Kristin Barber, Yinqian Zhang, and Radu Teodorescu

In The 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2025

Abs Bib PDF Code

Constant-time programming is a principal line of defense against timing side-channel attacks. It involves hardening software in such a manner that execution time is uncorrelated to sensitive data values, and is now broadly employed in most cryptography and other security critical kernels. However, constant-time programming relies on necessary assumptions about the underlying microarchitectural implementation, which are frequently incorrect or incomplete, leading to exploits. Consequently, devising methodologies for joint leakage detection in high assurance applications, compiler optimizations and microarchitectural implementations is an increasingly important problem. This paper presents MicroSampler, a dynamic leakage detection framework to identify secret-dependent microarchitectural behavior that can lead to side-channel leakage in security critical software. MicroSampler runs the constant-time code to be verified on a cycle-accurate register-transfer level (RTL) simulation of the target system and builds a comprehensive and detailed representation of microarchitectural state captured at cycle granularity. MicroSampler then uses statistical analysis to measure any existing association between microarchitectural state and data values that are identified as sensitive (e.g. encryption keys). We demonstrate the utility of the proposed leakage detection framework through multiple case studies. We show MicroSampler is able to reveal vulnerabilities in constant-time encryption code in diverse cases where the vulnerabilities originate in the algorithm design, compiler optimizations or microarchitectural implementation.
@inproceedings{ghaniyoun2025microsampler, title = {MicroSampler: A Framework for Microarchitecture-Level Leakage Detection in Constant Time Execution}, author = {Ghaniyoun, Moein and Barber, Kristin and Zhang, Yinqian and Teodorescu, Radu}, booktitle = {The 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)}, pages = {1--15}, year = {2025}, }

2024

HOST
Voltage Noise-Based Adversarial Attacks on Machine Learning Inference in Multi-Tenant FPGA Accelerators

Saikat Majumdar, and Radu Teodorescu

In 2024 IEEE International Symposium on Hardware Oriented Security and Trust (HOST), 2024

Abs Bib PDF

Deep neural network (DNN) classifiers are known to be vulnerable to adversarial attacks, in which a model is induced to misclassify an input into the wrong class. These attacks affect virtually all state-of-the-art DNN models. While most adversarial attacks work by altering the classifier input, recent variants have also targeted the model parameters. This paper focuses on a new attack vector on DNN models that leverages computation errors, rather than memory errors, deliberately introduced during DNN inference to induce misclassification. In particular, it examines errors introduced by voltage noise into FPGA-based accelerators as the attack mechanism. In an advancement over prior work, the paper demonstrates that targeted attacks are possible, even when randomly occurring faults are used. It presents an approach for precisely characterizing the distribution of faults under noise of individual input devices, by examining classification errors in select inputs. It then shows how, by fine-tuning the parameters of the attack (noise levels and target DNN layers) the attacker can produce the desired misclassification class, without altering the original input. We demonstrate the attack on an FPGA device and show the attack success rate ranges between 80% and 99.5% depending on the DNN model and dataset.
@inproceedings{majumdar_host2024, author = {Majumdar, Saikat and Teodorescu, Radu}, booktitle = {2024 IEEE International Symposium on Hardware Oriented Security and Trust (HOST)}, title = {Voltage Noise-Based Adversarial Attacks on Machine Learning Inference in Multi-Tenant FPGA Accelerators}, year = {2024}, volume = {}, number = {}, pages = {80-85}, keywords = {Computational modeling;Noise;Input devices;Artificial neural networks;Voltage;Machine learning;Vectors}, doi = {10.1109/HOST55342.2024.10545401}, }

2023

ISCA
TEESec: Pre-Silicon Vulnerability Discovery for Trusted Execution Environments

Moein Ghaniyoun, Kristin Barber, Yuan Xiao, Yinqian Zhang, and Radu Teodorescu

In Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023

Abs Bib PDF Code

Trusted execution environments (TEE) are CPU hardware extensions that provide security guarantees for applications running on untrusted operating systems. The security of TEEs is threatened by a variety of microarchitectural vulnerabilities, which have led to a large number of demonstrated attacks. While various solutions for verifying the correctness and security of TEE designs have been proposed, they generally do not extend to jointly verifying the security of the underlying microarchitecture. This paper presents TEESec, the first pre-silicon framework for discovering microarchitectural vulnerabilities in the context of trusted execution environments. TEESec is designed to jointly and systematically test the TEE and underlying microarchitecture against data and metadata leakage across isolation boundaries. We implement TEESec in the Chipyard framework and evaluate it on two open-source RISC-V out-of-order processors running the Keystone TEE. Using TEESec we uncover 10 distinct vulnerabilities in these processors that violate TEE security principles and could lead to leakage of enclave secrets.
@inproceedings{ghaniyoun2023teesec, title = {TEESec: Pre-Silicon Vulnerability Discovery for Trusted Execution Environments}, author = {Ghaniyoun, Moein and Barber, Kristin and Xiao, Yuan and Zhang, Yinqian and Teodorescu, Radu}, booktitle = {Proceedings of the 50th Annual International Symposium on Computer Architecture}, pages = {1--15}, year = {2023}, }

2022

SEED
ENCLYZER: Automated Analysis of Transient Data Leaks on Intel SGX

Jiuqin Zhou, Yuan Xiao, Radu Teodorescu, and Yinqian Zhang

In 2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED), 2022

Abs Bib PDF Code

Trusted Execution Environment (TEE) is the cornerstone of confidential computing. Among other TEEs, Intel® Secure Guard Extensions (Intel® SGX) is the most prominent solution that is frequently used in the public cloud to provide confidential computing services. Intel® SGX promotes runtime confidentiality and integrity of enclaves with minimal modifications to existing CPU microarchitectures. However, Transient Execution Attacks, such as L1 Terminal Fault (L1TF), Microarchitectural Data Sampling (MDS), and Transactional Asynchronous Abort (TAA) have exposed certain vulnerabilities within Intel® SGX solution. Over the past few years, Intel has developed various countermeasures against most of these vulnerabilities via microcode updates and hardware fixes. However, arguably, there are no existing tools nor studies that can measurably verify the effectiveness of these countermeasures. In this paper, we introduce an automated analysis tool, called ENCLYZER, to evaluate Transient Execution Vulnerabilities on Intel® SGX. We leverage ENCLYZER to comprehensively analyze a set of processors, with multiple versions of their microcode, to verify the correctness of these countermeasures. Our empirical analysis suggests that most countermeasures are effective in preventing attacks that are initiated from the same CPU hyperthread, but less effective for cross-thread attacks. Therefore, the application of the latest microcode patches and disabling hyperthreading is warranted to enhance the security of Intel® SGX-enabled systems. Security Configurations like hyperthreading disabled/enabled are attestable on Intel® SGX platform to provide user with increased confidence in making decision on system trustworthiness. Note that the Security Configurations cannot be modified without a system reboot.
@inproceedings{zhou_seed2022, author = {Zhou, Jiuqin and Xiao, Yuan and Teodorescu, Radu and Zhang, Yinqian}, booktitle = {2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED)}, title = {ENCLYZER: Automated Analysis of Transient Data Leaks on Intel SGX}, year = {2022}, volume = {}, number = {}, pages = {145-156}, keywords = {Cloud computing;Microarchitecture;Runtime;Program processors;Software;Hardware;Security}, doi = {10.1109/SEED55351.2022.00020}, }
arXiv
DNNShield: Dynamic Randomized Model Sparsification, A Defense Against Adversarial Machine Learning

Mohammad Hossein Samavatian, Saikat Majumdar, Kristin Barber, and Radu Teodorescu

2022

Abs Bib PDF

DNNs are known to be vulnerable to so-called adversarial attacks that manipulate inputs to cause incorrect results that can be beneficial to an attacker or damaging to the victim. Recent works have proposed approximate computation as a defense mechanism against machine learning attacks. We show that these approaches, while successful for a range of inputs, are insufficient to address stronger, high-confidence adversarial attacks. To address this, we propose DNNSHIELD, a hardware-accelerated defense that adapts the strength of the response to the confidence of the adversarial input. Our approach relies on dynamic and random sparsification of the DNN model to achieve inference approximation efficiently and with fine-grain control over the approximation error. DNNSHIELD uses the output distribution characteristics of sparsified inference compared to a dense reference to detect adversarial inputs. We show an adversarial detection rate of 86% when applied to VGG16 and 88% when applied to ResNet50, which exceeds the detection rate of the state of the art approaches, with a much lower overhead. We demonstrate a software/hardware-accelerated FPGA prototype, which reduces the performance impact of DNNSHIELD relative to software-only CPU and GPU implementations.
@misc{samavatian2022dnnshield, title = {DNNShield: Dynamic Randomized Model Sparsification, A Defense Against Adversarial Machine Learning}, author = {Samavatian, Mohammad Hossein and Majumdar, Saikat and Barber, Kristin and Teodorescu, Radu}, year = {2022}, eprint = {2208.00498}, archiveprefix = {arXiv}, primaryclass = {cs.CR}, url = {https://arxiv.org/abs/2208.00498}, }
HOST
Characterizing Side-Channel Leakage of DNN Classifiers though Performance Counters

Saikat Majumdar, Mohammad Hossein Samavatian, and Radu Teodorescu

In 2022 IEEE International Symposium on Hardware Oriented Security and Trust (HOST), 2022

Abs Bib PDF

Rapid advancements in Deep Neural Networks (DNN) have led to their deployment in a wide range of commercial applications. DNN classifiers are powerful tools that drive a broad spectrum of important applications, from image recognition to autonomous vehicles. Like other applications, they have been shown to be vulnerable to side-channel information leakage. There have been several proof-of-concept attacks demon-strating the extraction of their model parameters and input data. However, no prior study has examined the possibility of using side-channels to extract the DNN classifier’s decision or output. In this initial study, we aim to understand if there exists a correlation between the output class selected by a classifier and side-channel information collected while running the inference process on a CPU. Our initial evaluation shows that with the proposed approach it is possible to accurately recover the output class for model inputs via multiple side-channels: primarily power, but also branch mispredictions and cache misses.
@inproceedings{majumdar_host2022, author = {Majumdar, Saikat and Samavatian, Mohammad Hossein and Teodorescu, Radu}, booktitle = {2022 IEEE International Symposium on Hardware Oriented Security and Trust (HOST)}, title = {Characterizing Side-Channel Leakage of DNN Classifiers though Performance Counters}, year = {2022}, volume = {}, number = {}, pages = {45-48}, keywords = {Deep learning;Image recognition;Correlation;Neural networks;Hardware;Data models;Security}, doi = {10.1109/HOST54066.2022.9839882}, }
SP
A Systematic Look at Ciphertext Side Channels on AMD SEV-SNP

Mengyuan Li, Luca Wilke, Jan Wichelmann, Thomas Eisenbarth, Radu Teodorescu, and Yinqian Zhang

In 2022 IEEE Symposium on Security and Privacy (SP), 2022

Abs Bib PDF Code

Hardware-assisted memory encryption offers strong confidentiality guarantees for trusted execution environments like Intel SGX and AMD SEV. However, a recent study by Li et al. presented at USENIX Security 2021 has demonstrated the CipherLeaks attack, which monitors ciphertext changes in the special VMSA page. By leaking register values saved by the VM during context switches, they broke state-of-the-art constant-time cryptographic implementations, including RSA and ECDSA in the OpenSSL. In this paper, we perform a comprehensive study on the ciphertext side channels. Our work suggests that while the CipherLeaks attack targets only the VMSA page, a generic ciphertext side-channel attack may exploit the ciphertext leakage from any memory pages, including those for kernel data structures, stacks and heaps. As such, AMD’s existing countermeasures to the CipherLeaks attack, a firmware patch that introduces randomness into the ciphertext of the VMSA page, is clearly insufficient. The root cause of the leakage in AMD SEV’s memory encryption—the use of a stateless yet unauthenticated encryption mode and the unrestricted read accesses to the ciphertext of the encrypted memory—remains unfixed. Given the challenges faced by AMD to eradicate the vulnerability from the hardware design, we propose a set of software countermeasures to the ciphertext side channels, including patches to the OS kernel and cryptographic libraries. We are working closely with AMD to merge these changes into affected open-source projects.
@inproceedings{mengyuan_sp2022.pdf, author = {Li, Mengyuan and Wilke, Luca and Wichelmann, Jan and Eisenbarth, Thomas and Teodorescu, Radu and Zhang, Yinqian}, booktitle = {2022 IEEE Symposium on Security and Privacy (SP)}, title = {A Systematic Look at Ciphertext Side Channels on AMD SEV-SNP}, year = {2022}, volume = {}, number = {}, pages = {337-351}, keywords = {Privacy;Systematics;Side-channel attacks;Data structures;Libraries;Encryption;Registers}, doi = {10.1109/SP46214.2022.9833768}, }
CAL
A Pre-Silicon Approach to Discovering Microarchitectural Vulnerabilities in Security Critical Applications

Kristin Barber, Moein Ghaniyoun, Yinqian Zhang, and Radu Teodorescu

IEEE Computer Architecture Letters, 2022

Best of CAL Abs Bib PDF

Paper selected as one of the best papers of Computer Architecture Letters for 2022, with an invited presentation in special a session at HPCA 2023.

Microarchitectural vulnerabilities have become an increasingly effective attack vector. This is especially problematic for security critical applications, which handle sensitive data and may employ software-level hardening in order to thwart data leakage. These strategies rely on necessary assumptions about the underlying microarchitectural implementation, which may (and have proven to be) incorrect in some instances, leading to exploits. Consequently, devising early-stage design tools for reasoning about and verifying the correctness of high assurance applications with respect to a given hardware design is an increasingly important problem. This letter presents a principled dynamic testing methodology to reveal and analyze data-dependent microarchitectural behavior with the potential to violate assumptions and requirements of security critical software. A differential analysis is performed of the microarchitectural state space explored during register transfer-level (RTL) simulation to reveal internal activity which correlates to sensitive data used in computation. We demonstrate the utility of the proposed methodology through it’s ability to identify secret data leakage from selected case studies with known vulnerabilities.
@article{barber_cal2022, author = {Barber, Kristin and Ghaniyoun, Moein and Zhang, Yinqian and Teodorescu, Radu}, journal = {IEEE Computer Architecture Letters}, title = {A Pre-Silicon Approach to Discovering Microarchitectural Vulnerabilities in Security Critical Applications}, year = {2022}, volume = {21}, number = {1}, pages = {9-12}, keywords = {Microarchitecture;Hardware;Software;Registers;Computer architecture;Codes;Testing;Hardware security;verification}, doi = {10.1109/LCA.2022.3151256}, }

2021

HiPC
A Fused Inference Design for Pattern-Based Sparse CNN on Edge Devices

Jia Guo, Radu Teodorescu, and Gagan Agrawal

In 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC), 2021

Abs Bib PDF

Weight pruning approaches for Convolution Neural Networks (CNN) has been well developed in the past years. Compared with traditional unstructured and structured pruning, the new state-of-the-art sparse convolution pattern (SCP) based pruning uses certain patterns that lead to both high pruning rate and low accuracy loss. This paper introduce a novel inference scheme to accelerate the execution of SCP-pruned models on IoT devices with limited resources. This inference scheme applies and combines ideas from direct sparse convolution and layer fusion. To fully utilize the power of modern IoT processors, the inference is also mapped to all available cores and optimized with SIMD instructions. The experimental results show good performance improvement as well as scalability of our scheme on an edge device.
@inproceedings{guo_hipc2021, author = {Guo, Jia and Teodorescu, Radu and Agrawal, Gagan}, booktitle = {2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)}, title = {A Fused Inference Design for Pattern-Based Sparse CNN on Edge Devices}, year = {2021}, volume = {}, number = {}, pages = {424-429}, keywords = {Performance evaluation;Program processors;Convolution;Scalability;High performance computing;Conferences;Neural networks;Deep Neural Networks;Edge Computing}, doi = {10.1109/HiPC53243.2021.00060}, }
HOST
Using Undervolting as an on-Device Defense Against Adversarial Machine Learning Attacks

Saikat Majumdar, Mohammad Hossein Samavatian, Kristin Barber, and Radu Teodorescu

In 2021 IEEE International Symposium on Hardware Oriented Security and Trust (HOST), 2021

Abs Bib PDF

Deep neural network (DNN) classifiers are powerful tools that drive a broad spectrum of important applications, from image recognition to autonomous vehicles. Unfortunately, DNNs are known to be vulnerable to adversarial attacks that affect virtually all state-of-the-art models. These attacks make small imperceptible modifications to inputs that are sufficient to induce the DNNs to produce the wrong classification. In this paper we propose a novel, lightweight adversarial correction and/or detection mechanism for image classifiers that relies on undervolting (running a chip at a voltage that is slightly below its safe margin). We propose using controlled undervolting of the chip running the inference process in order to introduce a limited number of compute errors. We show that these errors disrupt the adversarial input in a way that can be used either to correct the classification or detect the input as adversarial. We evaluate the proposed solution in an FPGA design and through software simulation. We evaluate 10 attacks and show average detection rates of 77% and 90% on two popular DNNs.
@inproceedings{majumdar_host2021, author = {Majumdar, Saikat and Samavatian, Mohammad Hossein and Barber, Kristin and Teodorescu, Radu}, booktitle = {2021 IEEE International Symposium on Hardware Oriented Security and Trust (HOST)}, title = {Using Undervolting as an on-Device Defense Against Adversarial Machine Learning Attacks}, year = {2021}, volume = {}, number = {}, pages = {158-169}, keywords = {Low voltage;Image recognition;Neural networks;Process control;Voltage;Software;Hardware;undervolting;machine learning;defense}, doi = {10.1109/HOST49136.2021.9702287}, }
ISCA
IntroSpectre: A pre-silicon framework for discovery and analysis of transient execution vulnerabilities

Moein Ghaniyoun, Kristin Barber, Yinqian Zhang, and Radu Teodorescu

In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021

Abs Bib PDF

Transient execution vulnerabilities originate in the extensive speculation implemented in modern high-performance microprocessors. Identifying all possible vulnerabilities in complex designs is very challenging. One of the challenges stems from the lack of visibility into the transient micro-architectural state of the processor. Prior work has used covert channels to identify data leakage from transient state, which limits the systematic discovery of all potential leakage sources.This paper presents INTROSPECTRE, a pre-silicon framework for early discovery of transient execution vulnerabilities. IN- TROSPECTRE addresses the lack of visibility into the micro- architectural processor state by integrating into the register transfer level (RTL) design flow, gaining full access to the internal state of the processor. Full visibility into the processor state enables INTROSPECTRE to perform a systematic leakage analysis that includes all micro-architectural structures, allowing it to identify potential leakage that may not be reachable with known side channels. We implement INTROSPECTRE on an RTL simulator and use it to perform transient leakage analysis on the RISC-V BOOM processor. We identify multiple transient leakage scenarios, most of which had not been highlighted on this processor design before.
@inproceedings{ghaniyoun2021introspectre, title = {IntroSpectre: A pre-silicon framework for discovery and analysis of transient execution vulnerabilities}, author = {Ghaniyoun, Moein and Barber, Kristin and Zhang, Yinqian and Teodorescu, Radu}, booktitle = {2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)}, pages = {874--887}, year = {2021}, organization = {IEEE}, }
arXiv
HASI: Hardware-Accelerated Stochastic Inference, A Defense Against Adversarial Machine Learning Attacks

Mohammad Hossein Samavatian, Saikat Majumdar, Kristin Barber, and Radu Teodorescu

2021

Abs Bib PDF

Deep Neural Networks (DNNs) are employed in an increasing number of applications, some of which are safety critical. Unfortunately, DNNs are known to be vulnerable to so-called adversarial attacks that manipulate inputs to cause incorrect results that can be beneficial to an attacker or damaging to the victim. Multiple defenses have been proposed to increase the robustness of DNNs. In general, these defenses have high overhead, some require attack-specific re-training of the model or careful tuning to adapt to different attacks. This paper presents HASI, a hardware-accelerated defense that uses a process we call stochastic inference to detect adversarial inputs. We show that by carefully injecting noise into the model at inference time, we can differentiate adversarial inputs from benign ones. HASI uses the output distribution characteristics of noisy inference compared to a non-noisy reference to detect adversarial inputs. We show an adversarial detection rate of 86% when applied to VGG16 and 93% when applied to ResNet50, which exceeds the detection rate of the state of the art approaches, with a much lower overhead. We demonstrate two software/hardware-accelerated co-designs, which reduces the performance impact of stochastic inference to 1.58X-2X relative to the unprotected baseline, compared to 15X-20X overhead for a software-only GPU implementation.
@misc{samavatian2021hasi, author = {Samavatian, Mohammad Hossein and Majumdar, Saikat and Barber, Kristin and Teodorescu, Radu}, title = {{HASI:} Hardware-Accelerated Stochastic Inference, {A} Defense Against Adversarial Machine Learning Attacks}, journal = {CoRR}, volume = {abs/2106.05825}, year = {2021}, url = {https://arxiv.org/abs/2106.05825}, eprinttype = {arXiv}, eprint = {2106.05825}, timestamp = {Tue, 15 Jun 2021 16:35:15 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2106-05825.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
CCGrid
Fused DSConv: Optimizing Sparse CNN Inference for Execution on Edge Devices

Jia Guo, Radu Teodorescu, and Gagan Agrawal

In 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), 2021

Abs Bib PDF

Accelerating CNN on resource-constrained edge devices is becoming an increasingly important problem with the emergence of IoT and edge computing. This paper proposes an execution strategy and an implementation for efficient execution of CNNs. Our execution strategy combines two previously published, but not widely used, ideas – direct sparse convolution and fusion of two convolution layers. Together with a scheme for caching intermediate results, this results in a very efficient mechanism for speeding up inference after the model has been sparsified. We also demonstrate an efficient implementation that uses both multi-core and SIMD parallelism. Our experimental results demonstrate that our scheme significantly outperforms existing implementations on an edge device, while also scaling better in a server environment.
@inproceedings{guo_ccgrid2021, author = {Guo, Jia and Teodorescu, Radu and Agrawal, Gagan}, booktitle = {2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid)}, title = {Fused DSConv: Optimizing Sparse CNN Inference for Execution on Edge Devices}, year = {2021}, volume = {}, number = {}, pages = {545-554}, keywords = {Cloud computing;Convolution;Parallel processing;Servers;Optimization;Edge computing;Context modeling}, doi = {10.1109/CCGrid51090.2021.00064}, }

2020

JETC
RNNFast: An Accelerator for Recurrent Neural Networks Using Domain-Wall Memory

Mohammad Hossein Samavatian, Anys Bacha, Li Zhou, and Radu Teodorescu

ACM Journal on Emerging Technologies in Computing Systems (JETC), Sep 2020

Abs Bib PDF

Recurrent Neural Networks (RNNs) are an important class of neural networks designed to retain and incorporate context into current decisions. RNNs are particularly well suited for machine learning problems in which context is important, such as speech recognition and language translation. This work presents RNNFast, a hardware accelerator for RNNs that leverages an emerging class of non-volatile memory called domain-wall memory (DWM). We show that DWM is very well suited for RNN acceleration due to its very high density and low read/write energy. At the same time, the sequential nature of input/weight processing of RNNs mitigates one of the downsides of DWM, which is the linear (rather than constant) data access time. RNNFast is very efficient and highly scalable, with flexible mapping of logical neurons to RNN hardware blocks. The basic hardware primitive, the RNN processing element (PE), includes custom DWM-based multiplication, sigmoid and tanh units for high density and low energy. The accelerator is designed to minimize data movement by closely interleaving DWM storage and computation. We compare our design with a state-of-the-art GPGPU and find 21.8× higher performance with 70× lower energy.
@article{samavatian_jetc2020, author = {Samavatian, Mohammad Hossein and Bacha, Anys and Zhou, Li and Teodorescu, Radu}, title = {RNNFast: An Accelerator for Recurrent Neural Networks Using Domain-Wall Memory}, year = {2020}, issue_date = {October 2020}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {16}, number = {4}, issn = {1550-4832}, url = {https://doi.org/10.1145/3399670}, doi = {10.1145/3399670}, journal = {ACM Journal on Emerging Technologies in Computing Systems (JETC)}, month = sep, articleno = {38}, numpages = {27}, keywords = {LSTM, Recurrent neural networks, accelerator, domain-wall memory}, }
CCGRID
A Pattern-Based API for Mapping Applications to a Hierarchy of Multi-Core Devices

Jia Guo, Radu Teodorescu, and Gagan Agrawal

In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), Sep 2020

Abs Bib PDF

Recent years have witnessed an evolution of Internet of Things (IoT) devices. This has lead to the emergence of (related) paradigms of Edge/Fog computing, where the goal is to exploit the power of interconnected heterogeneous devices together with distributed/cloud computing. In Edge/Fog computing, one of the challenges is automatically distributing the work between different devices to reduce application latency. At the same time, with increasing transistor density and the end of Denard scaling, even small edge devices have parallelism. Thus, we need a programming model that can help distribute the work between different devices and yet parallelize operations on each device. Motivated by the popularity of MapReduce(-like) frame-works, we develop a pattern-based high-level programming API targeting computer vision applications for the Edge/Fog paradigm with parallelism within devices. Based on this API, parallelization, workload distribution, and optimizations that account for resource limitations of IoT devices, are implemented. Our evaluation with three image processing applications shows that while using a single device, we achieve 17-45% speedup over OpenCV, one of the most popular frameworks for image processing. In addition, we further gain benefits from distributing the work between multiple devices.
@inproceedings{guo_ccgrid2020, author = {Guo, Jia and Teodorescu, Radu and Agrawal, Gagan}, booktitle = {2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)}, title = {A Pattern-Based API for Mapping Applications to a Hierarchy of Multi-Core Devices}, year = {2020}, volume = {}, number = {}, pages = {11-20}, keywords = {Parallel processing;Image edge detection;Programming;Histograms;Optimization;Transforms}, doi = {10.1109/CCGrid49817.2020.00-92}, }
NDSS
SPEECHMINER: A Framework for Investigating and Measuring Speculative Execution Vulnerabilities

Yuan Xiao, Yinqian Zhang, and Radu Teodorescu

Network and Distributed System Security Symposium (NDSS), Sep 2020

Abs Bib PDF Code

SPEculative Execution side Channel Hardware (SPEECH) Vulnerabilities have enabled the notorious Meltdown, Spectre, and L1 terminal fault (L1TF) attacks. While a number of studies have reported different variants of SPEECH vulnerabilities, they are still not well understood. This is primarily due to the lack of information about microprocessor implementation details that impact the timing and order of various micro-architectural events. Moreover, to date, there is no systematic approach to quantitatively measure SPEECH vulnerabilities on commodity processors. This paper introduces SPEECHMINER, a software framework for exploring and measuring SPEECH vulnerabilities in an automated manner. SPEECHMINER empirically establishes the link between a novel two-phase fault handling model and the exploitability and speculation windows of SPEECH vulnerabilities. It enables testing of a comprehensive list of exception-triggering instructions under the same software framework, which leverages covert-channel techniques and differential tests to gain visibility into the micro-architectural state changes. We evaluated SPEECHMINER on 9 different processor types, examined 21 potential vulnerability variants, confirmed various known attacks, and identified several new variants.
@article{xiao_ndss2020speechminer, title = {SPEECHMINER: A Framework for Investigating and Measuring Speculative Execution Vulnerabilities}, author = {Xiao, Yuan and Zhang, Yinqian and Teodorescu, Radu}, journal = {Network and Distributed System Security Symposium (NDSS)}, year = {2020}, }

2019

SEC
Adaptive parallel execution of deep neural networks on heterogeneous edge devices

Li Zhou, Mohammad Hossein Samavatian, Anys Bacha, Saikat Majumdar, and Radu Teodorescu

In Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, Arlington, Virginia, Sep 2019

Abs Bib PDF

New applications such as smart homes, smart cities, and autonomous vehicles are driving an increased interest in deploying machine learning on edge devices. Unfortunately, deploying deep neural networks (DNNs) on resource-constrained devices presents significant challenges. These workloads are computationally intensive and often require cloud-like resources. Prior solutions attempted to address these challenges by either introducing more design efforts or by relying on cloud resources for assistance. In this paper, we propose a runtime adaptive convolutional neural network (CNN) acceleration framework that is optimized for heterogeneous Internet of Things (IoT) environments. The framework leverages spatial partitioning techniques through fusion of the convolution layers and dynamically selects the optimal degree of parallelism according to the availability of computational resources, as well as network conditions. Our evaluation shows that our framework outperforms state-of-art approaches by improving the inference speed and reducing communication costs while running on wirelessly-connected Raspberry-Pi3 devices. Experimental evaluation shows up to 1.9x 3.7x speedup using 8 devices for three popular CNN models.
@inproceedings{zhou_sec2019, author = {Zhou, Li and Samavatian, Mohammad Hossein and Bacha, Anys and Majumdar, Saikat and Teodorescu, Radu}, title = {Adaptive parallel execution of deep neural networks on heterogeneous edge devices}, year = {2019}, isbn = {9781450367332}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3318216.3363312}, doi = {10.1145/3318216.3363312}, booktitle = {Proceedings of the 4th ACM/IEEE Symposium on Edge Computing}, pages = {195–208}, numpages = {14}, keywords = {deep learning, edge devices, inference, parallel execution}, location = {Arlington, Virginia}, }
CAL
Isolating Speculative Data to Prevent Transient Execution Attacks

Kristin Barber, Li Zhou, Anys Bacha, Yinqian Zhang, and Radu Teodorescu

IEEE Computer Architecture Letters, 2019

Best of CAL Abs Bib PDF

Paper selected as one of the best papers of Computer Architecture Letters for 2019, with an invited presentation in special a session at HPCA 2020.

Hardware security has recently re-surfaced as a first-order concern to the confidentiality protections of computing systems. Meltdown and Spectre introduced a new class of exploits that leverage transient state as an attack surface and have revealed fundamental security vulnerabilities of speculative execution in high-performance processors. These attacks derive benefit from the fact that programs may speculatively execute instructions outside their legal control flows. This insight is then utilized for gaining access to restricted data and exfiltrating it by means of a covert channel. This study presents a microarchitectural mitigation technique for shielding transient state from covert channels during speculative execution. Unlike prior work that has focused on closing individual covert channels used to leak sensitive information, this approach prevents the use of speculative data by downstream instructions until doing so is determined to be safe. This prevents transient execution attacks at a cost of 18 percent average performance degradation.
@article{barber_cal2019, author = {Barber, Kristin and Zhou, Li and Bacha, Anys and Zhang, Yinqian and Teodorescu, Radu}, journal = {IEEE Computer Architecture Letters}, title = {Isolating Speculative Data to Prevent Transient Execution Attacks}, year = {2019}, volume = {}, number = {}, altpages = {1-1}, keywords = {Registers;Transient analysis;Pipelines;Delays;Security;Law;hardware security;transient execution attacks;covert timing channels}, doi = {10.1109/LCA.2019.2916328}, issn = {1556-6056}, month = {}, }
PACT
SpecShield: Shielding Speculative Data from Microarchitectural Covert Channels

Kristin Barber, Anys Bacha, Li Zhou, Yinqian Zhang, and Radu Teodorescu

In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2019

MICRO Top Picks Honorable Mention Abs Bib PDF

Selected as one of the IEEE MICRO Top Picks Honorable Mentions for 2019.

Hardware security has recently re-surfaced as a first-order concern to the confidentiality protections of computing systems. Meltdown and Spectre introduced a new class of microarchitectural exploits which leverage transient state as an attack vector, revealing fundamental security vulnerabilities of speculative execution in high-performance processors. These attacks profit from the fact that, during speculative execution, programs may execute instructions outside their legal control flows. This is used to gain access to restricted data, which is then exfiltrated through a covert channel. This paper proposes SpecShield, a family of microarchitectural mitigation techniques for shielding speculative data from covert channels used in transient execution attacks. Unlike prior work that has focused on closing individual covert channels used to leak sensitive information, SpecShield prevents the use of speculative data by downstream instructions until doing so is determined to be safe, thus isolating it from any covert channel. The most secure version of SpecShield eliminates transient execution attacks at a cost of 21% average performance degradation. A more aggressive version of SpecShield, which prevents the propagation of speculative data to known or probable covert channels provides only slightly relaxed security guarantees with an average of 10% performance impact.
@inproceedings{barber_pact2019specshield, title = {SpecShield: Shielding Speculative Data from Microarchitectural Covert Channels}, author = {{Barber}, Kristin and {Bacha}, Anys and {Zhou}, Li and {Zhang}, Yinqian and {Teodorescu}, Radu}, booktitle = {2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)}, pages = {151--164}, year = {2019}, }
SIGSPATIAL
Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights

Sobhan Moosavi, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath

In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Chicago, IL, USA, 2019

Abs Bib PDF

Reducing traffic accidents is an important public safety challenge, therefore, accident analysis and prediction has been a topic of much research over the past few decades. Using small-scale datasets with limited coverage, being dependent on extensive set of data, and being not applicable for real-time purposes are the important shortcomings of the existing studies. To address these challenges, we propose a new solution for real-time traffic accident prediction using easy-to-obtain, but sparse data. Our solution relies on a deep-neural-network model (which we have named DAP, for Deep Accident Prediction); which utilizes a variety of data attributes such as traffic events, weather data, points-of-interest, and time. DAP incorporates multiple components including a recurrent (for time-sensitive data), a fully connected (for time-insensitive data), and a trainable embedding component (to capture spatial heterogeneity). To fill the data gap, we have - through a comprehensive process of data collection, integration, and augmentation - created a large-scale publicly available database of accident information named US-Accidents. By employing the US-Accidents dataset and through an extensive set of experiments across several large cities, we have evaluated our proposal against several baselines. Our analysis and results show significant improvements to predict rare accident events. Further, we have shown the impact of traffic information, time, and points-of-interest data for real-time accident prediction.
@inproceedings{moosavi_sigspatial2019, author = {Moosavi, Sobhan and Samavatian, Mohammad Hossein and Parthasarathy, Srinivasan and Teodorescu, Radu and Ramnath, Rajiv}, title = {Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights}, year = {2019}, isbn = {9781450369091}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3347146.3359078}, doi = {10.1145/3347146.3359078}, booktitle = {Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems}, pages = {33–42}, numpages = {10}, keywords = {Accident Prediction, Heterogeneous Data, US-Accidents}, location = {Chicago, IL, USA}, series = {SIGSPATIAL '19}, }
HotEdge
Distributing Deep Neural Networks with Containerized Partitions at the Edge

Li Zhou, Hao Wen, Radu Teodorescu, and David H.C. Du

In 2nd USENIX Workshop on Hot Topics in Edge Computing (HotEdge 19), Jul 2019

Abs Bib PDF

Deploying machine learning on edge devices is becoming increasingly important, driven by new applications such as smart homes, smart cities, and autonomous vehicles. Unfortunately, it is challenging to deploy deep neural networks (DNNs) on resource-constrained devices. These workloads are computationally intensive and often require cloud-like resources. Prior solutions attempted to address these challenges by either sacrificing accuracy or by relying on cloud resources for assistance. In this paper, we propose a containerized partition-based runtime adaptive convolutional neural network (CNN) acceleration framework for Internet of Things (IoT) environments. The framework leverages spatial partitioning techniques through convolution layer fusion to dynamically select the optimal partition according to the availability of computational resources and network conditions. By containerizing each partition, we simplify the model update and deployment with Docker and Kubernetes to efficiently handle runtime resource management and scheduling of containers.
@inproceedings{zhou_hotedge2019, author = {Zhou, Li and Wen, Hao and Teodorescu, Radu and Du, David H.C.}, title = {Distributing Deep Neural Networks with Containerized Partitions at the Edge}, booktitle = {2nd USENIX Workshop on Hot Topics in Edge Computing (HotEdge 19)}, year = {2019}, address = {Renton, WA}, url = {https://www.usenix.org/conference/hotedge19/presentation/zhou}, publisher = {USENIX Association}, month = jul, }

2018

ICCD
NVCool: When Non-Volatile Caches Meet Cold Boot Attacks

Xiang Pan, Anys Bacha, Spencer Rudolph, Li Zhou, Yinqian Zhang, and Radu Teodorescu

In 2018 IEEE 36th International Conference on Computer Design (ICCD), Jul 2018

Abs Bib PDF

Non-volatile memories (NVMs) are expected to replace traditional DRAM and SRAM for both off-chip and on-chip storage. It is therefore crucial to understand their security vulnerabilities before they are deployed widely. This paper shows that NVM caches are vulnerable to so-called "cold boot" attacks, which involve physical access to the processor’s cache. SRAM caches have generally been assumed invulnerable to cold boot attacks, because SRAM data is only persistent for a few milliseconds even at cold temperatures. Our study explores cold boot attacks on NVM caches and defenses against them. In particular, this paper demonstrates that hard disk encryption keys can be extracted from the NVM cache in multiple attack scenarios. We demonstrate a reproducible attack with very high probability of success. This paper also proposes an effective software-based countermeasure that can completely eliminate the vulnerability of NVM caches to cold boot attacks with a reasonable performance overhead.
@inproceedings{pan_iccd2018nvcool, author = {Pan, Xiang and Bacha, Anys and Rudolph, Spencer and Zhou, Li and Zhang, Yinqian and Teodorescu, Radu}, booktitle = {2018 IEEE 36th International Conference on Computer Design (ICCD)}, title = {NVCool: When Non-Volatile Caches Meet Cold Boot Attacks}, year = {2018}, volume = {}, number = {}, pages = {439-448}, keywords = {Nonvolatile memory;Random access memory;Encryption;Schedules;Registers;cold boot attack;non volatile memory;caches;security}, doi = {10.1109/ICCD.2018.00072}, }
ICPP
C-Graph: A Highly Efficient Concurrent Graph Reachability Query Framework

Li Zhou, Ren Chen, Yinglong Xia, and Radu Teodorescu

In Proceedings of the 47th International Conference on Parallel Processing, Eugene, OR, USA, Jul 2018

Abs Bib PDF

Many big data analytics applications explore a set of related entities, which are naturally modeled as graph. However, graph processing is notorious for its performance challenges due to random data access patterns, especially for large data volumes. Solving these challenges is critical to the performance of industry-scale applications. In contrast to most prior works, which focus on accelerating a single graph processing task, in industrial practice we consider multiple graph processing tasks running concurrently, such as a group of queries issued simultaneously to the same graph. In this paper, we present an edge-set based graph traversal framework called C-Graph (i.e. Concurrent Graph), running on a distributed infrastructure, that achieves both high concurrency and efficiency for k-hop reachability queries. The proposed framework maintains global vertex states to facilitate graph traversals, and supports both synchronous and asynchronous communication. In this study, we decompose a set of graph processing tasks into local traversals and analyze their performance on C-Graph. More specifically, we optimize the organization of the physical edge-set and explore the shared subgraphs. We experimentally show that our proposed framework outperforms several baseline methods.
@inproceedings{zhou_icpp2018, author = {Zhou, Li and Chen, Ren and Xia, Yinglong and Teodorescu, Radu}, title = {C-Graph: A Highly Efficient Concurrent Graph Reachability Query Framework}, year = {2018}, isbn = {9781450365109}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3225058.3225136}, doi = {10.1145/3225058.3225136}, booktitle = {Proceedings of the 47th International Conference on Parallel Processing}, articleno = {79}, numpages = {10}, keywords = {K-Hop Reachability, Graph Processing, Distributed System, Concurrent Queries}, location = {Eugene, OR, USA}, series = {ICPP '18}, }

2017

IPDPS
Respin: Rethinking Near-Threshold Multiprocessor Design with Non-volatile Memory

Xiang Pan, Anys Bacha, and Radu Teodorescu

In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Jul 2017

Abs Bib PDF

Near-threshold computing is emerging as a promising energy-efficient alternative for power-constrained environments. Unfortunately, aggressive reduction in supply voltage to the near-threshold range, albeit effective, faces a host of challenges. This includes higher relative leakage power and high error rates, particularly in dense SRAM structures such as on-chip caches. This paper presents an architecture that rethinks the cache hierarchy in near-threshold multiprocessors. Our design uses STT-RAM to implement all on-chip caches. STT-RAM has several advantages over SRAM at low voltages including low leakage, high density, and reliability. The design consolidates the private caches of near-threshold cores into shared L1 instruction/data caches organized in clusters. We find that our consolidated cache design can service more than 95% of incoming requests within a single cycle. We demonstrate that eliminating the coherence traffic associated with private caches results in a performance boost of 11%. In addition, we propose a hardware-based core management system that dynamically consolidates virtual cores into variable numbers of physical cores to increase resource efficiency. We demonstrate that this approach can save up to 33% in energy.
@inproceedings{pan_ipdps2017respin, author = {Pan, Xiang and Bacha, Anys and Teodorescu, Radu}, booktitle = {2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)}, title = {Respin: Rethinking Near-Threshold Multiprocessor Design with Non-volatile Memory}, year = {2017}, volume = {}, number = {}, pages = {265-275}, keywords = {Clocks;Random access memory;Power demand;Reliability;Registers;System-on-chip;Low voltage}, doi = {10.1109/IPDPS.2017.109}, }

2016

MICRO
Snatch: Opportunistically reassigning power allocation between processor and memory in 3D stacks

Dimitrios Skarlatos, Renji Thomas, Aditya Agrawal, Shibin Qin, Robert Pilawa-Podgurski, Ulya R. Karpuzcu, Radu Teodorescu, Nam Sung Kim, and Josep Torrellas

In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Jul 2016

Abs Bib PDF

The pin count largely determines the cost of a chip package, which is often comparable to the cost of a die. In 3D processor-memory designs, power and ground (P/G) pins can account for the majority of the pins. This is because packages include separate pins for the disjoint processor and memory power delivery networks (PDNs). Supporting separate PDNs and P/G pins for processor and memory is inefficient, as each set has to be provisioned for the worst-case power delivery requirements. In this paper, we propose to reduce the number of P/G pins of both processor and memory in a 3D design, and dynamically and opportunistically divert some power between the two PDNs on demand. To perform the power transfer, we use a small bidirectional on-chip voltage regulator that connects the two PDNs. Our concept, called Snatch, is effective. It allows the computer to execute code sections with high processor or memory power requirements without having to throttle performance. We evaluate Snatch with simulations of an 8-core multicore stacked with two memory dies. In a set of compute-intensive codes, the processor snatches memory power for 30% of the time on average, speeding-up the codes by up to 23% over advanced turbo-boosting; in memory-intensive codes, the memory snatches processor power. Alternatively, Snatch can reduce the package cost by about 30%.
@inproceedings{skarlatos_micro2016snatch, author = {Skarlatos, Dimitrios and Thomas, Renji and Agrawal, Aditya and Qin, Shibin and Pilawa-Podgurski, Robert and Karpuzcu, Ulya R. and Teodorescu, Radu and Kim, Nam Sung and Torrellas, Josep}, booktitle = {2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)}, title = {Snatch: Opportunistically reassigning power allocation between processor and memory in 3D stacks}, year = {2016}, volume = {}, number = {}, pages = {1-12}, keywords = {Pins;System-on-chip;Three-dimensional displays;Voltage control;Resource management;Multicore processing}, doi = {10.1109/MICRO.2016.7783757}, }
ISPASS
EmerGPU: Understanding and mitigating resonance-induced voltage noise in GPU architectures

Renji Thomas, Naser Sedaghati, and Radu Teodorescu

In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Jul 2016

Abs Bib PDF

This paper characterizes voltage noise in GPU architectures running general purpose workloads. In particular, it focuses on resonance-induced voltage noise, which is caused by workload-induced fluctuations in power demand that occur at the resonance frequency of the chip’s power delivery network. A distributed power delivery model at functional unit granularity was developed and used to simulate supply voltage behavior in a GPU system. We observe that resonance noise can lead to very large voltage droops and protecting against these droops by using voltage guardbands is costly and inefficient. We propose EmerGPU, a solution that detects and mitigates resonance noise in GPUs. EmerGPU monitors workload activity levels and detects oscillations in power demand that approach resonance frequencies. When such conditions are detected, EmerGPU deploys a mitigation mechanism implemented in the warp scheduler that disrupts the resonance activity pattern. EmerGPU has no impact on performance and a small power cost. Reducing voltage noise improves system reliability and allows for smaller voltage margins to be used, reducing overall energy consumption by an average of 21%.
@inproceedings{thomas_ispass2016, author = {Thomas, Renji and Sedaghati, Naser and Teodorescu, Radu}, booktitle = {2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)}, title = {EmerGPU: Understanding and mitigating resonance-induced voltage noise in GPU architectures}, year = {2016}, volume = {}, number = {}, pages = {79-89}, keywords = {Graphics processing units;Resonant frequency;Power demand;Computer architecture;Impedance;Integrated circuit modeling;Instruction sets}, doi = {10.1109/ISPASS.2016.7482076}, }
HPCA
Core tunneling: Variation-aware voltage noise mitigation in GPUs

Renji Thomas, Kristin Barber, Naser Sedaghati, Li Zhou, and Radu Teodorescu

In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), Jul 2016

Best Paper Award Nomination Abs Bib PDF

Nominated by the Program Committee for a Best Paper Award at HPCA 2016.

Voltage noise and manufacturing process variation represent significant reliability challenges for modern microprocessors. Voltage noise is caused by rapid changes in processor activity that can lead to timing violations and errors. Process variation is caused by manufacturing challenges in low-nanometer technologies and can lead to significant heterogeneity in performance and reliability across the chip. To ensure correct execution under worst-case conditions, chip designers generally add operating margins that are often unnecessarily conservative for most use cases, which results in wasted energy. This paper investigates the combined effects of process variation and voltage noise on modern GPU architectures. A distributed power delivery and process variation model at functional unit granularity was developed and used to simulate supply voltage behavior in a multicore GPU system. We observed that, just like in CPUs, large changes in power demand can lead to significant voltage droops. We also note that process variation makes some cores much more vulnerable to noise than others in the same GPU. Therefore, protecting the chip against large voltage droops by using fixed and uniform voltage guardbands is costly and inefficient. This paper presents core tunneling, a variation-aware solution for dynamically reducing voltage margins. The system relies on hardware critical path monitors to detect voltage noise conditions and quickly reacts by clock-gating vulnerable cores to prevent timing violations. This allows a substantial reduction in voltage margins. Since clock gating is enabled infrequently and only on the most vulnerable cores, the performance impact of core tunneling is very low. On average, core tunneling reduces energy consumption by 15%.
@inproceedings{thomas_hpca2016, author = {Thomas, Renji and Barber, Kristin and Sedaghati, Naser and Zhou, Li and Teodorescu, Radu}, booktitle = {2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)}, title = {Core tunneling: Variation-aware voltage noise mitigation in GPUs}, year = {2016}, volume = {}, number = {}, pages = {151-162}, keywords = {Graphics processing units;Tunneling;Histograms;Power demand;Monitoring;Delays;Kernel}, doi = {10.1109/HPCA.2016.7446061}, }
USENIX Security
One Bit Flips, One Cloud Flops: Cross-VM Row Hammer Attacks and Privilege Escalation

Yuan Xiao, Xiaokuan Zhang, Yinqian Zhang, and Radu Teodorescu

In 25th USENIX Security Symposium (USENIX Security 16), Aug 2016

Abs Bib PDF

Row hammer attacks exploit electrical interactions between neighboring memory cells in high-density dynamic random-access memory (DRAM) to induce memory errors. By rapidly and repeatedly accessing DRAMs with specific patterns, an adversary with limited privilege on the target machine may trigger bit flips in memory regions that he has no permission to access directly. In this paper, we explore row hammer attacks in cross-VM settings, in which a malicious VM exploits bit flips induced by row hammer attacks to crack memory isolation enforced by virtualization. To do so with high fidelity, we develop novel techniques to determine the physical address mapping in DRAM modules at runtime (to improve the effectiveness of double-sided row hammer attacks), methods to exhaustively hammer a large fraction of physical memory from a guest VM (to collect exploitable vulnerable bits), and innovative approaches to break Xen paravirtualized memory isolation (to access arbitrary physical memory of the shared machine). Our study also suggests that the demonstrated row hammer attacks are applicable in modern public clouds where Xen paravirtualization technology is adopted. This shows that the presented cross-VM row hammer attacks are of practical importance.
@inproceedings{xiao_usenix2016, author = {Xiao, Yuan and Zhang, Xiaokuan and Zhang, Yinqian and Teodorescu, Radu}, title = {One Bit Flips, One Cloud Flops: {Cross-VM} Row Hammer Attacks and Privilege Escalation}, booktitle = {25th USENIX Security Symposium (USENIX Security 16)}, year = {2016}, isbn = {978-1-931971-32-4}, address = {Austin, TX}, pages = {19--35}, url = {https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/xiao}, publisher = {USENIX Association}, month = aug, }

2015

MICRO
Authenticache: Harnessing cache ECC for system authentication

Anys Bacha, and Radu Teodorescu

In Proceedings of the 48th International Symposium on Microarchitecture, Waikiki, Hawaii, Aug 2015

MICRO Top Picks Honorable Mention Abs Bib PDF

Selected as one of the twelve IEEE MICRO Top Picks Honorable Mentions for 2015.

Hardware-assisted security is emerging as a promising avenue for protecting computer systems. Hardware based solutions, such as Physical Unclonable Functions (PUF), enable system authentication by relying on the physical attributes of the silicon to serve as fingerprints. A variety of PUF designs have been proposed by researchers, with some gaining commercial success. Virtually all of these systems require dedicated PUF hardware to be built into the processor or System-on-Chip (SoC), increasing the cost of deployment in the field.This paper presents Authenticache, a novel, low-cost PUF design that does not require dedicated hardware support. Instead, it leverages on-chip error correction logic already built into many processor caches. As a result, Authenticache can be deployed and used by many off-the-shelf processors with minimal costs. We prototype, evaluate, and test the design on a real system, in addition to conducting extensive simulations. We find Authenticache to have high identifiability, as well as excellent resilience to measurement and environmental noise. Authenticache can withstand up to 142% of noise while maintaining a misidentification rate that is below 1 ppm.
@inproceedings{bacha_micro2015authenticache, author = {Bacha, Anys and Teodorescu, Radu}, title = {Authenticache: Harnessing cache ECC for system authentication}, year = {2015}, isbn = {9781450340342}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/2830772.2830814}, doi = {10.1145/2830772.2830814}, booktitle = {Proceedings of the 48th International Symposium on Microarchitecture}, pages = {128–140}, numpages = {13}, location = {Waikiki, Hawaii}, series = {MICRO-48}, }
TACO
On Using the Roofline Model with Lower Bounds on Data Movement

Venmugil Elango, Naser Sedaghati, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, Radu Teodorescu, and P. Sadayappan

ACM Transactions on Architecture and Code Optimization (TACO), Jan 2015

Abs Bib PDF

The roofline model is a popular approach for “bound and bottleneck” performance analysis. It focuses on the limits to the performance of processors because of limited bandwidth to off-chip memory. It models upper bounds on performance as a function of operational intensity, the ratio of computational operations per byte of data moved from/to memory. While operational intensity can be directly measured for a specific implementation of an algorithm on a particular target platform, it is of interest to obtain broader insights on bottlenecks, where various semantically equivalent implementations of an algorithm are considered, along with analysis for variations in architectural parameters. This is currently very cumbersome and requires performance modeling and analysis of many variants.In this article, we address this problem by using the roofline model in conjunction with upper bounds on the operational intensity of computations as a function of cache capacity, derived from lower bounds on data movement. This enables bottleneck analysis that holds across all dependence-preserving semantically equivalent implementations of an algorithm. We demonstrate the utility of the approach in assessing fundamental limits to performance and energy efficiency for several benchmark algorithms across a design space of architectural variations.
@article{elango_taco2015, author = {Elango, Venmugil and Sedaghati, Naser and Rastello, Fabrice and Pouchet, Louis-No\"{e}l and Ramanujam, J. and Teodorescu, Radu and Sadayappan, P.}, title = {On Using the Roofline Model with Lower Bounds on Data Movement}, year = {2015}, issue_date = {January 2015}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {11}, number = {4}, issn = {1544-3566}, url = {https://doi.org/10.1145/2693656}, doi = {10.1145/2693656}, journal = {ACM Transactions on Architecture and Code Optimization (TACO)}, month = jan, articleno = {67}, numpages = {23}, keywords = {architecture design space exploration, algorithm-architecture codesign, Operational intensity upper bounds, I/O lower bounds}, }

2014

MICRO
Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors

Anys Bacha, and Radu Teodorescu

In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Cambridge, UK, Jan 2014

Abs Bib PDF

Low-voltage computing is emerging as a promising energy-efficient solution to power-constrained environments. Unfortunately, low-voltage operation presents significant reliability challenges, including increased sensitivity to static and dynamic variability. To prevent errors, safety guard bands can be added to the supply voltage. While these guard bands are feasible at higher supply voltages, they are prohibitively expensive at low voltages, to the point of negating most of the energy savings. Voltage speculation techniques have been proposed to dynamically reduce voltage margins. Most require additional hardware to be added to the chip to correct or prevent timing errors caused by excessively aggressive speculation. This paper presents a mechanism for safely guiding voltage speculation using direct feedback from ECC-protected cache lines. We conduct extensive testing of an Intel Itanium processor running at low voltages. We find that as voltage margins are reduced, certain ECC-protected cache lines consistently exhibit correctable errors. We propose a hardware mechanism for continuously probing these cache lines to fine tune supply voltage at core granularity within a chip. Moreover, we demonstrate that this mechanism is sufficiently sensitive to detect and adapt to voltage noise caused by fluctuations in chip activity. We evaluate a proof-of-concept implementation of this mechanism in an Itanium-based server. We show that this solution lowers supply voltage by 18% on average, reducing power consumption by an average of 33% while running a mix of benchmark applications.
@inproceedings{bacha_micro2014, author = {Bacha, Anys and Teodorescu, Radu}, booktitle = {2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)}, title = {Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors}, year = {2014}, volume = {}, number = {}, pages = {306-318}, keywords = {Monitoring;Error correction codes;Hardware;Program processors;Error analysis;Low voltage;Timing}, doi = {10.1109/MICRO.2014.54}, location = {Cambridge, UK}, }
ICCD
NVSleep: Using non-volatile memory to enable fast sleep/wakeup of idle cores

Xiang Pan, and Radu Teodorescu

In 2014 IEEE 32nd International Conference on Computer Design (ICCD), Jan 2014

Abs Bib PDF

Spin-transfer torque random access memory (STTRAM) is an emerging memory technology with several attractive properties including non-volatility, high density, low leakage, and high endurance. These characteristics make it a potential candidate for replacing SRAM structures on processor chips. This paper presents NVSleep, a low-power microprocessor framework that leverages STT-RAM to implement fast checkpointing that enables near-instantaneous shutdown of cores without loss of the execution state. NVSleep stores almost all processor state in STT-RAM structures that do not lose content when power-gated. Memory structures that require low-latency access are implemented in SRAM and backed-up by “shadow” STT-RAM structures that are used to implement fast checkpointing. This enables rapid shutdown of cores and low-overhead resumption of execution, which allows cores to be turned off frequently and for short periods of time to take advantage of idle execution phases and save power. We present two implementations of NVSleep: NVSleepMiss which turns cores off when last level cache misses cause pipeline stalls and NVSleepBarrier which turns cores off when blocked on barriers. Evaluation of a simulated 64-core system shows average energy savings of 21% for NVSleepMiss for SPEC2000 benchmarks and 34% for NVSleepBarrier in high barrier count multi-threaded workloads from PARSEC and SPLASH2 benchmarks.
@inproceedings{pan_iccd2014nvsleep, author = {Pan, Xiang and Teodorescu, Radu}, booktitle = {2014 IEEE 32nd International Conference on Computer Design (ICCD)}, title = {NVSleep: Using non-volatile memory to enable fast sleep/wakeup of idle cores}, year = {2014}, volume = {}, number = {}, pages = {400-407}, keywords = {Random access memory;Benchmark testing;Checkpointing;Pipelines;Registers;Message systems;Nonvolatile memory}, doi = {10.1109/ICCD.2014.6974712}, }
PACT
Using STT-RAM to enable energy-efficient near-threshold chip multiprocessors

Xiang Pan, and Radu Teodorescu

In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT), Edmonton, AB, Canada, Jan 2014

Abs Bib PDF

Near-threshold computing is gaining traction as an energy-efficient solution for power-constrained systems. This paper proposes a novel near-threshold chip multiprocessor design that uses non-volatile spin-transfer torque random access memory (STT-RAM) technology to implement all on-chip caches. This technology has several advantages over SRAM that are particularly useful in near-threshold designs. Primarily, STT-RAM has very low leakage, saving a substantial fraction of the power consumed by near-threshold chips. In addition, the STT-RAM components run at a higher supply voltage to speed up write operations. This has the effect of making cache reads very fast to the point where L1 caches can be shared by several cores, improving performance. Overall, the proposed design saves 11-33% energy compared to an SRAM-based near-threshold system.
@inproceedings{pan_pact2014, author = {Pan, Xiang and Teodorescu, Radu}, title = {Using STT-RAM to enable energy-efficient near-threshold chip multiprocessors}, year = {2014}, isbn = {9781450328098}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/2628071.2628132}, doi = {10.1145/2628071.2628132}, booktitle = {Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT)}, pages = {485–486}, numpages = {2}, keywords = {near-threshold computing, non-volatile memory, stt-ram}, location = {Edmonton, AB, Canada}, series = {PACT '14}, }

2013

ISCA
Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors

Anys Bacha, and Radu Teodorescu

In Proceedings of the 40th Annual International Symposium on Computer Architecture, Tel-Aviv, Israel, Jan 2013

Abs Bib PDF

Lowering supply voltage is one of the most effective approaches for improving the energy efficiency of microprocessors. Unfortunately, technology limitations, such as process variability and circuit aging, are forcing microprocessor designers to add larger voltage guardbands to their chips. This makes supply voltage increasingly difficult to scale with technology. This paper presents a new mechanism for dynamically reducing voltage margins while maintaining the chip operating frequency constant. Unlike previous approaches that rely on special hardware to detect and recover from timing violations caused by low-voltage execution, our solution is firmware-based and does not require additional hardware. Instead, it relies on error correction mechanisms already built into modern processors. The system dynamically reduces voltage margins and uses correctable error reports raised by the hardware to identify the lowest, safe operating voltage. The solution adapts to core-to-core variability by tailoring supply voltage to each core’s safe operating level. In addition, it exploits variability in workload vulnerability to low voltage execution. The system was prototyped on an HP Integrity Server that uses Intel’s Itanium 9560 processors. Evaluation using SPECjbb2005 and SPEC CPU2000 workloads shows core power savings ranging from 18% to 23%, with minimal performance impact.
@inproceedings{bacha_isca2013, author = {Bacha, Anys and Teodorescu, Radu}, title = {Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors}, year = {2013}, isbn = {9781450320795}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/2485922.2485948}, doi = {10.1145/2485922.2485948}, booktitle = {Proceedings of the 40th Annual International Symposium on Computer Architecture}, pages = {297–307}, numpages = {11}, location = {Tel-Aviv, Israel}, series = {ISCA '13}, }
CC
Runtime failure rate targeting for energy-efficient reliability in chip microprocessors

Timothy N. Miller, Nagarjuna Surapaneni, and Radu Teodorescu

Concurrency and Computation: Practice and Experience, Jan 2013

Abs Bib PDF

Technology scaling is having an increasingly detrimental effect on microprocessor reliability, with increased variability and higher susceptibility to errors. At the same time, as integration of chip multiprocessors increases, power consumption is becoming a significant bottleneck. To ensure continued performance, growth of microprocessors requires development of powerful and energy-efficient solutions to reliability challenges. This paper presents a reliable multicore architecture that provides targeted error protection by adapting to the characteristics of individual cores and workloads, with the goal of providing reliability with minimum energy. The user can specify an acceptable reliability target for each chip, core, or application. The system then adjusts a range of parameters, including replication and supply voltage, to meet that reliability goal. In this multicore architecture, each core consists of a pair of pipelines that can run independently (running separate threads) or in concert (running the same thread and verifying results). Redundancy is enabled selectively, at functional unit granularity. The architecture also employs timing speculation for mitigation of variation-induced timing errors and to reduce the power overhead of error protection. On-line control based on machine learning dynamically adjusts multiple parameters to minimize energy consumption. Evaluation shows that dynamic adaptation of voltage and redundancy can reduce the energy delay product of a chip multiprocessor by 30-60% compared with static dual modular redundancy.
@article{miller_cc2013, author = {Miller, Timothy N. and Surapaneni, Nagarjuna and Teodorescu, Radu}, title = {Runtime failure rate targeting for energy-efficient reliability in chip microprocessors}, journal = {Concurrency and Computation: Practice and Experience}, year = {2013}, volume = {25}, number = {6}, pages = {790--807}, }

2012

ISCA
VRSync: characterizing and eliminating synchronization-induced voltage emergencies in many-core processors

Timothy N. Miller, Renji Thomas, Xiang Pan, and Radu Teodorescu

In Proceedings of the 39th Annual International Symposium on Computer Architecture, Portland, Oregon, Jan 2012

Abs Bib PDF

Power consumption is a primary concern for microprocessor designers. Lowering the supply voltage of processors is one of the most effective techniques for improving their energy efficiency. Unfortunately, low-voltage operation faces multiple challenges going forward. One such challenge is increased sensitivity to voltage fluctuations, which can trigger so-called "voltage emergencies" that can lead to errors. These fluctuations are caused by abrupt changes in power demand, triggered by processor activity variation as a function of workload. This paper examines the effects of voltage fluctuations on future many-core processors. With the increase in the number of cores in a chip, the effects of chip-wide activity fluctuation – such as that caused by global synchronization in multithreaded applications – overshadow the effects of core-level workload variability. Starting from this observation, we developed VRSync, a novel synchronization methodology that uses emergency-aware scheduling policies that reduce the slope of load fluctuations, eliminating emergencies. We show that VRSync is very effective at eliminating emergencies, allowing voltage guardbands to be significantly lowered, which reduces energy consumption by an average of 33%.
@inproceedings{miller_isca2012, author = {Miller, Timothy N. and Thomas, Renji and Pan, Xiang and Teodorescu, Radu}, title = {VRSync: characterizing and eliminating synchronization-induced voltage emergencies in many-core processors}, year = {2012}, isbn = {9781450316422}, publisher = {IEEE Computer Society}, address = {USA}, booktitle = {Proceedings of the 39th Annual International Symposium on Computer Architecture}, pages = {249–260}, numpages = {12}, location = {Portland, Oregon}, series = {ISCA '12}, }
Tech Rep.
Parameter variation at near threshold voltage: The power efficiency versus resilience tradeoff

Josep Torrellas, Nam Sung Kim, and Radu Teodorescu

Jan 2012

Abs Bib PDF

The strongest lever that we have to improve the power-efficiency of CMOS devices is to reduce their supply voltage (VDD). When VDD is only a bit higher than the threshold voltage (VTH), we attain a fairly optimal operation point, where energy per operation is low and switching delay is not too high. This region, known as Near Threshold Voltage (NTV), is attractive because it can offer a 1-2 orders of magnitude reduction in power relative to the conventional Super Threshold Voltage (STV) operation [6, 10, 32]. While NTV is appealing, it has three major limitations: lower speed, higher share of static power, and higher impact of process variations. The first two are well known and likely to be worked around—especially for highly-parallel workloads. The third one, however, is less known and harder to tame. Process variations are the deviation of the values of device parameters (such as a transistor’s VTH) from their nominal specifications. They lead to circuits and chips with lower speed, higher power consumption and lower resilience. The negative impact of process variations is much higher at NTV: small changes in VTH induce large changes in transistor speed and power consumption due to an intrinsic effect coming from having VDD so close to VTH.
@techreport{torrellas_tech2012, author = {Torrellas, Josep and Kim, Nam Sung and Teodorescu, Radu}, title = {Parameter variation at near threshold voltage: The power efficiency versus resilience tradeoff}, institution = {University of Illinois}, year = {2012}, }
HPCA
Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips

Timothy N. Miller, Xiang Pan, Renji Thomas, Naser Sedaghati, and Radu Teodorescu

In IEEE International Symposium on High-Performance Comp Architecture, Jan 2012

Abs Bib PDF

Lowering supply voltage is one of the most effective techniques for reducing microprocessor power consumption. Unfortunately, at low voltages, chips are very sensitive to process variation, which can lead to large differences in the maximum frequency achieved by individual cores. This paper presents Booster, a simple, low-overhead framework for dynamically rebalancing performance heterogeneity caused by process variation and application imbalance. The Booster CMP includes two power supply rails set at two very low but different voltages. Each core can be dynamically assigned to either of the two rails using a gating circuit. This allows cores to quickly switch between two different frequencies. An on-chip governor controls the timing of the switching and the time spent on each rail. The governor manages a “boost budget” that dictates how many cores can be sped up (depending on the power constraints) at any given time. We present two implementations of Booster: Booster VAR, which virtually eliminates the effects of core-to-core frequency variation in near-threshold CMPs, and Booster SYNC, which additionally reduces the effects of imbalance in multithreaded applications. Evaluation using PARSEC and SPLASH2 benchmarks running on a simulated 32-core system shows an average performance improvement of 11% for Booster VAR and 23% for Booster SYNC.
@inproceedings{miller_hpca2012, author = {Miller, Timothy N. and Pan, Xiang and Thomas, Renji and Sedaghati, Naser and Teodorescu, Radu}, booktitle = {IEEE International Symposium on High-Performance Comp Architecture}, title = {Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips}, year = {2012}, volume = {}, number = {}, pages = {1-12}, keywords = {Rails;Synchronization;Instruction sets;Switches;Reactive power;Voltage control;Regulators}, doi = {10.1109/HPCA.2012.6168942}, }

2011

PACT
StVEC: A Vector Instruction Extension for High Performance Stencil Computation

Naser Sedaghati, Renji Thomas, Louis-Noël Pouchet, Radu Teodorescu, and P. Sadayappan

In 2011 International Conference on Parallel Architectures and Compilation Techniques, Jan 2011

Abs Bib PDF

Stencil computations comprise the compute-intensive core of many scientific applications. The data access pattern of stencil computations often requires several adjacent data elements of arrays to be accessed in innermost parallel loops. Although such loops are vectorized by current compilers like GCC and ICC that target short-vector SIMD instruction sets, a number of redundant loads or additional intra-register data shuffle operations are required, reducing the achievable performance. Thus, even when all arrays are cache resident, the peak performance achieved with stencil computations is considerably lower than machine peak. In this paper, we present a hardware-based solution for this problem. We propose an extension to the standard addressing mode of vector floating-point instructions in ISAs such as SSE, AVX, VMX etc. We propose an extended mode of paired-register addressing and its hardware implementation, to overcome the performance limitation of current short-vector SIMD ISA’s for stencil computations. Further, we present a code generation approach that can be used by a vectorizing compiler for processors with such an instructions set. Using an optimistic as well as a pessimistic emulation of the proposed instruction extension, we demonstrate the effectiveness of the proposed approach on top of SSE and AVX capable processors. We also synthesize parts of the proposed design using a 45nm CMOS library and show minimal impact on processor cycle time.
@inproceedings{sedaghati_pact2011, author = {Sedaghati, Naser and Thomas, Renji and Pouchet, Louis-Noël and Teodorescu, Radu and Sadayappan, P.}, booktitle = {2011 International Conference on Parallel Architectures and Compilation Techniques}, title = {{StVEC: A} Vector Instruction Extension for High Performance Stencil Computation}, year = {2011}, volume = {}, number = {}, pages = {276-287}, keywords = {Vectors;Registers;Hardware;Decoding;Program processors;Arrays;Stencil Computation;High Performance;Vector ISA}, doi = {10.1109/PACT.2011.59}, }