Computer Science - Architecture Publications (50)


Computer Science - Architecture Publications

The emerging hybrid DRAM-NVM architecture is challenging the existing memory management mechanism in operating system. In this paper, we introduce memos, which can schedule memory resources over the entire memory hierarchy including cache, channels, main memory comprising DRAM and NVM simultaneously. Powered by our newly designed kernel-level monitoring module and page migration engine, memos can dynamically optimize the data placement at the memory hierarchy in terms of the on-line memory patterns, current resource utilization and feature of memory medium. Read More

Large-scale deep convolutional neural networks (CNNs) are widely used in machine learning applications. While CNNs involve huge complexity, VLSI (ASIC and FPGA) chips that deliver high-density integration of computational resources are regarded as a promising platform for CNN's implementation. At massive parallelism of computational units, however, the external memory bandwidth, which is constrained by the pin count of the VLSI chip, becomes the system bottleneck. Read More

Affiliations: 1Systems Group, Department of Computer Science, ETH Zurich, 2Systems Group, Department of Computer Science, ETH Zurich, 3Systems Group, Department of Computer Science, ETH Zurich, 4Systems Group, Department of Computer Science, ETH Zurich

The hardware/software boundary in modern heterogeneous multicore computers is increasingly complex, and diverse across different platforms. A single memory access by a core or DMA engine traverses multiple hardware translation and caching steps, and the destination memory cell or register often appears at different physical addresses for different cores. Interrupts pass through a complex topology of interrupt controllers and remappers before delivery to one or more cores, each with specific constraints on their configurations. Read More

In this paper, we offer and discuss three efficient structural solutions for the hardware-oriented implementation of discrete quaternion Fourier transform basic operations with reduced implementation complexities. The first solution: a scheme for calculating sq product, the second solution: a scheme for calculating qt product, and the third solution: a scheme for calculating sqt product, where s is a so-called i-quaternion, t is an j-quaternion, and q is an usual quaternion. The direct multiplication of two usual quaternions requires 16 real multiplications (or two-operand multipliers in the case of fully parallel hardware implementation) and 12 real additions (or binary adders). Read More

An ultra-high throughput low-density parity check (LDPC) decoder with an unrolled full-parallel architecture is proposed, which achieves the highest decoding throughput compared to previously reported LDPC decoders in literature. The decoder benefits from a serial message-transfer approach between the decoding stages to alleviate the well-known routing congestion problem in parallel LDPC decoders. Furthermore, a finite-alphabet message passing algorithm is employed to replace the variable node update rule of standard min-sum decoders with optimized look-up tables, designed in a way that maximize the mutual information between decoding messages. Read More

Time variation during program execution can leak sensitive information. Time variations due to program control flow and hardware resource contention have been used to steal encryption keys in cipher implementations such as AES and RSA. A number of approaches to mitigate timing-based side-channel attacks have been proposed including cache partitioning, control-flow obfuscation and injecting timing noise into the outputs of code. Read More

Deep convolutional neural networks (CNN) have shown their good performances in many computer vision tasks. However, the high computational complexity of CNN involves a huge amount of data movements between the computational processor core and memory hierarchy which occupies the major of the power consumption. This paper presents Chain-NN, a novel energy-efficient 1D chain architecture for accelerating deep CNNs. Read More

Checkpoint-restart is now a mature technology. It allows a user to save and later restore the state of a running process. The new plugin model for the upcoming version 3. Read More

FPGA-based hardware accelerators for convolutional neural networks (CNNs) have obtained great attentions due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Read More

Massive multiuser (MU) multiple-input multiple-output (MIMO) will be a core technology in fifth-generation (5G) wireless systems as it offers significant improvements in spectral efficiency compared to existing multi-antenna technologies. The presence of hundreds of antenna elements at the base station (BS), however, results in excessively high hardware costs and power consumption, and requires high interconnect throughput between the baseband-processing unit and the radio unit. Massive MU-MIMO that uses low-resolution analog-to-digital and digital-to-analog converters (DACs) has the potential to address all these issues. Read More

A silicon physically unclonable function (PUF) using ring oscillators (ROs) has the advantage of easy application in both an application specific integrated circuit (ASIC) and a field-programmable gate array (FPGA). Here, we provide a RO-PUF using the initial waveform of the ROs based on 65 nm CMOS technology. Compared with the conventional RO-PUF, the number of ROs is greatly reduced and the time needed to generate an ID is within a couple of system clocks. Read More

To avoid packet loss and deadlock scenarios that arise due to faults or power gating in multicore and many-core systems, the network-on-chip needs to possess resilient communication and load-balancing properties. In this work, we introduce the Fashion router, a self-monitoring and self-reconfiguring design that allows for the on-chip network to dynamically adapt to component failures. First, we introduce a distributed intelligence unit, called Self-Awareness Module (SAM), which allows the router to detect permanent component failures and build a network connectivity map. Read More

We describe the computing tasks involved in autonomous driving, examine existing autonomous driving computing platform implementations. To enable autonomous driving, the computing stack needs to simultaneously provide high performance, low power consumption, and low thermal dissipation, at low cost. We discuss possible approaches to design computing platforms that will meet these needs. Read More

The discrete cosine transform (DCT) is the key step in many image and video coding standards. The 8-point DCT is an important special case, possessing several low-complexity approximations widely investigated. However, 16-point DCT transform has energy compaction advantages. Read More

In recent years, we have observed a clear trend in the rapid rise of autonomous vehicles, robotics, virtual reality, and augmented reality. The core technology enabling these applications, Simultaneous Localization And Mapping (SLAM), imposes two main challenges: first, these workloads are computationally intensive and they often have real-time requirements; second, these workloads run on battery-powered mobile devices with limited energy budget. In short, the essence of these challenges is that performance should be improved while simultaneously reducing energy consumption, two rather contradicting goals by conventional wisdom. Read More

A decision feedback circuit with integrated offset compensation is presented in this paper. The circuit is built around the sense amplifier comparator. The feedback loop is closed around the first stage of the comparator resulting in minimum loop latency. Read More

Polar codes are a new class of block codes with an explicit construction that provably achieve the capacity of various communications channels, even with the low-complexity successive-cancellation (SC) decoding algorithm. Yet, the more complex successive-cancellation list (SCL) decoding algorithm is gathering more attention lately as it significantly improves the error-correction performance of short- to moderate-length polar codes, especially when they are concatenated with a cyclic redundancy check code. However, as SCL decoding explores several decoding paths, existing hardware implementations tend to be significantly slower than SC-based decoders. Read More

I discuss the evolution of computer architectures with a focus on QCD and with reference to the interplay between architecture, engineering, data motion and algorithms. New architectures are discussed and recent performance results are displayed. I also review recent progress in multilevel solver and integation algorithms. Read More

Neuromorphic vision processor is an electronic implementation of vision algorithm processor on semiconductor. To image the world, a low-power CMOS image sensor array is required in the vision processor. The image sensor array is typically formed through photo diodes and analog to digital converter (ADC). Read More

In the context of embedded systems design, two important challenges are still under investigation. First, improve real-time data processing, reconfigurability, scalability, and self-adjusting capabilities of hardware components. Second, reduce power consumption through low-power design techniques as clock gating, logic gating, and dynamic partial reconfiguration (DPR) capabilities. Read More

Affiliations: 1Rutgers University, 2Rutgers University, 3Rutgers University, 4Rutgers University

To improve system performance, modern operating systems (OSes) often undertake activities that require modification of virtual-to-physical page translation mappings. For example, the OS may migrate data between physical frames to defragment memory and enable superpages. The OS may migrate pages of data between heterogeneous memory devices. Read More

Portable computing devices, which include tablets, smart phones and various types of wearable sensors, experienced a rapid development in recent years. One of the most critical limitations for these devices is the power consumption as they use batteries as the power supply. However, the bottleneck of the power saving schemes in both hardware design and software algorithm is the huge variability in power consumption. Read More

High-performance computing systems are moving toward 2.5D as in High Bandwidth Memory (HBM) and 3D integration of memory and logic as in Hybrid Memory Cube (HMC) to mitigate the main memory bottlenecks. This trend is also creating new opportunities to revisit near-memory computation. Read More

This paper describes the design and implementation of an audio interface for the Patmos processor, which runs on an Altera DE2-115 FPGA board. This board has an audio codec included, the WM8731. The interface described in this work allows to receive and send audio from and to the WM8731, and to synthesize, store or manipulate audio signals writing C programs for Patmos. Read More

In this paper, a real-time 105-channel data acquisition platform based on FPGA for imaging will be implemented for mm-wave imaging systems. PC platform is also realized for imaging results monitoring purpose. Mm-wave imaging expands our vision by letting us see things under poor visibility conditions. Read More

This paper describes HoLiSwap a method to reduce L1 cache wire energy, a significant fraction of total cache energy, by swapping hot lines to the cache way nearest to the processor. We observe that (i) a small fraction (<3%) of cache lines (hot lines) serve over 60% of the L1 cache accesses and (ii) the difference in wire energy between the nearest and farthest cache subarray can be over 6$\times$. Our method exploits this difference in wire energy to dynamically identify hot lines and swap them to the nearest physical way in a set-associative L1 cache. Read More

Timing and power consumption play an important role in the design of embedded systems. Furthermore, both properties are directly related to the safety requirements of many embedded systems. With regard to availability requirements, power considerations are of uttermost importance for battery operated systems. Read More

Convolutional neural nets (CNNs) have become a practical means to perform vision tasks, particularly in the area of image classification. FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on FPGAs have shown limited advantages over other devices such as GPUs. Previous approaches on FPGAs have often been memory bound due to the limited external memory bandwidth on the FPGA device. Read More

L1 caches are critical to the performance of modern computer systems. Their design involves a delicate balance between fast lookups, high hit rates, low access energy, and simplicity of implementation. Unfortunately, constraints imposed by virtual memory make it difficult to satisfy all these attributes today. Read More

SRAM-based FPGAs are increasingly popular in the aerospace industry due to their field programmability and low cost. However, they suffer from cosmic radiation induced Single Event Upsets (SEUs). In safety-critical applications, the dependability of the design is a prime concern since failures may have catastrophic consequences. Read More

The increasing number of threads inside the cores of a multicore processor, and competitive access to the shared cache memory, become the main reasons for an increased number of competitive cache misses and performance decline. Inevitably, the development of modern processor architectures leads to an increased number of cache misses. In this paper, we make an attempt to implement a technique for decreasing the number of competitive cache misses in the first level of cache memory. Read More

High energy particles from cosmic rays or packaging materials can generate a glitch or a current transient (single event transient or SET) in a logic circuit. This SET can eventually get captured in a register resulting in a flip of the register content, which is known as soft error or single-event upset (SEU). A soft error is typically modeled as a probabilistic single bit-flip model. Read More

Flexibility at hardware level is the main driving force behind adaptive systems whose aim is to realise microarhitecture deconfiguration 'online'. This feature allows the software/hardware stack to tolerate drastic changes of the workload in data centres. With emerge of FPGA reconfigurablity this technology is becoming a mainstream computing paradigm. Read More

The design of a parallel computing system using several thousands or even up to a million processors asks for processing units that are simple and thus small in space, to make as many processing units as possible fit on a single die. The design presented herewith is far from being optimised, it is not meant to compete with industry performance devices. Its main purpose is to allow for a prototypical implementation of a dynamic software system as a proof of concept. Read More

Near-sensor data analytics is a promising direction for IoT endpoints, as it minimizes energy spent on communication and reduces network load. However, performing feature extraction or classification directly on the end-nodes poses security concerns, as valuable data, "distilled" with application knowledge, is stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. Read More

In the last decade we have witnessed a rapid growth in data center systems, requiring new and highly complex networking devices. The need to refresh networking infrastructure whenever new protocols or functions are introduced, and the increasing costs that this entails, are of a concern to all data center providers. New generations of Systems on Chip (SoC), integrating microprocessors and higher bandwidth interfaces, are an emerging solution to this problem. Read More

This paper starts with a comprehensive survey on RTL ATPG. It then proposes a novel RTL ATPG model based on "Gate Inherent Faults" (GIF). These GIF are extracted from each complex gate (adder, case-statement, etc. Read More

Affiliations: 1Faculty of Information Technology, Brno University of Technology, 2IT4Innovations Centre of Excellence, FIT, Brno University of Technology, 3IT4Innovations Centre of Excellence, FIT, Brno University of Technology

HADES is a fully automated verification tool for pipeline-based microprocessors that aims at flaws caused by improperly handled data hazards. It focuses on single-pipeline microprocessors designed at the register transfer level (RTL) and deals with read-after-write, write-after-write, and write-after-read hazards. HADES combines several techniques, including data-flow analysis, error pattern matching, SMT solving, and abstract regular model checking. Read More

We present the design of a low-power 4-bit 1GS/s folding-flash ADC with a folding factor of two. The design of a new unbalanced double-tail dynamic comparator affords an ultra-low power operation and a high dynamic range. Unlike the conventional approaches, this design uses a fully matched input stage, an unbalanced latch stage, and a two-clock operation scheme. Read More

Continuous improvement in silicon process technologies has made possible the integration of hundreds of cores on a single chip. However, power and heat have become dominant constraints in designing these massive multicore chips causing issues with reliability, timing variations and reduced lifetime of the chips. Dynamic Thermal Management (DTM) is a solution to avoid high temperatures on the die. Read More

In this paper, we describe the design and implementation of a high precision real time NAND simulator called Copycat that runs on a commodity multi-core desktop environment. This NAND simulator facilitates the development of embedded flash memory management software such as the flash translation layer (FTL). The simulator also allows a comprehensive fault injection for testing the reliability of the FTL. Read More

Application trends, device technologies and the architecture of systems drive progress in information technologies. However, the former engines of such progress - Moore's Law and Dennard Scaling - are rapidly reaching the point of diminishing returns. The time has come for the computing community to boldly confront a new challenge: how to secure a foundational future for information technology's continued progress. Read More

For decades, advances in electronics were directly related to the scaling of CMOS transistors according to Moore's law. However, both the CMOS scaling and the classical computer architecture are approaching fundamental and practical limits, and new computing architectures based on emerging devices, such as non-volatile memories e.g. Read More

This paper presents a memory efficient architecture that implements the Multi-Scale Line Detector (MSLD) algorithm for real-time retinal blood vessel detection in fundus images on a Zynq FPGA. This implementation benefits from the FPGA parallelism to drastically reduce the memory requirements of the MSLD from two images to a few values. The architecture is optimized in terms of resource utilization by reusing the computations and optimizing the bit-width. Read More

Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. Read More

Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation hardware (i. Read More

This work studies the behavior of state-of-the-art memory controller designs when executing scale-out workloads. It considers memory scheduling techniques, memory page management policies, the number of memory channels, and the address mapping scheme used. Experimental measurements demonstrate: 1)~Several recently proposed memory scheduling policies are not a good match for these scale-out workloads. Read More

This paper proposes the architecture of partial sum generator for constituent codes based polar code decoder. Constituent codes based polar code decoder has the advantage of low latency. However, no purposefully designed partial sum generator design exists that can yield desired timing for the decoder. Read More

Electronic circuits and systems used in mission and safety-critical applications usually employ redundancy in the design to overcome arbitrary fault(s) or failure(s) and guarantee the correct operation. In this context, the distributed minority and majority voting based redundancy (DMMR) scheme forms an efficient alternative to the conventional N-modular redundancy (NMR) scheme for implementing mission and safety-critical circuits and systems by significantly minimizing their weight and design cost and also their design metrics whilst providing a similar degree of fault tolerance. This article presents the first FPGAs based implementation of example DMMR circuits and compares it with counterpart NMR circuits on the basis of area occupancy and critical path delay viz. Read More