Joseph Devietti
my email address: my last name at cis dot upenn dot edu
(215) 746-4223
Levine Hall 572
3330 Walnut Street
Philadelphia, PA 19104-3409

I work on making multiprocessors easier to program by leveraging changes in both computer architectures and parallel programming models.

I am eagerly looking for new PhD students interested in improving the programmability and performance of multicore architectures. If you are interested in these topics please apply to our PhD program and drop me an email as well.


I'm currently teaching CIS 601: Special Topics in Computer Architecture: GPGPU Programming.


I'm lucky to be working with the following great students:

Former students

  • Ariel Eizenberg (Master's 2016)
  • Brooke Fugate (Master's 2015, co-advised with André DeHon)
  • Liang Luo (Master's 2015, now a PhD student at the University of Washington)
  • Akshitha Sriraman (Master's 2015, now a PhD student at the University of Michigan)

Recent Publications full list

Many of the paper links below use the ACM's Author-izer service, which tracks download statistics and provides a small kickback to various ACM Special Interest Groups for each download.

  • GPUDrano: Detecting Uncoalesced Accesses in GPU ProgramsGPUDrano: Detecting Uncoalesced Accesses in GPU Programs
    International Conference on Computer-Aided Verification (to appear at CAV '17), July 2017

    Graphics Processing Units (GPUs) have become widespread and popular over the past decade. Fully utilizing the parallel compute and memory resources that GPUs present remains a significant challenge, however. In this paper, we describe GPUDrano: a scalable static analysis that detects uncoalesced global memory accesses in CUDA programs. Uncoalesced global memory accesses arise when a GPU program accesses DRAM in an ill-structured way, increasing latency and energy consumption. We formalize the GPUDrano static analysis and compare it empirically against a dynamic analysis to demonstrate that false positives are rare for most programs. We implement GPUDrano in LLVM and show that it can run on GPU programs of over a thousand lines of code. GPUDrano finds 133 of the 143 uncoalesced static memory accesses in the popular Rodinia GPU benchmark suite, demonstrating the precision of our implementation. Fixing these bugs leads to real performance improvements of up to 25%.

  • BARRACUDA: Binary-level Analysis of Runtime RAces in CUDA programsBARRACUDA: Binary-level Analysis of Runtime RAces in CUDA programs
    Ariel Eizenberg, Yuanfeng Peng, Toma Pigli, William Mansky and Joseph Devietti
    ACM SIGPLAN Conference on Programming Language Design and Implementation (to appear at PLDI '17), June 2017

    GPU programming models enable and encourage massively parallel programming with over a million threads, requiring extreme parallelism to achieve good performance. Massive parallelism brings significant correctness challenges by increasing the possibility for bugs as the number of thread interleavings balloons. Conventional dynamic safety analyses struggle to run at this scale.

    We present Barracuda, a data race detector for GPU programs written in Nvidia’s CUDA language. Barracuda handles a wider range of parallelism constructs than previous work, including branch operations, low-level atomics and memory fences, which allows Barracuda to detect new classes of races. Barracuda operates at the binary level for increased compatibility with existing code, leveraging a new binary instrumentation framework that is extensible to other dynamic analyses. Barracuda incorporates a number of novel optimizations that are crucial for scaling data race detection to over a million threads.

  • Verifying Dynamic Race DetectionVerifying Dynamic Race Detection
    Certified Programs and Proofs (CPP '17), co-located with POPL 2017, January 2017
  • Remix: Online Detection and Repair of Cache Contention for the JVMRemix: Online Detection and Repair of Cache Contention for the JVM
    ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16), June 2016
    [abstract][paper][slides: pdf pptx]

    As ever more computation shifts onto multicore architectures, it is increasingly critical to find effective ways of dealing with multithreaded performance bugs like true and false sharing. Previous approaches to fixing false sharing in unmanaged languages have had to resort to highly-invasive runtime program modification. We observe that managed language runtimes, with garbage collection and JIT code compilation, present unique opportunities to repair such bugs directly, mirroring the techniques used in manual repairs.

    We present Remix, a modified version of the Oracle HotSpot JVM which can detect cache contention bugs and repair false sharing at runtime. Remix’s detection mechanism leverages recent performance counter improvements on Intel platforms, which allow for precise, unobtrusive monitoring of cache contention at the hardware level. Remix can detect and repair known false sharing issues in the LMAX Disruptor high-performance inter-thread messaging library and the Spring Reactor event-processing framework, automatically providing 1.5-2x speedups over unoptimized code and matching the performance of hand-optimization. Remix also finds a new false sharing bug in SPECjvm2008, and uncovers a true sharing bug in the HotSpot JVM that, when fixed, improves the performance of three NAS Parallel Benchmarks by 7-25x. Remix incurs no statistically-significant performance overhead on other benchmarks that do not exhibit cache contention, making Remix practical for always-on use.

  • LASER: Light, Accurate Sharing dEtection and RepairLASER: Light, Accurate Sharing dEtection and Repair
    Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Hu, Gilles Pokam, Chris Newburn and Joseph Devietti
    IEEE International Symposium on High Performance Computer Architecture (HPCA '16), March 2016
    [abstract][paper][slides: pdf pptx][data]

    Contention for shared memory, in the forms of true sharing and false sharing, is a challenging performance bug to discover and to repair. Understanding cache contention requires global knowledge of the program's actual sharing behavior, and can even arise invisibly in the program due to the opaque decisions of the memory allocator. Previous schemes have focused only on false sharing, and impose significant performance penalties or require non-trivial alterations to the operating system or runtime system environment.

    This paper presents the Light, Accurate Sharing dEtection and Repair (LASER) system, which leverages new performance counter capabilities available on Intel's Haswell architecture that identify the source of expensive cache coherence events. Using records of these events generated by the hardware, we build a system for online contention detection and repair that operates with low performance overhead and does not require any invasive program, compiler or operating system changes. Our experiments show that LASER imposes just 2% average runtime overhead on the Phoenix, Parsec and Splash2x benchmarks. LASER can automatically improve the performance of programs by up to 19% on commodity hardware.

  • Co-Design of Anytime Computation and Robust ControlCo-Design of Anytime Computation and Robust Control
    IEEE Real-Time Systems Symposium (RTSS '15), December 2015
    Control software of autonomous robots has stringent real-time requirements that must be met to achieve the control objectives. One source of variability in the performance of a control system is the execution time and accuracy of the state estimator that provides the controller with state information. This estimator is typically perception-based (e.g., Computer Vision-based) and is computationally expensive. When the computational resources of the hardware platform become overloaded, the estimation delay can compromise control performance and even stability. In this paper, we define a framework for co-designing anytime estimation and control algorithms, in a manner that accounts for implementation issues like delays and inaccuracies. We construct an anytime perception-based estimator from standard off-the-shelf Computer Vision algorithms, and show how to obtain a trade-off curve for its delay vs estimate error behavior. We use this anytime estimator in a controller that can use this trade- off curve at runtime to achieve its control objectives at a reduced energy cost. When the estimation delay is too large for correct operation, we provide an optimal manner in which the controller can use this curve to reduce estimation delay at the cost of higher inaccuracy, all the while guaranteeing basic objectives are met. We illustrate our approach on an autonomous hexrotor and demonstrate its advantage over a system that does not exploit co-design.
  • SlimFast: Reducing Metadata Redundancy in Sound & Complete Dynamic Data Race DetectionSlimFast: Reducing Metadata Redundancy in Sound & Complete Dynamic Data Race Detection
    Yuanfeng Peng and Joseph Devietti
    PLDI Student Research Competition (PLDI SRC '15), held in conjunction with PLDI '15, June 2015
    Dynamic race detectors that are sound and complete often incur too much memory overhead to be practical in many scenarios; to address this issue, we present SlimFast, a sound-and-complete race detector that significantly reduces the memory overhead, with comparable performance to a state-of-the-art race detector. According to experiments on DaCapo and Grande benchmarks, SlimFast brings up to 72% space savings and a 21% speedup over FastTrack.
  • High-Performance Determinism with Total Store Order ConsistencyHigh-Performance Determinism with Total Store Order Consistency
    European Conference on Computer Systems (EuroSys '15), April 2015

    We present Consequence, a deterministic multi-threading library. Consequence achieves deterministic execution via store buffering and strict ordering of synchronization operations. To ensure high performance under a wide variety of conditions, the ordering of synch operations is based on a deterministic clock, and store buffering is implemented using version-controlled memory.

    Recent work on deterministic concurrency has proposed relaxing the consistency model beyond total store order (TSO). Through novel optimizations, Consequence achieves the same or better performance on the Phoenix, PARSEC and SPLASH-2 benchmark suites, while retaining TSO memory consistency. Across 19 benchmark programs, Consequence incurs a worst-case slowdown of 3.9× vs. pthreads, with 14 out of 19 programs at or below 2.5×. We believe this performance improvement takes parallel programming one step closer to "determinism by default".