United States Department of Energy Office of Science

ASCR Machine Learning Workshop

Sponsored by the U.S. Department of Energy,
Office of Advanced Scientific Computing research
Hilton Washington DC/Rockville Executive Meeting Center
Rockville, MD
January 5-7, 2015

Research Topic Descriptions

  1. Self-aware Runtime and Operating Systems: The purpose of leadership class HPC systems is to support large scale experiments, simulations, observation, and analysis. However, emerging extreme scale HPC systems bring new challenges for resilience, fault tolerance, heterogeneity, hierarchy, etc. Machine learning approaches could provide solutions for runtime and operating systems that are dynamic, adaptive, self-protective, or self-healing of the applications and services they support. New approaches could leverage hierarchical methods of asynchronous communications that coordinate and reduce or develop communication requirements; and in particular, one may look at what is being performed in software defined networking to see which aspects or ideas may be applicable to extreme scale HPC systems. In addition, operating system and runtime resource management today is mostly static: Could we exploit power, i/o, or error fault management for enhanced performance? Is it possible to develop constructs for scheduling and coordinating workflows? Could we incorporate system sensor feedback to a control module for effective resource allocation? Can we leverage event driven tasks and performance feedback from applications profiling depending on whether communication patterns are predictable or regular to establish resource allocation guidelines?
  2. Deep Learning for Big Data: These algorithms are typically used to train networks of extremely large (e.g., 1 billion) numbers of parameters. Computer vision is one application domain that tries to exploit deep learning. ASCR's HPC domain could play an important role in this arena. Is it possible to generate knowledge from the data; smart analytics; smart rendering/usability?
  3. Resiliency and Trust: Correct computations and results in the presence of faults. Dynamic adaptive thread configuration and fault isolation, recovery and self-healing in exascale and beyond extreme performance leadership-class computing platforms. Machine learning can be used to discover new patterns on networks and data, perform anomaly detection, achieve data fusion from multiple sources, analyze social and behavioral networks to identify anomalous behavior, and perhaps fingerprint HPC programs.