Scheduling and Communication
           
Event Type Start Time End Time Rm # Chair  

 

Paper 10:30AM 11:00AM 40-41 Allan Snavely (San Diego Supercomputer Center)
 
Title:

Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System
  Speakers/Presenter:
Terry Jones (LLNL), William Tuel (IBM), Larry Brenner (IBM), Jeff Fier (IBM), Patrick Caffrey (IBM), Shawn Dawson (LLNL), Rob Neely (LLNL), Robert Blackmore (IBM), Brian Maskell (AWE), Paul Tomlinson (AWE), Mark Roberts (AWE)

 

Paper 11:00AM 11:30AM 40-41 Allan Snavely (San Diego Supercomputer Center)
 
Title:

BCS-MPI: a New Approach in the System Software Design for Large-Scale Parallel Computers
  Speakers/Presenter:
Juan Fernandez (LANL), Eitan Frachtenberg (LANL), Fabrizio Petrini (LANL)

 

Paper 11:30AM 12:00PM 40-41 Allan Snavely (San Diego Supercomputer Center)
 
Title:

Scalable NIC-based reduction on Large-scale Clusters
  Speakers/Presenter:
Adam Moody (Ohio State University), Juan Fernandez (LANL), Fabrizio Petrini (LANL), Dhabaleswar K. Panda (Ohio State University)
             

 

     
  Session: Scheduling and Communication
  Title: Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System
  Chair: Allan Snavely (San Diego Supercomputer Center)
  Time: Thursday, November 20, 10:30AM - 11:00AM
  Rm #: 40-41
  Speaker(s)/Author(s):  
  Terry Jones (LLNL), William Tuel (IBM), Larry Brenner (IBM), Jeff Fier (IBM), Patrick Caffrey (IBM), Shawn Dawson (LLNL), Rob Neely (LLNL), Robert Blackmore (IBM), Brian Maskell (AWE), Paul Tomlinson (AWE), Mark Roberts (AWE)
   
  Description:
  A parallel application benefits from scheduling policies that include a global perspective of the application. As the interactions among cooperating processes increase, mechanisms to ameliorate waiting within one or more of the processes become more important. Collective operations such as barriers and reductions are extremely sensitive to even usually harmless events such as context switches. For the last 18 months, we have been researching the impact of random short-lived interruptions such as timer-decrement processing and periodic daemon activity, and developing strategies to minimize their impact on large processor-count SPMD bulk-synchronous programming styles. We present a novel co-scheduling scheme for improving performance of fine-grain collective activities such as barriers and reductions, describe an implementation consisting of operating system kernel modifications and run-time system, and present a set of results comparing the technique with traditional operating system scheduling. Our results indicate a speedup of over 300% on synchronizing collectives.

This paper has been nominated for the Best Paper of SC2003 award.
  Link: Download PDF
   

 

     
  Session: Scheduling and Communication
  Title: BCS-MPI: a New Approach in the System Software Design for Large-Scale Parallel Computers
  Chair: Allan Snavely (San Diego Supercomputer Center)
  Time: Thursday, November 20, 11:00AM - 11:30AM
  Rm #: 40-41
  Speaker(s)/Author(s):  
  Juan Fernandez (LANL), Eitan Frachtenberg (LANL), Fabrizio Petrini (LANL)
   
  Description:
  Buffered CoScheduled (BCS) MPI proposes a new approach to design the communication libraries for large-scale parallel machines. The emphasis of BCS MPI is on the global coordination of a large number of processes rather than in the traditional optimization of the local performance of a pair of communicating processes. BCS MPI delays the interprocessor communication in order to schedule globally the communication pattern and it is designed on top of a minimal set of collective communication primitives. In this paper we describe a prototype implementation of BCS MPI and its communication protocols. The experimental results, executed on a set of scientific applications representative of the ASCI workload, show that BCS MPI is only marginally slower than the production-level MPI, but much simpler to implement, debug and analyze.
  Link: Download PDF
   

 

     
  Session: Scheduling and Communication
  Title: Scalable NIC-based reduction on Large-scale Clusters
  Chair: Allan Snavely (San Diego Supercomputer Center)
  Time: Thursday, November 20, 11:30AM - 12:00PM
  Rm #: 40-41
  Speaker(s)/Author(s):  
  Adam Moody (Ohio State University), Juan Fernandez (LANL), Fabrizio Petrini (LANL), Dhabaleswar K. Panda (Ohio State University)
   
  Description:
  Over the last decades, researchers have developed many efficient reduction algorithms. However, all these algorithms assume that the reduction processing takes place on the host CPU. Modern Network Interface Cards (NICs) sport programmable processors and thus introduce a fresh variable into the equation. This raises the following interesting challenge: Can we take advantage of modern NICs to implement fast reduction operations? In this paper, we take on this challenge in the context of large-scale clusters. Through experiments on a 960-node, 1920-processor cluster we show that NIC-based reductions indeed perform with reduced latency and improved consistency and scalability over host-based algorithms for the common case. In the largest configuration tested ---1812 processors--- our NIC-based algorithm can sum a single element vector in 73 microseconds with 32-bit integers and in 118 microseconds with 64-bit floating-point numbers, an improvement, respectively, of 121% and 39% with respect to the production level MPI library.
  Link: Download PDF