Poster Reception
           
Event Type Start Time End Time Rm # Chair  

 

Poster 5:00PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Poster Reception
  Speakers/Presenter:

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

A Performance Comparison of Sorting Algorithms in Unified Parallel C
  Speakers/Presenter:
Ronald Brightwell (Sandia National Laboratories), Jonathan Leighton Brown (Sandia National Laboratories), Sue Goudy (Sandia National Laboratories), Zhaofang Wen (Sandia National Laboratories)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

A Proposed Standard for Matrix Metadata
  Speakers/Presenter:
Victor Eijkhout (University of Tennessee), Erika Fuentes (University of Tennessee)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Parallel Implementation of the WASH123D Code for Surface and Subsurface Hydrologic Interactions
  Speakers/Presenter:
Jing-Ru C. Cheng (Major Shared Resource Center (MSRC), U.S. Army Engineer Research & Development Center (ERDC)), Hwai-Ping Cheng (Coastal & Hydraulics Laboratory, U.S. Army Engineer Research & Development Center), Robert M. Hunter (Major Shared Resource Center, U.S. Army Engineer Research & Development Center), Hsin-Chi J. Lin (Coastal & Hydraulics Laboratory, U.S. Army Engineer Research & Development Center), David R. Richards (Information Technology Laboratory, U.S. Army Engineer Research & Development Center), Earl V. Edris (Coastal and Hydraulics Laboratory, U.S. Army Engineer Research & Development Center), Gour-Tsyh Yeh (Department of Civil & Environmental Engineering, University of Central Florida)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

ZENTURIO: An Experiment Management Tool for Cluster and Grid Computing
  Speakers/Presenter:
Radu Prodan (Institute for Software Science, University of Vienna), Thomas Fahringer (Institute for Computer Science, University of Innsbruck)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Analysing and Enhancing Parallel I/O in mpiBLAST
  Speakers/Presenter:
Fernando Moraes (Los Alamos National Laboratory), Wu-chun Feng (Los Alamos National Laboratory)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Performance Analysis of On-Host Communication in the Los Alamos-Message Passing Interface
  Speakers/Presenter:
Timothy Brian Prins (Advanced Computing Laboratory Los Alamos National Laboratory), Richard L. Graham (Advanced Computing Laboratory Los Alamos National Laboratory)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Hardware Performance Counters: Is What You See, What You Get?
  Speakers/Presenter:
Alonso Bayona (The University of Texas at El Paso), Manuel Nieto (The University of Texas at El Paso), Patricia J. Teller (The University of Texas at El Paso)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Public Server for High-Throughput Genome Analysis
  Speakers/Presenter:
Dinanath Sulakhe (Argonne National Laboratory), Alex Rodriguez (Argonne National Laboratory), Natalia Maltsev (Argonne National Laboratory), Elizabeth Marland (Argonne National Laboratory)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Dynamic Sessions for the Grid Environment
  Speakers/Presenter:
Karl Doering (University of California, Riverside), Kate Keahey (Argonne National Laboratory)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

NAS Parallel Benchmarks: new territory, new releases.
  Speakers/Presenter:
Rob F. Van der Wijngaart (Computer Sciences Corporation)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Parallel landscape fish model for South Florida ecosystem simulation
  Speakers/Presenter:
Dali Wang (The Institute for Environmental Modeling, University of Tennessee, Knoxville), Louis Gross (The Institute for Environmental Modeling, University of Tennessee, Knoxville), Michael Berry (Department of Computer Science, University of Tennessee, Knoxville)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

GMIT: A Tool for Indexing Self-describing Scientific Datasets
  Speakers/Presenter:
Beomseok Nam (University of Maryland), Alan Sussman (University of Maryland)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Parallel hybrid optimization approaches for the solution of large scale groundwater inverse problems on the teragrid
  Speakers/Presenter:
Kumar Mahinthakumar (North Carolina State University), Mohamed Sayeed (North Carolina State University), Dongju Choi (San Diego Supercomputer Center), (), (), Nicholas Karonis (Northern Illinois University/Argonne National Laboratory)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

GHS: A Performance Prediction and Task Scheduling System for Grid Computing
  Speakers/Presenter:
Xian-He Sun (IIT), Ming Wu (IIT)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

The OurGrid Project: Running Bag-of-Tasks Applications on Computational Grids
  Speakers/Presenter:
Walfredo Cirne (UFCG), Nazareno Andrade (UFCG), Lauro Costa (UFCG), Glaucimar Aguiar (UFCG), Daniel Paranhos (UFCG), Francisco Brasileiro (UFCG)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

High-level Language Support for Dynamic Data Distributions
  Speakers/Presenter:
Steven J. Deitz (University of Washington), Bradford L. Chamberlain (University of Washington/Cray Inc.), Lawrence Snyder (University of Washington)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Strong scalability analysis and performance evaluation of a CCA-based hydrodynamic simulation on structured adaptively refined meshes.
  Speakers/Presenter:
Sophia Lefantzi (Sandia National Laboratories), Jaideep Ray (Sandia National Laboratories), Sameer Shende (University of Oregon)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Dynamic Partitioning for High-Speed Network Intrusion Detection
  Speakers/Presenter:
Jens Mache (Lewis & Clark College), Jason Gimba (Lewis & Clark College), Thierry Lopez (Lewis & Clark College)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Benchmarking Clusters vs. SMP Systems by Analyzing the Trade-Off Between Extra Calculations vs. Communication
  Speakers/Presenter:
Anne Cathrine Elster (Norwegian Univ. of Science & Technology, Trondheim, Norway), Robin Holtet (Norwegian Univ. of Science & Technology, Trondheim, Norway)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Benchmarking a Parallel Coupled Model
  Speakers/Presenter:
J. Walter Larson (Argonne National Laboratory), Robert L. Jacob (Argonne National Laboratory), Everest T. Ong (Argonne National Laboratory), Anthony Craig (National Center for Atmospheric Research), Brian Kauffman (National Center for Atmospheric Research), Thomas Bettge (National Center for Atmospheric Research), Yoshikatsu Yoshida (Central Research Institute of the Electrical Power Industry), Junichiro Ueno (Fujitsu Limited), Hidemi Komatsu (Fujitsu Limited), Shin-ichi Ichikawa (Fujitsu Limited), Clifford Chen (Fujitsu America, Inc.), Patrick H. Worley (Oak Ridge National Laboratory)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

GASNet: A High-performance, portable compilation target for parallel compilers
  Speakers/Presenter:
Dan O Bonachea (U.C. Berkeley), Paul H Hargrove (LBNL)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

MPH: a Library for Coupling Climate Component Models on Distributed Memory Architectures
  Speakers/Presenter:
Chris H.Q. Ding (Lawrence Berkeley National Laboratory), Yun He (Lawrence Berkeley National Laboratory)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Improving the Performance of MPI Derived Datatypes by Optimizing Memory-Access Cost
  Speakers/Presenter:
Surendra Byna (Dept. of Computer Science, Illinois Institute of Technology, Chicago), William Gropp (Math. and Comp. Science Division, Argonne National Laboratory), Xian-He Sun (Dept. of Computer Science, Illinois Institute of Technology, Chicago), Rajeev Thakur (Math. and Comp. Science Division, Argonne National Laboratory)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

MPICH-V3: A hierarchical fault tolerant MPI for Multi-Cluster Grids
  Speakers/Presenter:
Pierre Lemarinier (LRI), AurÈlien Bouteiller (LRI), Franck Cappello (INRIA-CNRS-LRI)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Performance Modeling of HPC Applications
  Speakers/Presenter:
Allan Snavely (SDSC), Laura Carrington (SDSC), Nicole Wolter (SDSC)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

1 TFLOPS achieved with distributed Linux cluster
  Speakers/Presenter:
Craig A. Stewart (University Information Technology Services, Indiana University), George Turner (UITS, Indiana University), Peng Wang (UITS, Indiana University), David Hart (UITS, Indiana University), Steven Simms (UITS, Indiana University), Daniel Lauer (UITS, Indiana University), Mary Papakhian (UITS, Indiana University), Matthew Allen (UITS, Indiana University), Jeff Squyres (Open Systems Laboratory, Indiana University), Andrew Lumsdaine (Open Systems Laboratory, Indiana University)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Distributed Intelligent RAS System for Large Computational Clusters
  Speakers/Presenter:
J. M. Brandt (Sandia National Laboratories), N. M. Berry (Sandia National Laboratories), R. B. Yao (Sandia National Laboratories), B. M. Tsudama (Sandia National Laboratories), A. C. Gentile (Sandia National Laboratories)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Prophesy: An Automated Modeling System for Parallel and Grid Applications
  Speakers/Presenter:
Xingfu Wu (Texas A&M University), Valerie Taylor (Texas A&M University), Rick Stevens (Arognne National Laboratory)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Real-Time Rendering of Attributed LIC-Volumes Using Programmable GPUs
  Speakers/Presenter:
Yasuko Suzuki (Information Technology R & D Center, Mitsubishi Electric Corp., Japan), Tokunaga Momoe (Office Imaging Products Operations, Canon Inc., Japan), Shoko Ando (Graduate School of Humanities and Sciences, Ochanomizu Univ., Japan), Shigeru Muraki (Collaborative Research Team of Volume Graphics, AIST, Japan), Yuriko Takeshima (Institute of Fluid Science, Tohoku Univ., Japan), Shigeo Takahashi (Graduate School of Arts and Sciences, University of Tokyo, Japan), Issei Fujishiro (Graduate School of Humanities and Sciences, Ochanomizu Univ., Japan)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

SX-6 Benchmark Results and Comparisons
  Speakers/Presenter:
Thomas J. Baring (Vector Specialist), Andrew C. Lee (User Consultant)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Performance of SGI's Altix
  Speakers/Presenter:
Michael Lang (LANL), Scott Pakin (LANL), Darren Kerbyson (LANL), Fabrizio Petrini (LANL), Harvey Wasserman (LANL), Adolfy Hoisie (LANL)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Computational Steering in RealityGrid
  Speakers/Presenter:
John M Brooke (Manchester Computing, University of Manchester), Peter V Coveney (Centre for Computational Science, University College London), Jens Harting (Centre for Computational Science, University College London), Shantenu Jha (Centre for Computational Science, University College London), Stephen M Pickles (Manchester Computing, University of Manchester), Robin L Pinning (Manchester Computing, University of Manchester), Andrew R Porter (Manchester Computing, University of Manchester)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

An Adaptive Informatics Infrastructure Enabling Multi-Scale Chemical Science
  Speakers/Presenter:
Thomas C. Allison (NIST, Gaithersburg, MD 20899-8381), Kaizar Amin (Argonne National Laboratory, Argonne, IL 60439-4844 ), William Barber (Los Alamos National Laboratory, Los Alamos, NM 87545), Sandra Bittner (Argonne National Laboratory, Argonne, IL 60439-4844 ), Brett Didier (Pacific Northwest National Laboratory, Richland, WA 99352), Michael Frenklach (University of California, Berkeley, CA 94720-1740), William H. Green, Jr. (Massachusetts Institute of Technology, Cambridge, MA 02139), John Hewson (Sandia National Laboratories, Livermore, CA 94551-0969), Wendy Koegler (Sandia National Laboratories, Livermore, CA 94551-0969), Carina Lansing (Pacific Northwest National Laboratory, Richland, WA 99352), Gregor von Laszewski (Argonne National Laboratory, Argonne, IL 60439-4844 ), David Leahy (Sandia National Laboratories, Livermore, CA 94551-0969), Michael Lee (Sandia National Laboratories, Livermore, CA 94551-0969), James D. Myers (Pacific Northwest National Laboratory, Richland, WA 99352), Renata McCoy (Sandia National Laboratories, Livermore, CA 94551-0969), Michael Minkoff (Argonne National Laboratory, Argonne, IL 60439-4844 ), David Montoya (Los Alamos National Laboratory, Los Alamos, NM 87545), Sandeep Nijsure (Argonne National Laboratory, Argonne, IL 60439-4844 ), Carmen Pancerella (Sandia National Laboratories, Livermore, CA 94551-0969), Reinhardt Pinzon (Argonne National Laboratory, Argonne, IL 60439-4844 ), William Pitz (Lawrence Livermore National Laboratory, Livermore, CA 94551), Larry A. Rahn (Sandia National Laboratories, Livermore, CA 94551-0969), Branko Ruscic (Argonne National Laboratory, Argonne, IL 60439-4844 ), Karen Schuchardt (Pacific Northwest National Laboratory, Richland, WA 99352), Eric Stephan (Pacific Northwest National Laboratory, Richland, WA 99352), Al Wagner (Argonne National Laboratory, Argonne, IL 60439-4844 ), Baoshan Wang (Argonne National Laboratory, Argonne, IL 60439-4844 ), Theresa Windus (Pacific Northwest National Laboratory, Richland, WA 99352), Lili Xu (Los Alamos National Laboratory, Los Alamos, NM 87545), Christine Yang (Sandia National Laboratories, Livermore, CA 94551-0969)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

The Advisor: Diagnosing End-To-End Network Problems
  Speakers/Presenter:
Tanya Brethour (NLANR/DAST), Jim Ferguson (NLANR/DAST)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Repartitioning Unstructured Meshes for the Parallel Solution of Engine Combustion
  Speakers/Presenter:
PeiZong Lee (Institute of Information Science, Academia Sinica), Chih-Hsueh Yang (Institute of Information Science, Academia Sinica), Jeng-Renn Yang (Institute of Information Science, Academia Sinica)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Ontology-based Resource Matching in the Grid
  Speakers/Presenter:
Hongsuda Tangmunarunkit (USC-ISI), Stefan Decker (USC-ISI), Carl Kesselman (USC-ISI)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Grid performance prediction with performance skeletons
  Speakers/Presenter:
Sukhdeep Sodhi (University of Houston), Jaspal Subhlok (University of Houston)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Inca Test Harness and Reporting Framework
  Speakers/Presenter:
Shava Smallen (San Diego Supercomputer Center), Pete Beckman (Argonne National Laboratory), Michael Feldmann (California Institute of Technology), Tim Kaiser (San Diego Supercomputer Center), Catherine Olschanowsky (San Diego Supercomputer Center), Jennifer Schopf (Argonne National Laboratory)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

PAPI Version 3
  Speakers/Presenter:
Jack Dongarra (University of Tennessee), Kevin London (University of Tennessee), Shirley Moore (University of Tennessee), Philip Mucci (University of Tennessee), Daniel Terpstra (University of Tennessee), Haihang You (University of Tennessee), MIn Zhou (University of Tennessee)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

A Parallel High Performance Inductance Extraction Algorithm
  Speakers/Presenter:
Hemant Mahawar (Texas A&M University), Vivek Sarin (Texas A&M University)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Ouroboros: A Tool for Building Generic, Hybrid, Divide & Conquer Algorithms
  Speakers/Presenter:
John R. Johnson (University of Chicago / Lawrence Livermore National Laboratory), Ian Foster (University of Chicago / Argonne National Laboratory)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Ultra-Large-Scale Molecular-Dynamics Simulation of 19 Billion Interacting Particles
  Speakers/Presenter:
Kai Kadau (Theoretical Division, Los Alamos National Laboratory, MS B262, Los Alamos, New Mexico 87545, U.S.A., E-mail: kkadau@lanl.gov), Timothy C. Germann (Applied Physics Division, Los Alamos National Laboratory, MS F699, Los Alamos, New Mexico 87545, U.S.A., E-mail: tcg@lanl.gov), Peter S. Lomdahl (Theoretical Division, Los Alamos National Laboratory, MS B262, Los Alamos, New Mexico 87545, U.S.A., E-mail: pxl@lanl.gov)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Large-scale Molecular Dynamics Simulations of Hybrid Inorganic-Organic Systems on Multi-Teraflop Linux Clusters and High-end Parallel Supercomputers
  Speakers/Presenter:
Satyavani Vemparala (Louisiana State University), Bijaya B. Karki (Louisiana State University), Hideaki Kikuchi (Louisiana State University), Rajiv K. Kalia (University of Southern California), Aiichiro Nakano (University of Southern California), Priya Vashishta (University of Southern California), Shuji Ogata (Nagoya Institute of Technology), Subhash Saini (NASA Ames Research Center)

 

Poster 5:01PM 7:00PM Lobby 2 Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
 
Title:

Data Mining on the TeraGrid
  Speakers/Presenter:
Helen Conover (University of Alabama in Huntsville), Sara J. Graves (University of Alabama in Huntsville), Rahul Ramachandran (University of Alabama in Huntsville), Sandra Redman (University of Alabama in Huntsville), John Rushing (University of Alabama in Huntsville), Steven Tanner (University of Alabama in Huntsville), Robert Wilhelmson (National Center for Supercomputing Applications)
             

 

     
  Session: Poster Reception
  Title: Poster Reception
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:00PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
 
   
  Description:
  Posters at SC2003 showcase the latest innovations and be prominently displayed throughout the conference. This evening Posters Reception, sponsored by AMD, features the posters and allows for a time for conference attendees to discuss the displays with the poster presenters. The poster session is a means of presenting timely research in a more casual setting, with the chance for greater in- depth, one-on-one dialogue that can proceed at its own pace.
  Link: --
   

 

     
  Session: Poster Reception
  Title: A Performance Comparison of Sorting Algorithms in Unified Parallel C
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Ronald Brightwell (Sandia National Laboratories), Jonathan Leighton Brown (Sandia National Laboratories), Sue Goudy (Sandia National Laboratories), Zhaofang Wen (Sandia National Laboratories)
   
  Description:
  In this poster, we provide a performance comparison of sorting algorithms designed for and implemented in the Unified Parallel C (UPC) programming language. We analyze six different algorithms and discuss how they are implemented in UPC. In particular, we describe the data distribution and data movement characteristics that impact the performance of the different sorting methods. We expect these results to provide insight into domain decomposition methods and load balancing issues with UPC and the Global Address Space (GAS) programming model in general. Algorithms are both adapted from literature and designed specifically for GAS. Notably for our radix sort, novel algorithmic techniques are employed to great effect. In the following sections, we describe our motivation and the details behind this analysis.

Sorting is a fundamental problem in theory and has a wide variety of practical applications. The problem inherently requires both communication and computation on parallel computers. Sorting is a first problem for our study of the usability and performance of UPC. Further, as the UPC library sort has not been implemented yet, our work will help identify the best implementation.

Sequential and parallel sorting algorithms have been extensively studied. Parallel sorting algorithms have generally been designed for PRAM-based models or for models assuming particular processor interconnection topology. Efficient parallel sorting networks have also been designed [Batcher68]. These algorithms have not been shown to provide efficient parallel implementations on real world machines, mainly due to assumptions about communication (costs). Sorting methods have been designed for specific parallel machines. Thus far, experimental data suggests that the sample sort [Blelloch91] is best in practice.

Our work was to design sorting algorithms for GAS and implement them in UPC. We selected six sorts: bubble sort, merge sort, Batcher's odd-even sort, Quicksort, sample sort, and radix sort. These algorithms were chosen for their standing in the literature, different sorting strategies, and communication patterns.

We briefly outline each algorithm. In bubble sort, the list is scanned repeatedly, moving the largest number to one end each time. For our parallel version, each thread sorts an equally-sized partition of the list, and then partitions are pairwise merged and split until the sequence is sorted. Merge sort is a divide-and-conquer algorithm where each half of the list is sorted and then the halves are merged. We use a bottom up implementation in our parallel version starting from equally-sized, locally-sorted list partitions. Odd-even sort is specifically designed for sorting networks. Our implementation is a direct translation of this with UPC parallel loops. Quicksort is a divide-and-conquer sort that uses a random pivot to split the list for recursion. The threads split sublists in parallel in our implementation for UPC. Sample sort is a parallel sort where the list is first partitioned into groups that are relatively ordered, and then each group is independently sorted. Our implementation is a direct translation. Radix sort uses bucketing to sort numbers digit by digit, starting from the least significant. Our radix sort is dynamic in that the radix is determined at runtime based on the number of threads, input size, and input range. Bucketing is distributed among the threads at each iteration, and load balancing is also done at each iteration.

Tests were conducted on a Cray T3e with 60 processing elements. The Cray T3e has cache-coherent, physically-distributed memory that is globally addressable.

Preliminary test results show clear delineations in algorithm performance. As expected, the odd-even sort is much slower than the rest of the algorithms due to its fine-grained communication pattern. Quicksort is also very slow, because it can become unbalanced, which is inherent in the algorithm. Because of their basic schemes, bubble and merge sorts do not scale well - due to communication overhead, performance deteriorates with increasing number of threads. The two top performers, sample sort and our dynamic radix sort, substantially outperform the other algorithms. They both exhibit good scalability. The dynamic radix sort exhibits super-linear scalability to thread count (in the range of physically available number of processors) because the radix is chosen based on the three input parameters described above. Our dynamic radix sort tops sample sort when the thread count reaches 16. This dynamic radix sort provides a good candidate for the UPC library sort.

NOTE: We will present our work in the poster through printed pages of high-level algorithm description as well as their runtime performance charts. We can also provide a Powerpoint presentation if necessary.
  Link: --
   

 

     
  Session: Poster Reception
  Title: A Proposed Standard for Matrix Metadata
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Victor Eijkhout (University of Tennessee), Erika Fuentes (University of Tennessee)
   
  Description:
  Statement:
We propose a standard for storing metadata describing numerical matrix data. The standard consists of an XML file format and an internal data format. We give the abstract description of the XML storage format, APIs (Application Programmer Interfaces) to access the stored data inside a program, and a core set of categories of data to be stored. The format is open-ended, allowing for other parties to define additional metadata categories to be stored within this framework.

Justification:
We can associate with matrix data any number of derived properties, such as norms, spectral properties, or graph properties of sparse matrices. There is no standard way of storing such data, making interoperability hard between software modules written by different authors. Such interoperability would be valuable in a number of contexts. For instance, linear algebra algorithms often need, or at least have a use for, difficult to compute matrix statistics, such as condition number estimates. While any ad hoc fit can be made between such analysis-producing and analysis-consuming software, a more general solution would make componentization of such software possible.

The existence of a metadata standard for numerical data -- we limit ourselves here to matrix data, though extension of these ideas to other fields is natural -- makes the following software functionalities possible.
* First of all, it allows numerical algorithms to request metadata, not easily derivable from more traditional inputs, that will assist in the computation process.
* Secondly, it allows numerical data processing program components to annotate data with information that normally gets lost in the interface to the numerics.
* Thirdly, it makes it possible to encode two sorts of expert knowledge: the mapping of application-oriented data to numerics, and the decision making process based on wider aspects of the numerical data.

In two other applications, furthermore, we need not only a standard format for storing such data programmatically, but also for more permanent file storage. The first application is the development of the Intelligent Agent of a Self-Adapting Numerical Software system (Dongarra and Eijkhout, 2003). Here, the exhaustive analyses of properties of a number of matrices are stored in a database for subsequent analysis, together with performance results, for instance from solving linear systems with these matrices. New matrices can then be matched up against this database for recommendations as to preferred solution methods.

The second application where a metadata file format can be beneficial is in matrix collections, such as Matrix Market. A standard format makes it easier to automate the insertion and analysis of matrices, as well as enabling complicated database queries. On the retrieval side of the collection, it means that any dataset extracted from the collection comes with substantial standardized information.

Proposed standard: Our proposal for a metadata standard comprises the following.
* First of all we propose a standard for storing metadata. The standard has a combined sequential/recursive structure that gives easy access to a number of commonly needed elements, as well as extendability that allows third parties to add metadata categories.
* We propose a number of categories of metadata, that are picked to cover most common applications in numerical matrix analysis.
* We propose an XML file format for permanent storage of the matrix metadata. We have written software that converts from this XML format to internal data structures; for software availibility see the conclusion to this article.
* We define an API (Application Programmer Interface) that allows access to the metadata data structures from inside application codes. The API allows both retrieval as well as insertion of data.

Poster presentation:
We will graphically present usage scenarios that motivate the proposal of numerical metadata. We will display sample XML code, as well as the XML Schema and XSL style sheet (excerpted where appropriate). We will display the proposed core set of metadata.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Parallel Implementation of the WASH123D Code for Surface and Subsurface Hydrologic Interactions
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Jing-Ru C. Cheng (Major Shared Resource Center (MSRC), U.S. Army Engineer Research & Development Center (ERDC)), Hwai-Ping Cheng (Coastal & Hydraulics Laboratory, U.S. Army Engineer Research & Development Center), Robert M. Hunter (Major Shared Resource Center, U.S. Army Engineer Research & Development Center), Hsin-Chi J. Lin (Coastal & Hydraulics Laboratory, U.S. Army Engineer Research & Development Center), David R. Richards (Information Technology Laboratory, U.S. Army Engineer Research & Development Center), Earl V. Edris (Coastal and Hydraulics Laboratory, U.S. Army Engineer Research & Development Center), Gour-Tsyh Yeh (Department of Civil & Environmental Engineering, University of Central Florida)
   
  Description:
  This research is supported by the Department of Defense (DoD) Common High Performance Computing Software Support Initiative (CHSSI) to explore scalable computing systems for the WASH123D code. WASH123D is a computer code that simulates water flow and contaminant and sediment transport in coupled surface water and groundwater systems. It simulates 1-dimensional (1-D) streams and rivers, 2-dimensional (2-D) overland regimes, and 3-dimensional (3-D) subsurface media. It is designed to assist water resource managers in conducting surface water and groundwater restoration activities. Typical applications include cleanup of military unique compounds in groundwater, evaluations of chemical and biological threats to water supply systems, and hydrologic restoration of large wetland systems such as the Everglades.

WASH123D is a first-principle, physics-based model, where 1-D channel flow and 2-D overland flow are described with the St. Venant equations, 3-D subsurface flow is governed by the Richards equation, and contaminant transport and sediment transport are computed by solving respective mass continuity equations. The Richards equation is solved with the Eulerian finite element method, while the St. Venant equation is solved with either the Semi-Lagrangian or the Eulerian finite element method. All of the transport equations are solved with the Lagrangian-Eulerian finite element method. An in-element particle-tracking (PT) algorithm that accounts for unsteady flow is applied to backtrack fictitious particles from global nodes to determine the so-called Lagrangian values when the Semi-Lagrangian or the Lagrangian-Eulerian method is used.

In dealing with such a sophisticated system, there are two tasks to make the simulation time reasonable, numerical algorithm improvement and code parallelization. The code parallelization, which enables the code to run efficiently on the scalable parallel high performance computers provided by the DoD High Performance Computing Modernization Program, is the central focus of this poster.

To manipulate such a large computational code, the original programming paradigm is changed in order to take advantage of modern software design principles, particularly object-oriented programming. Therefore, the software design targets the four elementsæabstraction, encapsulation, modularity, and hierarchyæin Boochís object model. This poster presents how this goal is achieved, including the data structure design and the development and incorporation of parallel software tools, and what performance has been achieved. To account for problem domains that may include 1-D river-stream network, 2-D overland regime, and 3-D subsurface media, where various partial differential equations are derived to mathematically characterize flow and transport behaviors, three mesh objects are constructed to describe the three subdomains. Moreover, the common phenomena are described by a global object, the subdomain-to-subdomain interactions managed by a couple object, and the parallel environmental context set up by a parallel object. Each subdomain is partitioned, based on its favorite partitioning criteria, to processors by the DBuilder that is the by-product of the project. The DBuilder encapsulates all the MPI implementation and parallel data management so that different parallel paradigms can be implemented without breaking the software that uses DBuilder. The parallel PT software developed by the first author is incorporated to implement the Lagrangian approach. Experimental results will be provided to detail the scalability on different application problems. The performance metrics are in terms of wall-clock time, algorithm efficiency, and parallel efficiency. Two hardware architectures in the ERDC MSRC are chosen for performance comparison and portability demonstration. Coding examples are also shown to demonstrate the simplicity of the building blocks in the application code.

This poster is of interest to the researchers in the following fields: (1) EQM (Environmental Quality Modeling and Simulation) applications, (2) PT (particle tracking) applications, (3) software tool development, and (4) the DoD user community. This research produces not only a scalable parallel computation code but also a parallel software tool, which serves as a domain partitioner and a parallel data manager. Such a software development will benefit applications in the communities mentioned above.

The complete poster is organized as follows: Section 1 gives a brief introduction of the WASH123D model; Section 2 reviews the numerical methods and algorithms implemented in the model, where color graphs are used to describe the interaction between 1-, 2-, and 3-D objects; Section 3 states the software design, software tool development, and PT software incorporation, where color charts and color graphs are used to show the software conceptual model and software architecture; Section 4 provides experimental results to demonstrate the performance of the implementation on two architectures in the ERDC MSRC, where color graphs and tables are used to describe the application problem, show the results, and compare them on different machines; Section 5 concludes the current work and describes the future tasks.
  Link: --
   

 

     
  Session: Poster Reception
  Title: ZENTURIO: An Experiment Management Tool for Cluster and Grid Computing
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Radu Prodan (Institute for Software Science, University of Vienna), Thomas Fahringer (Institute for Computer Science, University of Innsbruck)
   
  Description:
  The development of performance-oriented applications involves many cycles of code editing, execution, testing, performance analysis, and tuning. Some performance metrics, such as scalability or speedup, normally require the testing of numerous problem and machine sizes for different compiler options and target architectures. On the other hand, large sale parameter studies necessitate repeated execution of the same application with various different input parameters, which have to be correlated with the output results. To the date, scientists and engineers still need to manually manage many different I/O data sets, launch large number of program compilations and executions, invoke performance analysis tools to derive performance metrics, relate performance data back to experiments and code regions, etc.

We have developed the ZENTURIO experiment management tool for multi-experiment parameter and performance studies of parallel and distributed applications on cluster and Grid architectures.

We have designed a directive-based language (HPF and OpenMP-like) called ZEN to annotate arbitrary files and specify value ranges for arbitrary application parameters, including program variables, I/O file names, compiler options, target machines, machine sizes, scheduling strategies, data distributions, communication networks, etc. Through performance behaviour directives the user can request for a large variety of performance metrics (e.g. execution time, cache misses, communication time, synchronisation time, floating-point operations per second) for any code region in the program. The advantage of the directive-based approach over an external global script to describe experiments (see Nimrod, ILAB) is the ability to specify experiments at a more detailed level (e.g. associate local scopes to directives, restrict parametrisation to specific local variables, evaluate different scheduling alternatives for individual loops, etc.).

Through a graphical User Portal the user inputs the application, together with the compilation and execution commands, and optionally the target machine where to execute the experiments. An Experiment Generator service parses the ZEN-annotated files and generates the corresponding set of experiments. ZEN performance behaviour directives are translated into calls to the SCALEA performance instrumentation system, which provides a full-fledged HPF, OpenMP, and MPI Fortran90 front-end and unparser, based on the Vienna Fortran Compiler (VFC). An Experiment Executor service submits and monitors the execution of experiments on the target execution site. Interfaces to various batch schedulers has been developed, including PBS, LSF, LoadLeveler, Condor, GRAM, and DUROC. An Experiment Monitor portlet of the User Portal displays a continuous on-line view of the experiments and their status, as they progress. After starting a suite experiments, the user can always disconnect the User Portal (e.g. in cases of starting experiments over night, or from a temporary Internet/Grid location) and leave the experiments running in the background on the target HPC machine. Later on, he can reconnect to the User Portal (e.g. in the morning, or from a different Internet/Grid location) and analyse the newly updated status of experiments. After each experiment has completed, the performance and output data are automatically stored into a relational PostgreSQL-based data repository. An Application Data Visualiser portlet of the User Portal can be employed to automatically invoke SQL queries against the data repository and display visualisation diagrams (linecharts, barcharts, surfaces) that show the evolution of performance of output data across multiple sets of experiments. Flexibility is given for mapping ZEN-annotated parameters to arbitrary visualisation axis and therefore generate meaningful diagrams as needed by the user. Example of such useful visualisation diagrams include: variation of the execution time with the machine size (scalability study), impact of different interconnection networks (and TCP buffer sizes), loop scheduling strategies, or array distributions, on the overall execution time, variation of an output result as a function of various input parameters, etc.

ZENTURIO has been designed as a Grid application on top of existing state-of-the-art Grid middleware technologies, like Web services standards, Open Grid Services Architecture (OGSA), and Globus toolkit. We employ the Grid Security Infrastructure (GSI) together with message level security (based on XML digital signatures and XML encryption) for authentication and secure communication, GridFTP for experiment transfer from the user to the execution sites, and GRAM and DUROC for job submission on the Grid. We proposed a novel usage of the UDDI Service Repository to accommodate implementations of transient Grid services. Factory and Registry services are employed to create, respectively register Grid services on-the-fly on arbitrary Grid sites.

We have successfully applied ZENTURIO on various scientific codes developed in the context of the AURORA multi-disciplinary Austrian special research program, including a material science kernel, a particle-in-cell simulation, various FFT benchmarks, and a financial application.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Analysing and Enhancing Parallel I/O in mpiBLAST
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Fernando Moraes (Los Alamos National Laboratory), Wu-chun Feng (Los Alamos National Laboratory)
   
  Description:
  BLAST, a family of sequence database-search algorithms, serves as the foundation for bioinformatics. The BLAST algorithms search for similarities between a query sequence and a large, infrequently changing database of DNA or amino-acid sequences. For a newly discovered sequence, similarities between the new sequence and a gene of known function can help to identify the function of the new sequence.

Unfortunately, BLAST has proven to be too slow to keep up with the current rate of sequence acquisition. As a result, many solutions to parallelize BLAST have been proposed. Arguably, the most successful parallelization is mpiBLAST, an open-source parallelization based on a new technique called database segmentation. mpiBLAST is the first and only parallel implementation of BLAST that delivers super-linear speedup. It achieves such performance by reducing (or even eliminating) extraneous disk I/O. For a more in-depth discussion on mpiBLAST the interested reader is referred to "The Design, Implementation, and Evaluation of mpiBLAST" by Darling, Carey, and Feng, Clusterworld Conference & Expo, June 2003.

Although the disk I/O issues in the aforementioned paper were thoroughly addressed, the network I/O issues between the master node and worker nodes in mpiBLAST were largely ignored because the authors assumed that the sequence database was being used in a dedicated environment and would therefore not change between consecutive queries. However, in a shared environment (e.g., different research groups within the same institution doing BLAST sequencing), such an assumption may not always hold true. Different sequence databases may be required between consecutive queries. Consequently, we must take the distribution of the database into account when executing queries in a shared environment.

In its current incarnation, all the worker nodes simultaneously attempt to access a shared disk location, creating a "hot spot" in the parallel computing system. As the number of workers increases, the disk contention at the shared location rises dramatically. For instance, a 2-GB database fragmented into 25 parts takes roughly 1000 seconds to transfer. This is higher than expected and as such an effort to improve I/O performance was made.

The idea behind the I/O enhancement was to pipeline the parallel disk access so as to disperse the "hot spot," thus eliminating I/O contention. This provides us with a 40% improvement in I/O time and allows any given worker node to immediately begin computation upon receipt of its database fragment, e.g., the first worker node begins computation 40 seconds into the I/O round rather than 1000 seconds into the I/O round. Furthermore, this new implementation will create a greater dispersal in the computation end-time of the workers, and thus provides very little 'queue-waiting' in subsequent communication rounds.

If accepted, the poster will provide an in-depth analysis and comparison of the I/O in both implementations.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Performance Analysis of On-Host Communication in the Los Alamos-Message Passing Interface
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Timothy Brian Prins (Advanced Computing Laboratory Los Alamos National Laboratory), Richard L. Graham (Advanced Computing Laboratory Los Alamos National Laboratory)
   
  Description:
  In order to attain high performance in cluster computing using the Message Passing Interface (MPI) environment it is necessary to utilize efficient on-host communication methods for passing local messages. We first examine four different methods of on-host communication as they apply to an MPI implementation. We look at pipes, sockets, shared memory, and the Shmem library, detailing their advantages and disadvantages as well as providing latency and bandwidth data.

Next we examine the shared memory approach to on-host communication implemented in the Los Alamos Message Passing Interface (LA-MPI), a high-performance, fault tolerant MPI designed for terascale clusters. We compare the performance of LA-MPI to other MPI implementations including MPICH and Compaq's MPI (Alaska), as well as several simple benchmarks which measure the latency and bandwidth of shared memory and the Shmem library. Based on these results, we suggest several methods of improving the on-host performance of LA-MPI.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Hardware Performance Counters: Is What You See, What You Get?
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Alonso Bayona (The University of Texas at El Paso), Manuel Nieto (The University of Texas at El Paso), Patricia J. Teller (The University of Texas at El Paso)
   
  Description:
  Hardware performance counters have become an invaluable asset for application performance tuning. However, lack of standards and documentation can cause pitfalls and misinterpretations when using hardware counters for characterization of application performance. Thus, although hardware performance monitors can provide valuable data to the application programmer, they must be understood! In this poster we outline the research that we have been doing with respect to understanding the data generated by hardware performance counters on a variety of platforms. Via the format of a "quiz," i.e., questions and answers, we demonstrate the need for counter data calibration, and illustrate how some hardware metrics can be misleading due to incorrect interpretation and how some hardware metrics can be "downright puzzling."

Performance counters are a set of registers that count events (e.g., cache hits) that occur during the execution of a program. Interfaces provided by vendors or by third-party APIs like PAPI, developed by the University of Tennessee-Knoxville (UTK), make the counters accessible to application programmers. Event counts can be useful to application programmers for evaluating the performance of sections of code, analyzing the effect of compiler options on program execution, or deriving performance metrics for an application. Performance counters have gained acceptance in the high-performance computing (HPC) community because they can help application programmers identify and correct sources of performance degradation and, thus, attain better utilization of todayís superscalar microprocessors. Because of the design complexity of these microprocessors, performance counter data can be difficult to interpret for some events. By interpreting data generated by performance counters and determining under what circumstances performance counter data can be used with confidence, our work can assist application developers in their use of hardware performance counters to decrease the execution time of their applications.

The methodology used to study the hardware performance counts generated for an event consists of seven phases, which are repeated as is necessary: micro-benchmark, prediction, data collection-1, data collection-2, comparison, statistical analysis, and implementation of an alternative verification approach. The micro-benchmark phase involves designing and implementing a validation micro-benchmark that is simple and small enough to permit event count prediction. In the prediction phase event counts are estimated using tools and/or mathematical models. The data collection-1 phase consists of collecting hardware-reported event count data using a high-level API such as PAPI, the HPM Toolkit, or the PM Toolkit. The data collection-2 phase, which is not always necessary or possible, makes use of a simulator to collect predicted event count data. In the comparison phase predicted and hardware-reported event counts are compared, after which statistical analysis is performed to identify and possibly quantify differences, and detect cases requiring further investigation. When analysis indicates that prediction is not possible, an alternate approach is implemented to either verify reported event count accuracy or demonstrate that the reported event count seems reasonable.

Micro-benchmarks have been designed and implemented to monitor events such as: number of loads and stores completed, floating-point instructions executed, instruction and data cache hits, data TLB misses, and square root instructions completed. As the poster will illustrate, our findings range from concluding that performance counter data agrees with our predictions to concluding that counter data may be misleading and a better understanding of what is being counted is needed to properly use this information for performance tuning. We are studying a variety of platforms, e.g., Intelís Itanium and Pentium, IBMís Power Series, and SGI's MIPS R10K and R12K microprocessors.

With Luiz DeRose of IBM, we also are using performance counters to evaluate the effect of compiler options on program execution. For example, some experiments are being used to analyze the implications of using the -qrealsize=8 Fortran compiler option, which has been found to degrade the performance of the studied application when combined with D0 constants in the source code. In collaboration with UTK, we worked with DoD applications programmers to help them better understand performance counter data on the IBM Power3 platform.

The results of these experiments are being archived in an on-line repository called RIB (Repository In a Box) so that the community can avail itself of them. The RIB contains scripts, micro-benchmarks, reports, spreadsheets, and explanations of our experiments, making the experiments reproducible.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Public Server for High-Throughput Genome Analysis
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Dinanath Sulakhe (Argonne National Laboratory), Alex Rodriguez (Argonne National Laboratory), Natalia Maltsev (Argonne National Laboratory), Elizabeth Marland (Argonne National Laboratory)
   
  Description:
  During the past decade, the scientific community has witnessed an unprecedented accumulation of gene sequence data and data related to the physiology and biochemistry of organisms. To date, more than 149 genomes have been sequenced and genomes of 747 organisms are at various levels of completion [GOLD http://wit.integratedgenomics.com/GOLD/]. In order to exploit the enormous scientific value of this information for understanding biological systems, the information should be integrated, analyzed, graphically displayed and ultimately modeled computationally. The emerging systems biology approach requires the development of high-throughput computational environments that integrate (i) large amounts of genomic and experimental data and (ii) powerful tools and algorithms for knowledge discovery and data mining. Most of these tools and algorithms are very CPU-intensive and require substantial computational resources that are not always available to the researchers. The large-scale, distributed computational and storage infrastructure of the TeraGrid offers an ideal platform for mining such large volumes of biological information. Indeed, targeted performance of NSF Distributed Terascale Facility and DOE Science Grid amounts to trillions of floating-point operations per second (teraflops) and storage of hundreds terabytes of data. In addition to using the large-scale distributed infrastructure it is desirable to provide a secure collaborative environment, where scientists and researchers can discuss, share, and analyze data.

In order to address the issues presented above, we have developed a public server GNARE and an automated analysis pipeline, GADU. The Genome Analysis and Databases Update system, GADU was presented in the Supercomputing Conference 2002 (SC2002). GADU is an automated, high-performance, scalable computational pipeline for data acquisition and analysis of newly sequenced genomes. GADU allows efficient automation of major steps of genome analysis: data acquisition and data analysis by variety of tools and algorithms, as well as data storage and annotation.

Analysis of biological data on various levels of organization (e.g. sequence level, genome architecture level, analysis of gene networks) requires integration of data, scientific tools and tools for collaborative research into one analytical environment. In order to provide this environment, we are developing GNARE, Genome Analysis Research Environment.

GNARE is a public genome analysis server, which would have: i. Integrated Computational Environment containing Tools and Algorithms developed by other groups and ANL for analysis of the data. ii. Pre-defined as well as customized scientific workflow pipelines efficient analysis of biological data. iii. TeraGrid Infrastructure for performing CPU intensive computational tasks (via GADU gateway), using technologies like Condor, Globus, Chimera, Java CoG kit. iv. Portals-technology based user interface.

We are using Jakarta - Jetspeed technology to build this portal. Jetspeed provides the basic building blocks to design and develop a portal. It provides Database User Authentication, Template based layouts including JSP and Velocity, Custom default home page configuration, WML support, Web Application development infrastructure, Database support (Oracle, DB2, MySQL, Postgres, Sybase) and many more features.

Apart from the different features, the portal would also provide different collaborative tools like Calendars, Discussion Boards, Chats, File Uploads and downloads etc.

Already a significant amount of work has been done on the server. The user interface is all ready and few of the tools can be accessed through the server. All the tools can be accessed as standalone tools and no workflows or pipelines are included yet. Though the existing CGI scripts have been ported into the server, very soon Portlets would be written for all the tools and making workflows. Security issues would be taken care in case of job submissions using Globus Certification.

The poster layout will portray the need for such a system by demonstrating the enormous amounts of data from the sequenced genomes. It will show the flow that a scientist would take in trying to further characterize and analyze a protein sequence through the use of systems biology. The poster will also show the advantages of using the technology of portals by showing the different levels of user participation and collaboration. The poster will also demonstrate the several tools that can be used through the system GNARE.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Dynamic Sessions for the Grid Environment
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Karl Doering (University of California, Riverside), Kate Keahey (Argonne National Laboratory)
   
  Description:
  Problem Statement
Grid computing, defined as flexible, secure, coordinated resource sharing within virtual organizations (VOs), is becoming increasingly widespread. This mode of computing seeks to create a model whereby VOs authorize the sharing of resources according to (potentially) dynamic policies. Unfortunately, the implementation of this concept in current computational Grids is generally limited to mapping a Grid identity to a local identity on a specific resource. These static local identities (typically taking the form of Unix user accounts) are intended for the enforcement of site-specific policies on the local resource, not for enforcing dynamically-changing fine-grained VO policies.

A set of abstractions and mechanisms that would allow users to obtain and manage local protection environments (such as traditional Unix accounts, virtual machines, or custom sandboxes) based on a set of fine-grained policies specified by the VO or other stakeholders, rather than obtain a preconfigured account directly from the resource owner, is lacking. Such an abstraction should allow for interfacing with multiple implementations, suitable in different circumstances, rather than confining the resource owner/VO to a specific technology. Furthermore, the abstraction should be generalizable to composite environments spanning multiple resources.

Approach
In our work, we introduce the abstraction of a dynamic session, which combines protection, fine-grained resource management, and session state in the Grid. Dynamic sessions allow users to manage session properties (such as termination time or other resource characteristics embodied) based on agreements and authorization policies. Formalizing the notion of a session standardizes interactions so that sessions can be implemented using a diverse set of technologies. In addition, standardizing session creation alleviates the administrative burden involved in providing users fine-grained access to VO resources.

We model dynamic sessions as Grid services, allowing them to be acquired and manipulated via standard Open Grid Services Infrastructure (OGSI) mechanisms. Specifically, we have implemented an OGSI-compliant Dynamic Session Factory Service, which securely instantiates new sessions, and a Dynamic Session Service for session management. As part of the creation process, the session is mapped to an implementing technology, and policy is written governing session management. Dynamic sessions expose basic properties such as disk space usage, CPU time, and session termination time as Service Data Elements (SDEs). Session properties may be queried and manipulated by authorized users dynamically as permitted by the defined policies. The properties are designed to be platform-independent and extensible. Sessions can be combined to form complex sessions. In order to experiment with session use, we have extended the Globus Resource Allocation Manager (GRAM) architecture to allow job submissions against dynamic sessions.

Status
We have implemented the infrastructure described above in the Globus Toolkit 3 (GT3). We are currently investigating the properties of a variety of sandboxing technologies (ranging from Unix accounts to software fault isolation, system call interposition, runtime monitoring, virtual machines, and software dynamic translation) that may be employed to implement dynamic sessions. We are in the process of evaluating their qualitative and quantitative tradeoffs and their applicability in a distributed computing environment like the Grid, considering feature desirability and performance characteristics.

Presentation
Our poster submission presents the new architecture we are using. It also includes charts and figures depicting qualitative and quantitative differences among different sandboxing technologies available for dynamic session implementation and enforcement, illustrating their applicability for use in a Grid environment. Planned future work includes examining how to efficiently checkpoint sessions in the Grid, providing fault tolerance as well as the ability to migrate sessions across machines.
  Link: --
   

 

     
  Session: Poster Reception
  Title: NAS Parallel Benchmarks: new territory, new releases.
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Rob F. Van der Wijngaart (Computer Sciences Corporation)
   
  Description:
  Problem: Gauging the capabilities of high-performance computing (HPC) systems is becoming increasingly difficult with the introduction of more complex and more demanding applications and the addition of new architectural features. More and better tools are needed to cover the architecture and application space adequately. The NAS Parallel Benchmarks (NPB) were among the first widely used application-level HPC benchmark suites. They distinguish themselves from other benchmark suites by their realistic specifications yet compact implementations. However, they have some limitations. Except for Integer Sort, all NPB concern floating point-intensive applications whose memory accesses are identical across many iterations. Furthermore, except for a sparse matrix problem, memory strides are very regular, which can be exploited by specialized hardware. The benchmarks were constructed such that they would feature fairly easily load-balanced, loop-level parallelism, which was exploited in the reference implementations based on the Message Passing Interface (MPI). Finally, no significant I/O was performed by any of the benchmarks, and problem sizes which were challenging at the time the NPB were created are no longer considered substantial.

Approach: We address all these issues through several major extensions of the NPB suite:
1) Release of source code implementations of the NPB using parallelization techniques and paradigms other than message passing: OpenMP, Java multi-threading, and High Performance Fortran.
2) Increase in memory footprint of the previous largest NPB class by a factor of 16, and of the computational work by a factor of 20. This is roughly in keeping with the increase in computational power and in the growth of available (cache) memory per processor of current HPC systems.
3) Addition of an I/O intensive benchmark, derived from the previously unreleased BTIO benchmark. Parallel reference implementations are provided using MPI and a variety of techniques for collecting distributed data into a single output file. The special significance of this benchmark is that it allows users to measure the overhead of parallel I/O within the context of a realistic application that allows overlap of write/communicate and compute operations.
4) Addition of a suite of multizone benchmarks that feature multilevel exploitable parallelism, suited for hybrid and multilevel parallelization techniques. The problems are formulated such that both coarse-grain and loop-level parallelism must be exploited to obtain good performance results. They are composed of subproblems, each a single mesh-based NPB, that communicate boundary values after each iteration. Using different-sized subproblems within a multizone benchmark stresses the load balancing capabilities of the parallelization paradigm. A hybrid reference implementation (MPI + OpenMP) is also provided.
5) Introduction of a completely new benchmark, featuring an irregular, adaptive mesh and the concomitant irregular, dynamically changing memory accesses. This type of application stresses many aspects of HPC systems, most notably memory traffic and dynamic load balancing. A reference implementation in OpenMP is provided.

Status: Specification and reference implementations of all new NPB are publicly available from NAS at www.nas.nasa.gov/Software/NPB

Presentation: Details of the features of the new NPB will be shown, and graphs will be presented of a number of interesting performance results obtained by NAS and other research groups and vendors comparing different platforms and different techniques, highlighting scalability and absolute performance.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Parallel landscape fish model for South Florida ecosystem simulation
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Dali Wang (The Institute for Environmental Modeling, University of Tennessee, Knoxville), Louis Gross (The Institute for Environmental Modeling, University of Tennessee, Knoxville), Michael Berry (Department of Computer Science, University of Tennessee, Knoxville)
   
  Description:
  Introduction
The landscape of South Florida is a complex environment that has been subjected to years of environmental stress. Disruptions in the natural water flows have been the catalyst for profound changes in the vegetation and animal life in the region. Attempts are now being made to reduce the devastating effects of alterations of the natural hydrology on South Florida ecosystems. The effects of modifications of hydrology need to be modeled to ensure that changes do not further harm the natural systems, and provide benefits appropriate to the planned large expenditures of funds. An effective way to evaluate the effects of hydrologic modifications is through computer modeling. The Across Trophic Level System Simulation (ATLSS), a family of models, was developed to address regional environmental problems, which span a wide variety of spatial, temporal and organismal scales. The ATLSS Landscape Fish Model (ALFISH) is one component of ATLSS. One objective of ALFISH model is to compare relative effects of alternative hydrologic scenarios on fresh-water fish densities across South Florida. Another objective is to provide a measure of dynamic, spatially-explicit food resources available to wading birds.

ALFISH model components
The ALFISH model contains four basic components: landscape component, hydrology component, lower trophic components, and fish component. (1) Landscape: The total area of Everglades is represented as a landscape matrix (419 by 264). Each landscape cell in the ALFISH model represents a 500m by 500m region, which includes areas of marsh and pond. The difference between the marsh and pond areas is that the later is always considered wet (contains water) while the marsh area dries as a function of the hydrology plan. (2) Hydrology: data is provided by an external hydrology model, and (3) lower trophic component: the values of lower trophic level (food resources for fish) are a function of the hydrology data. (4) Fish component: The fish population model simulated by ALFISH is size-structured and is divided into two functional groups: small and large fishes. Both of these groups can be present in the marsh and pond areas. Each functional group in each area is further divided into age categories. The population that occupies a cell area is represented as the fish density (bio-mass) within that cell. Basic fish behaviors are simulated in the model, including movement within cells, movement between cells, mortality, growth and reproduction. The total simulation time of ALFISH model is typically 31 years with a timestep of 5 days.

Parallelization Methodology

Landscape partition: In the parallel ALFISH model, the hydrology landscape was duplicated on all processors, reducing the time needed for data transfer and making this ideal for dynamic loading balance. Due to the sequential I/O operation needed in each timestep, one processor was dedicated to collecting the data from other processes and storing the data onto disk. Each of the other processors simulates the fish behaviors in one row-size block-striped partition of the landscape. The computation of fish mortality, growth and reproduction is local, but the movement of fish depends and has influence on the adjacent cells, so data exchange and explicit synchronizations are required.

Dynamic loading balance: The computational intensity of the fish behavior component depends on the hydrology data, especially water depth, cell hydrologic types, as well as the transition from wet to dry or vice verse. The computational cost associated with each landscape cell is also related to the fish density, distribution and structure. At each timestep, the change of hydrology data causes the drift of computational intensity among processors. In order to better balance the workloads, dynamic partitioning is achieved by redistributing the computational boundary among all processors, which in turn causes extra data transfer during the model simulation.

Performance comparison and analysis on different parallel architectures
(1) Model performance comparison on SMP and cluster of workstations using MPI: The ALFISH model have been appropriately parallelized using MPI on both symmetric multiple processors (SMP) and clusters of workstations, linked with high bandwidth (1 Gb/s) connections. This experiment is designed to compare the message passing model performance on two kinds of typical computational platforms.
(2) Model performance comparison on SMP using MPI and Pthread: The ALFISH model have also been parallelized using Pthread on the SMP, in which the communication time and context switch time have been reduced. This comparison provides information on how performance improvement comes from the reduction of data communication and context switching on SMP. Since ALFISH is a data intensive computational model, the performance improvement by adapting Pthread is significant.
  Link: --
   

 

     
  Session: Poster Reception
  Title: GMIT: A Tool for Indexing Self-describing Scientific Datasets
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Beomseok Nam (University of Maryland), Alan Sussman (University of Maryland)
   
  Description:
  To help navigate through large scientific datasets, we have built a generic multi-dimensional indexing library (GMIL) for self-describing scientific datasets. The most common type of retrieval pattern into such multi-dimensional datasets is a spatial range query, which reads a contiguous subset of one or more multi-dimensional arrays within the given query range. Using spatial indexing techniques, such as R-trees or its variants, our indexing tool allows for direct access to subsets of a dataset, which can greatly improve I/O performance. Our generic indexing tool targets scientific data formats such as netCDF, HDF, and SILO, which contain structural metadata. Data stored in such self-describing formats may be easily accessed across heterogeneous platforms using the runtime library API for each format. Data stored in these formats also may contain application-specific semantic information about the contents of the file, so that no other information is necessary to interpret the data.

Our current implementation supports four different data formats and libraries (netCDF, HDF4, HDF5, SILO), but additional scientific data formats can utilize the tool by adding small amounts of code. The generic multi-dimensional scientific indexing library has an index creation module, an index search module, a resolution interpolation module, and a filtering module to facilitate performing range queries with various application specific-semantics. Some sensor datasets may contain data elements at different resolutions, though they share the same spatial coordinate data. Therefore one part of a dataset may have a higher resolution than the corresponding coordinate dataset. To address the problem of efficiently generating spatial coordinate information for such parts of a multi-resolution dataset, the indexing library supports resolution interpolation, to provide spatial coordinate information as needed. In addition, the data returned by the indexing library for a range query may contain some elements that are not within submitted query range, because the spatial index operates on blocks of data element, not individual elements, for efficiency in both searching the index and minimizing the size of the index. Therefore the library provides a filtering module to optionally remove the extra data elements on behalf of the client.

Experimental results, on NASA earth observing satellite and DOE astronomy datasets, have shown that the generic scientific indexing library greatly improves the performance of range queries, as compared to using the format-specific runtime libraries.

Ongoing research focuses on extracting and interpreting application-specific semantic metadata semi-automatically. Users will be provided with a GUI tool that enables specifying information that cannot be extracted from structural metadata, such as which data fields should be used to index a dataset. That information, combined with the generic indexing library, will allow for fast, easy generation of spatial indexes for scientific self-describing datasets. Additional ongoing work will provide several spatial indexing options, including multiple R-tree and kD-tree variants.

The poster will describe the overall structure of our library and tool, and display performance results on real scientific datasets over different classes of range queries.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Parallel hybrid optimization approaches for the solution of large scale groundwater inverse problems on the teragrid
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Kumar Mahinthakumar (North Carolina State University), Mohamed Sayeed (North Carolina State University), Dongju Choi (San Diego Supercomputer Center), (), (), Nicholas Karonis (Northern Illinois University/Argonne National Laboratory)
   
  Description:
  Solution to groundwater inverse problems is an important if not an essential step in developing efficient groundwater remediation and management strategies. But application of groundwater inverse modeling to field problems has been limited due to the lack of efficient and/or flexible algorithms and immense computational requirements. We have developed an efficient optimization based framework for solving three dimensional groundwater inverse problems in a parallel computing environment. Our implementation is based on a hybrid genetic algorithm - local search (GA-LS) optimizer that drives a parallel finite-element groundwater simulator. The MPI (Message Passing Interface) communication library is employed to exploit data parallelism within the groundwater simulator and task parallelism within the optimizer. Data parallelism in the groundwater simulator is achieved through a domain decomposition strategy and task parallelism in the optimizer is achieved through a dynamic self-scheduling algorithm. In particular, the use of MPI communicators in a hierarchical fashion has enabled us to achieve multiple levels of parallelism with limited code modification.

Two types of GAs, (i) binary/integer encoded GA (BGA/IGA), and (ii) real encoded GA (RGA), and four local search approaches, (i) Nelder-Mead simplex method (NMS), (ii) Hookes and Jeeves pattern search method (HKJ), and (iii) Powellís method of conjugate directions (PWL), and (iv) Fletcher-Reeves conjugate gradients (FRCG) have been implemented. A flexible interface is provided that allows for the choice of any combination of GA and LS approaches as well as standalone applications of each approach. Also, our implementation permits the simultaneous application of multiple local searches in parallel starting from multiple initial guesses provided by GA. Three types of groundwater inverse problems are tested: multiple biological activity zone identification, contaminant source identification (location and concentration), and multiple-source release history reconstruction.

The combination of parallel computing and efficient optimization algorithms has allowed us to test problems of magnitude and complexity that have not been attempted before. It is emphasized here that many of the problems tested required only a few hours of computation time on an IBM SP3 supercomputer using 256 processors. For example, similar problems would have required over a month of computation time on a high end PC. In almost all problems tested, solutions were achieved with over 90% accuracy even in the presence of moderate noise in the observations.

We are now in the process of extending this implementation to the NSF Teragrid by the use of Globus enabled MPI (MPICH-G2). The code suite has been built with MPICH-G2 and limited linked test runs have been completed between the Origin2000 at NCSA (soon to be retired) and the IBM SP3 (Blue Horizon) at SDSC. More recently, the code has been built and run with MPICH-G2 on the Linux cluster (Chiba City) at ANL. We are currently testing the code on the Teragrid node at SDSC. We anticipate more extensive Teragrid results involving multiple sites (SDSC, NCSA, PSC, and ANL) by the time of the poster presentation. Our Teragrid tests will include the following: (i) scalability test: comparison of single cluster vs. multiple clusters, (ii) determination of best processor configuration for a typical run (e.g., optimal number of processors at each site, the number of fine grained processors at each site, and the location of the master node), and (iii) algorithmic performance of different optimization approaches on the Teragrid. We also anticipate sharing our experiences in the following areas: (i) multiple platform run scheduling experience, (ii) overall teragrid environment experience, and (iii) suggestions for the future Teragrid (including suggestions for middleware improvements and hardware).

Collaborations are also underway with North Carolina Department of Environment and Natural Resources (NCDENR) to apply our methodology to solve field problems where multiple source release history reconstructions are needed to identify responsible parties in several contamination incidents in North Carolina. We envision this to be an ideal Teragrid real-world application in the near future.
  Link: --
   

 

     
  Session: Poster Reception
  Title: GHS: A Performance Prediction and Task Scheduling System for Grid Computing
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Xian-He Sun (IIT), Ming Wu (IIT)
   
  Description:
  Grid Computing introduces a great challenge in task scheduling: how to partition and schedule tasks in a large, available but shared, heterogeneous computing system. The conventional parallel processing scheduling methods cannot apply to a Grid environment where computing resources are shared and the Grid scheduler has no control over local virtual organizationís local jobs. The key to Grid task scheduling, therefore, is to estimate the availability of the computing resources and to find its influence on the application performance. Some latest Grid tools have been developed to meet the need. However, these tools are for short-term resource availability. For instance, the well-known Network Weather Service (NWS) only predicts the availability of a computing and communication resource for the next five minutes, and its satisfactory prediction range in general is much less than the five minutes upper bound.

In this study, we present a prototype implementation of a long-term, application-level performance prediction and task scheduling system, namely Grid Harvest Service (GHS) system, for Grid computing. Here long-term means the application requiring hours or more, in contrast to minutes, sequential execution time; and application-level performance means the application turn-around time, in contrast to resource availability. GHS consists of a performance monitor component, prediction component, partition and scheduling component, and user interface. The prediction component is based on a new performance model, which is based on a combination of stochastic analysis and direct-numerical simulation. The model considers computing capacity, utilization, and service distributions of heterogeneous resources. Unlike other stochastic models, this model individually identifies the effects of machine utilization, computing power, local job service and task allocation on the completion time of a parallel application. It is theoretically sound and practically feasible.

Utilizing the performance prediction, various partition and scheduling algorithms are developed and adopted in the partition and scheduling component, for single sequential task, single parallel task with a given number of sub-tasks, optimal parallel processing, as well as meta-task composed of a group of independent tasks. Heuristic task scheduling algorithms are proposed to find an acceptable solution with a reasonable cost. To reduce performance loss caused by machine ìmutationî, a self-adaptive rescheduling algorithm is also developed to address when and where to reschedule tasks in case that abnormal usage pattern occurs on some computing resources. GHS uses an adaptive measurement methodology to monitor the resources usage pattern, where the measurement frequency is dynamically updated according to the previous measurement history. This method reduces the monitor overhead considerably.

A prototype of GHS has been developed for feasibility testing, which consists of the components of performance monitor, system-level and application-level prediction, task partition and scheduling, and user interface. Initial experimental testing is conducted on Grid nodes at Argonne National Laboratory, Oak Ridge National Laboratory, and at IIT. Experimental results show that GHS adequately captures the dynamic nature of grid computing. For large jobs, eight hours runtime or more, its prediction error is less than 8% on a single machine. The prediction error is even less on multi-processing systems. Argonneís Pitcairn is a multiprocessor with 8 UltrasparcII processors and a local scheduler. Experiments on Pitcairn show our method works even better with the existence of a local scheduler where the prediction error is about 4%. Short-term prediction cannot be used in long-term prediction. Performance of NWS in these situations is abysmal. On the same experimental environment predictions with NWS show an average error of more than 100%. Experiments are also conducted on task scheduling. The optimal scheduling constantly provides a significantly better performance than that of random scheduling, while the heuristic scheduling constantly provides a nearly same performance with that of optimal scheduling. AppLeS scheduling system is probably the best known Grid scheduling system, which uses NWS as its prediction component. For a fair comparison, we have modified AppLeS to let it access GHSí prediction and compared the scheduling algorithms only. In scheduling of large applications on a non-dedicated heterogeneous environment, GHS scheduling system decreases the task completion time by 10%-20% than that of AppLeS, while using only about half of the machines used by AppLeS. The reason is that GHS scheduling considers the effect of machine availabilities on parallel processing while AppLeS does not. We are currently integrating GHS components with other Grid middleware to provide on-line performance prediction and task scheduling services in the grid environment. In this poster presentation, we introduce the GHS system architecture, its functionality, performance results, and show the graphical user interface to demonstrate how to evaluate and schedule a large application in a shared computation environment.
  Link: --
   

 

     
  Session: Poster Reception
  Title: The OurGrid Project: Running Bag-of-Tasks Applications on Computational Grids
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Walfredo Cirne (UFCG), Nazareno Andrade (UFCG), Lauro Costa (UFCG), Glaucimar Aguiar (UFCG), Daniel Paranhos (UFCG), Francisco Brasileiro (UFCG)
   
  Description:
  Bag-of-Tasks (BoT) applications (those parallel applications whose tasks are independent) are both relevant and amendable for execution on computational grids. In spite of such suitability, few users currently execute their BoT applications on grids. This state of affairs inspires preoccupation. After all, if grids can¥t deal with BoT applications, they are not going to be of much use.

The OurGrid project (http://dsc.ufcg.edu.br/ourgrid) describes our findings investigating this very issue. We have identified four key features needed for the execution of BoT applications on grids: (i) suitable grid working environment, (ii) automatic access granting, (iii) efficient application scheduling, and (iv) good fault handling mechanisms. Unfortunately, no grid infrastructure currently used addresses all three issues in a comprehensive way. In this poster, we describe our efforts on how to provide such functionality.

Having a suitable grid working environment means providing the user with a set of abstractions that enable her to conveniently use the grid. To provide such an environment for BoT applications, we have developed the MyGrid grid broker. MyGrid provides simple abstractions on which the user can rely when using and programming for grids. It schedules the application over whatever resources the user has access to, whether this access is through some grid infrastructure (e.g. Globus) or via simple remote login (such as ssh). MyGrid was designed to be simple, the least the user gets involved in grid details, the better; and complete, it must cover the whole production cycle, from development to execution.

MyGrid is built on two main virtualizations: the GridMachine, that provides the basic services needed to use a machine (i.e. remote task execution and file transfer) and the GridMachineProvider, that abstracts resource managers (e.g. Supercomputers Schedulers). This design simplifies interoperating with new resources: to use a new middleware or provider, it is enough to implement a new GridMachine or GridMachineProvider, respectively.

The second functionality, automatic access granting, is commonly left out of the scope of current grid middleware, despite the obvious fact that a user can make no use of grid middleware if she has no grid to use. Letting the user and/or system administrators "manually" assemble the grid typically produces a grid of somewhat modest size, limiting the gains that can be achieved. To address this issue we are developing the OurGrid community, a P2P resource sharing system targeted to BoT applications. With the OurGrid community, we aim to allow users of BoT applications to easily obtain access and use large amounts of computational resources, dynamically forming on-demand, large-scale grids. OurGrid works as a network of favors. That is, each site offers access to its idle resources to the community, expecting to latter gain access to the idle resources of other participants. Moreover, using autonomous accounting, the network of favors ensures equitability in resource sharing, providing incentive for resource donations.

Third, although scheduling BoT applications may appear easy, grids introduce two issues that complicate matters. First, it is difficult to obtain information about the application (e.g. estimated execution time) and the grid (resource load, bandwidth, etc). Second, since many important BoT applications are also data-intensive, considering data transfers is paramount to achieve good performance.

We developed the Workqueue with Replication (WQR) algorithm to deal with the scheduling of compute-intensive applications using no information. WQR delivers good performance using no information about the resources or tasks. Instead, it replicates running tasks on idle machines. We investigated the performance of WQR under a variety of scenarios, comparing it with well known algorithms.

In order to support BoT applications that process huge data, we have designed schedulers based on Storage Affinity. Storage Affinity is a metric that considers data previously transferred, minimizing the application makespan. In contrast with another heuristics, the Storage Affinity uses no resource load and application information to make the scheduling decisions. Even so, experimental results show that is possible to reach good performance levels using the Storage Affinity with Replication.

Finally, Grids are complex systems being built on layered software abstractions. This is all fine and good when abstractions work. However, when they fail (and they do fail), the transparency they provide is compromised and the user needs to drill down to lower levels of abstractions in order to diagnose failures. This requires understanding many different technologies, what is just too much for any single person. We argue that automated tests can help to diagnose where the problem really is (application, configuration, middleware, infrastructure), thus enabling better cooperation among the various people involved in the grid (users, application developers, system administrators, middleware developers).
  Link: --
   

 

     
  Session: Poster Reception
  Title: High-level Language Support for Dynamic Data Distributions
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Steven J. Deitz (University of Washington), Bradford L. Chamberlain (University of Washington/Cray Inc.), Lawrence Snyder (University of Washington)
   
  Description:
  The tradeoff between global view of computation and explicit control of communication severely limits many of today's parallel programming languages. In global-view languages, programmers write their algorithm as a whole, largely disregarding the multiple processors that will implement the program. Consequently, explicit control of communication is often compromised; the compiler manages the parallel implementation details. In contrast, local-view languages require programmers to write their algorithm on a per-processor basis. Thus, communication is explicit in local-view languages but often difficult to control in global-view languages.

Traditionally, a global-view language can only offer explicit control of communication by placing severe limitations on expressibility. For example, writing an efficient sample sort algorithm would be difficult because a global-view language with explicit control of communication would likely lack facilities to specify data distributions and write a per-processor sort. New capabilities of ZPL, a global-view parallel programming language, allow for specifying data distributions and local-view coding without significant loss of either global-view of computation or explicit control of communication.

ZPL has proved to be portable, easy to program, and capable of achieving high performance. The basic construct underlying ZPL is the region. A region is an index set, with no associated data, used to declare parallel arrays and specify indices of a computation. Regions allow explicit communication because the programmer controls the induction of communication with a set of parallel array operators.

Our poster introduces two language level constructs, grids and domains, that let programmers specify and dynamically change the distribution of regions and arrays. A grid specifies a processor layout. A domain is bound to a grid and specifies a distribution on that grid. A region is bound to a domain and thus has its distribution specified. By allowing regions, domains, and grids to be reassigned, ZPL supports dynamic data distributions. Through a combination of static and dynamic checks, the compiler ensures communication is explicit because arrays declared over regions from different domains may not interact implicitly.

In addition, our poster introduces local variables and special array dimension types. These mechanisms are useful for writing algorithms, like sample sort, in which each processor is programmed to compute on its portion of a distributed array. The integrity of the array is maintained because, throughout the rest of the program, the array can still be viewed in a global context. Thus programmers can write per-processor code without losing sight of the global view of computation.

These powerful features are demonstrated in the context of the NAS parallel benchmarks. Preliminary performance results are shown.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Strong scalability analysis and performance evaluation of a CCA-based hydrodynamic simulation on structured adaptively refined meshes.
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Sophia Lefantzi (Sandia National Laboratories), Jaideep Ray (Sandia National Laboratories), Sameer Shende (University of Oregon)
   
  Description:
  Problem Statement:
Simulations on structured adaptively refined meshes (SAMR) pose unique problems in the context of performance evaluation and modeling. Adaptively refined meshes aim to concentrate grid points in regions of interest while leaving the bulk of the domain sparsely tessellated. Structured adaptively refined meshes achieve this by having overlaid grids of different refinement. Numerical algorithms employing explicit multi-rated time-stepping methods apply a computational "kernel" to the finer meshes at a higher frequency than at the coarser meshes. Each application of the kernel at a given level of refinement is followed up by a communication step where data is exchanged with neighboring subdomains.

The SAMR approach is adaptive, i.e. its characteristics change as the simulation evolves in time. Thus, scalability depends on the number of processors and the time-integrated effect of the physics of the problem. The time-integrated effect renders the estimation of a general metric of scalability difficult and often impossible. Generally, as reported in the literature, for realistic problems and configurations, SAMR simulations do not scale well.

For this work we analyzed two different hydrodynamic problems and present how communication costs scale with various aspects of the domain decomposition.

Approach:
The codes that we analyzed solve PDEs to simulate reactive flows and flows with shock waves. The codes were run until the incremental decrease in run times (with increasing processors) approached zero. It was found that the nature of the problem changed vastly during the run - even runs which showed poor scaling had periods of evolution where the domain decomposition showed "good" scaling characteristics, i.e compute loads were higher than communication loads. The computational load was found to be evenly balanced across the processors - the lack of scalability was due to the dominance of communication and synchronization costs over computational costs.

We identified and analyzed phases in the evolution of the problem where the simulation exhibited good and bad scaling. Communication costs were analyzed with respect to the levels of refinement of the grid as well as the data-exchange radius for each of the runs. This is a thorough performance analysis of SAMR hydrodynamics codes, performed for the first time in CCA-compliant codes, tackling the time-dependent nature of the communication overheads.

Both the codes that we analyzed employ the Common Component Architecture (CCA) paradigm and were run within the CCAFFEINE framework. The adaptive mesh package used (that performs the bulk of the communications) was GrACE (Rutgers, The State University of New Jersey). The measurements were performed using the CCA version of TAU (Tuning and Analysis Utilities). The tests were performed on "platinum" at NCSA (University of Illinois, Urbana Champaign), a Linux cluster of dual-node Pentium III 1 GHz processors, connected via a Myrinet interconnect.

Visual:
As a part of the visual presentation, we will present a color poster with our performance analysis results and hold a demonstration of the composition and execution of CCA codes. Animations of the adaptively refined grid will also be shown.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Dynamic Partitioning for High-Speed Network Intrusion Detection
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Jens Mache (Lewis & Clark College), Jason Gimba (Lewis & Clark College), Thierry Lopez (Lewis & Clark College)
   
  Description:
  As the speed of networks continues to increase (for instance 9.6 Gbps for OC-192), the process of detecting intrusive or malicious network behavior becomes more difficult. Network-based intrusion detection systems (NIDS) detect attacks such as denial of service or exploitation of program vulnerabilities by stateful analyzing packets and matching against attack signatures. Unfortunately, a conventional single-sensor architecture can be overloaded by high-speed traffic. If not all packets can be analyzed, some attacks will go undetected.

In 2002, Researchers at the University of California Santa Barbara's Reliable Software Group proposed the following solution [1]: run several NIDS sensors in parallel and divide the overall network traffic into subsets that contain all the evidence necessary to detect each and every attack. This architecture consists of a traffic scatterer that forwards incoming packets in a round-robin fashion to machines designated as slicers. The slicers would then methodically pass the packets to the reassemblers, which run a NIDS sensor, e.g. Snort.

We set out to improve this architecture with dynamic load balancing. Depending on network traffic, it is possible that some NIDS sensors are overloaded while others are underutilized. Due to this imbalance, attacks may be missed due to misallocation of computing resources. The novelty of our architecture is an additional manager that monitors the NIDS sensors and instructs the slicers to change the partitioning scheme, if necessary. By dynamically changing the partitioning scheme, our architecture conducts its analysis with greater precision and fewer dropped packets.

Our poster will first clearly describe the problem and our solution of load-balanced high-speed network intrusion detection. We will show important pieces of our code (sockets and threads). We will then show our experiments and conclusions.

Our main performance evalualtion testbed consisted of one scatterer, two slicers, and two NIDS sensors. One computer acted as the manager and one computer generated traffic by modeling packets. In the interest of creating a realistic simulation of traffic, the latter could optionally skew data towards certain subnets. In our main experiment, we compared static partitioning without a manager to dynamic partitioning with a manager. Our results show that with a manager the load between NIDS sensors can be balanced dynamically and that up to 16% more packets could be analyzed.

Reference: [1] C. Kruegel, F. Valeur, G. Vigna, and R.A. Kemmerer, "Stateful Intrusion Detection for High-Speed Networks," in Proceedings of the IEEE Symposium on Research on Security and Privacy, Oakland, CA, May 2002.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Benchmarking Clusters vs. SMP Systems by Analyzing the Trade-Off Between Extra Calculations vs. Communication
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Anne Cathrine Elster (Norwegian Univ. of Science & Technology, Trondheim, Norway), Robin Holtet (Norwegian Univ. of Science & Technology, Trondheim, Norway)
   
  Description:
  Clusters of PCs have a high processing-power/cost ratio compared to traditional supercomputers which is making them increasingly interesting alternatives for HPC applications. However, low-cost clusters also generally have network interconnects with higher latency and lower bandwidth than typical supercomputers. This makes such cluster systems potentially unsuitable for several communication intensive applications.

So how does one a) benchmark various type of clusters vs. other HPC systems and b) optimize an application for clusters? Traditional supercomputers (typically SMP based) have well-defined architectures. Consequently supercomputer vendors like SGI and CRAY have their own implementations of MPI, optimized mathematical libraries and optimizing compilers to get maximum performance from the hardware. However, there is no standard hardware solution for building clusters since they are generally put together from a conglomerate of hardware choices regarding main boards, processors and interconnect solutions. In addition, cluster systems tend to evolve and expand more rapidly than traditional supercomputers, making them even more challenging to provide optimized codes and benchmarks for.

A recent paradigm has emerged called adaptive software or Automated Empirical Optimization of Software (EAOS) which is aimed at making new generations of performance-critical libraries more adaptive to different hardware. Several methods of automatic optimization are actively researched. Some methods aim at automatically generating optimal code, while others use fixed code with parameters which need optimized values. This work adapts the fixed-code paradigm with a focus towards generating meaningful benchmarks for comparing various HPC systems including both clusters and SMP supercomputers.

The specific class of applications focused on in this work is domain decomposition problems that exchange border information between each iteration since these problems are naturally communication intensive. Examples of such computations arise, for instance, when using finite difference methods to solve partial differential equations (PDEs). These algorithms find approximate solutions by doing iterations that look at local/neighbor data (often called stencils) with a communication step between each iteration that exchanges border information. However, by performing extra calculations, several iteration steps may be done between each communication step, thus saving communication. This can be beneficial since clusters tend to have very fast processors, but have relatively slow interconnects.

Our code (which will be made available via the web by SC2003) finds the optimal amount of extra calculations to do for a given size problem and dimension on a given cluster and/or SMP system.

Running our code on different types of machines gives results as expected: The cluster systems benefit greatly from the reduced communication, whereas the supercomputer systems more quickly start to suffer from the added calculations. This is expected since the supercomputer system we tested on (a current top 500 SGI system) has a much faster interconnection network than the cluster systems we tested, which had 100Megabit Ethernet interconnects typical for such systems.

The optimal border widths for clusters are generally larger than those for supercomputers. Also, for the largest problem sizes on the supercomputer, the method shows no particular performance gain. Variable runtimes caused by system noise such as other running processes have more impact on performance than the saved communication achieved through added computations.

Whether the grid is divided along one or two dimensions also proves to be relevant regarding the performance. Division along one dimension gives the best results when using 4 processors. When using 16 processors, 2-dimensional division is a better choice. This is valid for both the cluster systems and the supercomputers tested.

For some systems there is not necessarily just one optimal amount of overlap for a given problem size, since the execution time go up and down as the overlap is increased. We assume this to be an effect of compiler optimization combined with changing cache usage patterns when the border width changes. This also illustrates how hard it is to manually optimize code using these techniques.

Our poster presentation will include a couple of figures describing the partitioning schemes and C++ class structures used in our benchmarking code as well as how to use it. Several graphs will show results from our benchmarks of several current HPC systems including: NTNUs 512-node production SGI Origin 3800 (up to 256 nodes tested), SGI's Altix SuperCluster with Itanium-2 processors, and a 32-node 100Mbps Ethernet Cluster with 1.46GHz AMD Athlon nodes with 2GB RAM as well as a mini dual-Pentium SMP cluster.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Benchmarking a Parallel Coupled Model
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  J. Walter Larson (Argonne National Laboratory), Robert L. Jacob (Argonne National Laboratory), Everest T. Ong (Argonne National Laboratory), Anthony Craig (National Center for Atmospheric Research), Brian Kauffman (National Center for Atmospheric Research), Thomas Bettge (National Center for Atmospheric Research), Yoshikatsu Yoshida (Central Research Institute of the Electrical Power Industry), Junichiro Ueno (Fujitsu Limited), Hidemi Komatsu (Fujitsu Limited), Shin-ichi Ichikawa (Fujitsu Limited), Clifford Chen (Fujitsu America, Inc.), Patrick H. Worley (Oak Ridge National Laboratory)
   
  Description:
  The dramatic increases in computational power parallel computing offers is enabling researchers to shift focus from the simulation of individual subsystems in isolation toward more realistic simulations comprising numerous mutually interacting subsystems. We call these more complex systems parallel coupled models. Parallel coupled models present numerous challenges in parallel coupling, including - but not limited to - parallel data transfer, intermesh interpolation, variable transformations, time synchronization of exchanged data, and merging of multiple data streams, all while maximizing overall performance.

A climate system model [1-3] is an excellent example of a coupled model, typically comprising atmosphere and ocean general circulation models (GCMs), a dynamic-thermodynnamic sea ice model, a land-surface model, and a river transport model. Interactions between these component models are often managed by a special component called a flux coupler [4].

Parallel Coupled models may be classified using the following characteristics: number of executable images (single vs. multiple), component scheduling (sequential vs. concurrent), and resource allocation (shared, disjoint, or overlapping pools of processors). NCAR's CCSM is a coupled model with concurrent component execution on disjoint pools of processors with either multiple executables [1]. The Parallel Climate Model (PCM) is a single executable with components executed sequentially on a shared pool of processors [2].

Benchmarking these models can be difficult. For coupled models with sequentially executed components sharing a common pool of processors, measurement of the coupled modelís end-to-end throughput and a profile of the code executed on each processor is sufficient to guide performance tuning. Benchmarking coupled models with concurrently executing components is much more difficult, requiring a more comprehensive approach that includes benchmarking of component models and coupling mechanisms, and measurements of how intermodel interactions affect performance. Benchmarking of component models is a mature area of work [5], offering some estimate of a model's performance in a coupled environment. Benchmarking of coupling mechanisms is less mature, but fairly straightforward. Quantifying how intermodel interactions can impede overall performance is the most difficult part of the coupled model benchmarking problem.

We are in the process of benchmarking a new version of CCSM that features a new, distributed-memory parallel coupler (CPL6) [4] that was built using the Model Coupling Toolkit (MCT) [6-8]. We are developing a set of benchmarks for the MCT and CPL6 to gauge how CCSMís coupling mechanisms perform. These benchmarks measure important processes found in CPL6 such as intergrid interpolation of multiple fields, parallel data exchange between component models and the coupler, and global integrals and averages used in computing diagnostics. Finally, we are using a version of CCSM instrumented with MPE calls to analyze in situ the couplerís performance.

We will describe in detail our experiences in benchmarking the CCSM coupler. We will present MCT benchmark results for various commodity-based multiprocessors and vector platforms such as the Earth Simulator and the Cray X-1. We will also describe in detail how the interactions between the components in the CCSM affect its overall performance.

References
[1] National Center for Atmospheric Research (NCAR) Community Climate System Model (CCSM), http://www.ccsm.ucar.edu/models .
[2] Bettge, T., Craig, A., James, R., Wayland, V., and Strand, G. (2001): "The DOE Parallel Climate Model (PCM): The Computational Highway and Backroads," Proceedings of the International Conference on Computational Science (ICCS) 2001, V.N. Alexandrov, J.J. Dongarra, B.A. Juliano, R.S. Renner, and C.J.K. Tan (eds), Springer-Verlag LNCS Volume 2073, pp 149-158.
[3] Jacob, R.L., Schafer, C., Foster, I., Tobis, M., and Anderson, J., (2001): "Computational Design and Performance of the Fast Ocean-Atmosphere Model, Version One," Proceedings of the International Conference on Computational Science (ICCS) 2001, V.N. Alexandrov, J.J. Dongarra, B.A. Juliano, R.S. Renner, and C.J.K. Tan (eds), Springer-Verlag LNCS Volume 2073, pp 175-184.
[4] For examples, see the Web sites for the CCSM coupler (http://www.ccsm.ucar.edu/models/cpl6) and the European Centre for Research and Advanced Training in Scientific Computation (CERFACS) Ocean-Atmosphere-Sea-Ice-Surface (OASIS) coupler (http://www.cerfacs.fr/globc/software/oasis).
[5] For example, see the Fifth-generation Penn State/NCAR Mesoscale Model (MM5) benchmark Web site (http://www.mmm.ucar.edu/mm5/mpp/helpdesk/20030305.html).
[6] Larson, J.W., Jacob, R.L., Foster, I.T., and Guo, J. (2001): "The Model Coupling Toolkit: Computational Design and Performance of the Fast Ocean-Atmosphere Model, Version One," Proceedings of the International Conference on Computational Science (ICCS) 2001, V.N. Alexandrov, J.J. Dongarra, B.A. Juliano, R.S. Renner, and C.J.K. Tan (eds), Springer-Verlag LNCS Volume 2073, pp 185-194
[7] MCT Web site (http://www.mcs.anl.gov/mct).
[8] Ong, E.T., Larson, J.W., and Jacob, R.L. (2002): "A Real Application of the Model Coupling Toolkit," Proceedings of the International Conference on Computational Science (ICCS) 2002, C.J.K. Tan, J.J. Dongarra, A.G. Hoekstra, and P.M.A. Sloot (Eds.), LNCS Volume 2330, Springer-Verlag, pp. 748-757.
  Link: --
   

 

     
  Session: Poster Reception
  Title: GASNet: A High-performance, portable compilation target for parallel compilers
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Dan O Bonachea (U.C. Berkeley), Paul H Hargrove (LBNL)
   
  Description:
  GASNet is a language-independent, low-level networking layer that provides network-independent, high-performance communication primitives tailored for implementing parallel global address space SPMD languages such as UPC, Titanium and Co-Array Fortran. The interface is primarily intended as a compilation target and for use by runtime library writers, and the primary goals are high performance, interface portability, and expressiveness. GASNet stands for "Global-Address Space Networking".

The design of GASNet is partitioned into two layers to maximize porting ease without sacrificing performance: the lower level is a narrow but very general interface called the GASNet core API - the design is based heavily on Active Messages, and is implemented directly on top of each individual network architecture. The upper level is a wider and more expressive interface called the GASNet extended API, which provides high-level operations such as remote memory access and various collective operations.

We've implemented GASNet on a number of high performance networks, including Quadrics Elan, Myrinet GM, Infiniband VAPI, IBM LAPI and a portable implementation on MPI 1.1. We present microbenchmark results demonstrating GASNet's high-bandwidth and low-overhead/low-latency performance over the various networks. We also present application performance results from NAS parallel benchmarks compiled using the Berkeley UPC compiler, an open-source UPC compiler that targets the GASNet interface.

We find that GASNet successfully provides a portable, expressive and high-performance communication interface for compilers, which efficiently exposes the high-performance features of many supercomputing networks and systems.
  Link: --
   

 

     
  Session: Poster Reception
  Title: MPH: a Library for Coupling Climate Component Models on Distributed Memory Architectures
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Chris H.Q. Ding (Lawrence Berkeley National Laboratory), Yun He (Lawrence Berkeley National Laboratory)
   
  Description:
  The Problem
With rapid increase of computing powers of the distributed-memory computers, clusters of Symmetric Multi-Processors (SMP), the application problems also grow rapidly both in scale and complexity. Effectively organizing large and complex simulation programs such that it is maintainable, re-useable, shareable and high-performance at same time, becomes an important task for high performance computing.

A growing trend in developing large and complex applications on today's Teraflops computers is to integrate stand-alone and/or semi-independent program components into a comprehensive simulation package.

Multiple component approach as a way to organize software is a natural evolution for many large-scale simulations, such as climate modeling, engine combustion simulations, etc. For example, in modeling long-term global climate, NCAR's community climate system model (CCSM) consists of an atmosphere model, an ocean model, a sea-ice model and a land surface model. These model components interact with each other through a flux coupler component. They require a general-purpose handshaking library to setup the distributed multi-component environment.

The Approach
a) Novel contribution
We study how such a multi-component multi-executable application can effectively run on distributed memory architectures. We develop a library that has an easy user interface for the initial handshaking: for different executables to recognize each other, for setting up a registry of executable names and communication channels among different components. We identify five effective execution modes and the MPH library supports application developments for utilizing these modes. MPH performs component-name registration, resource allocation and initial component communication in a flexible way.

On distributed memory computers, a multi-component user application system may consist of several executables, each of them could contain a number of program components. In MPH, we systematically study the following different possible combinations: (1) Single-Component Executable, Single-Executable application (SCSE) (2) Multi-Component Executable, Single-Executable application (MCSE) (3) Single-Component Executable, Multi-Executable application (SCME) (4) Multi-Component Executable, Multi-Executable application (MCME) (5) Multi-Instance Executable, Multi-Executable application (MIME)

A unified interface is provided for all different software integration modes. We also support components-joining, inter-component communication, inquiry on multi-component environment, and redirect input/output.

One design goal of MPH is the complete flexibility. The number of components and executables, names of each components, processor allocation are all determined by a components registration file read in when the multi-executable job starts on different subsets of processors. One can trivially insert more components or delete some components from the application system. We found that this is one important feature in climate model developments.

b) Status of Work
MPH is an application driven software development. MPH version 1 is first developed for the single-component multi-executable mode for the CCSM model. MPH version 2 is then developed for the multi-component single-executable mode for the PCM model. MPH version 3 is developed for the multi-component multi-executable mode to provide a unified user interface for MPH1 and MPH2. The multi-instance-component and the command line argument passing are implemented in MPH4 to support climate ensemble simulations, a new emerging trend to ascertain the uncertainty in climate predictions.

The codes are written in Fortran 90. All MPH functionalities are currently working on IBM SP, SGI Origin, Compaq AlphaSC, and Linux clusters. Source codes and instructions on how to compile and run on all these platforms are publicly available on our MPH web site.

MPH is currently been adopted in CCSM development. CCSM is the U.S. flag-ship coupled climate model system most widely in long-term climate system modeling in the U.S. and in the world. MPH is adopted in NCAR's Weather Research and Forcast (WRF) model, which is the new generation of the meso-scale model (MM5). Several dozen countries use MM5 for their routine regional mid-range weather/climate forecasts. MPH is also adopted in Colorado State University's icosahedra grid coupled model. Some others that show interest are: SGI for coupled model; a group in Germany, for coupled climate model; and a group in UK, for ensemble simulations. A Model Coupling Toolkit for communication between different model components also uses MPH.

How to Present at SC2003
This work will be presented as a normal paper poster. The motivation, software environment, implementation, status, examples and functionalities will all be presented.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Improving the Performance of MPI Derived Datatypes by Optimizing Memory-Access Cost
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Surendra Byna (Dept. of Computer Science, Illinois Institute of Technology, Chicago), William Gropp (Math. and Comp. Science Division, Argonne National Laboratory), Xian-He Sun (Dept. of Computer Science, Illinois Institute of Technology, Chicago), Rajeev Thakur (Math. and Comp. Science Division, Argonne National Laboratory)
   
  Description:
  The MPI Standard supports derived datatypes, which allow users to describe non-contiguous memory layout and communicate noncontiguous data with a single communication function. This feature enables an MPI implementation to optimize the transfer of noncontiguous data. In practice, however, few MPI implementations implement derived datatypes in a way that performs better than what the user can achieve by manually packing data into a contiguous buffer and then calling an MPI function.

In this poster, we present a technique for improving the performance of derived datatypes by automatically choosing packing algorithms that are optimized for memory-access cost. The packing algorithms use memory-optimization techniques that the user cannot apply easily without advanced knowledge of the memory architecture. The range of performance improvement is predicted based on the architecutural details and data access pattern before optimizations are automatically applied for derived datatypes. We present performance results for a matrix-transpose example that demonstrate that our implementation of derived datatypes significantly outperforms both manual packing by the user and the existing derived-datatype code in the MPI implementation -- MPICH on SGI Origin 2000 at NCSA and IBM's MPI on IBM SP at San Diego Supercomputer Center. Future work includes utilizing various loop optimization techniques and incorporating this work into MPICH2.
  Link: --
   

 

     
  Session: Poster Reception
  Title: MPICH-V3: A hierarchical fault tolerant MPI for Multi-Cluster Grids
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Pierre Lemarinier (LRI), AurÈlien Bouteiller (LRI), Franck Cappello (INRIA-CNRS-LRI)
   
  Description:
  Introduction
Among the programming models envisioned for the Grid, explicit message passing following the SPMD paradigm is one of the most studied. In particular, there have been many attempts to design and develop MPI implementations for the Grid (MPICH-G2, PACX-MPI, etc.), making vendor MPI implementation interoperable. However, very few attentions have been paid on fault tolerance (MPICH-GF, FT-MPI and MPICH-V projects are some exceptions). While there are many previous works on fault tolerant message passing, none of them have addressed the fundamental question of hierarchy: in Grid, faults may occur within each Grid node or between Grid nodes. In this poster, we present the architecture of MPICH-V3, the third version of MPICH-V dedicated to run on hierarchical systems such as to clusters of clusters in the context of Grid, and the performance of its key mechanisms.

MPICH-V3 global architecture
MPICH-V3 follows several design constraints: a) providing automatic and transparent fault tolerance (detection and restart), b) allowing the choice of the fault tolerant protocol (coordinated or message logging) within each Grid node (cluster). c) allowing partial checkpointing of a MPI execution, d) avoiding coordinated checkpoint approach at the Grid level. To fulfill these constraints, MPICH-V3 relies on a hierarchical fault tolerance design involving several protocols and runtime layers. At the cluster level, fault tolerance is provided by MPICH-V-CL[1] or MPICH-V2[1] protocols featuring coordinated (Chandy-Lamport) or uncoordinated (message logging) checkpointing, depending on the cluster administrator. The MPICH-V runtime is compliant with batch scheduler and provides fault tolerance transparently. A batch job should just request more nodes than required by the application, as spare nodes, in case of faults. At the Grid level, the fault tolerant protocol relies on uncoordinated checkpointing associated with pessimistic message logging. The Channel Memories developed in the MPICH-V1[1] protocol are used along with checkpoint servers storing the states of all the MPI processes running on the Grid. If a Grid node fails or if it becomes unreachable because of network failures, the MPI processes ran on the faulty cluster are restarted on another Grid node (cluster). This feature requires 3 mechanisms: 1) a global runtime capable of launching MPI processes on Grid nodes (cluster through their batch schedulers), detecting faults and relaunching a set of faulty MPI processes on available nodes, 2) a system allowing the migration of a set of MPI processes between heterogeneous clusters and 3) a MPI execution controller ensuring that a set of restarting MPI processes will reach a state coherent with the full system. The first mechanism will rely on an augmented Grid Scheduler. The second mechanism relies on a novel organization of the MPICH lowest layer using a generic device linked with the application and some network specific communication daemons. The third mechanism uses the logging on a remote reliable media of causal dependency information of intra-cluster messages. This last mechanism is necessary even with coordinated checkpoint to ensure that a failed process can replay the same execution as its initial one and retrieve a global coherent execution.

Experimentations
Several parameters govern the performance and the fault tolerance properties of the hierarchical design. In this poster we present the evaluation of three basic mechanisms. First, we measure the performance penalty due to message logging at the highest hierarchical level where all messages are logged into reliable media, using MPICH-V1 Channel Memories[1]. The result clearly show that the latency for small messages is increased compared to the reference protocol (P4). For long message, however, the bandwidth is similar to the one of P4. Second, we consider one of the Grid clusters running a coordinated checkpoint based fault tolerant protocol augmented with a logging mechanism for storing causal dependency information of messages. A figure presents the performance degradation due to the overhead of remote logging of causality information, using the NAS Benchmark. The evaluation clearly demonstrates that the overhead is not significant for applications with high computation to communication ratio, which is a necessary property of parallel applications running on Grids. Finally, we measure the cost for the migration of a set of MPI processes from one cluster to another. A figure presents the cost of checkpoint/ migration/ restart for different kinds of network. This cost encompasses the traditional overhead of checkpoint/restart plus the cost for relaunching the MPI runtime on the target cluster and finally the cost for transferring the MPI process states from one cluster to another.

The poster will present: introduction, motivations, related work, general architecture, protocol definition, theoretical foundations, implementation, early performance evaluation, perspectives.

[1] www.lri.fr/~gk/MPICH-V/ (SC2002 and SC2003 papers)
  Link: --
   

 

     
  Session: Poster Reception
  Title: Performance Modeling of HPC Applications
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Allan Snavely (SDSC), Laura Carrington (SDSC), Nicole Wolter (SDSC)
   
  Description:
  Problem Statement:
To develop a flexible general framework capable of modeling and/or predicting the performance of scientific applications on current and future High Performance Computers (HPC).

Description:
In previous work [1] we presented an instantiation of a performance modeling framework that was proven effective for modeling different HPC kernels on several architectures. Detailed descriptions of the PERC/PMaC performance modeling framework can be found in references [1-4]. These papers are all online at http://www.sdsc.edu/pmac/Papers/papers.html. The framework is faster than cycle-accurate simulation, more comprehensive than simple benchmarking, and has proven useful for performance investigations via 'proof-of-principle' studies applied to short-running kernels. The poster will show methods that extend the framework, to encompass longer-running complex scientific applications, as well as adding flexibility to incorporate new architectures and novel concepts.

The framework provides an automated means for carrying out performance modeling investigations. It combines tools for gathering machine profiles and application signatures with convolution methods. Machine profiles are measurements of the rates at which HPC platforms can carry out fundamental operations (for example, floating-point operations, memory accesses, message transfers). Application signatures are summaries of the operations to be carried out on behalf of an application to accomplish its computation. Convolution methods are techniques for mapping signatures to profiles with reasonable time complexity to predict and understand performance.

Models for the scientific application POP (Parallel Ocean Program), NLOM (Navy Layered Ocean Model), and Cobalt-60 will be run with exemplary data inputs at various processor counts[4,16,32,128] for at least 3 different architectures. The results from the models show an average error of 7.9% (%error = [realtime-predicted time]/realtime*100). The model will further be utilized to orchestrate a sensitivity study for one of the applications. This involves manipulation of a performance models various machine parameters to quantify how the application's performance is influenced from upgrades to the processor and/or network.

POP is an ocean circulation model, for eddy-resolving simulations of the world's oceans, and for climate simulations as the ocean component of coupled climate models. The model solves the three-dimensional primitive equations for fluid motions on the sphere under hydrostatic and Boussinesq approximations as well as spatial derivatives using finite-difference discretizations.

NLOM, the Navy's hydrodynamic (iso-pycnal) non-linear primitive equation layered ocean circulation model has been used at NOARL for more than 10 years for simulations of ocean circulation. The model retains the free surface and uses semi-implicit time schemes that treat all gravity waves implicitly. It makes use of vertically integrated equations of motion, and their finite difference discretizations on a C-grid.

Cobalt60 is a HPC application that solves the compressible Navier-Stokes equations using an unstructured Navier-Stokes solver. It uses Detached-Eddy Simulation (DES), which is a combination of Reynolds-averaged Navier-Stokes (RANS) models, and Large Eddy Simulation (LES).

Short description of how you will present work in the poster.
The poster will contain an image of the performance prediction framework describing each component and how they combine for a performance prediction. The poster will report the modeled results for the 3 applications. These tables will include predictions for different machines, various processor counts, the real runtime, the predicted runtime, and the % error for the prediction. The poster will also report on the sensitivity study. These tables and graphs will highlight the influential aspects of an architecture with respects to the given application.

Reference:
1. A. Snavely, L. Carrington, N. Wolter , J. Labarta, R. Badia A. Purkayastha, "A Framework for Performance Modeling and Prediction", proceedings of SC02, Baltimore,Nov. 2002
2. L. Carrington, A. Snavely, N. Wolter, X. Gao, "A Performance Prediction Framework for Scientific Applications", Workshop on Performance Modeling and Analysis - ICCS, Melbourne, June 2003
3. A. Snavely, N. Wolter, and L. Carrington, "Modeling Application Performance by Convolving Machine Signatures with Application Profiles", IEEE 4th Annual Workshop on Workload Characterization, Austin, Dec. 2001
4. L. Carrington, N. Wolter, and A. Snavely, "A Framework to For Application Performance Prediction to Enable Scalability Understanding", Scaling to New Heights Workshop, Pittsburgh, May 2002.
  Link: --
   

 

     
  Session: Poster Reception
  Title: 1 TFLOPS achieved with distributed Linux cluster
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Craig A. Stewart (University Information Technology Services, Indiana University), George Turner (UITS, Indiana University), Peng Wang (UITS, Indiana University), David Hart (UITS, Indiana University), Steven Simms (UITS, Indiana University), Daniel Lauer (UITS, Indiana University), Mary Papakhian (UITS, Indiana University), Matthew Allen (UITS, Indiana University), Jeff Squyres (Open Systems Laboratory, Indiana University), Andrew Lumsdaine (Open Systems Laboratory, Indiana University)
   
  Description:
  The most recent Top500 list released in June 2003 includes, for the first time ever, a persistent, distributed Linux cluster with an achieved performance of more than 1 TFLOPS on the Linpack benchmark - the IU AVIDD facility. The purposes of this report are to describe this distributed cluster, describe the results of our performance analyses and tuning efforts, and consider the costs and benefits of this distributed approach in terms of robustness and performance.

Indiana University's AVIDD facility (Analysis and Visualization of Instrument-Driven Data) includes two identical Linux clusters, each with 208 2.4 GHz Prestonia processors. One is located in Indianapolis, the other in Bloomington (IN). They are connected via 53 miles of the I-light network (www.i-light.org). Communication over the I-light network is achieved via two Force10 E600ô switches, configured with 1000Base-TX (Gigabit Ethernet over copper) ports for connections to local machines, and with 10Gbase-ER modules to achieve communication between Indianapolis and Bloomington. Each local cluster also includes a Myrinet interconnect.

Methods and Materials:
Performance of the AVIDD facility was investigated with the High Performance Linpack (HPL) benchmark (http://www.netlib.org/benchmark/hpl/) using a beta version of LAM 7.0 (CVS snapshot date 03/23/2003) (http://www.lam-mpi.org/). Benchmarks were run under three conditions. HPL was run on each local cluster using the Myrinet interconnect. This test was repeated using gigabit Ethernet running over the local Force10 switch. The HPL benchmark was also run across the combined Bloomington and Indianapolis clusters, using gigabit Ethernet, the Force10 switches, and two dedicated fibers of the I-light network. For the benchmarks we used 192 processors in each cluster, for a peak theoretical capacity per cluster of 0.922 TFLOPS, and a combined peak theoretical capacity of 1.843 TFLOPS.

Results:
With one cluster, connected locally via Myrinet, the HPL benchmark achieved 0.614 TFLOPS, or 66.7% of peak theoretical capacity. The peak theoretical capacity of the cluster used for this benchmark was 0.921 TFLOPS. This benchmark was performed using MPICH-GM 1.2.4.8 and Myrinet GM 1.6.3. Problem size was set to 150,000.

With one cluster, connected locally via gigabit Ethernet and a Force10 E600 switch, the HPL benchmark achieved 0.575 TFLOPS, or 62.4% of peak theoretical capacity. This benchmark performed using Force10 FTOS v4.3.2.0b. Problem size was again 150,000.

The benchmarks run on one local cluster were duplicated, which verified that the two clusters were operating identically.

Running across the combined Bloomington and Indianapolis clusters, the HPL benchmark achieved 1.058 TFLOPS, or 57.4% of the peak theoretical capacity of those two clusters, for a peak achieved performance of 1.058 TFLOPS. The problem size was set to 220000. Network parameters were tuned as follows: jumbo frames were enabled (packet size 9252). Adaptive packet tuning was disabled and packet handling latencies were minimized on the compute node's NIC. The default and maximum read & write TCP buffer sizes were optimized for the latencies (distances) involved. The ability of the gigabit Ethernet NICs to do the segmentation and checksumming work was critical to the performance achieved.

Conclusion:
The AVIDD facility provides an unusual opportunity to directly consider the costs and benefits of a distributed, grid-based approach as compared to a single installation. The upper bound on the cost in terms of performance was approximately 9% from the performance, on the HPL benchmark, from the results expected for a single location installation. (This 9% figure is based on doubling the size of one local Myrinet-connected cluster with perfect scaling). The benefits of the distributed approach have to do primarily with facilities and robustness. The distributed installation reduced the impact on IUís existing machine rooms in terms of electrical power and cooling. This was important - neither of the two existing machine rooms had electrical power and cooling facilities sufficient to house the entire system. In addition, the distributed approach creates resilience in the face of a disaster striking one of the two machine rooms.

There are many other reasons for a grid-based approach to computing infrastructure, but overall the grid approach has proven useful to IU in practice in this installation.

NOTE: Prior to presentation of our poster, all of the benchmarks will be re-run with the latest versions of software. We also plan two additional variants of the benchmark tests. Namely, we plan to run the benchmark of the cluster using Force10 switches and Gigabit Ethernet, but connected over a shared network. In addition, we plan to run a performance analysis using multi-protocol support in LAM/MPI, which will permit us to use a hybrid of Myrinet locally and Gigabit Ethernet between the two clusters.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Distributed Intelligent RAS System for Large Computational Clusters
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  J. M. Brandt (Sandia National Laboratories), N. M. Berry (Sandia National Laboratories), R. B. Yao (Sandia National Laboratories), B. M. Tsudama (Sandia National Laboratories), A. C. Gentile (Sandia National Laboratories)
   
  Description:
  The viability of large computational clusters rests upon their "RAS" - Reliability (infrequency of problems), Availability (usability during failures or maintenance), and Serviceability (ease of maintenance and problem diagnosis). Due to non-negligible component failure rates, node failures in large clusters are inevitable. Sandia Livermore's CPlant cluster of 488 nodes has ~1 node failure a day. A post-mortem analysis to determine the cause of problems is difficult to impossible and involves correlation of large amounts of distributed data on nodes, processes, log files etc.

Node problems are particularly costly to users, even with application checkpointing, during computations of long duration (hours-months) on many processors. Users must carefully balance the granularity of checkpointing with the probability of node death as they may lose far more time to the overhead of checkpointing than by re-running their applications in the event of failure. Even with a valid checkpoint just before a node death, on our system, users must still re-queue their jobs and wait until the required number of nodes becomes available in order to carry on.

In our opinion, current RAS approaches are not well suited to large clusters as they push/pull notification of pre-defined and pre-selected events up to a single, central controlling management node, where scripts may be invoked in response. These approaches don't scale well because the workload of the monitor grows as the cluster size. Additionally, this design handles scenarios of limited complexity and requires pre-existing knowledge of failure modes and mechanisms.

We have developed a RAS approach in which the overall problem is decomposed into small parts locally managed at each node. Our RAS system is comprised of intelligent, pro-active decision making (e.g., failure prediction and recovery strategy) entities distributed throughout the cluster on a per-node basis. These entities coordinate information from their specific tasks to form collective beliefs and generate solutions for the overall global system. They learn what is normal and react to abnormalities in their environment. If a RAS entity dies, that entity and its associated node are mapped out until fixed and the aggregate system goes on unimpaired.

Each RAS entity maintains an equivalency table of nodes that have similar geographic, job, power, etc., attributes. Rather than each RAS entity polling every other RAS entity in its entire node equivalency group(s), or having a hierarchical approach that involves building some tree structure for processing and communicating appropriate data (this would result in an unbalanced processing load), each RAS entity bases its "global" view on a sample of the parameter set which is randomly pushed to it from entities within its equivalency group(s).

Our RAS entities consist of three basic pieces: 1) sensor monitor, 2) decision making, and 3) fault tolerance and system interface. The latter piece ensures that the aggregate knows if it or the other two pieces are not operational. When there is imminent failure of a node hosting a job, this piece can also invoke checkpointing abilities of the application and notify the scheduler to, if possible, re-run the job with the current set of good nodes and replacement(s) for the failing node(s).

In order for this system to scale, each processor must have a bounded workload independent of the cluster or job size. In order to accomplish this, we have adopted a distributed approach where we use a secondary processor, not the main CPU, at each node to perform the RAS functionality. (For our proof of concept system, we are running our RAS software on PDA units each attached to its respective node, thus mimicking the use of a secondary "service processor" on the motherboard.) By doing this we can distribute the information gathering, statistical processing, and troubleshooting over all the nodes. Thus as the cluster grows in size, so does the aggregate RAS processing power.

There are several hardware prerequisites for our system: 1) a separate, sufficiently powerful processor which lives on the motherboard and has access to sensor and system related data, 2) a network which has no connection/contention with message passing resources, and 3) network access to the job scheduler.

This presentation will illustrate how our RAS system detects abnormalities and predicts failure of cluster nodes through intelligent evaluation of indicators such as temperature, etc. from embedded sensors, as well as historical data. We will show that reliable failure prediction can boost application performance when compared to traditional checkpointing strategies. This is accomplished by checkpointing primarily when failure is imminent, and only secondarily as a preemptive measure.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Prophesy: An Automated Modeling System for Parallel and Grid Applications
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Xingfu Wu (Texas A&M University), Valerie Taylor (Texas A&M University), Rick Stevens (Arognne National Laboratory)
   
  Description:
  Efficient execution of applications requires insight into how the system features impact the performance of the application. For parallel and grid systems, the task of gaining this insight is complicated by the complexity of the system features. This insight generally results from significant experimental analysis and possibly the development of performance models. Performance models provide significant insight into the performance relationships between an application and the system used for execution. In particular, models can be used to predict the relative performance of different systems used to execute an application or to explore the performance impact of the use of different algorithms to solve a given task. This poster discusses Prophesy, our effort to automate the process of developing analytical models for parallel and grid applications, so as to significantly decrease the time required for model development.

The Prophesy framework (Prophesy web site: http://prophesy.mcs.anl.gov) is a performance analysis and modeling infrastructure that consists of three major components: data collection, data analysis, and three central databases. The data collection component focuses on the automated instrumentation and application code analysis at the level of basic blocks, procedures, and/or functions. The resultant performance data is automatically stored in the performance database. The data analysis component, which is the distinguishing component of Prophesy, produces analytical performance models with coefficients at the granularity specified by the user. The models are developed based upon performance data from the performance database, model templates from the template database, and system characteristics from the systems database. The Prophesy automated modeling system allows a developer to quickly gain insight into the performance of an application code or functions within a code such that the developer can cut down on the time required for developing efficient applications, locate potential bottlenecks, and approximate the runtime on different systems. Currently, Prophesy system provides three different modeling methods: curve fitting, parameterization, and coupling (composition). The first two are well-established techniques for model generation, which Prophesy facilitates by automation. The last method is brought about by Prophesy because of the amount of archived information about an application. Curve fitting is a least squares fit of the empirical data. This method is the most accessible as it uses the application information contained within the Prophesy database. Parameterization requires manual code analysis, which exposes system parameters in the model. It is assumed that the manual analysis involves kernels, not full applications. The parameterization method allows users to use the information in the system database to explore the performance of an application on different systems, including distributed systems. The coupling (composition) method focuses on developing an application model in terms of its kernels; the kernel models can be developed using curve fitting or parameterization methods. Each model type has its own unique set of options and parameters that a user can easily modify to produce large amount of different performance functions.

The Prophesy model development uses a combination of script generation in PHP and Octave, which is a tool with similar functionality to Matlab. Octave executes the generated script and returns a graph of the model with desired prediction along with the actual experimental data.

This poster will include a description of the components of Prophesy, with emphasis placed on the three modeling techniques. We will illustrate the use of the modeling techniques with some of the applications stored in the Prophesy database. These applications include the NAS Parallel Benchmarks, a high energy physics application called ATLFAST, a cosmology application and others. If possible, we would also like to include demonstrations of Prophesy from our laptop.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Real-Time Rendering of Attributed LIC-Volumes Using Programmable GPUs
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Yasuko Suzuki (Information Technology R & D Center, Mitsubishi Electric Corp., Japan), Tokunaga Momoe (Office Imaging Products Operations, Canon Inc., Japan), Shoko Ando (Graduate School of Humanities and Sciences, Ochanomizu Univ., Japan), Shigeru Muraki (Collaborative Research Team of Volume Graphics, AIST, Japan), Yuriko Takeshima (Institute of Fluid Science, Tohoku Univ., Japan), Shigeo Takahashi (Graduate School of Arts and Sciences, University of Tokyo, Japan), Issei Fujishiro (Graduate School of Humanities and Sciences, Ochanomizu Univ., Japan)
   
  Description:
  Line Integral Convolution (LIC) [1] is a popular method for visualizing 2D dense flow fields. However, the volume rendering of direct 3D extensions of LIC (LIC-volumes) cannot produce useful images, since optical integration along the viewing direction tends to obscure the coherent structures along local streamlines. In our previous research [2], we proposed selective LIC-volume rendering using a significance map (S-map) derived from the characteristics of flow structures and a specific illumination model for streamlines to make the volume rendering of LIC-volume more useful. In addition, we have achieved real-time animation of LIC-volume rendering using a VolumePro500 (TeraRecon, Inc.) board. With this board, however, we could only visualize relatively small datasets because of the limited memory size, and could not visualize multiple field values, such as an LIC-volume and an S-map, simultaneously.

Some other techniques for 3D LIC and hardware LIC rendering have been reported [3, 4]. However, these techniques cannot provide general-purpose visualization for a given 3D flow dataset. In this poster, we present a scalable PC cluster system that performs real-time animation of LIC-volume of arbitrary size, allowing interactive control of the transfer function (TF) by using an extra attribute field value.

First, we used a phase-shift animation technique [1] to generate a time-varying LIC-volume, which dramatically improved the visualization of a 3D flow structure. Since this technique generates large-volume data sets, we implemented it on our latest VG cluster [5], which is a scalable visual computing system consisting of three image-compositing devices (Mitsubishi Precision Co., Ltd.) and 18 commodity PCs equipped with the latest graphics processing units (GPUs). This system generates a time series (stream) of phase-shifted LIC-volumes in parallel and distributes the series to each PC to perform high-speed sort-last volume rendering. Because of this distribution, each GPU can store the stream of partial LIC-volumes in the relatively small 3D texture memory. The texture-based volume-renderer of each PC visualizes the stream of partial LIC-volumes in real-time, and the resultant stream is sent to the image-compositing devices via the interface board on the PCI bus. The image-compositing devices blend the streams in front-to-back order according to their depth priorities and send the resultant stream to the host PC via the PCI interface board for the final display.

Second, we used an extra attribute field (e.g. velocity) in addition to the LIC-volume to ensure comprehensive and convenient exploration of 3D flow fields. For this purpose, we developed a TF editor adopting a feature-extracting technique called Volume Skelton Tree (VST) [6]. The VST is an efficient representation of the global field topology connecting critical field points, where the topology of the isosurface changes. By using the VST of the attribute field, we can automatically obtain a TF that reflects the global characteristic structure of the flow field and provides flexible control of the TF for the LIC-volume. This TF editor reveals the inner structure of flow fields more clearly and efficiently.

Third, we used the programmable features of the commodity GPU (nVIDIA GeForce FX 5900 Ultra) to implement the volume-renderer for this system. By writing a fragment program for each GPU, we can flexibly control color, translucency and lighting of the flow field by considering both the LIC-volume and the attribute field.

The current system allows us to generate a phase-shift animation of attributed LIC-volumes in real-time (more than 45 fps for a 512 x 512 image size). We can also change the TF and the viewing direction interactively. Furthermore, this system is highly scalable. One image-compositing device can connect up to eight PCs, and multiple image-compositing devices can be connected as an octree structure. In principle, we should be able to visualize 3D flow data of any size at real-time rates by scaling up the system.

In the poster session, we will show a video that demonstrates 3D flow visualization. We will also present a more detailed description of our system and user interfaces.

[1] Cabral, B. and Leedom, L., Proc. SIGGRAPH 93, 263-270, 1993.
[2] Suzuki, Y., et al., Proc. IEEE Visualization 2002, 485-488, 2002.
[3] Rezk-Salama, C., et al., Proc IEEE Visualization 99, 233-240, 1999.
[4] Bordoloi, U. and Shen, H-W., Computer Graphics Forum, 21(3), 2002.
[5] Muraki, S., et al., Proc. ACM/IEEE SC2001 (CD-ROM), 2001.
[6] Takeshima, Y., et al., Proc. Visual Computing 2001, 79-84, 2001. (in Japanese)
  Link: --
   

 

     
  Session: Poster Reception
  Title: SX-6 Benchmark Results and Comparisons
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Thomas J. Baring (Vector Specialist), Andrew C. Lee (User Consultant)
   
  Description:
  Problem Statement: The Arctic Region Supercomputing Center (ARSC) installed a single 8-CPU Cray SX-6 node, and made it available in August 2002 to the broader U.S. HPC community for benchmarking and testing. This poster compares the performance of several codes run on the SX-6 versus other HPC computing platforms.

Description of Approach: Accounts on the SX-6 were granted to more than two-dozen U.S. researchers. This allowed them to port and benchmark their codes on this platform novel to the U.S. HPC community. In exchange for this access, account holders agreed to provide ARSC with benchmarking results for at least one code run on the SX-6 and comparison data from at least one other HPC platform.

Status of Work: At this time ARSC is hosting 30 users on the SX-6, not including ARSC staff. We have received results for almost a dozen codes that have been ported to the SX-6. Results from seven codes have already been presented in papers. A paper at the 2003 Cray User Group (ìSX-6 Comparisons and Contrasts,î available at http://www.arsc.edu/support/technical/pdf/200305.SX6CompareAndContrast.pdf) presented five codes and a paper, by Guy Robinson, at the 2003 DoD HPCMPO User Group meeting included two additional codes. Benchmark results from more codes are expected between now and the poster deadline.

Description of Poster: Our poster will present, in a familiar manner, comparative performance data from several codes run on the SX-6 and on other HPC platforms. Several graphs will be used to present individual and/or aggregate benchmark results. Metrics compared will be wall-clock time, percentage of CPU peak performance, GFLOPS, and other metrics when available, applicable, and of interest. Comparison of vectorization and scalability on platforms will be presented when applicable and of interest. We will describe salient issues in porting, using, and achieving performance on the SX-6.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Performance of SGI's Altix
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Michael Lang (LANL), Scott Pakin (LANL), Darren Kerbyson (LANL), Fabrizio Petrini (LANL), Harvey Wasserman (LANL), Adolfy Hoisie (LANL)
   
  Description:
  SGI's latest entry into the HPC market is the Itanium based Altix. Its unique features include a ccNUMA architecture, a single system image Linux many additions from SGI, and Itanium processors We assess the performance based on two MPI codes of interest to Los Alamos, Sage and Sweep3D. Sage is a hydrodynamic code, while Sweep3D is a sn-transport kernel using wave propagation which is sensitive to latency. Using these applications, we will also compare performance with our local test-bed cluster. The comparison is interesting as our cluster is configured with the same Itanium-II 1.0 Ghz processors, but in a two processor per node configuration, and with Linux running on each node rather than a single system image. The cluster system also has a high performance Quadrics Elan-3 interconnect.

We will present:
1) Overview of the Altix shared memory architecture.
2) Basic network MPI benchmarks (Bandwidth, Latency, Bi-directional BW, Bi-directional Latency, Hot-Spot, Barrier, Broadcast)
3) Sweep3d performance
4) Sage performance 5) Comparisons to a Linux cluster.

Work will be revisited when system is refreshed with Madison processors and NUMA-link4 hardware upgrades.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Computational Steering in RealityGrid
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  John M Brooke (Manchester Computing, University of Manchester), Peter V Coveney (Centre for Computational Science, University College London), Jens Harting (Centre for Computational Science, University College London), Shantenu Jha (Centre for Computational Science, University College London), Stephen M Pickles (Manchester Computing, University of Manchester), Robin L Pinning (Manchester Computing, University of Manchester), Andrew R Porter (Manchester Computing, University of Manchester)
   
  Description:
  PROBLEM STATEMENT: Problems associated with demanding high performance computational science are not confined to merely finding resources with larger numbers of processors or memory. Simulations that require greater computational resources also require increasingly sophisticated and complex tools for the analysis and management of the output of the simulations. For a range of scientific applications, controlling the evolution of a computation based upon the realtime analysis and visualisation of the status of a simulation alleviates many of the limitations of a simple simulate-then-analyse approach. Computational steering is the functionality that enables the user to influence the otherwise sequential simulation and analysis phases by merging and interspacing them.

OUR APPROACH: A central theme of RealityGrid (http://www.realitygrid.org) is the facilitation of distributed and interactive exploration of physical models and systems through computational steering of parallel simulation codes and simultaneous on-line, high-end visualisation.

We implement our steering framework as a library, callable from a variety of languages, programming paradigms and scientific applications types (we have working examples of Fluid dynamics, Monte Carlos, Molecular Dynamics and Event Driven code). The steering library itself consists of two parts: an application side and a client side. The client side is intended to be used in the construction of a steering client (of which the RealityGrid steering client is just one implementation). The communications between the application and client are routed through an intermediate steering grid service (SGS). The SGS provides the public interface through which clients can steer the application. In our architecture, the visualisation and application components appear on equal footing, and a visualisation can possess its own SGS. Each SGS publishes information about itself in a registry, which is used by clients to discover and bind to running applications, and is instrumental in bootstrapping the communications between the application and visualisation components.

The UK e-Science community has constructed an ambitious "Level 2 Grid", that aims to provide the user community with a persistent grid. We narrate our experiences in deploying RealityGrid applications involving computational steering, using Globus on this Level 2 Grid, thus becoming one of the first groups to use a persistent grid for routine science requiring high performance computing and computational steering.

PRESENTATION: The poster will give design and implementation details of the RealityGrid steering library and toolkit. The capabilities of the library will be outlined and a diagram showing the service-orientated architecture of the latest implementation based on the Open Grid Services Infrastructure will be included. The poster will show, using a specific scientific application, the advantages that a powerful, flexible computational steering framework provides the physical scientist, i.e, how it enables more effective utilization of computational resource and enhances a scientist's productivity. Facilities permitting we'll have an adjacent laptop to guide attendees through the steps involved in steering a real application on the grid, with the aim of elucidating the concepts discussed in the poster.
  Link: --
   

 

     
  Session: Poster Reception
  Title: An Adaptive Informatics Infrastructure Enabling Multi-Scale Chemical Science
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Thomas C. Allison (NIST, Gaithersburg, MD 20899-8381), Kaizar Amin (Argonne National Laboratory, Argonne, IL 60439-4844 ), William Barber (Los Alamos National Laboratory, Los Alamos, NM 87545), Sandra Bittner (Argonne National Laboratory, Argonne, IL 60439-4844 ), Brett Didier (Pacific Northwest National Laboratory, Richland, WA 99352), Michael Frenklach (University of California, Berkeley, CA 94720-1740), William H. Green, Jr. (Massachusetts Institute of Technology, Cambridge, MA 02139), John Hewson (Sandia National Laboratories, Livermore, CA 94551-0969), Wendy Koegler (Sandia National Laboratories, Livermore, CA 94551-0969), Carina Lansing (Pacific Northwest National Laboratory, Richland, WA 99352), Gregor von Laszewski (Argonne National Laboratory, Argonne, IL 60439-4844 ), David Leahy (Sandia National Laboratories, Livermore, CA 94551-0969), Michael Lee (Sandia National Laboratories, Livermore, CA 94551-0969), James D. Myers (Pacific Northwest National Laboratory, Richland, WA 99352), Renata McCoy (Sandia National Laboratories, Livermore, CA 94551-0969), Michael Minkoff (Argonne National Laboratory, Argonne, IL 60439-4844 ), David Montoya (Los Alamos National Laboratory, Los Alamos, NM 87545), Sandeep Nijsure (Argonne National Laboratory, Argonne, IL 60439-4844 ), Carmen Pancerella (Sandia National Laboratories, Livermore, CA 94551-0969), Reinhardt Pinzon (Argonne National Laboratory, Argonne, IL 60439-4844 ), William Pitz (Lawrence Livermore National Laboratory, Livermore, CA 94551), Larry A. Rahn (Sandia National Laboratories, Livermore, CA 94551-0969), Branko Ruscic (Argonne National Laboratory, Argonne, IL 60439-4844 ), Karen Schuchardt (Pacific Northwest National Laboratory, Richland, WA 99352), Eric Stephan (Pacific Northwest National Laboratory, Richland, WA 99352), Al Wagner (Argonne National Laboratory, Argonne, IL 60439-4844 ), Baoshan Wang (Argonne National Laboratory, Argonne, IL 60439-4844 ), Theresa Windus (Pacific Northwest National Laboratory, Richland, WA 99352), Lili Xu (Los Alamos National Laboratory, Los Alamos, NM 87545), Christine Yang (Sandia National Laboratories, Livermore, CA 94551-0969)
   
  Description:
  The ultimate impact of Chemical Science relies upon flow of information across many physical scales and scientific disciplines. These scales range from subatomic quantum chemistry to predictive simulations of chemical processes such as combustion. New experimental methods, modern super computers, and complex theories have increased the pace of scientific advances, but are increasingly limited in the impact they make by barriers to the needed communication and collaboration across scales and disciplines. Going well beyond the increasing need for collaboration among geographically distributed researchers, these barriers include incompatible data formats and undocumented metadata that limit the accessibility of complex validated data across scales, the lack of important data pedigree information, the time required to communicate through peer-reviewed literature, and the manual process of post-publication annotation.

The Collaboratory for Multi-scale Chemical Science (CMCS) project is enabling chemical scientists to overcome barriers by collaborating with them to develop and deploy an adaptive informatics infrastructure that integrates a set of key collaboration tools and chemistry-specific applications, data resources, and services. The CMCS infrastructure provides XML data/metadata management services enabling annotation and data discovery, a Chemical Science Portal enabling data-centric project- and community-level collaboration, and tools for security, notification, and collaboration. The CMCS team is piloting these tools in the combustion research community to provide rapid exchange of multi-scale data and data pedigree and the integration of chemical science tools that generate, use and archive metadata.

Several innovative features of the CMCS infrastructure directly address the barriers described above. The web-based data archive implements the Digital Authoring and Versioning (WebDAV), which is a HTTP-based digital content management protocol that enables resource annotation and search via "properties." Further enhancements are provided by middleware that automatically annotates and translates data using XML standards and tools. The data is accessed using a data explorer available through a web portal. The portal also integrates collaboration tools, web services, notification services, security, and other features. A novel feature of the data explorer is the ability to "browse" views of the metadata/pedigree of the stored files, and to easily view data using the different available translations registered with the middleware. Provenance and other metadata can be stored separately from very large data files and thus be manipulated, searched, and browsed enabling cataloguing and/or discovery of such data without directly accessing it. Applications beyond chemical science are also of interest, and will benefit by planned co-development agreements and the planned open-sourcing of the CMCS software infrastructure.

The first production version of this infrastructure has been developed to meet requirements of specific chemical science use-cases, and was released to pilot user groups in May, 2003. An updated production version with changes motivated by application requirements will be released in August, 2003. User agreements that protect the intellectual property of data providers have been negotiated among the CMCS project team, external users, and DOE legal representatives. Three user groups that involve significant participation from scientists outside the CMCS project team are now actively involved in piloting CMCS infrastructure. A number of chemical science applications have been integrated. Active Thermochemical Tables (ATcT; http://www-unix.globus.org/cog/projects/cmcs/cmcs-paper.pdf ), a new approach to providing thermochemical data, is available through the CMCS portal as a web-service. The Extensible Computational Chemistry Environment (Ecce; http://ecce.emsl.pnl.gov/) is an example that directly accesses the CMCS data store through defined application programming interfaces. Metadata generators and translators have allowed CMCS to capture data and display data provenance information that spans five Chemical science disciplines. Publicly available chemical science data, along with tutorials and other information, is now accessible at the CMCS Portal at http://cmcs.org/. At SC03, a wall poster will display the features of the CMCS infrastructure and examples of its implementation to enable chemical science. The infrastructure, chemical science data, and applications will also be demonstrated live using available wireless access and a portable computer.

The Comprehensive collaborativE Framework (CHEF; http://www.chefproject.org), Scientific Annotation Middleware (SAM; http://collaboratory.emsl.pnl.gov/docs/collab/sam/ ), and the Java COG Kit Project (http://www.cogkits.org/ ) have partnered with CMCS to develop software infrastructure. The CMCS is a National Collaboratories Program project sponsored by the U.S. Department of Energy's Office of Mathematical, Information, and Computational Sciences. Participating organizations include Sandia National Laboratories, Pacific Northwest National Laboratory, Argonne National Laboratory, Lawrence Livermore National Laboratory, Los Alamos National Laboratory, National Institute of Standards and Technology, Massachusetts Institute of Technology, and University of California, Berkeley. Parts of this work were also supported by the U.S. Department of Energy, Division of Chemical Sciences, Geosciences and Biosciences of the Office of Basic Energy Sciences under Contract No. W-31-109-ENG-38 (Argonne National Laboratory).
  Link: --
   

 

     
  Session: Poster Reception
  Title: The Advisor: Diagnosing End-To-End Network Problems
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Tanya Brethour (NLANR/DAST), Jim Ferguson (NLANR/DAST)
   
  Description:
  Tools to enhance network performance are not a novel concept, yet many users still struggle to achieve the full potential of their networks. This is usually a result of an abundance of information provided by tools and a lack of expertise to determine the performance problems. Consulting the advice of a network engineer is the ideal solution, but not always a possibility. It is clear that an automated way of providing advice for end users is needed.

The NLANR/DAST Network Performance Advisor ('The Advisor') is a simple, yet sophisticated, open source application that measures, displays, and analyzes network metrics. It uses existing diagnostic tools such as ping, traceroute, Iperf, and Web100 and integrates them into a common framework. This framework attempts to emulate a networking expert, and allows users to troubleshoot their own networking problems.

Developed by the NLANR Distributed Applications Support Team, the Advisor should appeal to researchers, and networking engineers alike. The Advisor is rapidly approaching its first official release scheduled for Fall 2003.

Our poster will focus on two main aspects of the Advisor: Design, and Using the Advisor. The Advisor is one of many new projects that are focusing on network performance issues. However, what makes the Advisor unique is its design, in that it uses existing network diagnostic tools, and analyses their output in order to provide plain text advice to end-to-end users.

The poster will describe the three main design components of the Advisor: Performance Data Collector (PDC), Performance Data Historical Archiver (PDHA), and Analysis Engine (AE). All components of the architecture communicate through XML Remote Procedure Calls (XML-RPC).

The PDC is responsible for measuring the desired metrics, and exposing them through a consistent interface. The PDC does not measure these metrics itself, but uses existing network diagnostic tools (i.e. ping, traceroute, Iperf, etc.) to perform the measurement. The PDC has been designed such that it is easy to integrate new tools through the use of Application Definition Files (ADFs). It is also independent of the other Advisor components and can be deployed separately to provide a uniform way of collecting performance data.

The PDHA stores the metrics obtained from the PDC. Its main purpose it to serve as a cache for the PDC, which minimizes the number of metrics the PDC must re-measure. In addition, it also is useful for observing discrepancies for particular measurements over a period of time. The AE will use this information to diagnose performance problems.

The AE is a sophisticated component that attempts to duplicate the analysis steps that a network engineer would typically take. It uses Test Definition Files (TDF) to describe network problems. By querying the PDHA for measurements, it systematically determines which problems are occurring, and offers advice to the user to remedy the situation. This advice can either be used by the user to solve the problem, or the advice and actual measurements may be passed on to the network engineer responsible for that segment of the network.

The second half of the poster will focus on the user interface and examples of network problems the Advisor can diagnose. The Advisor GUI provides a simple interface for all users. There are three main displays that form the GUI: Expert, Map, and Analysis.

The Expert display will appeal to network engineers who want to examine the data themselves. It provides a single display for all metrics that have potentially come from different tools. The Map display is the most visual, in that it plots the path between the source and destination, and highlights trouble spots. The user may ìdrill downî into subsequent layers to see what the problems are. Lastly, an Analysis display is for the impatient user. It simply displays the results of the Analysis Engine, which include what the problems are and potential solutions in an easy to understand plain text format, and not ìnetwork-jargonî. All three displays will be illustrated through a series of annotated screen shots.

Additionally, the poster will illustrate a sample of network problems the Advisor can diagnose. Through the use of text and illustrations we will explain the problem, the required network metrics, time to diagnose the problem, and what the GUI displays upon completion of the analysis. We expect to have 2 or more of these scenarios. Such network issues include: Duplex mismatch, connectivity issues, and how to increase bandwidth.

To supplement our poster, we hope to provide a number of attendees with a free CD distribution of the Advisor.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Repartitioning Unstructured Meshes for the Parallel Solution of Engine Combustion
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  PeiZong Lee (Institute of Information Science, Academia Sinica), Chih-Hsueh Yang (Institute of Information Science, Academia Sinica), Jeng-Renn Yang (Institute of Information Science, Academia Sinica)
   
  Description:
  This paper is concerned with designing effective algorithms for partitioning and repartitioning unstructured meshes for the parallel solution of engine combustion. Many practical applications are time-varying (transient) and have complex geometries and can adopt more than one mesh in the numerical simulations. The internal combustion engine, which consists of chemical reactions, moving valves and pistons, and fuel injection, is typically a transient problem with changing shapes. For engine combustion simulation, a period of 128 separate operations is used to simulate the processes of fuel and air intake, compression of the fuel-air mixture, ignition and combustion of the charge, expansion of gases, and the removal of waste. For this type of transient (where shapes change with time) application, it is practical to generate a separate unstructured mesh for each frame of an object-geometry within a period of operations.

When dealing with parallel processing, we have to partition the computing domain such that each processing element (PE) contains a part of an unstructured mesh. Then, during parallel computation, each PE calculates its local data and occasionally exchanges data with other PEs. Since computing optimal partitioning is NP-complete, an optimal partitioning is not time affordable. The goal of our algorithm is to partition and repartition consecutive unstructured meshes subject to the requirement of load balance, the small number of cut edges among adjacent partitions, and the low inter-frame communication overhead. Our algorithm is based on refining successive partitioning.

We use a regular background mesh, which can be treated as a complete quadtree, as a coarse graph. A node in the coarse graph represents a cell of the regular background mesh. The weight of each node represents the number of triangles for an unstructured mesh whose gravity centers fall within the territory of the corresponding cell in the regular background mesh. An edge connects two nodes if there exists at least one pair of adjacent triangles whose gravity centers fall within each territory of these two corresponding cells in the regular background mesh, respectively. The weight of an edge represents the number of pairs of these adjacent triangles.

We define a changing factor for each cell of the regular background mesh based on the number of triangles (falling within the cell) changed between two consecutive meshes. If the shape of a successive mesh is changed so that the sum of all changing factors is larger than a threshold value, then we repartition the new mesh based on a static algorithm [Lee2002]. Otherwise, we adopt the quadtree-based multi-level refinement.

Multi-level refinement includes three phases: Perform coarsening, do partitioning, and perform uncoarsening refinement. The regular background quadtree can be treated as a complete quadtree, where each quadtree leaf represents a cell in the background mesh. The quadtree representation implicitly defines a hierarchical (parent-child) relationship for a series of coarsening graphs.

Therefore, we can repeatedly apply coarsening, partitioning, and uncoarsening refinement, until no improvement can be made. We call this W-cycle refinement. W-cycle means a series of V-cycles and a V-cycle represents a cycle of coarsening, partitioning, and uncoarsening refinement. Note that METIS only applies V-cycle refinement, where its coarsening phase is based on a maximal matching among adjacent nodes [Karypis1998]. We have found that, for many cases, the partitioning results obtained by our method are better than those obtained by METIS. Perhaps this is because our method adopts W-cycle refinement.

After performing multi-level refinement, we can apply either a dynamic diffusion refinement algorithm [Yang2000] or a prefix code matching refinement algorithm [Chung2000] to achieve a completely load-balanced partitioning. Finally, we apply Kernighan-Lin algorithm to further reduce the number of cut edges among partitions.

Experimental results for simulation of the internal combustion engine that includes 128 unstructured meshes in a periodic cycle are reported. We implement the decomposition of consecutive unstructured meshes for 8 partitions and 16 partitions. If we only use a static algorithm for partitioning each unstructured mesh, it incurs a large volume of the inter-frame communication. If we only use a dynamic diffusion algorithm or a prefix code matching algorithm to refine the partitioning of successive unstructured meshes, it increases the number of cut edges among partition boundaries. However, when we further apply quadtree-based multi-level refinement, we reduce the number of cut edges and achieve a completely load-balanced partitioning. In terms of the number of cut edges, our experiments show that our approach performs the best and is even superior to that of using a static algorithm for partitioning each mesh independently.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Ontology-based Resource Matching in the Grid
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Hongsuda Tangmunarunkit (USC-ISI), Stefan Decker (USC-ISI), Carl Kesselman (USC-ISI)
   
  Description:
  Grid is an emerging technology for enabling resource sharing and coordinated problem solving in dynamic multi-institutional virtual organizations. In the Grid environment, shared resources and users typically span different organizations. The resource matching problem in this environment involves assigning resources to tasks in order to satisfy task requirements and resource policies. These requirements and policies are often expressed in disjoint application and resource models, forcing a resource selector to perform semantic matching between the two. In this poster, we propose a flexible and extensible approach for solving resource matching in the Grid using semantic web technologies. Instead of exact syntax matching, our approach performs semantic matching between resources and requests. We have designed and prototyped an ontology-based resource selector that exploits ontologies, background knowledge, and rules for solving resource matching in the Grid.

Grids are used to join various geographically distributed computational and data resources, and deliver these resources to heterogeneous user communities. These resources may belong to different institutions, have different usage policies and pose different requirements on acceptable requests. Grid applications, at the same time, may have different constraints that can only be satisfied by certain types of resources with specific capabilities. Before resources can be allocated to run an application, a user or agent must select resources appropriate to the requirements of the application. We call this process of selecting resources based on application requirements "resource matching". In a dynamic Grid environment, where resources may come and go, it is desirable and sometimes necessary to automate the resource matching to robustly meet application requirements.

Existing resource description and resource selection in the Grid is highly constrained. Traditional resource matching, as exemplified by the Condor Matchmaker or Portable Batch System, is done based on symmetric, attribute-based matching. In these systems, the values of attributes advertised by resources are compared with those required by jobs. For the comparison to be meaningful and effective, the resource providers and consumers have to agree upon attribute names and values. The exact matching and coordination between providers and consumers make such systems inflexible and difficult to extend to new characteristics or concepts. Moreover, in a heterogeneous multi-institutional environment such as the Grid, it is difficult to enforce the syntax and semantics of resource descriptions.

In this poster, we propose a novel flexible and extensible approach for performing Grid resource selection using an ontology-based matchmaker. Unlike the traditional Grid resource selectors that describe resource/request properties based on symmetric flat attributes (which become unmanageable as the number of attributes grows), separate ontologies, (i.e., semantic descriptions of domain models) are created to declaratively describe resources and job requests using an expressive ontology language. Instead of exact syntax matching, our ontology-based matchmaker performs semantic matching using terms defined in those ontologies. The loose coupling between resource and request descriptions remove the tight coordination requirement between resource providers and consumers. In addition, our matchmaker can be easily extended, by adding vocabularies and inference rules, to include new concepts (e.g., Unix compatibility) about resources and applications and adapted the resource selection to changing policies.

We have designed and prototyped our matchmaker using existing semantic web technologies to exploit ontologies and rules (based on Horn logic and F-Logic) for resource matching. In our approach, resource and request descriptions are asymmetric. Resource descriptions, request descriptions, and usage policies are all independently modeled and syntactically and semantically described using a semantic markup language; RDF schema. Domain background knowledge (e.g., ``SunOS and Linux are types of Unix operating system'') captured in terms of rules are added for conducting further deduction (e.g., a machine with ``Linux'' operating system is a candidate for a request of a ``Unix'' machine). Finally, matchmaking procedures written in terms of inference rules are used to reason about the characteristics of a request, available resources and usage policies to appropriately find a resource that satisfies the request requirements. Additional rules can also be added to automatically infer resource requirements from the characteristics of domain-specific applications (e.g., 3D finite difference wave propagation simulation) without explicit statements from the user.

Our work on ontology-based matchmaker is an on-going effort. The poster will provide an architectural design of the matchmaker, its components, and usage examples. If accepted, during the poster session, we will use the poster to illustrate this new technology, and its pros and cons comparing to the traditional approach. We will also show a brief demo of our prototype, and how it is being used with a real application.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Grid performance prediction with performance skeletons
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Sukhdeep Sodhi (University of Houston), Jaspal Subhlok (University of Houston)
   
  Description:
  Problem statement:
The goal of this work is to enable rapid estimation of expected performance of a distributed application on shared execution environments such as a computation grid or a shared compute cluster. This is a hard problem since the availability of resources changes dynamically due to network and CPU sharing, but it is also an important problem that must be tackled for good resource selection and resource management.

Approach:
Our framework for performance estimation is based on the novel concept of ìperformance skeletonsî. A performance skeleton is a synthetically generated short-running program that has the same fundamental execution characteristics as the application it represents (but with no semantic relevance). The execution time of a performance skeleton program on a given set of nodes under existing resource availability is expected to reflect the execution time of the corresponding full application under the same conditions, but scaled down by one or more orders of magnitude. For illustration, if the performance skeleton executes in 1/20th of the time it takes to execute the corresponding application on a dedicated testbed (no resource sharing), then the execution time of the skeleton should always be 1/20th of the application under any kind of sharing of network, CPU or other resources.

The key challenge in this research is automatic construction of the performance skeleton of an application. This skeleton must be representative of the applicationís execution behavior, yet execute for a short time. Our framework for performance skeleton construction consists of the following major steps:
1. The application is executed on a controlled testbed and all system activity such as CPU and communication usage during execution is recorded.
2. The application execution information collected above is analyzed and summarized by identifying repeated execution patterns. The result is a compact representation of the execution activity history.
3. The skeleton program source code is automatically generated. This skeleton program mimics the execution activity of the application as recorded above except that repeated actions are scaled down by a constant factor.

In order to estimate application execution time under existing system and network conditions, the performance skeleton is executed and the execution time of the skeleton program is multiplied by the compaction factor to obtain an estimate of the application execution time under those conditions.

To the best of our knowledge this is the first effort in the specific direction of estimating application performance by running a representative program. The state of the art is to estimate performance based on resource availability information, provided by tools like NWS. We argue and demonstrate that our approach is fundamentally more accurate and eliminates estimation errors due to inaccurate network measurements and predictions.

Status:
We have implemented our framework for the MPICH implementation of distributed memory message passing MPI programs. This implementation has been tested using the NAS benchmark suite under a variety of network and node sharing scenarios. Our results show that the skeletonís execution time mirrors the corresponding applicationís execution time fairly accurately. The average prediction error for different resource sharing scenarios was in the range 4-12%. Across all scenarios the absolute prediction error was always less than 20% with an average of 9%.

The scope of our current implementation targets predicting the change in performance due to competing compute-bound and communication-bound jobs and does not adequately predict performance across different system architectures. We are in the process of modeling the memory access behavior in construction of skeletons, which is an important consideration if a prediction is to be made for a system with a different memory hierarchy from the prototyping system. Also, the current system generates a fixed program skeleton but ongoing work is developing procedures that can manage the tradeoff between accuracy and execution time of skeletons.

Presentation:
The poster will consist of schematics and short descriptions of the following:
1) Explain the concept of a performance skeleton, its significance for performance estimation, and the challenge of automatically constructing performance skeletons.
2) A block-diagrammatic representation of our framework for constructing performance skeletons along with a running example.
3) A suite of results (bar charts) that compare the performance estimates obtained with automatically constructed skeletons in diverse conditions with the actual performance measured under the same conditions.
4) Illustrations of factors that can lead to inaccurate estimates and how they are being addressed in ongoing work.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Inca Test Harness and Reporting Framework
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Shava Smallen (San Diego Supercomputer Center), Pete Beckman (Argonne National Laboratory), Michael Feldmann (California Institute of Technology), Tim Kaiser (San Diego Supercomputer Center), Catherine Olschanowsky (San Diego Supercomputer Center), Jennifer Schopf (Argonne National Laboratory)
   
  Description:
  One fundamental challenge for Grid users today is the lack of a stable and consistent software environment throughout the participating sites of their virtual organization. This lack of consistency imposes a high overhead on adapting an application for use on a Grid; users must manually log into each site, learn the environment, verify what is working, and then configure their application specifically for that site. Efforts including NMI and NPACKage are addressing part of this problem by providing automated installation of many popular Grid software packages, thereby encouraging more consistent software environments. However, given the autonomous and heterogeneous nature of each participating site, it is critical to supplement this with testing, verification, and monitoring at each site to verify installation and compliance to policies, e.g., all necessary packages are installed, updates and patches are deployed on schedule, etc. It is also critical that data be accessible in a format appropriate to various target groups from system operators, who will be monitoring the system, to users, who may want to troubleshoot problems specific to their application.

The TeraGrid is a national-scale effort to build a production Grid for large-scale scientific applications. Built on top of a 40 Gb/s backbone, the initial 5 sites (ANL, CACR, NCSA, PSC, and SDSC) provide 20 TF of computing power and 1 PB of data storage. The TeraGrid Hosting Environment refers to the specific list of packages and environment variables that are deployed and supported at every participating TeraGrid site. One TeraGrid goal is to increase the usability of TeraGrid resources by providing a single environment that users can adapt their application to and then expect to be available at all TeraGrid sites, also called TeraGrid roaming.

The purpose of the Inca Test Harness and Reporting Framework (or Inca) is to provide a flexible framework for automated testing, verification, and monitoring of the TeraGrid Hosting Environment. Inca is composed of a set of reporters that interact with the system and report status information, a harness that provides basic control of the reporters and collection of information, and clients that provide an interface to the information collected by the reporters, for example through a graphical interface or command line tools.

A reporter is a pluggable component implemented as a script or executable that performs a test, benchmark, or query and outputs a result in XML. Reporters are grouped together into suites to provide specific sets of data needed by TeraGrid users or operators. For example, a user can define a suite of reporters that check the basic functionality needed for a specific application run, or an operator can define a suite to verify the successful installation of all packages defined in a software stack.

The harness is the control framework of Inca and handles the execution and data collection details of a reporter suite. The harness has two basic modes of operation: one-shot and continuous. One-shot mode is used to run a set of reporters all at once for a specific purpose. For example, an operator or user would run the above mentioned reporter suites in the one-shot mode. The harness is also capable of executing in a continuous monitoring mode (i.e., executing reporters on a periodic schedule) useful for proactively verifying service availability and checking system performance at each site over a period of time. The results are currently centrally collected, cached, published into the MDS, and optionally archived using RRDTool. Future improvements include multiple collection sites, a hierarchical framework, and more advanced archiving capabilities.

The third component of Inca is a set of clients, which are enabled through a publicly accessible interface to the reporter data. Currently, a graphical web page shows current system status, with links to performance graphs and other data. In the future, notification mechanisms will be added to inform system administrators of problems or unexpected changes in status enabling quicker response times.

Inca is novel in that it is the first system we know of that is specifically designed for testing, verifying, and monitoring a Grid software environment. It provides a modular and extensible framework that is used to collect status information from the system, ultimately contributing to a more stable environment for users. Furthermore, its applicability is not limited to TeraGrid.

This poster will describe the architecture of Inca and how it is deployed on the TeraGrid including a description of the reporter APIs and an overview of the currently available reporters. In addition, we will discuss the harness operation modes, archiving capabilities, and display the web page front-end.
  Link: --
   

 

     
  Session: Poster Reception
  Title: PAPI Version 3
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Jack Dongarra (University of Tennessee), Kevin London (University of Tennessee), Shirley Moore (University of Tennessee), Philip Mucci (University of Tennessee), Daniel Terpstra (University of Tennessee), Haihang You (University of Tennessee), MIn Zhou (University of Tennessee)
   
  Description:
  PAPI is a cross-platform interface to the hardware performance counters that are available on most modern microprocessors. PAPI has been implemented for most major HPC platforms and has been widely adopted by application developers as well as by developers of third-party performance analysis tools. Experiences with PAPI have necessitated a major redesign to address problems encountered with the current version and to implement new features requested by users and tool developers.

The PAPI counter interface necessarily has some overhead. Part of this overhead comes from the underlying native counter interface used by PAPI and part of it comes from PAPI itself. Version 3 of PAPI is being streamlined to reduce the runtime overhead and memory footprint of PAPI and thus cause as little runtime overhead and perturbation for the application being measured as possible. The poster will present performance measurement of the prototype version of PAPI 3 and several platforms and illustrate how much of the overhead is due to the underlying native counter interface and how much is due to PAPI itself.

Previous versions of PAPI have provisions for calculating derived events from multiple native events and for accessing native events, in addition to the PAPI standard events, through the PAPI interface. Restrictions in the previous implementation have led to a redesign of the PAPI counter allocation scheme and derived event calculation to be more efficient and powerful. The native event interface has been redesigned to be easier to use and to be more powerful. The poster will illustrate the new capabilities available.

Software multiplexing, which uses time-slicing to allow more counters to be accessed simultaneously than are supported in hardware, has been improved in PAPI 3 to use all available physical counters and to use the platform's native high resolution virtual cycle counter for calibration where possible. These improvements will enable more accurate counter values when using multiplexing. An option is also being implemented that will enforce monitonically increasing counter values when using multiplexing. The poster will illustrate the improved accuracy in the new multiplexing scheme.

In addition to aggregate counting mode, PAPI 3 will support event sampling and tracing through new API calls. Sampling will read event buffers at time intervals, tracing at buffer fill. The poster will illustrate how the new API calls work and show sample output from using them.
  Link: --
   

 

     
  Session: Poster Reception
  Title: A Parallel High Performance Inductance Extraction Algorithm
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Hemant Mahawar (Texas A&M University), Vivek Sarin (Texas A&M University)
   
  Description:
  Problem Statement
The next-generation VLSI circuits will be designed with billions of components and millions of interconnect segments. The signal delays due to parasitic resistance (R), inductance (L) and capacitance (C) must be taken into account for timing verification and signal integrity checks. At frequencies in the giga-hertz range, signal delays will be dominated by parasitic inductance. As a result, fast and accurate inductance extraction techniques are going to be critical to the design of future VLSI circuits.

This work presents a fast, memory efficient, object oriented, parallel inductance extraction software, ParIS, which has the capability of solving the inductance extraction problem for modest sized VLSI circuits. On set of benchmark problems, a serial implementation of this software is up to 5 times faster than FastHenry, a popular inductance extraction software, with only one-fifth of memory requirements.

Approach
The inductance extraction problem consists of determining the effective impedance of a set of conductors in a particular geometric configuration. Each conductor is discretized with a two-dimensional uniform grid whose edges represent current-carrying filaments. The potential drop across a filament is due to the inductive effect of current in other filaments as well as due to its own resistance. At the nodes of the grid, the net outgoing current is zero in accordance with Kirchoff's Law. These conditions give rise to complex symmetric linear systems which are solved with multiple right hand sides to obtain the effective impedance matrix.

In ParIS each conductor contributes a local transformed system to the coefficient matrix comprising of both sparse and dense sub-matrices. Entries of sparse sub-matrix represent filament resistances and the dense sub-matrix represents the influence of filaments on all the filaments present in the system. Hierarchical multipole-based algorithms are used to compute matrix-vector products with the dense matrices in linear time. In our algorithm, filaments in each conductor are represented in a tree structure. Conductors are distributed across processors and all the computations, except mutual interactions between conductors, are done locally. For mutual interactions, each conductor takes the contribution of other filaments from the corresponding root or at some level down the tree as defined by the error criteria. Only this step accounts for the communication costs, which decreases parallel efficiency. Local computations are parallelized using OpenMP directives and MPI is used for interprocessor communication.

Status
Our parallel code has demonstrated high parallel efficiency on a set of benchmark problems on the IBM p690 and SGI Origin for problems with several hundred thousand unknowns. We are in process of developing the software in C++ with capabilities for handling more complex problems and expect to obtain similar speedups.

Presentation
The poster will describe in detail the algorithm used in ParIS and its parallel implementation. We will present the results of several benchmark problems solved on the IBM p690 multiprocessor. The poster will provide a detailed analysis of the parallel performance and present comparisons with existing software packages.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Ouroboros: A Tool for Building Generic, Hybrid, Divide & Conquer Algorithms
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  John R. Johnson (University of Chicago / Lawrence Livermore National Laboratory), Ian Foster (University of Chicago / Argonne National Laboratory)
   
  Description:
  A hybrid divide and conquer algorithm is one that switches from a divide and conquer to an iterative strategy at a specified problem size. Such algorithms can provide significant performance improvements relative to alternatives that use a single strategy. However, the identification of the optimal problem size at which to switch for a particular algorithm and platform can be challenging. We describe an automated approach to this problem that first conducts experiments to explore the performance space on a particular platform and then uses the resulting performance data to construct an optimal hybrid algorithm on that platform. We implement this technique in a tool, Ouroboros, that automatically constructs a high-performance hybrid algorithm from a set of registered algorithms. We present results obtained with this tool for several classical divide and conquer algorithms, including matrix multiply and sorting, and report speedups of up to six times achieved over non-hybrid algorithms.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Ultra-Large-Scale Molecular-Dynamics Simulation of 19 Billion Interacting Particles
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Kai Kadau (Theoretical Division, Los Alamos National Laboratory, MS B262, Los Alamos, New Mexico 87545, U.S.A., E-mail: kkadau@lanl.gov), Timothy C. Germann (Applied Physics Division, Los Alamos National Laboratory, MS F699, Los Alamos, New Mexico 87545, U.S.A., E-mail: tcg@lanl.gov), Peter S. Lomdahl (Theoretical Division, Los Alamos National Laboratory, MS B262, Los Alamos, New Mexico 87545, U.S.A., E-mail: pxl@lanl.gov)
   
  Description:
  In the early 90's Lomdahl, Beazley, and coworkers did pioneering work on large-scale molecular-dynamics (MD) simulations(1) with up to 131 million particles on a variety of parallel platforms, including the Connection Machine 5. Ever since, large-scale MD simulations have been used intensively to get insight into atomic processes on the pico-second time scale for various solid-state material science issues. It has been demonstrated that on today's largest parallel supercomputers MD simulations with one billion atoms are feasible(2), and in 2000 Roth et al. ran some five billion atoms for five integration time steps on a Cray T3E using 512 processors(3).

However, such ultra-large-scale simulations present an analysis problem due to the enormous datasets which are produced(4). For example, the memory needed to save one configuration of one billion atoms in single precision is more than 15 GByte, which makes it problematic and inconvenient to save all the raw data and analyze the data in a postprocessing fashion, even on modern platforms with TBytes of hard-disc capacity. Rather, an on-the-fly analysis during the simulation run is desired.

The MD code SPaSM (Scalable Parallel Short-range Molecular dynamics) has multiple tools to analyze the simulation results during the course of the simulation(1,4). The methods implemented include generation of pictures of configurations (particle as well as cell based), local and global quantities like stress tensor elements, temperature, energy, displacement, centrosymmetry parameter etc. It also allows for calculating histograms and correlation functions for the whole system as well as for subsets of the system.

We have performed parallel ultra-large-scale molecular-dynamics simulations on the QSC-machine at Los Alamos. The good scalability of the SPaSM code is demonstrated together with its capability of efficient data analysis for enormous system sizes up to 19,000,416,964 particles. To the best of our knowledge this is the largest atomistic simulation reported to date(5). Furthermore, we introduce a newly developed graphics package that renders in a very efficient parallel way a huge number of spheres necessary for the visualization of atomistic simulations. The capability was realized by implementing an in-house developed graphics library in ANSI C, using MPI for the communication between PN. Since no (hardware oriented) graphics library is needed, the software is portable and easy to implement on various architectures including large parallel platforms. An example of the use of a prototype of the new graphics code together with the use of the SPaSM library can be found in (6).

These abilities pave the way for future atomistic large-scale simulations of physical problems with system sizes on the micro-meter-scale which has been so far only been theoretical investigated by continuum methods. Moreover, the flexibility of the SPaSM library can be exploited in other ways, for instance by opening up the possibility of large-scale agent-based simulations. Smaller-scale agent-based models for the spread of epidemics(7) can now scale up to encompass the entire world population of approximately 6.3 billion people.

The poster presentation will be assisted by a laptop showing movies and additional material.

Literature:
1.) P. S. Lomdahl, P. Tamayo, N. Gronbech-Jensen, and D. M. Beazley, in Proceedings of Supercomputing 93, ed. G. S. Ansell (IEEE Computer Society Press, Los Alamitos, CA, 1993), p. 520.
2.) F. F. Abraham, R. Walkup, H. Gao, M. Duchaineau, T. D. De La Rubia, and M. Seager, PNAS 99, 5783 (2002).
3.) J. Roth, F. Gaehler, and H.-R. Trebin, Int. J. Mod. Phys. C 11, 317 (2000).
4.) D. M. Beazley and P. S. Lomdahl, Computers in Physics 11, 230 (1997).
5.) K. Kadau, P.S. Lomdahl, T.C. Germann, submitted to Int. J. Mod.Phys. C (2003)
6.) K. Kadau, T. C. Germann, P. S. Lomdahl, and B. L. Holian, Science 296, 1681 (2002).
7.) M. E. Holloran, I. M. Longini Jr., A. Nizam, and Y. Yang, Science 298, 1428 (2002).
  Link: --
   

 

     
  Session: Poster Reception
  Title: Large-scale Molecular Dynamics Simulations of Hybrid Inorganic-Organic Systems on Multi-Teraflop Linux Clusters and High-end Parallel Supercomputers
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Satyavani Vemparala (Louisiana State University), Bijaya B. Karki (Louisiana State University), Hideaki Kikuchi (Louisiana State University), Rajiv K. Kalia (University of Southern California), Aiichiro Nakano (University of Southern California), Priya Vashishta (University of Southern California), Shuji Ogata (Nagoya Institute of Technology), Subhash Saini (NASA Ames Research Center)
   
  Description:
  Problem Statement:
Recent advances in biomedical nanotechnologies have made it imperative to perform very large-scale parallel molecular dynamics (MD) simulations of hybrid inorganic-organic systems involving million to billion atoms. An example is self-assembly of semiconductor quantum-dot nanostructures on a 2D array of proteins. Spatial decomposition, combined with spatiotemporal multiresolution algorithms, can achieve high scalability in the large-scale hybrid inorganic-organic MD simulations, but this requires nontrivial handling of constrained dynamics and bonded interactions for organic macromolecules.

Approach:
We have developed a scalable parallel MD algorithm for organic macromolecules coupled with inorganic materials, based on a novel spatial-decomposition scheme and space-time multiresolution algorithms. The hybrid inorganic-organic molecular dynamics (HIO-MD) algorithm employs: 1) a novel parallel implementation of the fast multipole method (FMM), which computes the electrostatic contribution to the atomic stress tensors in O(N) time based on a complex-charge method (CCM); 2) penalty-function-based constrained MD combined with a symplectic multiple time-scale (MTS) method; 3) dynamic management of distributed data structures for bonded force computations involving atomic n-tuples; and 4) local-topology-preserving computational-space decomposition for latency minimization and load balancing.

1) The fast multipole method (FMM) developed by Greengard and Rokhlin calculates electrostatic energies and forces for a collection of N charged particles with O(N) operations and predictable error bounds. In MD simulations of realistic materials, microscopic stress tensors are often required for constant-pressure simulations and local stress analysis. However no general O(N) method has been found in the existing literature to calculate stress tensors due to Coulomb interactions. We have developed the complex-charge method (CCM) to calculate the Coulomb contribution to the stress tensor field. The CCM attaches information of the particle position to a complex-valued charge. The stress tensor is computed by finite differencing of the resulting complex-valued electrostatic field, without the need for multipole translation operators, which are unknown, for the stress tensor. Our parallel FMM code is applicable to various boundary conditions, including periodic boundary conditions in two and three dimensions, corresponding to slab and bulk geometries, respectively.

2) To avoid the sequential procedures in constrained MD simulations, the bond-length constraints are replaced by penalty functions. To efficiently handle the stiff penalty forces, we employ a mutilple time-scale (MTS) method, which reduces the number of force computations significantly by separating the fast (small time-scale) modes from the slow (large time-scale) modes. Our MTS algorithm has doubly nested loops, in which the outer loop computes the non-bonded forces, while the inner loop computes the bonded forces. The MTS algorithm is reversible and symplectic, resulting in excellent long time stability of solutions.

3) To remove the need for global connectivity information in conventional parallel organic MD algorithms, we have developed dynamic list management techniques for the computation of bonded interactions involving atomic n-tuples. Time complexities for computing bond-stretching, bond-bending, and torsion forces are O(N) for N atoms with small prefactors that are low-order polynomials of the vertex degree (or the number of bonded atoms).

4) Our parallelization framework preserves the local 3D mesh topology so that message passing is performed in a structural way in only 6 steps, thus minimizing the latency. Load balancing is achieved by relaxing the global 3D mesh topology, such that the union of subsystems compactly covers the simulated system, and introducing a computational space, which shrinks in a region of high workload density.

Status of Work:
Scalability of the HIO-MD algorithm has been tested on the 1024-processor Intel Xeon-based Linux cluster at Louisiana State University, the 1184-processor IBM SP4 system at the Naval Oceanographic Office (NAVO) Major Shared Resource Center (MSRC), and the 512-processor Compaq AlphaServer at the U.S. Army Engineer Research and Development Center (ERDC) MSRC. The parallel efficiency of the FMM code is 0.98 for 512 million charged particles on 512 IBM SP4 processors. The parallel efficiency of the HIO-MD code, which incorporates the FMM code, is 0.87 for a 21.6 million-atom system on 1024 SP4 processors. The scalability tests also show significant effects of the sharing of memory and L2 cache on the performance.

Short Description of Presentation:
The poster will contain a concise description of FMM and HIO-MD algorithms as well as graphs and tables showing performance results on various architectures. We will also present graphics and movies showing the application of the HIO-MD code to the design of electrically-switching organic nano-films.
  Link: --
   

 

     
  Session: Poster Reception
  Title: Data Mining on the TeraGrid
  Chair: Michelle Hribar (Pacific University) and Karen L. Karavanic (Portland State University)
  Time: Tuesday, November 18, 5:01PM - 7:00PM
  Rm #: Lobby 2
  Speaker(s)/Author(s):  
  Helen Conover (University of Alabama in Huntsville), Sara J. Graves (University of Alabama in Huntsville), Rahul Ramachandran (University of Alabama in Huntsville), Sandra Redman (University of Alabama in Huntsville), John Rushing (University of Alabama in Huntsville), Steven Tanner (University of Alabama in Huntsville), Robert Wilhelmson (National Center for Supercomputing Applications)
   
  Description:
  Researchers at the Information Technology and Systems Center (ITSC) at the University of Alabama in Huntsville are currently working on the Modeling Environment for Atmospheric Discovery (MEAD) Expedition, part of the National Center for Supercomputing Applications (NCSA) TeraGrid Alliance program. Led by NCSA, the Alliance program is a partnership of organizations working together to build and prototype a grid infrastructure with applications and tools for specific scientific disciplines. The goal of the MEAD Expedition is the development and adaptation of Grid-enabled cyberinfrastructure for mesoscale storm and hurricane research and education, allowing a MEAD user to configure and integrate model simulations, manage resulting model and derived data, and analyze, mine, and visualize large model data suites in a research (not predictive) context. As a member of this team, ITSC is selecting data mining algorithms targeting the expedition's datasets and problems, and enhancing the algorithms for the TeraGrid's parallel Linux environment. ITSC is also working with the Alliance Portal team and NCSA's data mining group to integrate these services into the MEAD workflow architecture and portal interface to facilitate distributed mining capabilities.

The ITSC-developed Algorithm Development and Mining (ADaM) system [http://datamining.itsc.uah.edu/adam/index.html] is being adapted to help researchers analyze data from tens to hundreds of model simulations (ensembles and parameter studies) in a distributed grid environment. ADaM consists of modular and programmatically scriptable components that can be integrated into the grid operational environment to provide grid mining services for MEAD. Current capabilities of the ADaM system are being extended to offer a comprehensive repository of mining and image processing services where the operations cover a variety of data types. ADaM's flexible, lightweight architecture allows the operations to work with several different data models. Application developers are free to use the databases, queuing systems, schedulers, scripting languages or other tools that are appropriate for their applications. This also makes ADaM an ideal candidate for a grid application in many other disciplines.

Current data mining research for MEAD is exploring the use of texture features for image classification and segmentation. Wrapper based feature subset selection methods are being used to eliminate noisy, irrelevant or redundant attributes or features from the dataset in order to improve runtime performance and maximize classifier accuracy. The feature subset selection methods evaluate how well the classifier performs for a number of different feature subsets, and then choose the subset that exhibits the greatest performance. For problems with small numbers of features, it is possible to consider all possible subsets. However, since the number of possible subsets for a problem with N features is 2N - 1, this is not feasible for problems with large numbers of features. In such cases, it is necessary to use a search or optimization procedure to drive the feature subset selection process.

For MEAD, an application has been constructed to automatically select feature subsets for supervised classification problems. The application consists of a set of feature selection/optimization techniques, a set of trainable classifiers, and a set of scripts for constructing and evaluating classifiers. The application works by trying out all the combinations of feature selection techniques and trainable classifiers for the given problem, and reporting the best performance achieved for each configuration. For each combination of techniques, the classifier is trained and tested on several thousand feature subsets. Application scripts were developed in Python and tested on Linux, and then modified for the Globus environment by writing a simple Globus RSL file. The application scripts were modified to run each combination of tools in parallel on a different node on the grid, in order to take advantage of distributed resources for this computationally demanding and data intensive application.

Another mining example in support of one of MEAD's science objectives is the investigation of different cell interactions favorable for subsequent tornado formation. Weather Research and Forecasting (WRF) model simulations with different initial conditions have created a large parameter space of thunderstorm cell interactions and consequent storm development. The team has begun initial efforts to mine this search space for patterns and trends. This study focuses on developing a feature detection algorithm for mesocyclones in the WRF model data and tracking the evolution of these features in time. The team has also begun to investigate the use of unsupervised classification techniques to summarize differences between various WRF model simulations.

This poster will introduce the ADaM system, detail its use in a large-scale, collaborative grid environment, and address the problems described above, with their large volumes of distributed data and computationally intensive analysis requirements.
  Link: --