Yolanda Gil, PhD
Associate Division Director, Information Sciences Institute of the University of Southern California
and Research Associate Professor in the Computer Science Department

Friday, March 13, 2009
12:00 pm - 1:00 pm
5275 MSC

Semantics and Problem Solving for Planning Computational Workflows:
Large-Scale Distributed Data Analysis for e-Science

Computational workflows have emerged as an important paradigm in large- scale, distributed scientific data analysis. Workflows represent abstract representation of interrelated data retrieval and processing tasks and their mapping to the underlying computational environment. Workflows are used to coordinate potentially thousands of distributed operations that may be required to obtain a scientific result from raw experimental or simulation data.

Because workflows are declarative representations of computations and the data flow among them, scientists see them as a key enabler for reproducibility and repeatability of results in computational science. Formally, a workflow is a directed graph whose nodes correspond to computations and the links represent data flow among them. Creating valid workflows is very challenging, since many algorithms and code implementations may be available to carry out a computation, and each algorithm may have constraints regarding the kind of data it was designed to process. In this talk, I will present our results to date on using Artificial Intelligence (AI) techniques to assist users in specifying valid workflows and to automate the generation of executable workflows that can be submitted to distributed resources.

Our workflow generation techniques exploit semantic metadata to reason about data characteristics and data processing algorithms, assisting users in conducting a systematic exploration of the space of alternative valid workflow designs. This reasoning also results in rich characterizations of planned data products and their provenance, enabling data reuse and other automated optimizations. In a workflow application for seismic hazard analysis, our Wings/Pegasus workflow system exploited AI planning techniques and semantic representations to generate workflows of more than 8,000 computations, create more than 100,000 data products with automatically generated rich metadata descriptions, and manage the execution of the workflow for a total of 1.9 CPU years. In a more recent collaboration, we explored accuracy/ quality tradeoffs for biomedical image analysis for neuroscience and cancer research by automatically selecting application parameters. We are now investigating how to assist users in creating data analysis protocols for genomics by suggesting valid workflows that implement the intended protocol while complying with the constraints of the algorithms used to realize the protocol steps. I will conclude with an overview of the research challenges that lie ahead and the broader benefits of having scientific workflows more widely adopted.

Brief Biography
Dr. Yolanda Gil is Associate Division Director at the Information Sciences Institute of the University of Southern California, and Research Associate Professor in the Computer Science Department. She received her M.S. and Ph. D. degrees in Computer Science from Carnegie Mellon University. Dr. Gil leads a group that conducts research on various aspects of Interactive Knowledge Capture. Her research interests include intelligent user interfaces, knowledge-rich problem solving, scientific and grid computing, and the semantic web. An area of recent interest is large-scale distributed data analysis through knowledge-rich computational workflows.

Dr. Gil was Program Chair of the Intelligence User Interfaces (IUI) Conference in 2002, was co- founder and co-chair of the First International Conference on Knowledge Capture (K-CAP) in 2001, and was program co-chair of the International Semantic Web Conference (ISWC) in 2005. She was elected to the Council of the American Association of Artificial Intelligence (AAAI), and was program co-chair of the AAAI conference in 2006. She serves in the Advisory Committee of the Computer Science and Engineering Directorate of the National Science Foundation.

