Plan for LWS/SEC Data Environment
D. A. Roberts, April 2004
Consider the following as a starting point for discussion. The idea is to give people a framework in which to proceed efficiently to a working environment that has natural upgrade paths that will also make it possible to write more focused calls for proposals.
LWS, and increasingly SEC in general, seeks a system level understanding of the coupled Sun-Heliosphere-Magnetosphere-Ionosphere-Thermosphere-Mesosphere-Atomosphere system. Many definition teams, MOWGs, and other planning groups have stated that the data environment to make such studies possible should be implemented soon, both to show feasibility and to accomplish as much LWS science as possible before the specific missions are launched.
The new scientific challenges will require integrated analysis across spacecraft, regions, models, and disciplines. Of course, all success is predicated on the availability of high quality data from many agencies and countries, and efforts need to continue to identify gaps and to fund the provision of currently inaccessible datasets.
Equally important is a long-term active archive plan that will insure the correct, independent use of data, software, and models after the PI-team expertise is no longer available. This will be addressed, in principle, through PDMPs for new missions, but legacy data will also be important.
What follows focuses on the issue of the easy discovery, retrieval, and use of data from multiple observatories/spacecraft and models. This is the core of the data environment. I believe we now have a de facto community consensus for the framework of a distributed LWS (and by implication, SEC) data environment that will accomplish these goals. There are no significant technical problems in the way of implementation, and the environment has natural built-in growth paths to make it expandable. Security issues have been dealt with in various ways in each area of the environment. The issue of limited bandwidth is one that we can dream of having, but if it becomes a problem there are known means (grids, MPI, P2P, multiple brokers) for introducing more parallelism, and we can improve the efficiency of data discovery methods, by, for example, the clever use of event catalogues.
The biggest challenge will be to make the data environment sufficiently useful for science that it will be widely used before the community loses interest or funding goes away. (The functionality of such existing services as CDAWeb will also provide metrics against which success can be measured.) This will involve both coordinated development involving many people, and the tedious work of populating product registries and assuring that each product can be used through various means. Assuming success, a second challenge will be maintaining and upgrading the environment to meet new scientific needs.
The de facto consensus architecture involves the following elements (to be modified by community input in a manner TBD):
Web-based, machine/application-accessible (ÒqueryableÓ) repositories of data. The three most obvious ways to make data accessible are (1) through a well-organized ftp site, (2) using extended URLs as is done in OPeNDAP, and (3) with a SOAP service, as is now implemented by CDAWeb (and used by VSPO). Other means can be supported by the data environment outlined here.
Registries of products and services. These will provide a uniform description of products and services across the community, and will be based on a Data Model (dictionary) such as those of SPASE, VSO, LWS, and EGSO. (Efforts are underway to unify these at some level.) The ideal is to have all LWS/SEC products registered, and to this end a simple Registration GUI is needed. (This is being worked on.)
Front-end applications. These represent the portals through which users access the data and services. The simplest are Web-browser-based extensions of existing repository-specific tools that allow the user to narrow a search for products based on values of keywords. It will also be useful, where feasible, to have direct streaming of data into applications such as IDL, as is now done in some cases by SolarSoft and OPeNDAP and by SDDAS. The front end is what the users see, and thus is crucial to success.
Brokers to connect the repositories to front ends using the information in the registries. Having this extra layer between the user and data or services makes it possible to give the user a uniform means of access. With easily Web-accessible data and good registries, it will be possible to have many of these with varying scope (ÒVxOsÓ), some of which will use others to extend their reach. Brokers translate uniform queries into requests to specific repositories in the language of the repository.
Services that will, at a minimum, allow the user to rapidly and easily put data into useful form, no matter what the underlying format. Ideally, there will be generic software that can interpret a variety of formats and produce listings and plots (as in CDAWeb or SPIDR, for example, but generalized) as well as a library of routines that will provide value-added features for each data product (as is done by SolarSoft for many solar observations). These services can be Web-enabled and chained, as done by CoSEC.
Higher-order search capabilities. Ultimately we would like to be able to pose questions such as Òwhen was there a spacecraft in the magnetopause and gathering data when a CME was arriving.Ó One important approach to this type of problem is to provide uniform access to Event Lists (or Catalogues; EGSO is beginning to do this) that will include detailed data availability listings as well. Other important aspects will be model-based spacecraft location determinations (as in SSCWeb), and the use of survey-level data (as in OMNIWeb) to find, for example, regions of low solar wind density by direct examination of the data. Again, these capabilities can be Web-service enabled such that intervals found from event catalogues can be passed to the brokers.
Success will also require users' guides for each product designed to help cross-disciplinary users understand the typical applications of the product and how to use the data for each application. These will also serve as guides to use for the long-term, when the PI teams are no longer available for consultation. The long-term guides may need extra information on data reduction, etc. Of course, the above technical plan must also be complemented by plans for long-term (active) archiving of datasets and software.
Given the above, the most important things to be done now are:
* Identifying and developing needed, as yet inaccessible or incomplete data products.
* Automating data reduction for both on-the-fly and preprocessed products.
* Deciding on a Data Model to facilitate Registry construction and data access.
* Identifying and registering all relevant products. Making this easy.
* Enabling repositories to be queryable. Developing a cookbook for this.
* Developing and making available means to use the datasets, independent of format.
* Refining and extending the current prototypes for brokers.
* Refining/reinventing and extending current front-ends, based on user feedback and comparison of many existing services.
Longer term efforts will develop more capable services and search capabilities, as well as APIs to allow tailored product/service access.