ADS Dataset Verification and Resolution Services

This document contains information about the Dataset Verification and Linking efforts underway among the NASA Archives and Data Centers and the University of Chicago Press (publisher of ApJ, AJ and PASP). This activity has taken place under the auspices and guidance of the NASA Astrophysics Data Centers Executive Council (ADEC) and aims at fulfilling the promise of further integrating astronomical literature and the on-line data it is based upon.

The NASA Astrophysics Data System (ADS) will provide the tools necessary to publishers and users at large for both dataset verification and linking through stable, top-level services that can be maintained for the foreseeable future. Links created to datasets from on-line manuscripts will always refer to a dataset via a URI created using a well-defined identifier, and the URI will be turned into one or more URLs in real-time by a central resolver provided by ADS. This will provide a high level of reliability and persistence to the links, as well as providing an upgrade path into any future VO efforts in this direction.

Overview

Dataset citation, verification and linking will work as follows:
  1. Astronomy data centers and archives will start attaching permanent dataset identifiers to the data they distribute.
  2. Astronomers will write papers referencing the dataset they have used. As per the instructions given to them by the AAS, they will start using the appropriate markup to identify datasets in the papers.
  3. During the publishing pipeline, UCP will extract the identifiers and send a query to a central dataset identifier service (hosted by ADS) to find out if (a) the dataset is valid and (b) a URL can be associated to it.
  4. The central dataset identifier verification service will query a number of (relevant) datacenters using its own protocol, will cache the results, and will return a status flag to the UCP query, indicating if a dataset is known or not.
  5. For the dataset identifiers that are known, URLs can be built by using the base URL of a dataset identifier resolver and the dataset identifier itself, e.g. http://vo.ads.harvard.edu/dv/DataResolver.cgi?ADS/Sa.ROSAT#X/701576n00 If the verification is successful, UCP should include such URL in its on-line article.
  6. When the article goes on-line, a user clicking on the link associated with the dataset will be taken initially to the URL above. What happens next depends on whether ADS has one or more datacenters claiming to have data relative to this dataset (there could even be different mirror sites for a given data center). If only one final URL is available for the dataset in question, our cgi script can simply forward the user to it. If more than a single URL is available, we could display a simple menu listing all the information we have about the available links.
ADS will take the responsibility of maintaining services that are aware of all relevant datacenters that may have datasets available on-line, and which datasets are available from which data centers.

Dataset Identifiers

In order to allow easy integration of this effort in the emerging VO framework, the ADEC has decided to adopt a syntax for the dataset identifiers which is consistent with the current IVOA Identifier Proposed Recommendation (Plante et al 2003). This adoption will facilitate integration of these identifiers and the tools that manipulate them in the VO.

IVOA Identifiers

According to the IVOA Identifiers Draft, the general URI format for an individual dataset identifier is a string of the kind:
ivo://AuthorityId/ResourceKey#PrivateId
While we refer the reader to the recommendation for a full explanation of the syntax, a few things are worth pointing out:

Using Dataset Identifiers in the Literature

Given the fact that much of the VO infrastructure is still under design and development, the ADEC has decided on a specific recommendation for referring to dataset identifiers in the astronomical literature. The general form of these identifiers is:
ADS/FacilityId#PrivateId
Comparing these identifiers with the general IVOA syntax we can make the following observations:

Generating Dataset Identifiers

All Data Centers and Archives which provide public access to their data should structure their databases and interfaces so that when a particular dataset is released to the public, it is uniquely tagged by an identifier ID created as discussed above. Users who download one of such datasets should be made aware of the identifiers associated with it and how it should be referenced in the published literature.

In order for a datacenter to ensure that the identifiers it is generating comply with the syntax endorsed by the ADEC, the following must occur:

  1. The identifier is in the form ADS/FacilityId#PrivateID
  2. The FacilityId has been registered with ADS and is listed in the table of known facilities
  3. The PrivateId is a unique identifier within the FacilityId, and its association with the dataset will not change.
  4. A profile for the datacenter has been registered with the ADS, and in it FacilityId has been listed as one of the resources that the center has data for.
  5. The datacenter provides a dataset verification service which will be used to verify the validity and location of identifiers published in the literature.

Once a datacenter has published a dataset ID, it should provide access to it. Ideally this will be a human-readable page on its web server displaying the dataset's relevant metadata and offers the user the option to download the dataset itself in some form or fashion. It is left up to the datacenter to decide what to do if and when a revised version of a particular dataset is published. In general, however, it is understood that access to the latest revision of a dataset should be an option if not the default.

Registering a Datacenter Profile

In order for ADS to coordinate the verification and linking of dataset identifiers to the appropriate datacenters, it is necessary for the datacenters to provide some basic metadata about its data holdings and services. While it is expected that the appropriate metadata will one day be made available by a public VO registry, its format and access methods are at this time not available. As an intermediate solution to the problem, we require that the data centers maintain a simple profile which will provide ADS with the necessary metadata.

The data center profile is simple XML document that lists the data center name and description, the name and email address of the person responsible for the maintenance of the profile, the URL of the web service to be used for dataset verification, and the list of facilities that the datacenter has data for. For more information, please see this simple example.

Two options are available for creating and maintaining such a profile document:

  1. Create the appropriate XML document and make it available at a stable URL on your web site.
  2. Install the ITWG SOAP data verification toolkit that has a built-in option to generate such a profile when invoked with the proper syntax (please see below for more details).
Once the datacenter profile document has been created, the person responsible for it should let ADS know of the profile's location by submitting its URL to the Datacenter profile registration form. ADS will review the profile and merge it into its list of datacenters to be used for the verification of dataset identifiers. Also, ADS will periodically harvest the URL corresponding to the datacenter's profile and will update its list of datacenters and supported facilities accordingly. Once registered, a datacenter can update its profile without further intervention from the ADS.

Providing Data Verification Capabilities

In order to promote an open framework that can be used for the distributed verification of dataset identifiers across data centers, the ADEC ITWG (Interoperability Technical Working Group) has created the specification for a SOAP-based web service. The corresponding WSDL file can be used to generate client and server interfaces to the service. Each datacenter providing data verification services should provide and maintain a service that abides by this specification.

ADS provides a central verification service that fans out queries to the appropriate datacenters. A diagram showing the architecture of the system and the datacenters currently providing the verification services (as of mid-2003) is available.

To facilitate the deployment of verification services, the ADS also developed a PERL toolkit that greatly simplifies the creation of a compliant web service. Among other things, by defining a few variables and installing a simple CGI script based on this toolkit you will be able to automatically define your site's profile described above. For more information, please see the README file.

Miscellanea

What follows is a list of additional resources and information regarding this effort. Please bear in mind that this is work in progress, so some of the things listed below may not be (yet) working the way they are intended to.

The following resources are mostly of interest to the Data Centers maintainers:

A number of things still remain to be done:

If you have comments or questions about the contents of this document, please send me an email.


Alberto Accomazzi -- Last modified: 14 Nov. 2004