Dear fdsn-wg2,
currently, the working group is discussing several minor updates to the stationXML specification. The proposals aim at fixing some issues; however, notable structural shortcomings of the base fdsn stationXML 1.0 [1] as well as in the "station availability" extension are not addressed.
While of the current proposals aim at homogenizing the representation of similar concepts [2], others increase the heterogeneity of the representation of similar information [3].
Further, several topics are currently under vigorous discussion, which would lead to the expansion of the scope of stationXML: e.g. data ownership & provenance, station characterization, different connotations of station, as well as waveform availability & quality.
As a result, the community risks
- To go into a series of updates, resulting in instance files of multiple versions of stationXML floating around in parallel and leading to issues with conversion and interpretation.
- To result in large and complicated standard which makes it difficult to fully cover in software implementations, or to clearly specify the level of partial coverage. Such standards rarely make well-accepted exchange formats (see e.g. SensorML).
- At the same time, to continue following some concepts of full seed of the pre-1990ies, when response information aimed at describing "the data in a specific file" rather than "a seismic station".
In other words, we fear that even if we implement the current proposals, the next version of StationXML will continue to fall short.
To avoid this, we would like to propose to set up a working group to thoroughly review the representation of data referring to seismic stations, and build a new data model from scratch, based on a broad public discussion/rfc process and taking into account the needs of all: network maintainers, data centers, and researchers.
As contributions to key design goals, we would propose the following:
0. First create a conceptual data model (e.g., in UML); then implement as xml, json, sql, code classes or whatever the community may need. If all those are derived from the same concept, mapping will be easy.
1. Split the format definition in three different packages covering different content:
a) response (covering networks, stations, streams, their instrumentation, location and properties; answering the question: "how does available time series data translate to the observed physical property (typically ground motion)?"),
b) station characterization (covering housing, relief, velocity profiles etc.; answering the question: "What are the permanent properties of the measurement site potentially influencing the data measured?", and
c) data availability/quality,covering the availability of data in general or fulfilling specific requirements or levels of review and answering the question: "what data is available and adequate for my purpose? (or: available where . for my purpose?)"
Separating station information into 3 independent, but linked packages allows users and service providers with a limited scope to work with well defined, and simple subsets, and it allows communities to develop standards at their own pace, without interfering with services and applications for other domains.
2. Model the data according to natural entities, not specific use cases. E.g., model response as a property of a physical (or conceptual) instrument rather than of a located data stream (although for a specific use case, it may be of interest just as such); model coordinates not as properties of a stream, but of the place where the sensor (or multiple, also sequentially deployed sensors) producing streams are deployed. This allows versatile usage of the format in multiple, possibly unexpected contexts, and fosters its use as an exchange format .
3. Use unique identification and references rather than data duplication. This leads to smaller files and avoids inconsistencies.
4. Do not imply business logic into the format: E.g. while if in your community, sensors are typically deployed at registered stations uniquely to a single network, they may be deployed without this hierarchy for a PhD student's field work, or just reside on the shelf for a manufacturer. However, all three should be able to use the same standard for instrumentation and response description.
5. Clarify on transactions. StationXML 1.0 contains some features describing transaction, however not enough to describe who is updating what how from which to which version. Our take: this belongs to the domain of business logic. Don't make it part of the standard, as business logic varies between application cases. Better keep the data format stateless, and define more specific services using it for more specific purposes (in WG3).
6. Use standardized data types, e.g. for resource metadata and description of uncertainties.
Obviously, these principles are influenced by the QuakeML experience. You could also call this proposal: "create QuakeML packages for response, station information, and data availability" (and yes, for the second of these, there is already a draft proposal for QuakeML 2.0). However, we do not care about names - it is more important that the community has a clean and thus long-lasting representation of seismic instrumentation metadata.
We would like to seek feedback on whether the community would agree with this approach or would prefer to continue implementing incremental patches to stationXML without changing the core structure.
Best regards,
Philipp Kästli, Fabian Euchner, John Clinton
Swiss Seismological Service
----------------------------------
[1] E.g.: duplication of instrumentation data with streams, incapability to describe non-deployed instruments, missing concept of sensor location (beyond fuzzy matching of coordinates), impossibility to identify streams independent of networks, undefined relationship between station instrumentation and stream instrumentation, missing match between calibration process and calibration result etc.
[2] E.g.: representation of the sampling rate
[3] E.g.: data availability: one gappy vs. multiple continuous DataAvailabilitySpans vs. an intermittingly available Channel; open-end channels described by mandatorily closed-ended DataAvailabilityExtents.