Thread: Next generation miniSEED - 2016-3-30 straw man change proposal 17 - General compression

Started: 2016-08-12 00:30:56

Last activity: 2016-08-20 03:47:27

Topics: FDSN Working Group II FDSN Working Group III

This thread is from a mailing list that has moved to Google Groups. Use the following links to browse the updated archives.

FDSN Working Group II

FDSN Working Group III

Chad Trabant

Next generation miniSEED - 2016-3-30 straw man change proposal 17 - General compression

2016-08-12 00:30:56

Hi all,

Change proposal #17 to the 2016-3-30 straw man (iteration 1) is attached: General compression.

Please use this thread to provide your feedback on this proposal by Wednesday August 24th.

thanks,
Chad

View this message in Google Groups at https://groups.google.com/a/fdsn.org/d/msgid/fdsn-wg2-data/51C5F6A4-99E1-4C9C-8F2E-561603B6EF72%40iris.washington.edu.

Philip Crotwell

Re: Next generation miniSEED - 2016-3-30 straw man change proposal 17 - General compression

2016-08-12 20:45:49

Hi

Reading this, I wonder if there is a meaningful distinction between
data compressed with a "general compressor" and simply "Opaque data".
It may be better to not have encodings that in any way generic, and
only add specific ones as they come into existence. I would presume
that an FDSN update to the miniseed3 to assign the next code number to
"32-bit IEEE floats, Brotli compression, bla bla bla" would be
possible. In the mean time, individuals that wish to experiment with
other compression types can do so by using the opaque data code 100
and using a, perhaps standardized, the optional header to specify
information about how the opaque data is suppose to be extracted.

Don't add a code until there is a specific implementation, and that
code is tied to a single specific algorithm.

thanks
Philip

On Thu, Aug 11, 2016 at 8:32 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:

Hi all,

Change proposal #17 to the 2016-3-30 straw man (iteration 1) is attached:
General compression.

Please use this thread to provide your feedback on this proposal by
Wednesday August 24th.

thanks,
Chad

----------------------
Posted to multiple topics:
FDSN Working Group II
(http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
FDSN Working Group III
(http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
Update subscription preferences at http://www.fdsn.org/account/profile/

View this message in Google Groups at https://groups.google.com/a/fdsn.org/d/msgid/fdsn-wg2-data/CAGFrVcVg6S7LBKrSgMScqU3yy6riyL8dBahdbiVt%2BpShxuUwEg%40mail.gmail.com.
- Chad Trabant
  
  Re: Next generation miniSEED - 2016-3-30 straw man change proposal 17 - General compression
  
  2016-08-19 22:17:18
  
  Hi Philip and all,
  
  Yes, I think there is a meaningful distinction between the two. Generic compressed data that requires interpretation of a string, as described in the proposal, would either need to be controlled or we leave the possibility of getting lots of different ones. If we control which ones are used allowed then we might as well assign them encoding values and we do not want 100s of those. So while I can appreciate the concept expressed in the proposal to allow lots of flexibility for future compressors, I don't think we actually want that much flexibility.
  
  When this was added to the straw man it was the intention to ultimately have an encoding that is explicit, similar to "32-bit IEEE floats, Brotli compression". What was left for discussion is if Brotli is the right choice or if some other algorithm (or small number of algorithms) is/are better.
  
  Some lengthy background to explain the motivations for adding a generic compression encoding in the straw man follows.
  
  The main advantages are to provide a single encoding that can be used with all sample types (including floats/doubles for which we have no compression) and to leverage the extensive work done by those outside of seismology.
  
  In my opinion the most important guidelines for a general compressor are:
  1) general and usable for any sample type,
  2) efficient at very small payloads (not a common scenario in the compression world),
  3) broad support in languages and environments and freely usable and
  4) a realistic possibility of integrating with existing miniSEED libraries/processors.
  
  Obviously also needs to be a documented standard (whether FDSN does it or adopts it).
  
  The reasons Brotli was raised as a potential candidate are:
  
  1) It is designed for and efficient at small payload sizes. For example, many formats store the "dictionary" with the payload, whereas Brotli has a default, static dictionary. Even though the static dictionary is designed for text it works well on binary data.
  
  2) The format has been on the IETF standard track for a while and reached RFC (7932) recently: https://datatracker.ietf.org/doc/rfc7932/.
  
  3) It is a general compressor. Ints, floats, doubles, whatever sample type. We can always get more compression out of tailoring a compressor for seismological time series, but we'd probably have to invent it and support it (aka Steim encodings).
  
  4) There is already quite broad support in many languages.
  
  5) It is designed to be efficiently decoded, with more of the cost going into encoding. This fits the seismological data use case well, where data is decompressed much more often than compressed.
  
  6) There is a reference encoder and decoder from Google. This C language code is simpler and more portable than the high performance, complicated DEFLATE compressors (gzip, lzham, etc.), which would dwarf libmseed and qlib2 in size/complexity.
  
  Between the RFC'd format definition and the MIT-licensed reference library, Brotli is about as open as it gets and cannot be revoked. More in-depth technical evaluation is needed to ensure that Brotli's performance on seismic data is acceptable.
  
  The KMI change proposal raises a good point about efficiency. We should be mindful of resource limitations in field recorders, etc.. Then again there would still be value in an encoding that is only used once data reaches a center.
  
  Chad
  
  On Aug 12, 2016, at 10:46 AM, Philip Crotwell <crotwell<at>seis.sc.edu> wrote:
  
  Hi
  
  Reading this, I wonder if there is a meaningful distinction between
  data compressed with a "general compressor" and simply "Opaque data".
  It may be better to not have encodings that in any way generic, and
  only add specific ones as they come into existence. I would presume
  that an FDSN update to the miniseed3 to assign the next code number to
  "32-bit IEEE floats, Brotli compression, bla bla bla" would be
  possible. In the mean time, individuals that wish to experiment with
  other compression types can do so by using the opaque data code 100
  and using a, perhaps standardized, the optional header to specify
  information about how the opaque data is suppose to be extracted.
  
  Don't add a code until there is a specific implementation, and that
  code is tied to a single specific algorithm.
  
  thanks
  Philip
  
  On Thu, Aug 11, 2016 at 8:32 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:
  
  Hi all,
  
  Change proposal #17 to the 2016-3-30 straw man (iteration 1) is attached:
  General compression.
  
  Please use this thread to provide your feedback on this proposal by
  Wednesday August 24th.
  
  thanks,
  Chad
  
  ----------------------
  Posted to multiple topics:
  FDSN Working Group II
  (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
  FDSN Working Group III
  (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)
  
  Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
  Update subscription preferences at http://www.fdsn.org/account/profile/
  
  ----------------------
  Posted to multiple topics:
  FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
  FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)
  
  Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
  Update subscription preferences at http://www.fdsn.org/account/profile/
  
  View this message in Google Groups at https://groups.google.com/a/fdsn.org/d/msgid/fdsn-wg2-data/B3150438-0711-4DA3-A583-4F33A865CFBF%40iris.washington.edu.
  - Philip Crotwell
    
    Re: Next generation miniSEED - 2016-3-30 straw man change proposal 17 - General compression
    
    2016-08-20 03:47:27
    
    Hi
    
    I think we agree here. I an in favor of a code for a specific general
    compression but not for a code for a generic general compression as it
    is pretty much opaque data at that point.
    
    ...and yes that is a confusing sentence.
    Philip
    
    On Fri, Aug 19, 2016 at 6:17 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:
    
    Hi Philip and all,
    
    Yes, I think there is a meaningful distinction between the two. Generic compressed data that requires interpretation of a string, as described in the proposal, would either need to be controlled or we leave the possibility of getting lots of different ones. If we control which ones are used allowed then we might as well assign them encoding values and we do not want 100s of those. So while I can appreciate the concept expressed in the proposal to allow lots of flexibility for future compressors, I don't think we actually want that much flexibility.
    
    When this was added to the straw man it was the intention to ultimately have an encoding that is explicit, similar to "32-bit IEEE floats, Brotli compression". What was left for discussion is if Brotli is the right choice or if some other algorithm (or small number of algorithms) is/are better.
    
    Some lengthy background to explain the motivations for adding a generic compression encoding in the straw man follows.
    
    The main advantages are to provide a single encoding that can be used with all sample types (including floats/doubles for which we have no compression) and to leverage the extensive work done by those outside of seismology.
    
    In my opinion the most important guidelines for a general compressor are:
    1) general and usable for any sample type,
    2) efficient at very small payloads (not a common scenario in the compression world),
    3) broad support in languages and environments and freely usable and
    4) a realistic possibility of integrating with existing miniSEED libraries/processors.
    
    Obviously also needs to be a documented standard (whether FDSN does it or adopts it).
    
    The reasons Brotli was raised as a potential candidate are:
    
    1) It is designed for and efficient at small payload sizes. For example, many formats store the "dictionary" with the payload, whereas Brotli has a default, static dictionary. Even though the static dictionary is designed for text it works well on binary data.
    
    2) The format has been on the IETF standard track for a while and reached RFC (7932) recently: https://datatracker.ietf.org/doc/rfc7932/.
    
    3) It is a general compressor. Ints, floats, doubles, whatever sample type. We can always get more compression out of tailoring a compressor for seismological time series, but we'd probably have to invent it and support it (aka Steim encodings).
    
    4) There is already quite broad support in many languages.
    
    5) It is designed to be efficiently decoded, with more of the cost going into encoding. This fits the seismological data use case well, where data is decompressed much more often than compressed.
    
    6) There is a reference encoder and decoder from Google. This C language code is simpler and more portable than the high performance, complicated DEFLATE compressors (gzip, lzham, etc.), which would dwarf libmseed and qlib2 in size/complexity.
    
    Between the RFC'd format definition and the MIT-licensed reference library, Brotli is about as open as it gets and cannot be revoked. More in-depth technical evaluation is needed to ensure that Brotli's performance on seismic data is acceptable.
    
    The KMI change proposal raises a good point about efficiency. We should be mindful of resource limitations in field recorders, etc.. Then again there would still be value in an encoding that is only used once data reaches a center.
    
    Chad
    
    On Aug 12, 2016, at 10:46 AM, Philip Crotwell <crotwell<at>seis.sc.edu> wrote:
    
    Hi
    
    Reading this, I wonder if there is a meaningful distinction between
    data compressed with a "general compressor" and simply "Opaque data".
    It may be better to not have encodings that in any way generic, and
    only add specific ones as they come into existence. I would presume
    that an FDSN update to the miniseed3 to assign the next code number to
    "32-bit IEEE floats, Brotli compression, bla bla bla" would be
    possible. In the mean time, individuals that wish to experiment with
    other compression types can do so by using the opaque data code 100
    and using a, perhaps standardized, the optional header to specify
    information about how the opaque data is suppose to be extracted.
    
    Don't add a code until there is a specific implementation, and that
    code is tied to a single specific algorithm.
    
    thanks
    Philip
    
    On Thu, Aug 11, 2016 at 8:32 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:
    
    Hi all,
    
    Change proposal #17 to the 2016-3-30 straw man (iteration 1) is attached:
    General compression.
    
    Please use this thread to provide your feedback on this proposal by
    Wednesday August 24th.
    
    thanks,
    Chad
    
    ----------------------
    Posted to multiple topics:
    FDSN Working Group II
    (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
    FDSN Working Group III
    (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)
    
    Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
    Update subscription preferences at http://www.fdsn.org/account/profile/
    
    ----------------------
    Posted to multiple topics:
    FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
    FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)
    
    Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
    Update subscription preferences at http://www.fdsn.org/account/profile/
    
    View this message in Google Groups at https://groups.google.com/a/fdsn.org/d/msgid/fdsn-wg2-data/CAGFrVcUz2ziCGkXFKGdu4SsZ9%2BnCi3%3DM5sxnNXnj0X47Tp4hMQ%40mail.gmail.com.

Thread: Next generation miniSEED - 2016-3-30 straw man change proposal 17 - General compression

Attachments