International Federation of Digital Seismograph Networks

Thread: Next generation miniSEED - 2016-3-30 straw man change proposal 17 - General compression

None
Started: 2016-08-12 00:30:56
Last activity: 2016-08-20 03:47:27

Hi all,

Change proposal #17 to the 2016-3-30 straw man (iteration 1) is attached: General compression.

Please use this thread to provide your feedback on this proposal by Wednesday August 24th.

thanks,
Chad



  • Hi

    Reading this, I wonder if there is a meaningful distinction between
    data compressed with a "general compressor" and simply "Opaque data".
    It may be better to not have encodings that in any way generic, and
    only add specific ones as they come into existence. I would presume
    that an FDSN update to the miniseed3 to assign the next code number to
    "32-bit IEEE floats, Brotli compression, bla bla bla" would be
    possible. In the mean time, individuals that wish to experiment with
    other compression types can do so by using the opaque data code 100
    and using a, perhaps standardized, the optional header to specify
    information about how the opaque data is suppose to be extracted.

    Don't add a code until there is a specific implementation, and that
    code is tied to a single specific algorithm.

    thanks
    Philip


    On Thu, Aug 11, 2016 at 8:32 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:

    Hi all,

    Change proposal #17 to the 2016-3-30 straw man (iteration 1) is attached:
    General compression.

    Please use this thread to provide your feedback on this proposal by
    Wednesday August 24th.

    thanks,
    Chad




    ----------------------
    Posted to multiple topics:
    FDSN Working Group II
    (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
    FDSN Working Group III
    (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

    Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
    Update subscription preferences at http://www.fdsn.org/account/profile/


    • Hi Philip and all,

      Yes, I think there is a meaningful distinction between the two. Generic compressed data that requires interpretation of a string, as described in the proposal, would either need to be controlled or we leave the possibility of getting lots of different ones. If we control which ones are used allowed then we might as well assign them encoding values and we do not want 100s of those. So while I can appreciate the concept expressed in the proposal to allow lots of flexibility for future compressors, I don't think we actually want that much flexibility.

      When this was added to the straw man it was the intention to ultimately have an encoding that is explicit, similar to "32-bit IEEE floats, Brotli compression". What was left for discussion is if Brotli is the right choice or if some other algorithm (or small number of algorithms) is/are better.

      Some lengthy background to explain the motivations for adding a generic compression encoding in the straw man follows.

      The main advantages are to provide a single encoding that can be used with all sample types (including floats/doubles for which we have no compression) and to leverage the extensive work done by those outside of seismology.

      In my opinion the most important guidelines for a general compressor are:
      1) general and usable for any sample type,
      2) efficient at very small payloads (not a common scenario in the compression world),
      3) broad support in languages and environments and freely usable and
      4) a realistic possibility of integrating with existing miniSEED libraries/processors.

      Obviously also needs to be a documented standard (whether FDSN does it or adopts it).

      The reasons Brotli was raised as a potential candidate are:

      1) It is designed for and efficient at small payload sizes. For example, many formats store the "dictionary" with the payload, whereas Brotli has a default, static dictionary. Even though the static dictionary is designed for text it works well on binary data.

      2) The format has been on the IETF standard track for a while and reached RFC (7932) recently: https://datatracker.ietf.org/doc/rfc7932/.

      3) It is a general compressor. Ints, floats, doubles, whatever sample type. We can always get more compression out of tailoring a compressor for seismological time series, but we'd probably have to invent it and support it (aka Steim encodings).

      4) There is already quite broad support in many languages.

      5) It is designed to be efficiently decoded, with more of the cost going into encoding. This fits the seismological data use case well, where data is decompressed much more often than compressed.

      6) There is a reference encoder and decoder from Google. This C language code is simpler and more portable than the high performance, complicated DEFLATE compressors (gzip, lzham, etc.), which would dwarf libmseed and qlib2 in size/complexity.

      Between the RFC'd format definition and the MIT-licensed reference library, Brotli is about as open as it gets and cannot be revoked. More in-depth technical evaluation is needed to ensure that Brotli's performance on seismic data is acceptable.

      The KMI change proposal raises a good point about efficiency. We should be mindful of resource limitations in field recorders, etc.. Then again there would still be value in an encoding that is only used once data reaches a center.

      Chad


      On Aug 12, 2016, at 10:46 AM, Philip Crotwell <crotwell<at>seis.sc.edu> wrote:

      Hi

      Reading this, I wonder if there is a meaningful distinction between
      data compressed with a "general compressor" and simply "Opaque data".
      It may be better to not have encodings that in any way generic, and
      only add specific ones as they come into existence. I would presume
      that an FDSN update to the miniseed3 to assign the next code number to
      "32-bit IEEE floats, Brotli compression, bla bla bla" would be
      possible. In the mean time, individuals that wish to experiment with
      other compression types can do so by using the opaque data code 100
      and using a, perhaps standardized, the optional header to specify
      information about how the opaque data is suppose to be extracted.

      Don't add a code until there is a specific implementation, and that
      code is tied to a single specific algorithm.

      thanks
      Philip


      On Thu, Aug 11, 2016 at 8:32 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:

      Hi all,

      Change proposal #17 to the 2016-3-30 straw man (iteration 1) is attached:
      General compression.

      Please use this thread to provide your feedback on this proposal by
      Wednesday August 24th.

      thanks,
      Chad




      ----------------------
      Posted to multiple topics:
      FDSN Working Group II
      (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
      FDSN Working Group III
      (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

      Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
      Update subscription preferences at http://www.fdsn.org/account/profile/


      ----------------------
      Posted to multiple topics:
      FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
      FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

      Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
      Update subscription preferences at http://www.fdsn.org/account/profile/


      • Hi

        I think we agree here. I an in favor of a code for a specific general
        compression but not for a code for a generic general compression as it
        is pretty much opaque data at that point.

        ...and yes that is a confusing sentence.
        Philip

        On Fri, Aug 19, 2016 at 6:17 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:
        Hi Philip and all,

        Yes, I think there is a meaningful distinction between the two. Generic compressed data that requires interpretation of a string, as described in the proposal, would either need to be controlled or we leave the possibility of getting lots of different ones. If we control which ones are used allowed then we might as well assign them encoding values and we do not want 100s of those. So while I can appreciate the concept expressed in the proposal to allow lots of flexibility for future compressors, I don't think we actually want that much flexibility.

        When this was added to the straw man it was the intention to ultimately have an encoding that is explicit, similar to "32-bit IEEE floats, Brotli compression". What was left for discussion is if Brotli is the right choice or if some other algorithm (or small number of algorithms) is/are better.

        Some lengthy background to explain the motivations for adding a generic compression encoding in the straw man follows.

        The main advantages are to provide a single encoding that can be used with all sample types (including floats/doubles for which we have no compression) and to leverage the extensive work done by those outside of seismology.

        In my opinion the most important guidelines for a general compressor are:
        1) general and usable for any sample type,
        2) efficient at very small payloads (not a common scenario in the compression world),
        3) broad support in languages and environments and freely usable and
        4) a realistic possibility of integrating with existing miniSEED libraries/processors.

        Obviously also needs to be a documented standard (whether FDSN does it or adopts it).

        The reasons Brotli was raised as a potential candidate are:

        1) It is designed for and efficient at small payload sizes. For example, many formats store the "dictionary" with the payload, whereas Brotli has a default, static dictionary. Even though the static dictionary is designed for text it works well on binary data.

        2) The format has been on the IETF standard track for a while and reached RFC (7932) recently: https://datatracker.ietf.org/doc/rfc7932/.

        3) It is a general compressor. Ints, floats, doubles, whatever sample type. We can always get more compression out of tailoring a compressor for seismological time series, but we'd probably have to invent it and support it (aka Steim encodings).

        4) There is already quite broad support in many languages.

        5) It is designed to be efficiently decoded, with more of the cost going into encoding. This fits the seismological data use case well, where data is decompressed much more often than compressed.

        6) There is a reference encoder and decoder from Google. This C language code is simpler and more portable than the high performance, complicated DEFLATE compressors (gzip, lzham, etc.), which would dwarf libmseed and qlib2 in size/complexity.

        Between the RFC'd format definition and the MIT-licensed reference library, Brotli is about as open as it gets and cannot be revoked. More in-depth technical evaluation is needed to ensure that Brotli's performance on seismic data is acceptable.

        The KMI change proposal raises a good point about efficiency. We should be mindful of resource limitations in field recorders, etc.. Then again there would still be value in an encoding that is only used once data reaches a center.

        Chad


        On Aug 12, 2016, at 10:46 AM, Philip Crotwell <crotwell<at>seis.sc.edu> wrote:

        Hi

        Reading this, I wonder if there is a meaningful distinction between
        data compressed with a "general compressor" and simply "Opaque data".
        It may be better to not have encodings that in any way generic, and
        only add specific ones as they come into existence. I would presume
        that an FDSN update to the miniseed3 to assign the next code number to
        "32-bit IEEE floats, Brotli compression, bla bla bla" would be
        possible. In the mean time, individuals that wish to experiment with
        other compression types can do so by using the opaque data code 100
        and using a, perhaps standardized, the optional header to specify
        information about how the opaque data is suppose to be extracted.

        Don't add a code until there is a specific implementation, and that
        code is tied to a single specific algorithm.

        thanks
        Philip


        On Thu, Aug 11, 2016 at 8:32 PM, Chad Trabant <chad<at>iris.washington.edu> wrote:

        Hi all,

        Change proposal #17 to the 2016-3-30 straw man (iteration 1) is attached:
        General compression.

        Please use this thread to provide your feedback on this proposal by
        Wednesday August 24th.

        thanks,
        Chad




        ----------------------
        Posted to multiple topics:
        FDSN Working Group II
        (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
        FDSN Working Group III
        (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

        Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
        Update subscription preferences at http://www.fdsn.org/account/profile/


        ----------------------
        Posted to multiple topics:
        FDSN Working Group II (http://www.fdsn.org/message-center/topic/fdsn-wg2-data/)
        FDSN Working Group III (http://www.fdsn.org/message-center/topic/fdsn-wg3-products/)

        Sent via IRIS Message Center (http://www.fdsn.org/message-center/)
        Update subscription preferences at http://www.fdsn.org/account/profile/