[HN Gopher] Kaitai Struct: A new way to develop parsers for bina...
___________________________________________________________________
 
Kaitai Struct: A new way to develop parsers for binary structures
 
Author : marcodiego
Score  : 74 points
Date   : 2022-03-17 20:13 UTC (2 hours ago)
 
web link (kaitai.io)
w3m dump (kaitai.io)
 
| kangalioo wrote:
| There's also Wuffs, a safe and fast programming language made by
| Google specifically for decoding and encoding file formats
| https://github.com/google/wuffs
| 
| Paired with C FFI available in most languages, this seems like
| the nicer solution. It's simpler than generating code for a bunch
| of high level languages, and more performant
 
  | layer8 wrote:
  | Not for managed environments like client-side JS, JVM, .NET,
  | ...
 
    | jmgao wrote:
    | This appears to just allow you to parse binary formats to the
    | represented fields. (Not that that's not extremely useful,
    | doing this in managed languages is generally a giant pain in
    | the ass!)
    | 
    | wuffs is much more powerful: it's essentially a safe C
    | dialect that compiles to C, that lets you write an entire
    | codec and know that there aren't any overflows.
 
  | eesmith wrote:
  | How much of a future should I expect for Wuffs?
  | 
  | The linked-to page says: "Version 0.2. The API and ABI aren't
  | stabilized yet. The compiler undoubtedly has bugs."
  | 
  | There are not many recent commits, and mostly by one developer.
 
| secondcoming wrote:
| Interesting, but now you have to add in the possibility of having
| bugs in your YAML file. The YAML is probably less readable than
| the spec for the binary format itself.
| 
| Looking at the code-gen for utf8_string [0] and it's a case of
| 'thanks, but no thanks'
| 
| > std::unique_ptr>>
| m_codepoints;
| 
| This is a solution looking for a problem, but I bet it was fun to
| write.
| 
| [0] https://formats.kaitai.io/utf8_string/cpp_stl_11.html
 
| asadawadia wrote:
| Great library - too bad it only allows reading
 
  | ctoth wrote:
  | If you're working in Python and need to write as well as read
  | check out Construct[0], which is also a declarative parser
  | builder.
  | 
  | [0]: https://construct.readthedocs.io/en/latest/intro.html
 
| CGamesPlay wrote:
| As a code generator, I guess this may be nice. It seems like a
| DSL like the Nom [0] API is more natural and expressive, though.
| I imagine you can hit limits to expressiveness in Yaml pretty
| quickly.
| 
| [0] https://github.com/Geal/nom
 
| mturk wrote:
| Kaitai is a really great system, with an awesome WebIDE. At work
| we have just started a project to use it for astrophysics
| simulations and data from dark matter detectors, and one of my
| hobby projects is to use it to explore retro game data formats.
 
| jll29 wrote:
| Kudos - this is neat - I especially love the library of pre-
| existing descriptions, which helps me to learn about the tool as
| well as about an abundance of file formats without re-engineering
| time wasted.
| 
| This is somewhat akin to ASN.1.
| 
| My personal feature wish list:
| 
| - support writing as well as reading;
| 
| - support generating Rust, Julia and Swift code.
| 
| - upload button to let users add to a contrib/ folder of existing
| format descriptions
 
| dhx wrote:
| I contributed a number of file formats a few years ago (and
| attempted numerous others) but ran into a number of problems with
| certain file formats:
| 
| 1. It's not possible to read from the file until a multiple byte
| termination sequence is detected. [1]
| 
| 2. You can't read sections of a file where the termination
| condition is the presence of a sequence of bytes denoting the
| next unrelated section of the file (and you don't want to
| consume/read these bytes) [2]
| 
| 3. The WebIDE at the time couldn't handle very large file format
| specifications such as Photoshop (PSD) [3]
| 
| 4. Files containing compressed or encrypted sections require a
| compression/encryption algorithm to be hardcoded into Kaitai
| struct libraries for each programming language it can output to.
| 
| The WebIDE I particularly liked as it makes it easy to get
| started and share results. I also liked how Kaitai Struct allows
| easy definition of constraints (simple ones at least) into the
| file format specification so that you can say "this section of
| the file shall have a size not exceeding header.length * 2
| bytes".
| 
| Some alternative binary file format specification attempts for
| those interested in seeing alternatives, each with their own set
| of problems/pros/cons:
| 
| 1. 010 Editor [4]
| 
| 2. Synalysis [5]
| 
| 3. hachoir [6]
| 
| 4. DFDL [7]
| 
| [1] https://github.com/kaitai-io/kaitai_struct/issues/158
| 
| [2] https://github.com/kaitai-io/kaitai_struct/issues/156
| 
| [3]
| https://raw.githubusercontent.com/davidhicks/kaitai_struct_f...
| 
| [4] https://www.sweetscape.com/010editor/repository/templates/
| 
| [5] https://github.com/synalysis/Grammars
| 
| [6] https://github.com/vstinner/hachoir/tree/main/hachoir/parser
| 
| [7] https://github.com/DFDLSchemas/
 
| gigel82 wrote:
| Ugh, wish I'd found this a couple of years ago; after hand-
| writing a Unity asset parser in node.js for a hobby project
| (big/little-endian mixes, byte alignment, versioned header
| format, different compression algos, etc.).
 
| sidpatil wrote:
| This looks really cool! This would have been really useful to me
| a couple years ago.
 
  | lpapez wrote:
  | It was available a few years ago, and I found it very useful.
 
| neonsunset wrote:
| As far as .NET implementation goes, it is _really bad_ :
| 
| - Very old and currently obsolete project target
| 
| - As a result, does not use modern data types such as Span
| 
| - No utilisation of ArrayPool which is important for things
| like serialisers where you expect to deal with buffers a lot
| 
| - Appears to be a blind Java port given provided code style
| 
| This is not acceptable when working with low-level and binary
| structures which this standard is focused on. Yes, I know, this
| is an OSS project and therefore instead of complaining here I
| should have been working on contributing a PR to fix those
| issues. However, my main concern is that this standard and set of
| libraries in the current form work against the performance-
| sensitive nature of working with binary data.
 
| imglorp wrote:
| Erlang got this right: for the narrow case of packets
| in/mangle/out, described like an RFC bit-field diagram, it was
| very clean and simple.
 
| renewiltord wrote:
| Seems rather well designed actually. Appears that you can even
| use length-delimited lists and stuff. I like it. I have a project
| where we have a compact binary encoding and I have to write
| documentation _and_ serde for it. This works for docs and
| deserialization so that's good. I understand why serialization
| isn't supported but I feel like there's probably a clever API
| that allows inserting your own ser in. We'll see. I might switch
| our internal thing this weekend to it.
| 
| Would be cool if you could generate a protocol diagram from this.
 
___________________________________________________________________
(page generated 2022-03-17 23:00 UTC)