Msgpack can't differentiate between raw binary data and text strings

  • Bikeshedding at its finest.

    I ran into this issue a few months ago, on a cross-platform project involving four languages that each take a distinctly different view about strings from the other three. Although this situation is a common objection to supporting strings in the issue thread, it took just a couple of hours to extend msgpack to support strings in a reasonable-enough-for-me way on each platform.

    The proposals in the thread are a lot better than mine. And I suppose it's pretty antisocial / arrogant for me to just roll my own implementation without consulting anybody. But in three years[0] of talking about the problem, nothing had gotten done. Meanwhile, my code shipped a long time ago.

    I do this a lot--fork people's projects to solve my problems and don't merge back changes--and I feel guilty for not being more participatory with the project maintainers. But the fact is that the expected cost of getting embroiled in a flamewar like this is high (whether it is over architecture, whitespace convention, "behavior by design", "Jim's already working on that", etc.), whereas the benefit to me of getting my changes merged upstream is essentially zero. So my antisocial behavior continues to be positively reinforced.

    Does anyone else have this problem? Or do people just enjoy flamewars more than I do, or have the persuasive skills to avoid them?

    [0] https://github.com/msgpack/msgpack/issues/13

  • ... by design. Because there's no "string" type. This is a bug report about high level implementations that don't encode and decode in reversible ways, contrary to the msgpack protocol. http://wiki.msgpack.org/display/MSGPACK/Format+specification

    Really, for a protocol that values minimal space usage, not defining a string type is probably a good thing. Use the one that produces the fewest bytes in your application - it may not be UTF-8.

    Also:

    >For instance, the objective C wrapper is currently broken because it tries to decode all raw bytes into high-level strings (through UTF-8 decoding) because using a text string (NSString) is the only way to populate a NSDictionary (map).

    Well there's your problem: https://github.com/msgpack/msgpack-objectivec/blob/master/Me... It's a buggy wrapper that's trying to be convenient. And NSString keys are by no means the only way to populate an NSDictionary, and it doesn't look like the Objective-C wrapper requires this: https://github.com/msgpack/msgpack-objectivec/blob/master/Me...

  • I'm fine with the fact that Msgpack does not differentiate between binary data and text strings. Sure, it requires a schema, but if you're concerned with data size and parsing speed, you should choose an encoding appropriate for your task anyway.

    The bigger problem is that Msgpack is advertised as being "like JSON, but fast and small." To me, that makes it sound like I can replace JSON messages with Msgpack messages and be done, and that's not at all the case, because I need to add a schema layer. I think the "like JSON" comparison is what is really causing this frustration with the format.

  • The conflict here seems to be between people who think any arbitrary valid msgpack stream should be decodable into a specific object graph, and those who assume msgpack will be used to implement a protocol where only messages of a predefined format should be allowed - hence the decoding app will know beforehand what should be a string and what shouldn't.

    The conflict is unresolvable until the participants agree on which of these two distinct things msgpack should be.

  • msgpack always seemed an odd thing to me. Compressed json (gzip, lz*) is small, and fast (see: http://news.ycombinator.com/item?id=4091051). If you need structure, use protocol buffers or thrift.

    I actually like tnetstrings for backend messaging, but I don't see it used very often. json is pretty damn ubiquitous these days.

  • Well, then the information of whether some sequence of bytes is a string needs to be communicated out of band. That's a perfectly acceptable design decision, but one that may lead potential users to favor alternatives that include that information in band.

    The discussion is pointless if the objectives of the participants differ and none is willing to compromise.

  • Reading TFA apparently there is no string type, so nothing in it is a string. It's all binary data, byte array or whatever that means in your language of choice. If an application or a library the application uses converts a string into binary data before and converts it back after, that's none of the format's business.

  • The solution to these problems is for everyone to be completely ignorant of any character encoding and just deal with octets. If the characters represent UTF-8 text, then only when text needs to be presented or interpreted in some way, UTF-8 decoding happens. Any automatic encoding or decoding of UTF-8 (such as what Python3 does) is stupid.

    EDIT: A common example of implicit and wrong handling of character encoding is when a file gets created with invalid characters, and your Linux file manager is unable to delete it. This can happen because the file manager assumes the file names it gets from the OS are text, and which it decodes incompletely. When it wants to delete the file it encodes the text back, but the result is different than the original file name bytes. The error happens because the file manager tries to decode the as text too early - it should keep the original octets as a reference to the file, but only decode them when it needs to display a file name.

  • I can very easily froth at the mouth when it comes to character encoding problems. It's one of those problems that should never even be a problem, but ends up consuming hours upon hours of consulting arcane cobwebby specs.