You generated pretty much ~all of this with Claude (c.f. ASCII diagrams with emojis on each line to "prove" various not-even-wrong claims it was told to justify), and the work is mediocre enough that it's worth full-throatedly criticizing both the work quality and that you inflicted this upon the world.
Look how many confused comments there are due to the page claiming features you don't have, don't understand, and don't make sense on their own terms (what's an "attention map"? with maximum charity, if we had some sort of attention-as-in-LLM-like structure precached, how would it apply beyond one model? how big would the image be? is it possible to fit that in the 2 bits we claim to fit in every 4 bytes)
I don't want for you to take it personally, at all, but I never, ever, want to see something like this on the front page again.
You've reinvented EXIF and JPEG metadata, in the voice of a diligent teenager desiring to create something meaningful, but with 0 understanding of the computing layers, 4 hours with Wikipedia, and 0 intellectual humility - though, with youth, born not from obstinance, but naiveté.
Some warning signs you should have taken heed of:
- Metadata is universally desirable, yet, somehow unexplored until now?
- Your setup instructions use UNIX commands up until they require running a Windows batch file
- The example of hiding data hides it in 2 bits in a channel then "demonstrates" this is visually lossless because its hidden in 1 bit across 2 channels (it isn't, because if it was, how would we determine which 2 of the channels?) ("visually lossless" confuses "lossless", a technical term meaning no information was lost with a weaker claim of being lossy-but-not-detectably-so)
I'll leave it here, I think you have the idea and there's a difference between being firm and honest, and being cruel, and length will dictate a lot of that to a casual observer.
Reality check:
Your extra data is a big JSON blob. Okay, fine.
File formats dating back to Targa (https://en.wikipedia.org/wiki/Truevision_TGA) support arbitrary text blobs if you're weird enough.
PNG itself has both EXIF data and a more general text chunk mechanism (both compressed and uncompressed, https://www.libpng.org/pub/png/spec/1.2/PNG-Chunks.html#C.An... , section 4.2.3, you probably want iTXt chunks).
exiftool will already let you do all of this, by the way. There's no reason to summon non-standard file format into the world (especially when you're just making a weird version of PNG that won't survive resizing or quantization properly).
Maybe I'm jaded, but I fail to see how a bespoke file format is a better solution than bundling a normal image and a JSON/XML document containing metadata that adheres to a defined specification.
It feels like creating a custom format with backwards PNG compatibility and using steganography to cram metadata inside is an inefficient and over-engineered alternative to a .tar.gz with "image.png" and "metadata.json"
Why not simply JXL? It has multiple channels, can store any metadata, is lossy/lossless.
No one runs an edge detection first then sends the image as a screenshot and then trains ai on it. That's an absurd workflow.
Maybe your format could have some use, but I don't find your motivation convincing.
I don't understand the purpose of effectively hard-coding things like edge detection and attention weight maps into the image. There are various ways to do edge detection and various ways to focus attention, so having that fixed and encoded into the image instead of synthesizing it on demand to suit your particular ends seems suboptimal.
Wouldn't the kind of metadata that's most useful be things that can't be synthesized, like labels or (for ai-generated images) the prompt used to generate the image?
You have invented essentially an _incredible way_ to poison AI image datasets.
Step 1: Create .meow images of vegetables, with "per-pixel metadata" instead encoded to represent human faces. Step 2: Get your images included in the data set of a generative image model. Step 3: Laugh uproariously as every image of a person has vaguely-to-profoundly vegetal features.
It would be better to use this as an additional extension before the normal extension like other tools that embed additional metadata do.
For example, Draw.io can embed original diagrams in .svg and .png files, and the pre-suffix is .drawio.png or .drawio.png .
You're adding metadata, but what problems does this added metadata solve exactly? If your converter can automatically compute these image features, then AI training and inference pipelines can trivially do the same, so I don't see the point in needing a new file format that contains these.
Moreover, models and techniques get better over time, so these stored precomputed features are guaranteed to become obsolete. Even if they're there and it's simple to use in a pipeline and everybody is using this file format, pipelines still won't use it when they were precomputed years ago and state-of-the-art techniques give more accurate features.
So converting the file to a lossy format, or resizing the image as png will destroy the encoded information? I see why one wants to use it, but I think it can be only useful in a controlled environment. As soon as someone else has access to the file, the information can easily get lost. Just like metadata.
This approach tries to combine pixels stream with metadata stream, but from my opinion that's not a very elegant solution.
Being consistent when perceived and being lossless in information are different things. https://github.com/Kuberwastaken/meow/blob/60339a764a2365c4a... shows that the library simply truncates the lower bits of the pixel, doing a lossy transformation to the carrier. This could lead to (at best) inconsistencies in later sample processing, or (at worst) the sample being pulled away far from the original location in the sample/embedding space.
Steganography is generally used to covertly carry information -- you try to keep your extra bits in-band with the carrier. That's why anti-piracies use that to carry identifiers, watermarks or so, without disrupting the perceived consistency/quality of the carrier.
Metadata, on the other hand, does not need to be transmitted covertly. They are public information can be included in the container format itself. Many image formats already have facilities for these out-of-band data streams. So I think it's reinventing the wheels, but in a crude and complicated way.
Modifying the image in any way (cropping, resizing, etc) destroys the metadata. This is necessary in basically every application that interacts with any kind of model that uses images, either for token count reasons, file size reasons, model limits, etc. (Source: I work at a genai startup)
At inference time, you don't control the inputs, so this is moot. At training time, you've already got lots of other metadata that you need to store and preserve that almost certainly won't fit in steganographically encoded format, and you've often got to manipulate the image before feeding it into your training pipeline. Most pipelines don't simply take arbitrary images (nor do you want them: plenty of images need to be modified to, for instance, remove letterboxing).
The other consideration is that steganography is actively introducing artifacts to your assets. If you're training on these images, you'll quickly find that your image generation model, for instance, cannot generate pure black. If you're adding what's effectively visual noise to every image you train on, the model will generate images with noise.
> it gets stripped way too easily when sharing
that's not a bug, that's a (security) feature
Nice work!
Though I have one question: once 2 bits/channel are used with Meow-specific data thus leaving 6bits/channel, I doubt how it can still retain perfect image quality when either: (if everything's re-encoded) dynamic range is reduced by 75% or LSB changes introduce noise to the original image. Not too much noise, but still.
I do like the idea of storing it steganographically, which also serves as a watermark.
But it requires a ton of redundancy and error correction, perhaps enough to survive a few rounds of not-too-lossy reencoding. I dunno how much bandwidth is available before it damages the image.
Great idea and insight. If i understand, it will allow you to embed metadata such as bounding box coordinates and class names, something I have also been working on[0] -- embedding computer vision annotation data directly into an image's EXIF tags, rather than storing it in separate sidecar text files. The idea is simplifying the dataset's file structure. It could offer unexpected advantages — especially for smaller or proprietary datasets, or for fine-tuning tasks where managing separate annotation files adds unnecessary overhead.
[0] https://github.com/VoxleOne/XLabel
Edited for clarity
> Python-based image file format
This is one of the first lines of the readme. But this is PNG with some metadata encoded using the most naive steganographic technique (just thrown into the LSB of pixels -- no redundancy, no error correction, no compensation for down sampling, etc). Even ignoring everything else, this is just ... Nonsensical.
I am very very pro-AI. But this is slop.
Why not store metadata, along with a checksum of the png, in myPublicPhoto.png.meow?
Labeling and metadata a separate concerns. "Edge detection maps" etc are implementation details of whatever you are doing with image data, and quite likely to be non-portable.
And non-removability / steganography of additional metadata is not a selling point at all?
So my thoughts are, this violates separation of concerns and seems badly thought-out.
It also mangles labeling, metadata and technicalities, and attempts to predict future requirements.
I don't understand potential utility.
Cool idea. I can see it being useful in a pipeline, where you mutate the image as you go. Losing referenced data can be a pain. Are you able to extract the original image?
The amount of information you can encode using EXIF/IPTC doesn't have an upper-bound the way that using stenography is inherently capped by the resolution of the image. What happens when you want to encode more information using the MEOW format than you have "pixels" (which seems like a very real possibility with thumbnail or smaller pictures)?
How about a format that will break AI instead?
This is not a good idea in practice. Why not bundle the metadata as JSON or Protobuf via an aux file?
Metadata gets stripped by most websites.
Embedding metadata into the pixels by using the least significant bits of RGB won't cut it, that stuff is gone when the file becomes a JPEG.
But there do exist methods of embedding data in pixels that can survive JPEG compression.
I think that this is interesting research. As LLMs are becoming an important part of building stuff, I suspect that we will find that embedding context close to where it’s needed will yield better results in longer or more complex workflows.
In my AI assisted coding I’ve started experimenting with embedding hyper-relevant context in comments; for example I embed test related context directly into test files, so it’s immediately available and fresh whenever the file is read.
Extrapolating, I’ve been thinking recently about whether it might be useful to design a programming language optimized for LLM use. Not a language to create LLMs, but a language optimized for LLMs to write in and to debug. The big obstacle would seem to be bootstrapping since LLMs are trained by analyzing large amounts of human created code.
Does this survive resizing images or converting from png to jpg? (or worse, taking jpg screenshots of resized pngs). Because that also happens a lot when sharing images.
using LSB to store structured metadata inside PNG is clever > survives format conversion, stays invisible to standard viewers, and doesn't break compression. but the space is tight. even at 1 bit per channel, that's just 3 bits per pixel on RGB.
given that, how are you handling tradeoffs between spatial fidelity (like masks or edges) versus scalar data (like complexity scores)? is there a priority system, or does it just truncate when it runs out?
please. no more chatgpt-generated readmes upvoted to the front page
if you couldn't be bothered to write it, why should anyone read it, and what does that say about your view of the potential users you're trying to attract
> PNG on steroids
You mean PNG on steganoroids.
Is the main goal here just to have a cool-looking file extension?
Why not use PNG’s built-in zTXt chunks to store metadata? That seems like a more standard and less fragile approach.
I can see the case for using LSB steganography to watermark or cryptographically sign an image—but using it to embed all the metadata you're describing is likely to introduce a lot of visual noise.
Also worth considering: this approach could be used to poison models by embedding deliberately misleading metadata. Depending on your perspective, that might be a feature or a bug.
So you do a bunch of the network's job for it?
Also I also remember when I discovered steganography and tried putting it in everything. I was 13. Seriously, what's the point of that?
Wow so much hate on this article.
Instead of the vitriol and downvotes, maybe next time just point out “you can put arbitrary data in exif”.
But you missed half the point of the article, which was the EXTRA DATA to make the image more LLM-useful.
You do know that EXIF exists?
You now have X+1 problems …
[dead]
> Instead of storing metadata alongside the image where it can be lost, MEOW ENCODES it directly inside the image pixels using LSB steganography
That makes the data much more fragile than metadata fields, though? Any kind of image alteration or re-encoding (which almost all sites do to ensure better compression — discord, imgur, et al) is going to trash the metadata or make it utterly useless.
I'll be honest, I don't see the need for synthesizing a "new image format" because "these formats are ancient (1995 and 1992) - it's about time we get an upgrade" and "metadata [...] gets stripped way too easily" when the replacement you are advocating not only is the exact same format as a PNG but the metadata embedding scheme is much more fragile in terms of metadata being stripped randomly when uploaded somewhere. This seems very bizarre to me and ill-thought-out.
Anyway, if you want a "new image format" because "the old ones were developed 30 years ago", there's a plethora of new image formats to choose from, that all support custom metadata. including: webp, jpeg 2000, HEIF, jpeg xl, farbfeld (the one the suckless guys made).
I'll be honest... this is one of the most irritating parts of the new AI trend. Everyone is an "ideas guy" when they start programming, it's fine and normal to come up with "new ideas" that "nobody else has ever thought of" when you're a green-eared beginner and utterly inexperienced. The irritating part is what happens after the ideas phase.
What used to happen was you'd talk about this cool idea in IRC and people would either help you make it, or they would explain why it wasn't necessarily a great idea, and either way you would learn something in the process. When I was 12 and new to programming, I had the "genius idea" that if we could only "reverse the hash algorithm output to it's input data" we would have the ultimate compression format... anyone with an inch of knowledge will smirk at this preposition! And so I learned from experts on why this was impossible, and not believing them, I did my own research, and learned some more :)
Nowadays, an AI will just run with whatever you say — "why yes if it were possible to reverse a hash algorithm to its input we would have the ultimate compression format", and then if you bully it further, it will even write (utterly useless!) code for you to do that, and no real learning is had in the process because there's nobody there to step in and explain why this is a bad idea. The AI will absolutely hype you up, and if it doesn't you learn to go to an AI that does. And now within a day or two you can go from having a useless idea, to advertising that useless idea to other people, and soon I imagine you'll be able to go from advertising that useless idea to other people, to manufacturing it IRL, and at no point are you learning or growing as a person or as a programmer. But you are wasting your own time and everyone else's time in the process (whereas before, no time was wasted because you would learn something before you invested a lot of time and effort, rather than after).