So if I'm understanding this correctly:
The SAM paper from this past April (that let you do zero-shot segmentation on any image, seemingly better than even OpenAI's CLIP) was using a ~600M parameter ViT model to generate image embeddings. And in order to make it less computationally expensive to generate those same embeddings, they replace that model with a smaller ViT encoder that was pre-trained using the masked auto-encoder back propagation method?
https://github.com/ChaoningZhang/MobileSAM was the previous attempt at reducing the size of the large image encoder used by SAM.
it's called efficient Sam and it appears to be onpar or better than fastsam but did I miss a memory or speed comparison?
can’t wait for everywhere all at once function.
Is what?
Excited to play with this more! Forked the repo and added the models into the repo itself (migrated from Dropbox): https://github.com/xetdata/EfficientSAM