Both of those image description features are likely powered by a LLAVA model: https://llava-vl.github.io/ https://arxiv.org/abs/2304.08485
Both of those image description features are likely powered by a LLAVA model: https://llava-vl.github.io/ https://arxiv.org/abs/2304.08485