Hacker News

Microsoft Kosmos-1: A Multimodal Large Language Model

by solariston 3/1/2023, 9:38:13 AM with 12 comments

by josalhoron 3/1/2023, 11:15:36 AM
The examples in the paper are pretty impressive. There is an example of a windows 11 dialog image. The computer can figure out which button to press given the desired outcome of the user. If one where to take this model and scale it, I can see an advanced bot in <5 years navigating the web and doing work based on a text input of a human purely by visual means. Interesting times.
by tompon 3/1/2023, 10:52:13 AM
Is there a better page to link to? I cannot even see "Kosmos" on this page!
Edit: Ah, looks like this is the link to the paper: https://arxiv.org/abs/2302.14045
It was discussed yesterday: https://news.ycombinator.com/item?id=34965326
by ducktectiveon 3/1/2023, 10:21:58 AM
It can even solve IQ tests...I mean, how much further are we moving the goal post?
Is there a model that can solve differential equations symbolically and numerically? Most of modern engineering just boils down to diff.eqs whether ordinary or partial. It's our current best method to reason about stuff and control them.
by solariston 3/1/2023, 9:38:36 AM
Paper: https://arXiv.org/abs/2302.14045
Examples: https://twitter.com/alphasignalai/status/1630651280019292161
by PaulHouleon 3/1/2023, 5:04:09 PM
I like this feature they are working on
https://arxiv.org/abs/2212.10554
as I'd say the most obvious limitation of today's transformers is the limited attention window. If you want ChatGPT to do a good job of summarizing a topic based on the literature the obvious thing is to feed a bunch of articles into it and ask it to summarize (how can you cite a paper you didn't read?) and that requires looking at maybe 400,000 - 4,000,000 tokens.
Similarly there is a place for a word embedding, a sentence embedding, a paragraph embedding, a chapter embedding, a book embedding, etc. but these have to be scalable and obviously the book embedding is bigger but I ought to be able to turn a query into a sentence embedding and somehow match it against larger document embeddings.
by RcouF1uZ4gsCon 3/1/2023, 12:03:45 PM
I don’t trust any report of model performance from papers, unless there is a publicly accessible demo. It is way too easy to test things the model has trained on and for the model to then completely fall flat when used by people in the real world.
by naaskingon 3/1/2023, 1:17:16 PM
Another one that looks even more compelling:
Multimodal Chain-of-Thought Reasoning in Language Models, https://arxiv.org/abs/2302.00923
By building in chain of thought and multimodal learning, this 1B parameter model beats GPT-3.5's 170B parameter model.
by nlon 3/1/2023, 1:06:43 PM
It's worth noting that this is a comparatively small model (1.6B params from memory).
It'll be interesting what capabilities emerge as they grow that model capacity.
by aegistudioon 3/1/2023, 11:01:09 AM
Hmm... LLMs / MLLMs might be truly a unified input / output interface of a would-be AGI, I think.
by drKarlon 3/1/2023, 4:55:40 PM
At Microsoft:
Hey why don't we call our new LLM Cosmos? That's taken by the Azure Cosmos DB guys Damn it... how about Kosmos-1 ?
by Karellenon 3/1/2023, 4:26:37 PM
Did anyone else initially read that as `Kosmos~1`, and wonder what the full name of the project was?
by xfalcoxon 3/1/2023, 5:00:18 PM
Anyone know if this will be an openly available model?