Nice work, I was looking for this for a while and no time to do it myself. I would say it's probably a good idea to make it ai-assisted ; many things you can do faster yourself by saying 'click h2' , fill in text 'hello world' etc instead of having the LLM figure it out. So a combination of things basically. But very good start!
Edit; also probably good to, in case it is not sure, to open the browser and try there.
Take a look at this related work https://arxiv.org/abs/2310.11441
I wonder if this can be optimized by letting GPT provide multiple instructions per screenshot instead of just one.
For example in the twitter screenshot, it could use just the one image.
Hey there everyone, now that AI can "see" very well with GPT-V, I was wondering if it can interact with a computer like we do, just by looking at it. Well, one of the shortcommings of GPT-V is that it cannot really pinpoint the x,y coordinates of something in the screen very well, but I solved it by combining it with simple OCR, and annotating for GPT-V to tell where it wants to click
Turns out with very few lines of code the results are already impressive, GPT-V can really control my computer super well and I can as it to do whatever tasks by itself, it clicks around, type stuff and press buttons to navigate
Would love to hear your thoughts on it!