I've been blown away by the support and use cases people have been coming up with, thank you!
To address a few questions:
aaron77 wrote:Does the Optimize images for size checkbox set the detail parameter when calling the Vision API to low?
This was the intention, yes. It was supposed to also take pictures in lower resolution and then compress them to cut down on upload time. I'm still considering how that piece should actually work because unsurprisingly, snapping a screenshot, modifying the resolution, and asking the API to do the same doesn't actually result in any performance improvements--sometimes the opposite even.
Aaron77 wrote:have you considered allowing users to reply to the image descriptions? That would open up many more use cases, I think.
The thought has certainly crossed my mind, and once I can ensure the rest of the features are stable, I might work on it. Problem is, unlike any of the other APIs, vision preview doesn't seem to remember the conversation. Which means I'll need to upload each message again whenever the user asks a question. This would get crazy expensive... and fast.
To be honest I was kinda holding out hope that Sama and team would address this limitation but given the shitshow going on at OpenAI right now, I've gone from bullish to just glad developers haven't lost the ability to build on the models in a weekend.
Aaron77 wrote:Could you possibly make the prompt field a multiline field?
Easily. One caveat to be aware of: the NVDA settings dialog appears to intersept the enter, control+enter, and shift+enter keys. I'll have to find a way to overwrite this behavior, but it'll probably be a part of the next release.
Aaron77 wrote:could you possibly look into having the function that calls the vision API be called by a separate thread?
Good catch! It's already doing this when you snap an object from the menu, but other recognitions are indeed performed on the main thread. This will be done for the next release as well.
Speaking of the menu, there is in fact a fairly annoying bug when we try to describe focus or navigator objects without using their defined keystrokes. It occurs because popping up the menu changes both the focus and navigator positions, and somehow simply caching and setting them again is insufficient. I'll continue to play around with some of the more obscure pieces of NVDA to figure out why this is happening. It's 100% possible, just a matter of finding and reciting the right incantations.
As a tip, if you add "for some one who is blind" to the end of the prompt, it appears to add greater detail to the explanations - undoubtedly a byproduct of the collab with Be my Eyes.
@Defender
Thanks. It's been a while, I hope you are well. Out of curiosity, what frontend? Is it the LLaVA demo that's been making it's rounds recently? Or MiniGPT4?