2023-11-20 23:22:36

Hey gamers,

I started hacking on this a few days ago and, seeing as it runs fairly flawlessly on my machine, figured now was as good of a time as any to reach out for feedback.

github link
latest release

readme wrote:

This add-on makes it possible to obtain detailed descriptions for images and other visually inaccessible content.
Leveraging the multimodal capabilities of the GPT-4 large language model, we aim to deliver best-in-class content descriptions.

readme wrote:

Features include:

• Describe the focus object, navigator object, or entire screen
• Describe any image that has been copied to the clipboard, be it a picture from an email or a path in windows explorer
• Supports a wide variety of formats including PNG (.png), JPEG (.jpeg and .jpg), WEBP (.webp), and non-animated GIF (.gif)
• Optionally caches responses to preserve API quota
• For advanced use, customize the prompt and token count to tailor information to your needs

There were a few primary motivations behind this project.
NVDA is capable of performing optical character recognition (OCR) out of the box, which is a game changer. If you are trying to get text out of an image or PDF document, this is what you're looking for.
However, OCR is only able to analyze data that might be text. It falls short at considering the context, objects and relationships conveyed in those images. And the internet is full of them. Logos, portraits, memes, icons, charts, diagrams, bar/line graphs... You name it. They're everywhere, and usually not in a format that screen reader users can interpret. Until recently, there has been an unwavering reliance on content authors providing alternative text descriptions. While this is still a must, it's difficult to change the fact that a high standard of quality happens to be the exception, not the rule.
Now, the possibilities are almost endless. You might:
• Visualize the desktop or a specific window to understand the placement of icons when training others
• Get detailed info about the status of games, virtual machines, etc when sound is insufficient or unavailable
• Figure out what is displayed in a graph
• Demystify screenshots
• Ensure your face is looking clearly at the camera before recording videos or participating in online meetings

I've personally had a great deal of success in each of these contexts, but also geometric and even higher-level math content, which has always been of interest but somewhat out of reach due to challenges I won't get into here.

I just got these descriptions (IMO they're on the low end in quality relative to what I can usually generate):

The windows recycle bin:

This is a digital illustration of a recycle bin icon, typically used on a computer interface to represent where deleted files are stored temporarily. The icon features a translucent white bin filled with crumpled pieces of paper. On the side of the bin, there is the universally recognized recycling symbol composed of three chasing arrows forming a triangular shape, colored in blue. The background is a muted dark blue or gray, and there is a subtle shadow below the bin indicating light coming from the top. Below the image, the words "Recycle Bin" label the icon's function.

My travel laptop's desktop:

The image shows a computer desktop with a variety of icons against a dark background. In the top row, from left to right, there are icons labeled "Recycle Bin," "Mush-Z," "foobar2000," "REAPER (X64)" and "FSReader 3.0." Beneath that, there are two more icons labeled "alter aeon" and "NVDA," followed by three additional icons without visible labels, but with distinctive logos: the "Steam" gaming platform logo, a shiny, yellowish 'J' for what might be the "Jarte" word processor, and a blue and white 'JAWS 2023' logo, possibly for the JAWS screen reader software.
In the bottom left corner of the screen, there's a taskbar. On it, there's a Windows Start button, an area that appears to say "Type here to search," some pinned or running applications with the icons of Edge, an Office application, an app with a blue logo, another with a green chat bubble and finally the Recycle Bin. In the bottom right corner, the system tray shows a speaker icon, a network icon, the date "11/20/2023," and the time "2:08 PM"

Please consult the links above to learn how to get this running. Happy to help fix any issues that might arise, and I hope you find this little tool as useful as I have!

2023-11-20 23:36:56

This is cool! Will definitely check this out when I have a minute!

Discord: clemchowder633

2023-11-21 00:08:12

This is fantastic. I was wondering when something like this will come up. Thank you for your work.
I have a question for a friend who might be interested, but doesn't speak English.

Normally, GPT4 can provide quite decent descriptions in other languages too, but customizing the prompt would of course be necessary, to tell it in that language.

Does the addon support customizing the default prompt, to allow this use case?
If not, would that be something you would consider doing?

Thanks.

2023-11-21 00:52:32

It does, there's a text field under the settings dialog (Settings -> Ai Content Describer -> prompt).

If I can get confirmation that people are able to use it effectively in languages other than english, I'll consider pre-creating a list of prompts that can get automatically set in accordance with the user's preferences.

2023-11-21 01:48:10

This is fantastic!
I was planning on building something just like this during my winter break, but I guess I no longer have to. neat!

A couple questions:
1. Does the Optimize images for size checkbox set the detail parameter when calling the Vision API to low?
2. have you considered allowing users to reply to the image descriptions? That would open up many more use cases, I think.
3. Could you possibly make the prompt field a multiline field? Properly formatted text especially if you're using Markdown, really does improve the quality of responses.
Anyway, that's it from me for now. Thanks again!
Anyway, thanks for building this!

I'm probably gonna get banned for this, but...

2023-11-21 02:04:26

Oh, one last thing:
I'm not super familiar with Python or NVDA addon development for that matter, so not sure how difficult this is to do, but could you possibly look into having the function that calls the vision API be called by a separate thread? The system prompt I have at the moment means requests take anywhere from 15-20 seconds, and not being able to do anything during that time is... unfortunate.

I'm probably gonna get banned for this, but...

2023-11-21 02:28:26

@4 Thank you for the answer. I'll let you know how it goes, and wouldn't mind helping to provide a translated English prompt that can be used as a preset, when the time comes.

2023-11-21 02:35:45

Hi,
I don't think this is a good idea. The bot has the most training data in English. So, it should express itself in English, and then be translated to another language by deepl or something in  a separate API call. I believe there is some research somewhere which suggests that vision performs worse in other languages, which would not surprise me

2023-11-21 02:57:54

@8: That won't really work.
Say it picks up on text in Spanish. If we specifically tell it to *only* respond in English, we might experience weird results, like it deciding to translate the text from Spanish to English on its own. Which would be unfortunate because then that translation service would be translating all that text back into Spanish, and the more times you translate text, the more likely you are to introduce inaccuracies.

Anyway, the OpenAPI docs do talk about it struggling with other languages, specifically, non-latin alphabets.

OpenAI wrote:

Non-English: The model may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.

My guess is this is specifically referring to it struggling to OCR non\-latin alphabets, not that it can't describe images in those languages.

I'm probably gonna get banned for this, but...

2023-11-21 03:07:42

No no. Most of its "image description" fine tuning is probably in English, so it would do better giving objective info in English. No one knows how the image part of it actually works though. There have been rumours -- some think it's embedded in the model, some think it calls some sort of API, I don't know.

2023-11-21 03:30:48

I have an open ai account with an api key, but do I need to own the pro version to use this?

Blindgoofball
Founder and lead developer at Nibble Nerds
Where gamers are united!
https://nibblenerds.com

2023-11-21 03:34:40

You would have to upgrade your account to paid when the trial runs out, but I think gpt-4 is generally available now.

2023-11-21 04:12:24

@11: are you referring to ChatGPT plus?
No. They are completely different services and are billed separately.

@10: Your argument doesn't make much sense to me.
Yes, image description pairs were likely done in English, but given that OpenAI is an American company, it's probably safe to assume that RLHF for this model was also done in English.
If your argument were true, GPT-4 would struggle to communicate in languages that weren't English, and it does not.

Anyway, it doesn't matter what language you're using. Accuracy will vary regardless. People should know that before using this.

I'm probably gonna get banned for this, but...

2023-11-21 05:20:57

@8 It is completely irrelevant.
The addon just uses the relevant GPT4 API. The questsion now is, does the API allow you to ask for a description in your language?
The answer to that is yes.
Everything else would be just artificial limitations.
Feel free to limit yourself to using the model in English, or any other language if that's what you feel works for you, but from my experience, it works perfectly fine in other languages.
I think the decision of what language the AI should express itself in is up to Open AI, and it seems to already be made.

2023-11-21 07:13:41

Is there a monthly subscription for this? Do you have unlimited usage of the add on?

2023-11-21 07:34:07

As a golden rule, I usually say things that are indeed relevant.
You have a fundamental misunderstanding around how it works. Open AI has released benchmarks that prove that, yes, indeed it does communicate worse in all other languages. Since communication for it is also the stage at which its answer is computed, I believe such a step needs to be taken.

2023-11-21 08:12:30

It's not a subscription. You need to add card details and buy credits with it. Once you have credits, it will slowly be used up by each request you make. You can buy more any time, and when it runs out you will probably be unable to use the addon. It's kind of like prepaid for your cell phone where you buy air time or data and then it's there until you use it all up.

2023-11-21 08:26:35

So I don't know if I'm doing something wrong, but I can't get it to describe an image on a web page. I move the navigator object to the image with NVDA and numpad -, and even the mouse with NVDA and numpad slash, but both commands in the menu just tell me object is not visible. The cloud vision addon seems to be able to get the image though. Any ideas?

2023-11-21 09:07:33

@16: Of course it won't perform equally well in all languages. That's not what I'm trying to say here.
Its competence with any given language is primarily determined by the language's representation within the model's training corpus.
Common languages which are strongly represented in the corpus will perform better, while those with sparser representations will perform worse.
More concerning, though, is the fact that tokenizing non-English languages tends to be more expensive, because more tokens are required to represent each word.
All that being said, it's going to do really well with the most commonly spoken languages. I talk to it in Spanish all the time and have never noticed a degradation in performance.

Anyway, I don't disagree with you here. This is a given.
Where I disagree with you is the argument that RLHF and other fine-tuning being done in English is the reason English is the best performing language.
Models like this are really, really good at transfer learning. Given knowledge of how to converse in one language, and also given that they are basically a walking talking encyclopedia in a good chunk of languages out there, they can figure out how to talk and answer questions. Can they handle complex domain-specific stuff in said languages? Probably not, but that's not a fine-tuning problem. That's a "hey, nobody wrote two billion math textbooks in Swahili problem so I don't know how to predict the next token!" problem.
Unless OpenAI open-sourced the base GPT-4 model tomorrow, which wouldn't surprised me given the insanity and drama going on over there, there's no way for us to test this either way.

Something that we can absolutely test, though,  is how well GPT4-V handles requests in other languages.
I, for my part, have had conversations about images in Spanish already, and it does great!
I can't test domain-specific stuff because my Spanish isn't good enough for me to be able to follow complex technical things, but I am curious to know how native Spanish speakers find it.

Anyway, I guess what I'm trying to say in this rambly passage of text is that I understand what you're saying, but unless you can cite several sources that indicate that GPT-4-v doesn't work in *all* other languages other than English, we can't assume that it won't.

People assumed they understood what LLMs were capable of way back in 2022 (pre-ChatGPT) then this paper practically shook the AI world to its foundations by demonstrating that adding "let's think step by step" to prompts massively improved performance on many tasks.
LLMs are not a solved problem. There is still a lot to discover, but we can't discover new things by immediately making the assumption that something isn't gonna work.

I'm probably gonna get banned for this, but...

2023-11-21 09:18:46

@18, Make sure that the program you opened is maximized.

2023-11-21 09:20:44 (edited by DJWolfy 2023-11-21 09:21:26)

@19:
Sorry, but your anecdotes are moot. Show me benchmarks or I'm not interested.
I'm sorry to be so harsh, but if it's the difference between a correct description and a hallucinated one, then that difference is enough.

2023-11-21 09:43:41 (edited by zakc93 2023-11-21 09:47:40)

Maximising unfortunately doesn't work. As a test, if you go to google.com and press g to go to the logo that reads as google graphic, can you get the addon to describe it? I can copy the image from the context menu and then NVDA shift y works to describe it from the clipboard, but can't get it to describe it with either the current focus or navigator object options from the menu.

2023-11-21 10:38:48

@21, why exactly are you so against it. Have you actually used it in another language? People that actually used it with other languages tell you it works fine. And deepl? That's not supported by all languages that gpt 4 supports.

2023-11-21 10:39:27

This will be my last response to this particular line of discussion, though if anyone is interested in testing this / working toward a solution to this problem, feel free to reach out!

@21: I'm convinced at this point that you are trolling and not interested in having a constructive discussion.
Given your knowledge on the subject, you should know why I can't provide you with benchmarks. I'll list just a couple reasons for you, in case you are actually approaching this discussion with good intentions.

Rate Limits

Right now, the gpt-4-vision model is in preview. this means that everyone, whether you're a tier 1 API user, which I am, or a rich tier 5 user, you may only make 100 requests per day. 100 requests per day is hardly enough data to come to a benchmark-level conclusion.

Lack of Evaluators

Most benchmarks use AI systems to evaluate other AI systems. This is because in most situations, finding enough humans to evaluate model responses sufficiently to
rate even a small percentage of possible input/output pairs is not gonna happen for most of us. Not to mention, llms are non-deterministic, which means the same input won't always produce the same output so you often want to test the same input multiple times.
So how does an LLM evaluate GPT4V if GPT4V is the best multimodal model in existence at the moment? Other models, even sometimes inferior models can still evaluate a superior model's text generating capabilities, but last I checked, GPT-3.5 is not multimodal and can't even begin to understand what might be in an image.
There's PaLM2, but PaLM2, as far as I can tell, hallucinates far more often than GPT4... about everything.

Conclusion

This is why, short of the GPT-4-vision System Card there is very little concrete data about what it's capable of that's not anecdotal.
Anyway, hallucinations are a given right now, even when you're using English.
You absolutely should not be using this for any mission-critical system right now and nobody's advocating for that.
I personally use it to describe my cars in Forza, game screenshots, charts and graphs where mistakes won't ruin my life, UML diagrams, pictures of my pets, etc.
If it happens to think my dog has a tentacle growing out of her, it's not going to ruin my day or cost me thousands of dollars.
So yeah, if hallucination is something you're concerned about, you shouldn't be using this. Not even in English.

I'm probably gonna get banned for this, but...

2023-11-21 10:48:02 (edited by omer 2023-11-21 10:54:07)

21, your litererly either trying to troll, or trying to mislead this discussion into some other topic. If you want to discuss how the api and models work, go open up another topic, what you're doing is beyond pointless and doenst serve any much purpose aside from cluttering up the topic. One would expect to see discussion about usage of the addon, rather I only see the constant bickering which doesnt make sense at all
Back to the topic, thanks for creating this addon. I've been wondering if such a thing would be possible to do, and seems it is. Is it possible to make it describe a video as well in case of an image description is not enough? maybe a few seconds of screen recording maybe.