Hey gamers,
I started hacking on this a few days ago and, seeing as it runs fairly flawlessly on my machine, figured now was as good of a time as any to reach out for feedback.
This add-on makes it possible to obtain detailed descriptions for images and other visually inaccessible content.
Leveraging the multimodal capabilities of the GPT-4 large language model, we aim to deliver best-in-class content descriptions.
Features include:
• Describe the focus object, navigator object, or entire screen
• Describe any image that has been copied to the clipboard, be it a picture from an email or a path in windows explorer
• Supports a wide variety of formats including PNG (.png), JPEG (.jpeg and .jpg), WEBP (.webp), and non-animated GIF (.gif)
• Optionally caches responses to preserve API quota
• For advanced use, customize the prompt and token count to tailor information to your needsThere were a few primary motivations behind this project.
NVDA is capable of performing optical character recognition (OCR) out of the box, which is a game changer. If you are trying to get text out of an image or PDF document, this is what you're looking for.
However, OCR is only able to analyze data that might be text. It falls short at considering the context, objects and relationships conveyed in those images. And the internet is full of them. Logos, portraits, memes, icons, charts, diagrams, bar/line graphs... You name it. They're everywhere, and usually not in a format that screen reader users can interpret. Until recently, there has been an unwavering reliance on content authors providing alternative text descriptions. While this is still a must, it's difficult to change the fact that a high standard of quality happens to be the exception, not the rule.
Now, the possibilities are almost endless. You might:
• Visualize the desktop or a specific window to understand the placement of icons when training others
• Get detailed info about the status of games, virtual machines, etc when sound is insufficient or unavailable
• Figure out what is displayed in a graph
• Demystify screenshots
• Ensure your face is looking clearly at the camera before recording videos or participating in online meetings
I've personally had a great deal of success in each of these contexts, but also geometric and even higher-level math content, which has always been of interest but somewhat out of reach due to challenges I won't get into here.
I just got these descriptions (IMO they're on the low end in quality relative to what I can usually generate):
The windows recycle bin:
This is a digital illustration of a recycle bin icon, typically used on a computer interface to represent where deleted files are stored temporarily. The icon features a translucent white bin filled with crumpled pieces of paper. On the side of the bin, there is the universally recognized recycling symbol composed of three chasing arrows forming a triangular shape, colored in blue. The background is a muted dark blue or gray, and there is a subtle shadow below the bin indicating light coming from the top. Below the image, the words "Recycle Bin" label the icon's function.
My travel laptop's desktop:
The image shows a computer desktop with a variety of icons against a dark background. In the top row, from left to right, there are icons labeled "Recycle Bin," "Mush-Z," "foobar2000," "REAPER (X64)" and "FSReader 3.0." Beneath that, there are two more icons labeled "alter aeon" and "NVDA," followed by three additional icons without visible labels, but with distinctive logos: the "Steam" gaming platform logo, a shiny, yellowish 'J' for what might be the "Jarte" word processor, and a blue and white 'JAWS 2023' logo, possibly for the JAWS screen reader software.
In the bottom left corner of the screen, there's a taskbar. On it, there's a Windows Start button, an area that appears to say "Type here to search," some pinned or running applications with the icons of Edge, an Office application, an app with a blue logo, another with a green chat bubble and finally the Recycle Bin. In the bottom right corner, the system tray shows a speaker icon, a network icon, the date "11/20/2023," and the time "2:08 PM"
Please consult the links above to learn how to get this running. Happy to help fix any issues that might arise, and I hope you find this little tool as useful as I have!