Arg! I feel like I'm going crazy haha! This reminds me so much of when I was trying to build the object identifier program. I'm writing up pages upon pages of code, realizing it doesn't work and isn't going anywhere, selecting it all, pressing delete, and starting over. I've tried to get myself to stop working on this crazy idea so I can concentrate on the easier ones, but I keep coming back to this with new ideas I want to try.
Sheesh, I might as well explain what I've been working on, on the off chance someone here will have an idea of how to help. I'm sure you'll all think I'm crazy, but oh well, I suppose people like me for the games I make and not because I'm sane, ROFL!
I'm trying to build software that would allow you to see through your web cam, using sound. I know, sounds crazy, but let me explain first. Short of having some kind of major birth defect, all of our brains are hard wired to handle certain types of data. Sound, sight, smell, touch, taste, and things that are not solely tied to one of the senses, like the sense of motion, spacial awareness, balance (probably should be lumped in with motion). There's other examples, but hopefully you get the idea. For each of these concepts our brains are hard wired to understand, the data comes in its own type of format, or organization. It's a crude example, but your computer can read music files. If those 1s and 0s are arranged in the proper type of pattern, it is a music file, and as long as the computer knows how to use the different hardware, it doesn't care if those 1s and 0s were read from a magnetic drive, a laser on a cd surface, wirelessly beamed through the air, or sent through a fiber optic cable using light. In the end, all that really matters is what kind of pattern those 1s and 0s end up in. That is how your computer sees it as a music file. I know, the example has some technical holes, but again, hopefully you see the point I'm trying to make.
Now with the computer example, obviously some hardware forms are better at sending the data than others. For example, reading light signals through an optical cable will always be faster than reading from an old magnetic 3.5 diskette. In the same way, ears can bring in more data, and faster, than your nose. That seems like a strange comparison, because ears and a nose are meant for completely different things, but it's actually fair to compare them. When it all boils down, your ears and nose are just reading data from the world around us, and turning it into electrical signals for our brain. This is no different than a CD drive reading 1s and 0s with a laser, or an old floppy drive reading a disk with a magnet. We are talking about hardware equipment, that is all.
Our senses are not completely exclusive. Sometimes 2 things will produce the same type of data, such as how your nose and tongue can both help you taste food. If you plug your nose while eating, some of the flavor will go away. Now I'm not talking about the smell of the food, I'm literally talking about taste! This is also why you can smell something and usually have some idea of what it would taste like. Your nose is less efficient at producing taste data, but still, it is capable of doing it to some degree. Spacial awareness is the same way between the eyes and ears.
Now that all of that background info has been said, what I'm attempting to do is use your ears to relay sight information to your brain. As long as the formatting of the information is a close enough match, your brain should be able to understand it as visual information. That is my theory at least. This process would require a bit of training, but I'm also trying to make that process both automatic and quick. Let me step to the side for a moment and give an example of what I mean. When you were very young, you learned to understand language through sound. You'd hear people speak, and your brain associated those sound patterns to mean different things. The only thing separating the word "Hello" from any other random sound, is the fact that your brain associated that pattern with something. So at that time, it seemed that the only way to communicate that message was to hear it. Of course, if you were sighted at that time, you learned to read with your eyes. Your brain learned that particular patterns of visual data represented letters, then, you were taught how to "sound out" each letter or each letter combination. You probably didn't think of it this way at the time, but you were learning how to convert between visual data to letters, and then letters to sounds so that you could then link it in to the language patters you already knew! From that point on, you could get sound data or visual data, in the correct formats, and it would follow the chain in your brain that would lead you to language information. Of course, for all of you, you then took it even a step further by linking those same connections to touch data. As you learned to read braille, the same process linked back to the audio patterns of language that you already knew. In theory, you should be able to learn to communicate words using smell or taste! Now I've never received any words of wisdom while eating alphabits cereal, but I have had my brain shout "Run away" when people sent smells in my direction! ROFL! Sorry, I got side tracked there.
With those examples, it sounds like even if I was successful, learning to use the equipment would take a long time. After all, learning to read took a while! To speed up the process, you need to give you brain near instant feedback on your progress, and I plan to do that using the web cam itself. Let me give another example. If you were deaf, imagine how long it would take for you to learn how to speak my name. I write to you "Say Jeremy" and I wait for you to try, only then to shake my head no or yes depending on if you were correct. Such a process would take forever! Now imagine you, being able to hear, trying to say my name for the first time. If I spoke out loud to you, "Say Jeremy", it is unlikely it would take you more than 1 or 2 tries because you could instantly hear your own speech and your brain would be making thousands of comparisons to the audio data example I had provided you with. In the same way, I can use the web cam to instantly give your brain feedback on what is going on. I think, if I can get this working, it would speed up the learning process. Who knows, it could be a few hours of listening to gibberish sounds and then suddenly your brain would figure it out. There's really no way of knowing yet, and this really is all assuming I can get anything to come of this.
Now we've already said that different hardware is better, or worse, at producing types of data. Unless your taste buds are damaged somehow, they will always run circles around your nose when it comes to producing taste information. No matter what I do, I don't believe it will ever be possible to use your ears to match what the eyes can do. After all, the eyes were built for the job they do! My goal is to give you enough visual resolution to get a general idea of what's going on around you, but anything more is probably unrealistic. I would be thrilled if you could turn your camera out your window and be able to "see" that a blue car and a white van just drove by, and that a person is sitting on the porch next door. Even if you can't tell who the person is, or what body type the vehicles were, this would have to be a useful tool nonetheless!
The first thing I did was try to figure out the smallest resolution I should be trying to recreate. I decided to go with 64 by 48. For those who have never seen, the most basic comparison I can give is to imagine a grid of 64 by 48 where each cell contains a single point of color. I've given up on any idea that would relay all 3072 cells to you as sound, that's just insane, especially when you understand that each cell would change several times each second! My next idea was to focus only on changing values, since technically that's how the eyeball physically works, although as a sighted person I'd never be aware of that under normal circumstances. This was tough too, because it almost forced me to dismiss any color data from the scene. I hate doing that because having color information is especially helpful when you're working with such low resolution. If we have to go gray scale, I'd be very tempted to increase the resolution, and thereby cause a whole new problem. Even when focusing only on the changing cells, there was far too much going on to handle it with sound.
The idea I finally threw out yesterday was to have the computer break the image down to basic shapes, like circles, triangles, and rectangles, to the best of its ability. Each shape, and its location, would be portrayed using sound so that we only had maybe a dozen or so things to track at a time. This seemed to be the most promising, except it seemed to have strayed too much from being a visual format. as things move around and you see them from different angles, completely different shapes are formed. That's all fine and good, except they don't suddenly change from 1 shape to another as this would be forced to do. Changes are gradual, and even if they weren't, a sighted person can process that change in a way the blind person can't. Visual data of a triangle suddenly becoming a square breaks down into smaller elements that let the brain follow the change. Going from 1 sound representing a triangle, to another sound representing a square, does not work the same way. I liked the general idea, but it has serious flaws.
The latest idea is a variation on the shapes. What I want to do, is get away from specific shapes, and assign sounds that tell you relative measurements of each shape instead. For example, the old way would have been confusing as you switched between a flag oval, to a circle, to a tall oval. If, instead, the sound of the shape gradually changed as a function of the shape's width and height, you'd hear a smooth changing sound as the shape squished into a circle, and then stretched its height into another oval. This is the kind of changing we need, since it allows the brain to break down what is actually happening instead of just a sudden sound change which is supposed to represent some new shape. There are just so many things to consider, it is driving me crazy. For the moment, I think the most important attributes each tracked shape needs are, shape as a function of width and height, overall size on the scene, color as a function of red, green, and blue content, and position on the field of view. If I can represent all of those things using sound, and manage to track a dozen or so such objects without it overwhelming your ears, this just might work! Any comments, feedback, or ideas are more than welcome. As I've said many times already, this is weighing heavily on me and it is quite overwhelming.
- Aprone
Please try out my games and programs:
Aprone's software