2014-11-11 05:03:16

Hello, new here!

I came here to ask a question. Right now trying to make my game accessible to at least the legally blind (through residual vision, fully blind will have to wait since it's not easy to adapt), and that means screen reader support. Currently trying to make it as easy as possible to implement, especially since it's crossplatform. Right now two methods are supported: outputting text to a console and outputting text to the clipboard (the Skullgirls method).

I wonder about the former. If I recall correctly, all screen readers should be able to see text on a console. But the game still has its own window, and that one has to have focus to receive input. What I wonder is: what do most screen readers do in this case? Do they only see the console when itself has focus, or do they read any text sent to the console as long as any window of that program has focus? (the game outputs an entire line whenever text changes, if you wonder)

That's all I want to know, to see if the option of outputting to the console is still useful. Thanks for the information.

PS: output to the console is done by writing to the standard output, so as a side effect anything that works by redirecting that to a file will also work with that method.

2014-11-11 06:24:58

We've done it using redirection through a pipe, and encoding information about message priority in the messages themselves, thus allowing the reader end of the pipe to process the messages through TTS and interrupt as appropriate.

Otherwise, I really can't think of a screen reader that would read console output where the console was not the focus, no. Hope I'm wrong …

Just myself, as usual.

2014-11-11 06:25:23

Most, screen readers, should be able to read automaticly those command prompt screens without much problem.

2014-11-11 07:11:25 (edited by Sik 2014-11-11 07:13:04)

Well yeah, the problem here is whether they still detect it when it isn't focused (either updating the console itself or writing to the standard output). Take into account that a program can only have a single console (they are treated in a different way than most windows), so it doesn't seem that farfetched. Alas, I have no idea how screen readers work.

Sebby wrote:

We've done it using redirection through a pipe, and encoding information about message priority in the messages themselves, thus allowing the reader end of the pipe to process the messages through TTS and interrupt as appropriate.

Nice, so that alone makes that method useful (since any sort of redirection should work). Are there any common filenames to take into account? Maybe I could just make the game output to those directly if possible. (EDIT: oh, also, how would I encode message priority?)

2014-11-11 14:17:38

Sik wrote:
Sebby wrote:

We've done it using redirection through a pipe, and encoding information about message priority in the messages themselves, thus allowing the reader end of the pipe to process the messages through TTS and interrupt as appropriate.

Nice, so that alone makes that method useful (since any sort of redirection should work). Are there any common filenames to take into account? Maybe I could just make the game output to those directly if possible. (EDIT: oh, also, how would I encode message priority?)

We designed both the TTS processor and the engine output so that we could do portable TTS ( actually, now I think of it, it was pretty awesome, though I say so myself smile ). The idea is that each platform has some idiom, EG Windows had ag_say which used SAPI (you're right, if you can go direct, is better), OS X had the say command, and Linux had--what was it? oh yeah, we handled that directly and used serial to an Apollo TTS device.

To encode priority, you have to contrive some scheme that the backend knows, EG you have a prefix before a message string result in a TTS reset prior to speaking the new string. Etc.

Now our game launcher is handling all the TTS on both OS X and Windows in Python using PyTTS, so there must be a market for a universal TTS API and hopefully there will be one in your language.

The other measure you suggested, that of using the clipboard, is also known to work with a tool people have been using, though if you're going that far you might as well just build in SAPI support directly.

Just myself, as usual.

2014-11-11 15:54:11

You can easily call the screen reader APIs directly, and to be honest that's what I'd do.  Clipboard requires extra setup, as does the console thing.  No screen reader is going to see both windows at the same time.  By the time you're done bundling all of this, well...
That said, a lot of blind people already have the clipboard stuff set up because of some of the Japanese games.  If you did anything except call directly, that's what I'd suggest just because the community is already familiar with it.  SAPI can be used and will work, but most of us find the SAPI voices to be sad and way, way, way too slow.  The average screen reader user is going 3 times as fast as what Microsoft seems to think the maxes ought to be.
See http://hg.q-continuum.net/accessible_output2/ which provides Python source to talk to all of the ones on Windows, and I believe also Mac.  Not sure if it handles Linux.  It's fairly simple, though I can't rattle off the API function names and whatnot--I just use accessible_output2 when I need them.  Porting it to your language of choice should be pretty simple, though.

My Blog
Twitter: @ajhicks1992

2014-11-11 19:36:27 (edited by Sik 2014-11-11 20:01:44)

Just to make it clear: both console and clipboard output are already implemented (they're very trivial to program actually, took me like a few minutes each at most), I'm just trying to make sure that I'm not providing something that is useless in the end (which creates false hopes).

Also I managed to turn on the screen reader on Linux and it decided to start reading anything I wanted except what I was pointing to, including a not focused window at some point *sigh* This is going to take a while.

Sebby wrote:

We designed both the TTS processor and the engine output so that we could do portable TTS ( actually, now I think of it, it was pretty awesome, though I say so myself smile ). The idea is that each platform has some idiom, EG Windows had ag_say which used SAPI (you're right, if you can go direct, is better), OS X had the say command, and Linux had--what was it? oh yeah, we handled that directly and used serial to an Apollo TTS device.

Is ag_say something from AudioQuake? Because that's what I seem to have found around. I guess that won't work for a default setup, but then again I don't know how common is to have that program installed for a blind user who plays games (maybe it is and I can just rely on it). Also I couldn't find info on how to use it (e.g. is it "game | ag_say" or "ag_say game" or "ag_say text-to-say"?)

And yeah, Linux is a horrible mess. There isn't an equivalent to SAPI but rather several different engines (at the very least two major ones, it seems one for Gnome and one for KDE) and any of them could be installed on a given system. That's annoying, I'm not sure if there's some de facto standard API to communicate with them. (mind you, it's likely you can just pipe the output to the engines directly, knowing Unix philosophy, although I'd still need to know their filenames)

It seems that on Linux there's Festival, but I'm not sure how it works. I should check.

Sebby wrote:

To encode priority, you have to contrive some scheme that the backend knows, EG you have a prefix before a message string result in a TTS reset prior to speaking the new string. Etc.

Oh, so engine specific.

Sebby wrote:

Now our game launcher is handling all the TTS on both OS X and Windows in Python using PyTTS, so there must be a market for a universal TTS API and hopefully there will be one in your language.

I'm using C (not C++) and the game is for Windows and Linux, so if there's something cross-platform that works on those, it'd be nice (oh, and there's the issue of the license being compatible with the GPL 3 as well). Tried doing a quick search but I didn't seem to be able to find anything useful (the only one I found was using an incompatible license).

Sebby wrote:

The other measure you suggested, that of using the clipboard, is also known to work with a tool people have been using, though if you're going that far you might as well just build in SAPI support directly.

Clipboard support is so easy with SDL 2 that it's a no-brainer though, compared to implementing SAPI support (not to mention still having to figure out what to do with Linux, which seems to have two major engines at least).

camlorn wrote:

You can easily call the screen reader APIs directly, and to be honest that's what I'd do.  Clipboard requires extra setup, as does the console thing.

I assume you mean extra setup for the user? Because programming those two is actually much easier, console output is just a call to printf, while clipboard output is just a call to SDL_SetClipboard (I'm using SDL 2), while programming the screen reader APIs directly is a much more involved effort (not to mention platform specific, so I need multiple codebases), and I don't have any suitable TTS engine available right now.

camlorn wrote:

SAPI can be used and will work, but most of us find the SAPI voices to be sad and way, way, way too slow.  The average screen reader user is going 3 times as fast as what Microsoft seems to think the maxes ought to be.

Actually, looking at the SAPI documentation the impression I got was that the program set a speed relative to the user settings, which would solve that problem for every program. The problem is that Microsoft's own engine doesn't provide any settings from what I've read.

camlorn wrote:

See http://hg.q-continuum.net/accessible_output2/ which provides Python source to talk to all of the ones on Windows, and I believe also Mac.  Not sure if it handles Linux.  It's fairly simple, though I can't rattle off the API function names and whatnot--I just use accessible_output2 when I need them.  Porting it to your language of choice should be pretty simple, though.

Well it doesn't seem as simple as me, but then again I don't have enough experience with SAPI (I just looked up some functions to get an idea of how it works), so that probably isn't helping. And that's assuming I ignore the rest of the APIs it supports. I'd need to see. (EDIT: meh, forget that, the SAPI 5 one seems easy actually, but I still need to learn how to use the API and what it wants)

Ideally I'd want to make a proper TTS engine that's more portable (ideally just render sound) and is more suitable for games (since sounding natural is more important for games than with normal applications), but that will take a while so that's why I want to have output to screen readers meanwhile. When I have that one ready I will definitely implement it in my game though (as well as making it available to everybody, of course).

2014-11-11 21:41:52

Well, if you're GPL you can just use Espeak directly.  people will kind of hate you because they want their favorite voices and most dislike Espeak but...
The screen reader calls are, in most cases, printf.  You send it a string, it says the string at the next available opportunity.  What you've already got will work, though--the users just need to start the utility that causes clipboard updates to autoread.  It's floating around somewhere, but I've not had a need for it in a very good while.
The problem with SAPI is that basically it's the microsoft voices or nothing.  It doesn't matter about absolute and relative rates, and if talking as quickly as possible isn't going to provide a gameplay advantage it may not matter.  Most of us go upwards of 600 words a minute, but the SAPI voices stop well short of that (if I had to guestimate, maybe 300, maybe).  I'm personally upwards of 800, as are most people I know.
Also, Windows matters a lot more than Linux.  I know that someone Linux will come along and be mad, but the fraction of blind people on Linux is very, very tiny.  The bugs you are experiencing may not even be your fault, in that case.  I'm not quite sure what you're trying to do, and my knowledge is geared towards windows, so I can't really comment for sure either way.

My Blog
Twitter: @ajhicks1992

2014-11-11 23:51:17 (edited by Sik 2014-11-11 23:52:27)

Yeah, just found out about espeak (it has a command line tool) and... let's say it's just plain horrible, OK?

On Linux I found flite instead. It isn't installed by default but I can just make it a dependency (it's in the Ubuntu repo after all) and it doesn't even need a screen reader enabled. Admittedly it's far from great, but tweaking some parameters the default voices can be made more tolerable. The two biggest issues are that it's English-only (language support seemed to be an issue in general with all the tools I checked now) and that I can't stop a voice when I need to start a new one... for now just gonna let users press Shift to make the text get output again (using that key since it's unlikely to conflict and it doesn't cause trouble in the game).

For the record, I think I can fix the latter issue by using the flite library instead, although I'll see later about it (looking for a quick solution right now). Also flite is available on Windows as well it seems, so I can just install it alongside the game (if I don't just go with the library instead). Only thing I'll need to see then is about how to carry the voices around.

2014-11-12 00:42:42

Sick, when it comes to TTS output on Linux the API you should be using is called Speech-Dispatcher. It is a universal speech wrapper for various TTS engines such as Espeak, Festival, Flite, FreeTTS, Swift, etc and you can support all of the above with a single API. By all means don't try to support the TTS engines directly because it A, doesn't give the end user much freedom of choice, and B, some TTS engines are problematic on modern Linux do to the introduction of Pulse. Flite and Festival are both a major pain in the neck on Ubuntu because they are difficult to get working correctly. If you were to just use one of those directly you may end up creating some headaches for you and the end user by inadvertently using an engine with known compatibility issues. Last but not least most Linux distributions have Speech-Dispatcher installed by default. Especially, if Orca, the screen reader, comes bundled with the distribution you are using. Just thought you may want to look into Speech-Dispatcher as it would go a long ways to solving your problems with TTS support on Linux.

Sincerely,
Thomas Ward
USA Games Interactive
http://www.usagamesinteractive.com

2014-11-12 03:06:45

It's worth noting at this point that your idea of horrible can often be our idea of great.  If I had eloquence samples to post, I'd point you at them--to the sighted user, they're both equally horrid as far as I know, but Eloquence is the most popular synth among this crowd.  One of my biggest annoyances (as well as that of many others) is the current push towards natural voices.  To many (if not a majority, then at least very close to one) of us, it's not about natural.  It's an I/O method and, like other I/O methods like typing, needs to be fast.
I'd say that most of us would agree with me in saying that--to a blind person with a blind person's priorities--flite is an extra special sort of hell.  We train to these synths over time and the top priority is productivity--and, unfortunately, you don't get both fast and natural.
Just be careful when applying your evaluations.  Compared to you, our evaluations are basically blue and orange morality.  if you don't want to call the screen readers directly on Windows, then please leave the clipboard method in--I guarantee you that half of us will end up using it over flite or SAPI.
Since I was the prime example, at least for a little while, I'll mention it.  With Espeak, I used to go 1217 words a minute for casual computer operation, basic programming, etc.  For novels it was 1051.  These numbers dropped because you can only do that when your life is such that you have spare energy, it's almost meditative, and I have never met a sighted person who could even recognize it as words.  I can still do it, if I have to, but am a bit out of practice.  I am now in the ball park of the sane speed of 800 words a minute or so because my life got busy and I'm not always 100% awake and perky these days, and as far as I know, that speed is most of us (including the developer of NVDA).  yet again, most sighted people are hard pressed to even see it as words.
We have got to prepare some samples and actually collect data on this as a community at some point--sighted people never quite understand what it's truly like, and it's very irritating that I don't have a link to provide when these conversations inevitably come up.  Just think of it like this: for every day of our lives, for every activity that involves a pencil or writing, we're quite probably using a synth.  Now imagine if that synth was "natural", at the cost of topping out around 200 words a minute or so, disable your ability to glance or read a few words ahead or skip in any meaningful way, and you will begin to understand.  Some of us go that slow, but it's really a rarity among the kind of person who would be able to set this up in the first place.
P.S: I am in no way saying that clipboard isn't good enough.  Clipboard is good enough, but I wish to forestall you removing it and replacing it with flite.  Also to aid understanding of the synth issue generally, and why we don't consider Espeak to be nearly as bad as you do.

My Blog
Twitter: @ajhicks1992

2014-11-12 05:20:33

OK, so got around implementing speech-dispatch support (which took quite a lot because the documentation indicated how to use the functions but not what library was needed or even the header file, I had to guess those argh). Also already found a bug on it that I had to work around... (note to those who want to try it: make sure your never pass an empty string to spd_say, you'll hang the program later otherwise). Does anybody know how to make a more decent setup for speech-dispatch anyway? I doubt that espeak (with stock settings) is even remotely realistic to what somebody uses, and without a good setup I'll probably tune things in a way that turns out to be unusable in practice.

I guess now to figure out how to use SAPI then (and more importantly, how to test it in the first place).

camlorn wrote:

One of my biggest annoyances (as well as that of many others) is the current push towards natural voices.  To many (if not a majority, then at least very close to one) of us, it's not about natural.  It's an I/O method and, like other I/O methods like typing, needs to be fast.

Eh, for me "natural" just means "could reasonably pass for a human". And by that I mean the pronunciation is right (although English being a horrible mess in that sense isn't helping matters). And entonation, I can understand not being perfect due to lack of context, but come on, flite completely ignores question and exclamation marks (it treats them like periods), and espeak not only does that but if they're repeated it will outright spell them out (when they're usually added emphasis). Is this normal for most screen readers or what?

I could care less about things like messing with speed, pitch, etc. to push them to extreme values. That doesn't count in my definition of being or not natural, that's just messing with the speech synthesis settings.

camlorn wrote:

P.S: I am in no way saying that clipboard isn't good enough.  Clipboard is good enough, but I wish to forestall you removing it and replacing it with flite.  Also to aid understanding of the synth issue generally, and why we don't consider Espeak to be nearly as bad as you do.

The game is set up to support multiple output methods for the screen reader (the first post should have already hinted at it), so no, clipboard mode is not going away. It sorta irks me in that it needs a very specific setup to work, but it's staying.

2014-11-12 09:21:40 (edited by Sebby 2014-11-12 09:58:27)

I'll just throw my opinions into the mix here in order the better to keep this discussion nice and inflamed. smile

My usual screen reading rate is about 275 WPM; sometimes it's faster, but it's rarely slower. I appreciate natural reading speeds because I enjoy savouring the content. I do realise that other blinks are happy to go faster, but I think I (and, frankly, many others) can afford the luxury of slower speeds.

I do agree with Camlorn, though, that it is a mistake to assume that the sensibilities of the sighted make any sense to blind people, or vice-versa, in the matter of concatenative vs formant synthesis. You're trying to judge the quality of a TTS synthesiser by how "Natural" it sounds precisely illustrates the problem; accuracy and speed are higher up on the list than "Naturalness", at least in typical screen-reading applications (but ironically, often not in games, which usually only output short, predictable and oft-used sentences). On the other feeler, I dislike eSpeak as compared to Eloquence--both are formant. As a Mac user and now using "Alex", which is essentially a concatenative TTS with go-faster signal processing not normally found in such synthesisers, I still miss Eloquence, even though Alex is among the very best such synths I've used. Microsoft's newest speech platform voices come very close too, though given the choice I'd stick to Eloquence again. If I were running you're game, I'd just install the older SAPI-enabled versions of the Nuance voices, and use those; they'd do fine. And on Linux, I'd use whatever worked--probably eSpeak.

Yes, ag_say is part of (older/obsolete versions of) AudioQuake and not expected to be on anyone's machine that doesn't use it, but really, it sounds to me like you'll be alright coding to the SAPI platform API yourself. A tip for you though, we only used C++ because M$ can't put a proper accessor declaration together for the API header files. COM is a bitch if you're a straight C programmer, but I trust you'll figure it out. Well, I hope you do, anyway …  The ag_say source is available and it's GPL--feel free to use it if it helps you. It takes the stuff to say on stdin. I forget, for the moment, how we identify "Next string interrupt" but we translate that into the appropriate SAPI calls.

I've never used non-hardware Linux TTS, so I defer to some other, better authority on the subject. I would keep your eye out for that universal magic bullet that nobody's written yet though, just in case. smile

Good luck in your endeavours.

Edit: listen for yourself to Alex. Listen to the different rates.

Just myself, as usual.

2014-11-13 02:27:32 (edited by Sik 2014-11-13 03:16:05)

OK, before we all keep arguing here: should I just drop messing with voice settings altogether (i.e. go with the defaults always)? I mean, it feels weird having a female character speak with a male voice, but should I just leave it at that or what?

Also, does anybody know how to use SAPI with MinGW-w64? (I don't care about vanilla MinGW) Because it turns out that apparently I don't have the relevant files, although looking around it does seem like there's an implementation of SAPI for MinGW-w64 (not to mention Qt apparently having it as well, QtCreator using MinGW). I think the problem is that it was only added recently to MinGW-w64 (so it's not in the Ubuntu repo yet) and I'm trying to figure out where are the relevant files to install them.

Sebby wrote:

You're trying to judge the quality of a TTS synthesiser by how "Natural" it sounds precisely illustrates the problem; accuracy and speed are higher up on the list than "Naturalness", at least in typical screen-reading applications (but ironically, often not in games, which usually only output short, predictable and oft-used sentences).

Again, I explicitly said that speed does not account my definition of naturalness (that's just a synth setting!), and in fact for the record, that Axel synth you linked is pretty much the prime example of what I'd consider natural. For the record, I actually cranked up the synth speed here because I found it too slow, and that's despite not being accustomed to it and being horrible at understanding spoken English (it isn't my first language).

And for the record yes, I was talking about the context of games, where having characters sound like people (synthesis tweaking aside) affects how we perceive the game (in a productivity application you just want to know what's on screen). Although beware, when characters talk you may want to speed it up because text is longer but understanding it 100% exactly isn't a priority.

EDIT: is it safe to assume that a Windows system with a screen reader installed already has sapi.dll as well? (at least those screen readers that work with SAPI anyway)

2014-11-13 04:09:14

Espeak defaults are lame.  If you change it a bit it's much better, at least for a blind person.  The periods and all that aren't being ignored, it's just that the inflection/procity/whatever is too low.  Most screen readers take some additional steps before passing along to the Synth, so typing into the espeak command line is also not conveying it--NVDA for one has huge dictionaries in addition to the ones built in to the synth that have to do with how to (or not to) say punctuation based on user preference.  NVDA also has tweaked voices, which I'd kill to see on Linux.  Default settings for literally all the screen readers I can think of suck for everyone; it's more a matter of not sucking just enough to let the user tweak them
As for Sapi and MinGW-64, well, there's reasons I'm not using MinGW in general and missing hugely big chunks of the windows API is part of it.  I'm not really sure if this specific one is there, but would not at all be surprised if it is not.  Most screen readers do not use SAPI directly, and it's not safe to assume that the DLL exists if a scree reader does.  Few of us use SAPI with our screen readers.  I believe that anything post XP just has it, however--Narrator uses it and is built in to all windows systems, so you're probably okay for that reason alone.  I doubt you're targeting 95/98.
Finally, before I tackle this general issue a little more, don't touch speech settings.  Leave them alone.  It's an interesting idea to give each character their own voice, and I'd love to see the ability to also feed synths through 3D audio, but it just doesn't work.  I'm much less fanatical on this issue than most.  I was capable of converting from the religion of Eloquence to the cult of Espeak.  But bringing me down to slowness land would still be excruciating.  I also don't see how you'll ever make that cross platform.
And my remark on the general issue:
Speed and naturalness are inversely proportional.  Start cranking up a "natural" synth and it starts sounding drunk before topping out well below the capabilities of many blind people.  I have yet to have anyone anywhere show me a synth that sounds remotely humanlike and have it continue to do so post-200 words a minute. 
Yeah, rate is a setting, but the maximum value for rate is a feature and you're normally only specifying things in arbitrary rate units that don't link well to reality from your app.  That's why we still use formant synthesis and why, when Eloquence inevitably dies, we'll all probably end up moving to Espeak.  Move the rate maximum higher, and all the other parts of your definition of naturalness go out the window, even at slow speeds.  It's not about having speed be part of your definition, it's that having speed costs on every other part of the definition.
The naturalness Vs. productivity is an interesting debate.  I, for one, am proud of my ability to basically be as fast as my sighted peers, though my insane rates weren't maintainable long-term: I needed complete relaxation and also a surplus of energy to do it.  I could see the sacrifice being made for some blind people, but not professional programmers--time to review the 800 lines of code in 5 minutes.  And yes, many of us vary the speeds based on the task at hand.  Reading laTeX, for example, means a big, big cut.

My Blog
Twitter: @ajhicks1992

2014-11-13 04:30:37

Just to make it clear, by "default settings" I mean whatever are the current system settings, so e.g. if the user sets the speed to 800 WPM then that'd be the default setting for the program.

Also I managed to find the SAPI header files for MinGW-w64. No libraries, although I think that it just uses the OLE libraries and I already have those, so I believe I'm already set. I'm going to see if by tomorrow I can have something working. As long as the program works on Windows XP (the current system requirement) I should be fine, I can make the DLL a requirement for the SAPI-based screen reader.

camlorn wrote:

Speed and naturalness are inversely proportional.  Start cranking up a "natural" synth and it starts sounding drunk before topping out well below the capabilities of many blind people.  I have yet to have anyone anywhere show me a synth that sounds remotely humanlike and have it continue to do so post-200 words a minute.

OK, I'm seriously convinced I have a completely different idea of "natural" than most people here do. As long as the base synth sounds natural it should be fine in my opinion. And yeah, huge speeds are not really human-like, but I bet that if we could speak that quickly we may sound like that (although I don't know if current methods introduce issues at those speeds, maybe pure synthesis has a chance to fare better than pre-recorded formants at those speeds).

The problem is when it can't sound natural no matter what settings you use, which is the problem with low-quality speech synthesis. The Axel synth linked earlier seemed pretty good to me, for instance (I wish espeak was even remotely close to that).

2014-11-13 07:08:13 (edited by queenslight 2014-11-13 07:13:37)

You may want to try your project demos of the Supernova screen products as well:
http://yourdolphin.com/demos.asp
All demos are "30" days, no restart times.
Each version has a seperate 30 day period by the way.

Also, some folks were wanting more Espeak voices for Linux? Check out:
http://espeak-extra-voices.tk/
, thank Kyle B. when ya see him for that site... smile
Update!
Unless I'm missing a page, looks like he site is going under a major face lift, though I can't say for sure.

2014-11-13 17:58:02

Yeah I tried to look at those espeak voices but it seems they're nowhere.

Anyway, got SAPI working now (was lucky to find somebody with a working SAPI install immediately), only thing I need to figure out is how to set the output language (and hope I don't need to resort to SSML just for that). So I guess that between SAPI on Windows and speech-dispatcher on Linux I should be set. Besides these two (which I'm going to call native mode) and the clipboard mode, should I bother keeping around the standard output method?

2014-11-15 02:02:27

I'd not worry about setting anything from your app.  At all.  The user should and will probably have done this right for their setup.  Keeping the console output might be interesting and worthwhile unless it contributes to making the program more complicated--I don't personally see a use for it, but someone else might.
As for naturalness, our definitions are, believe it or not, the same.  We just don't care about it; many of us are more interested in clarity.   This is a bit long and arguably off topic, but may interest you:
there are two main models of speech synthesis.  The first of these is what you hear as natural and is called concatenative synthesis.  You basically take a bunch of wave files representing phonemes, made using a recording studio and a bunch of manual editing, and stick them together.  On top of this, you throw a bunch of math to make it sound natural, and then you apply a variety of algorithms that can speed up recordings of speech for the rate adjustment.  I'm simplifying, mostly because doing this properly is a black art and trade secret and all of that, and I don't have the background to build one yet (give me a couple more years).  This is the realm of people like nuance, with their ability to basically make hugely big databases.  Any synth you're aware of that you think of natural is most likely either using this technique or a technique that's based on it.  Ironically, making a very low quality one is something that any programmer who can concatenate lists can work out, though a database or other rule set for transforming text to phonemes is still needed.  At normal talking speeds, these sound really great.  But you can't crank them up without quality loss no matter how much you wish you could.
The second method, the one most blind people prefer, is definitively not natural.  There are a variety of mathematical models.  The most common of these are based on formant synthesis, which relies on (oversimplifying, again lack of background but give me a couple years) adding sine waves.  You basically pull out the fundamental frequencies of the phonemes and play them back.  The trick, of course, is making it not sound like a flute and getting the transitions right--something that many scientists have spent a long time on.  Espeak uses this as does Eloquence, though the algorithms behind Eloquence are more widely published and known (look up klatt synthesis, if you're interested).  there are two advantages to this model.  The first, not so much of interest here, is that it requires much less system resources and can be anywhere from 100x to 10000x smaller (depends on what you need to synthesize, I've seen an impressively low quality one in 1k of Javascript once).  The second is that you don't crank up the speed at the end.  Instead, you adjust the words per minute setting, and literally all stages of the pipeline reconfigure themselves--it's an actual mathematical model of the human vocal tract, not just a big data problem.  Consequently you can literally run the whole thing faster without cheating at the end.  Synths like this don't typically require tricks until you want to go post-500 words a minute, and many don't require it even then.  As you turn them up, the loss of quality is very, very minimal.  The cost is naturalness.  They're quite, quite clear, just you'd never, ever, ever mistake them for humans and there's very definitely an adjustment period.  Unfortunately, now that you can do stuff like waste a few hundred megabytes of ram, and given that 90% of the population is more interested in it being easily understood by anyone with no prior exposure, these are on the decline.  This is unfortunate for us, though it should be said that Espeak and the work NVDA has done on Speechplayer should be forward compatible for at least 20 years.
To draw an analogy, the former is like scaling an image to 200 times its original size.  The latter is like having a function that knows an algorithm to draw the same image at any size, without containing the image itself.  The second might be a bit blurry, but it's going to be pretty much the same amount of blurry for any size no matter how big you make it.

My Blog
Twitter: @ajhicks1992

2014-11-15 05:13:23

Another advantage of formant synthesis (although I think formats are just the vowels, not sure) that you didn't mention is that voices and languages are decoupled. Voices are just parameters that tweak the way the waveform is generated, while languages affect the way phones are combined (and which phones are used in the first place). You can't do this easily with concatenative synthesis because you'd need to record the sound for every possible phoneme, thereby voices and languages being tied together with those engines.

But yeah, I'd defintiely take a formant-based one over a concatenative-based one, both because a well done one would be less prone to issues as well as allowing me to use any voice regardless of what I feed to it (and also it's much easier to make a voice for them since you don't need to find a person and figure out how to get them make the required sounds). The problem is that the former is much harder to get right so most of the effort goes in the latter instead. But in theory it should be doable, and it would be flawless if it can be pulled off.

2014-11-15 05:57:20

Actually, not so much.  The parameters for the formant synthesizers are usually derived from the same sorts of recordings.  Klatt's work actually used Klatt.  You need a lot less of them, but you still need them, plus a great deal of fiddling.  I have some idea how this analysis works, but haven't yet done anything along those lines myself.  The tradeoff of needing less recordings is that you're spending a lot of time analyzing those recordings to develop parameters to a very detailed mathematical model.  The real problem is that you lose a ton of the frequency content and can't really git it back; 8 or 10 sine waves and noise generators plus 20 or so filters is really cheap.  200 or 300 sine waves and noise generators plus 4 or 5 hundred filters isn't, and the truth of the matter is that you can't even easily fine-tune the algorithms at that scale.  To give some idea, Klatt's synthesizer (later Eloquence) had 39 parameters if I recall correctly, most of which are really cryptic.  If you dig, you can find some source code for it from the 80s.  I think i still have it, but it's useless in that it's headless: it doesn't even work at the level of phonemes.
And really they're "based" on formant synthesis.  The ideas are similar, but it's actually a very complicated filter setup.  The plosives and various hissing and the like require dedicated algorithms.  This is why I can't do it yet--I don't have the experience in DSP to actually understand the implications of half the literature on the topic.  With a Saturday, I could probably get the vowel sounds, but you still need tables of parameters for your model for every phoneme, as well as dedicated transition rules.  I'm working on learning DSP, namely by successfully writing my own custom mixer that does 3D audio, but it's going to take time yet and I'd probably use Espeak for half of it anyway--the hardest part is, ironically, the text-to-phonemes part.
And also-the iPhone voices for Siri?  I'm virtually certain that it is concatenative, given the size.  If you use Voiceover, you can download an incredibly close variant for Voiceover use; it is big enough that you're forced to be on Wi-fi.  The same voices on the Mac take up multiple gigabytes if you grab them all.  It's common wisdom that concatenative is the way of natural intelligibility, but maybe someone will prove common wisdom wrong and/or we'll get computers powerful enough to brute force the problem somehow.  I am aware of no formant synthesizer anywhere at all whatsoever that even comes close to the results I've seen with concatenative in terms of "sounding nice".  It would perhaps be very interesting to see some sort of machine learning approach to this, however; it seems like a trainable model.
If you're really, really, really interested, Espeak separates the voices from the code.  You can get the source for the voices if you dig, and read them in all their horrible horror.  But you'd have to be really interested to get much out of it.  Because it's horrible.  And horrifying.  And let's say special.

My Blog
Twitter: @ajhicks1992

2014-11-15 06:50:03

OK, so this is what I have in the game currently:

  • Native mode (SAPI on Windows, Speech-dispatcher on Linux)

  • Clipboard mode

  • Standard output mode

Is that good enough?

camlorn wrote:

the hardest part is, ironically, the text-to-phonemes part.

Honestly I think this depends on the language. For instance, with Spanish you can pretty much figure out the pronunciation of a word entirely from the written form (only exception being foreign words), it has its quirks but ultimately it's unambiguous as long as there aren't spelling errors. With English on the other hand the only foolproof method is pretty much a dictionary, since there doesn't seem to be any relation between vowels and their pronunciation at all (and worse, from what I've seen the pronunciation can even change based on the meaning of the word - ugh).

Proper entonation is a whole different can of worms though, only way to truly work around this is to add metadata (this is what SSML does), although even then at least you can try to use heuristics to get something out of there (e.g. if the sentence ends in a question mark you can assume it needs the entonation of a question).

2014-12-08 01:55:11

Returning to this thread since I'm looking for more information (though I should probably change the name of this thread).

Does anybody happen to know anything about UI Automation or Microsoft Active Accessibility? I found the documentation but they seem to be rather complex (I could be wrong), so I was wondering if somebody here could give me a hint as to where to start looking into.

Also if somebody happens to know the Linux equivalent (I think Gnome and KDE have some way to talk to screen readers) that would be nice, though if not I'll probably try looking into it later anyway.