The text processing part of the system was trained by way of machine learning by the folks who made Festival/Festvox/Flite, so technically I am using machine learning - just not in the synthesis backend code.
ESpeak is similar to what I want to do for sure, but I am personally not very fond of its output so I wanted to see if I could achieve something different.
@BoundTo:
1. The formant frequencies and bandwidths do vary for different speakers, especially between men, women and children. It is definitely possible to derive a new formant table to add a new voice, though some other tweaks would be needed as well such as defining the average pitch etc.
2. I generate two completely separate signals, one with white noise and one with the pulse; a sawtooth in this case. The noise runs through the unvoiced filters and the sawtooth runs through the voiced ones. They are then combined using envelopes to smoothly turn the two sources on and off as appropriate.
3. Rendering the slow version of the "visual roses" sentence which comes out to 3.91 seconds, took 31 milliseconds on my laptop. This is generating the whole thing in one go, however, which you would not do in a streaming application - you could generate as little as 5 or 10 milliseconds of audio at a time in many cases. Note that I have not done any work on optimizing the code; I'm sure I could speed it up significantly down the road.
@Jack Festvox is a suite of voice building tools, you don't actually use the code in Festvox in the final synthesizer. Since I am using Flite for text processing, I'm still very much making use of Festvox. But you're absolutely right in thinking that all the synthesis code is being done by hand.
@harrylst Yes, you could definitely make a Swedish voice using the Festvox tools as a starting point. You would train models for duration and fundamental frequency prediction based on natural speech, and you would train a letter to sound rule set by analyzing a large pronunciation dictionary. All this is possible in Festvox, and the output can be converted to Flite which means I could get phones, durations and a pitch contour for a sentence. But it would take some tweaking and quite a bit of trial and error to get the various models right, not to mention getting formant settings for all the sounds that differ from English.
Thanks so much for all the positive feedback, everyone!
Kind regards,
Philip Bennefall