2019-01-13 04:26:01

Hi all,

I just wanted to share a little pet project of mine with you all. It is an attempt to create a formant based speech synthesizer from scratch. If you don't know what a formant based speech synthesizer is, think Eloquence or DECTalk. Formant synthesis is based on mathematics rather than prerecorded segments of speech.

I have nothing to release yet and probably won't for some time, but I hack away at this on weekends. I am keeping a diary of my efforts with accompanying audio examples. In case anyone is interested, the URL is:
http://blastbay.com/blastvox/

I am still at a very early stage so don't expect anything impressive. Still, it's a lot of fun and I make slow but steady progress.

I will update this from time to time, so feel free to check back every now and then if you're curious.

Kind regards,

Philip Bennefall

2019-01-13 05:47:07

This is a pretty interesting project. Hats off for the hours spent scratching your head at the online articles that never seem to give you the piece of information to make everything click inside your head. I've definitely been there. I'd love to see how far you are able to get with this. I gave the entries a quick read, but I'll probably have more questions once I'm able to sit down and wrap my head around everything.

Trying to free my mind before the end of the world.

2019-01-13 06:07:29 (edited by philip_bennefall 2019-01-13 06:10:19)

Sounds good. I'm definitely open to tips and suggestions from those who are interested. I'm having a lot of fun with this and I welcome feedback from anyone who takes the time to read through the text and listen to the output.

Kind regards,

Philip Bennefall

2019-01-13 06:36:04

Honestly I'm just glad your not dead.

2019-01-13 06:39:33

This is pretty cool.

Facts with Tom MacDonald, Adam Calhoun, and Dax
End racism
End division
Become united

2019-01-13 06:42:01

Oh far from dead, just not doing much of anything relating to audio games so haven't had a lot of relevant things to post on here. Strictly speaking this is not really relevant to games either which is why I put it in the off topic room, but I figured some people might find it fun to follow along since I'm sure I'm not the only one interested in speech synthesis around here.

Kind regards,

Philip Bennefall

2019-01-13 08:17:25

Yikes. I always wondered how painful it was to make something like eloquence. Now I know.

2019-01-13 08:29:16

Can you post a link to the git repo? I would like to look at the source code. I think this is also an interesting project.

2019-01-13 09:50:56

I'm keeping the source closed for now, just so that I can decide what to do with the final product if it ever reaches acceptable quality. If I release it as open source now, I limit my options later. It is not out of the question that I will release it as open source in the future, but I want to work on it a little more before I make my decision.

Kind regards,

Philip Bennefall

2019-01-13 09:58:03

I hope this project works out for you

2019-01-13 12:30:59

Nice enthusiasm in here! Keep on hacking the funky stuff and may-be one day there will be another oldschool formant voice to use in electronic music.

2019-01-13 12:44:15

I just had another hacking session and managed to get semi-voiced sounds working, as well as a short fade at the beginning and at the end to get rid of clicks. There is a new diary entry with audio examples (see post 1 for the link).

Kind regards,

Philip Bennefall

2019-01-13 13:07:14

hi,
maybe you can use deep learning for it as well.
like convolution neural networks etc.
maybe WaveNet can help you a little bit.
also, eSpeak is somehow like what you are trying to achieve.

2019-01-13 17:00:06

Questions:
1. Wouldn't the formant/bandwidth values for voiced sounds slightly vary depending on the voice itself? (my point being, if a different person recorded the stories, would your formant values be different?)
2. So, since you ended up needing two sets of filters for voicing, are you running the same signal through both sets of filters in parallel?
3. How long does it take to generate an audio file for a short sentence like the ones you've been using for testing?

Trying to free my mind before the end of the world.

2019-01-13 18:17:04

Neural network/deep learning is not possible with a formant synthesizer the likes of what is being developed here. Google, for example, has stupid amounts of compute power allocated to just wavenet. Granted, they've trunkated the resources significantly from when they started, but it's still far from something that can run on an embedded device.
As for the synth itself, I like what I'm hearing so far! I assume you've moved on from Festvox and are just doing this by hand now, since Festvox, from the looks of it, requires quite a lot of resources to run?

2019-01-13 20:19:57

@jack
For something that started out being done by hand, the synth sounds incredible so far to me. Coming from information gathered from the web and a few estimates, it's definitely farther than I'd be able to take it. The processing power needed for wavenet is insane. I don't remember exactly what it was, but the amount of time required to generate one second of audio, even with the top-notch resources that a company like Google has, took an extremely long time. (at least at the time the article I read a while back was posted)

Trying to free my mind before the end of the world.

2019-01-13 20:39:42

Formant speech synthesis has nothing to do with that whole wavenet stuff.

2019-01-13 20:49:17

This is super interesting. Will it be easy (once you get the synth up and running), to make it bilingual? If, for example, it could support Swedish as well? (Although that'd be a headache in itself because of the different rhythms of the languages...)

skype name: techluver
Feel free to add me.

2019-01-13 21:28:30

Wavenet is a concattinative respin that involves sample-for-sample synthesis rather than splicing, either way it is not formant at all. And Festvox is diphone-concatinative.

2019-01-13 22:14:37

The text processing part of the system was trained by way of machine learning by the folks who made Festival/Festvox/Flite, so technically I am using machine learning - just not in the synthesis backend code.

ESpeak is similar to what I want to do for sure, but I am personally not very fond of its output so I wanted to see if I could achieve something different.

@BoundTo:

1. The formant frequencies and bandwidths do vary for different speakers, especially between men, women and children. It is definitely possible to derive a new formant table to add a new voice, though some other tweaks would be needed as well such as defining the average pitch etc.

2. I generate two completely separate signals, one with white noise and one with the pulse; a sawtooth in this case. The noise runs through the unvoiced filters and the sawtooth runs through the voiced ones. They are then combined using envelopes to smoothly turn the two sources on and off as appropriate.

3. Rendering the slow version of the "visual roses" sentence which comes out to 3.91 seconds, took 31 milliseconds on my laptop. This is generating the whole thing in one go, however, which you would not do in a streaming application - you could generate as little as 5 or 10 milliseconds of audio at a time in many cases. Note that I have not done any work on optimizing the code; I'm sure I could speed it up significantly down the road.

@Jack Festvox is a suite of voice building tools, you don't actually use the code in Festvox in the final synthesizer. Since I am using Flite for text processing, I'm still very much making use of Festvox. But you're absolutely right in thinking that all the synthesis code is being done by hand.

@harrylst Yes, you could definitely make a Swedish voice using the Festvox tools as a starting point. You would train models for duration and fundamental frequency prediction based on natural speech, and you would train a letter to sound rule set by analyzing a large pronunciation dictionary. All this is possible in Festvox, and the output can be converted to Flite which means I could get phones, durations and a pitch contour for a sentence. But it would take some tweaking and quite a bit of trial and error to get the various models right, not to mention getting formant settings for all the sounds that differ from English.

Thanks so much for all the positive feedback, everyone!

Kind regards,

Philip Bennefall

2019-01-13 23:30:32

Hay philop,
how do you plan to release this? sapi 5 engine, NVDA addon, or both?

be a hero and stop Coppa now!
https://docs.google.com/document/d/1Dkm … DkWZ8/edit
-id software, 1995

2019-01-13 23:31:00

or an android tts voice?

be a hero and stop Coppa now!
https://docs.google.com/document/d/1Dkm … DkWZ8/edit
-id software, 1995

2019-01-14 00:11:44

Probably all of them, if I reach a high enough level of quality. If the output is reasonable, I can package the system in all sorts of ways.

Kind regards,

Philip Bennefall

2019-01-14 01:47:28

This sounds really cool! I'm looking forward to learning more about this!

Personally I absolutely cannot stand Espeak, so I would love to see something better on the open-source market as it were.

2019-01-14 06:09:04 (edited by visualstudio 2019-01-14 06:19:11)

regarding deep learning and wavenet:
wavenet takes some audio as it's input for training, and it can produce audio output.
i'm not talking about the computational power either, just talking about the speech model
it is not only for generating speech, it can even generate music.
regarding festival/festvox, they are not so much needed in this project, since festvox by itself is used to build voices for festival/flite
also, festival has support for diphone, but it's recommended way is clustergen (unit selection).
now, coming to the text processing part:
for converting text into phones, g2p is your best option.
it becomes better, when it is trained on a sequence to sequence model.
p.s: checkout soloud
it has a little synthisizer.