2019-01-14 06:50:40

Oh I have used SoLoud in the past, used it for my little board game called Jungle. I don't rate its speech synthesizer very highly though, nor does the author as far as I am aware.

As for text processing, I am using the default lang/usenglish and lang/cmulex pretrained models that ship with Flite. I'm not sure exactly how these were trained, but I will dive deeper into that if I ever attempt to make a Swedish voice.

Kind regards,

Philip Bennefall

2019-01-14 07:51:38

I just read up on the grapheme based language models in Festvox and they seem awesome when you are doing a language for which you don't have a lot of linguistic data. But in the case of Swedish, if I ever get far enough where I begin looking at making a Swedish model I think I will do it in the slower way, with letter to sound rules trained from a large lexicon. I found a huge lexicon for Swedish which contains over 700,000 words and is in the public domain, so I figured that might be a good starting point. Also, Swedish is my native language so mapping the formants for the various sounds shouldn't be too difficult. Actually, I'm rather looking forward to it so I hope I can get the actual synthesis part up to scratch. If I can, adding new languages and voices will be a blast.

Kind regards,

Philip Bennefall

2019-01-14 20:05:02

also if your language is difficult to process like mine (vowels between nouns etc), consider training an nGram model for it as well.

2019-01-14 22:26:56

Well that would be cool. This is kinda fun following its development because I've always wondered how synthesizers were developed.
If you ever want to branch into other languages, I'm sure you'd have many helpers here. smile

skype name: techluver
Feel free to add me.

2019-01-19 08:48:14

I wanted to throw out a question for anyone who might have some ideas. I'm trying to figure out how best to transition between phones in an utterance. A phone is a single sound, not necessarily the same as a letter, that you can say in a given language. The list of phones recognized by the synthesizer at the moment can be found at:
http://blastbay.com/blastvox/documents/ … oneset.txt

As input, I take a list of phones. For each phone I also specify a duration in seconds. Now, I am trying to figure out how to transition between the phones. At the moment I start halfway through the current phone and begin transitioning into the next. This works for some vowels but is very bad for other types of sounds. I was thinking of having a static duration for phone transitions, and begin transitioning to the next phone once the entire duration of the current phone has elapsed. So if you have two phones which are both 100 milliseconds in length and I have a transition time between phones of 20 milliseconds, it would play the entire 100 milliseconds of the first phone, then over the next 20 milliseconds it would transition from the first to the second phone, and then it would play the remaining 80 milliseconds of the last phone. This gets a bit more complex when we have diphthongs such as in the word I, but I could possibly take the remainder of the phone after the transition from the prior one and do the diphthong transition there. Does anyone have any thoughts on this idea, or perhaps a completely different approach?

Kind regards,

Philip Bennefall

2019-01-21 03:52:03 (edited by musicalman 2019-01-21 03:57:28)

I would've liked to respond to this sooner, but was having issues logging into the site. Thankfully it's sorted out.

Your research into this is really wild to me. I'm no expert in these things, but I am really interested in this stuff. I only wish I could code so that I could either try to build something on my own or contribute to yours smile.

One thing I meant to say earlier is that, at least given the examples on your site, I think that some of the transitions as you said are a little messy. One thing that particularly stuck out to me was the transition from voiced to unvoiced. It sounded too long to me, as though the synth were drunk. Lol. Of course the lengths of these transitions could change with different speaking rates, but as a bass line value, it sounds too long in proportion to vowels to me. Not sure if your most recent question will tie into that, but thought I'd mention it anyway.

As to the transition between phones, if I am understanding you correctly, you want to know if a static transition time between any two phones is most suitable? My layman understanding of synthesis tells me that it would be fine, at least as a starting point, to use the same transition time between consonants and vowels. With diphthongs, or two adjacent vowels, I'm not so sure but this would be a good opportunity for me to give it some thought.

A long long time ago, when I was a young teenager with nothing better to do, I did play around with this stuff in Dectalk, and I also had an old program called Flex Voice (which btw if anyone has Flex Voice, please get in touch). I wish I'd kept my singing synth stuff around but sadly I either lost most of it or deleted it.

In both synths, when singing or doing complex manual phoneme work, you specify phonemes and durations. The phonemes are exactly like those in the phone list in your post, though obviously the codes were not the same, but it's the same principal. You even specified duration in milliseconds, which is mainly why I never continued my singing efforts because syncing it with music would be difficult, not to mention there is at least one bug I know of in Dectalk that messes with the millisecond values of certain phonemes. Anyway, I found that setting consonants like b, d, f, g, h, k etc. to 60 milliseconds generally worked well if I remember right, and I don't remember having to change them. Now for words like grass, that have multiple consonants at the beginning, I can't remember what I did. I also don't remember, in a musical setting, whether I wanted to put the consonant on the beat, or the vowel on the beat. Something tells me the latter is more appropriate. For your speech synth efforts, that obviously doesn't matter since the transition will happen in either case, but it's an interesting question when you're trying to do singing.

As for diphthong transitions, they were always controlled by the synthesizer, which for a geek like me does get a little annoying when it transitions in a way you don't particularly want in the context of what you're doing. For example, let's say I want to make the word ice, with a long duration. . Some synths will stretch the duration from ah to ih. The stretch will be proportional, meaning that if you make it 5 seconds long, you'll get a very slow transition. If I remember right, Dectalk has a maximum length for this transition and will just hang on the ih sound at the end for the rest of the phoneme. I suspect other synths will do the inverse, that is, prolong the ah and do the transition at the end, which sounds somewhat commical. As to which one is best, I can't really say, especially for normal speech.

All of this reflection on speech synths reminds me of my own efforts to explore the concept, but being the techy/musical person I am, I like putting things in a musical context, as you can probably tell from this post.

I don't know if you've ever heard of a machine called the Voder. The Voder is an old machine from the late 1930s iirc. It was never commercially used but was more an educational contraption built to show that with the cutting-edge technology of the time, it was possible to synthesize speech. I think the research that went into synthesizing speech with the Voder also helped in building the vocoder concept. Anyway the Voder supposedly had a complex control board to adjust analog circuitry (filters, oscillators etc) to make speech sounds. Of course it wasn't an automatic tts. To make any sort of speech you had to learn precisely how to move the controllers to produce different phonemes, and this required many months of training to master.

One day I was exceptionally bored and started a script for a musical instrument player called Sforzando with the goal of making a singing synth, and its usage is, roughly, the same idea as the Voder. To make things complicated, each formant filter's frequency is assigned to a different control knob, mainly because in Sforzando there is almost no way I could do any complex transitions automatically so it has to be done manually. As for unvoiced sounds, I haven't at all decided how to tackle that. So this thing would never be useful to anyone unless they wanted to experiment, but it's cool that we are approaching speech synthesis from totally different angles. I am already learning a lot from reading about your efforts. The difference between parallel and cascade filter setups is really interesting to me, as well as your take on synthesizing unvoiced consonants. I fully relate to how tricky it can be to EQ noise in such a way as to creat consonants and am wondering how you did it. Will be especially interested to see how you make f and th distinguishable, but the better formant synths make valliant attempts (at least Eloquence does, and that's the one whose sound I know best).

Well that's enough rambling from me. Hope it was at least a fun read anyway.

Make more of less, that way you won't make less of more!
If you like what you're reading, please give a thumbs-up.

2019-01-21 14:58:13

sorry I didn't read this entire thread. What are you using for a text to phoneme engine? Did you write your own or did you use ESpeak or something else as a frontend translator. Or do you just not have one yet. Also if I may be of a little assistance, ESpeak is the synth that advertises itself as a formant synth, the resonances seem to be a bit more hollow sounding and the f0 is a weird waveform. Klatt synthesis is what DECtalk and Eloquence use, which I'm sure is similar but I haven't dove into all the particularities. That may be the filter bank selection or whatever. I know there are a few implementations around of klatt synths you can get, one of them being the now dead NVSpeech Player.

----------
An anomaly in the matrix. An error in existence. A being who cannot get inside the goddamn box! A.K.A. Me.

2019-01-21 15:13:14

ah. Flight. K, well I guess that's a bit more compatible with your phoneme tables smile. It mentions a repository, where can I find it and what do I need to make it run?

----------
An anomaly in the matrix. An error in existence. A being who cannot get inside the goddamn box! A.K.A. Me.

2019-01-22 02:14:16

@musicalman Thanks for the thorough response! I have definitely been playing around with the idea of singing. In fact, singing is much easier to do than natural speech as it is much more confined. I added pitch the other day so you can supply a list of target points across the utterance. Unlike DECTalk, you specify the list of F0 targets independently from the list of phones, so you could make a vowel that is stretched out but which changes notes all over the place. I will prepare something to show this over the weekend.

The Voder was amazing. In fact, it is the first machine that is demonstrated in this rather interesting vinyl recording from 1986 by Dennis Klatt, the author of DECTalk:
https://www.dropbox.com/s/ky5vxfnpc79a9 … ng.au?dl=1

Speaking of vocoders, I did write one of those as well a few weeks ago. Haven't decided what to do with it yet but it sounds pretty cool. You can do the traditional singing synthesizer effect, and if you use a generated carrier you can also do time stretching where the speech gets slower or faster. I might release it as an open source library or a VST plugin at some point, not sure yet.

Regarding unvoiced sounds and transitions, both of these things are definitely work in progress. I did make some improvements to both last weekend but didn't have time to write another diary entry; look out for one this weekend.

@x02 Both DECTalk and ESpeak are formant synthesizers. Klatt synthesis is a style of formant synthesis where you have a bunch of defined parameters that you can play with to make speech sounds. It was used in the MITalk system from MIT, it was used in DECTalk and possibly in a few others - I'm not entirely sure. Both ESpeak and DECTalk are formant synthesizers, in other words, just based on different techniques for how to generate the excitation signal and a few other details as far as I know.

I do actually have a public domain implementation of a Klatt synthesizer, but I didn't like the sound of it so I wanted to make one from scratch using the techniques I knew. Whether I will get all the way remains to be seen, of course, but it's a lot of fun to have it as a side project regardless of how it turns out.

There is indeed a repository but I haven't made it public. I just didn't edit the mentions of it out of the diary, as I wanted to keep it intact. I haven't decided what to do with the implementation yet; it really depends on what kind of quality level I am able to reach.

Kind regards,

Philip Bennefall

2019-01-22 18:13:30

philip_bennefall wrote:

The Voder was amazing. In fact, it is the first machine that is demonstrated in this rather interesting vinyl recording from 1986 by Dennis Klatt, the author of DECTalk:
https://www.dropbox.com/s/ky5vxfnpc79a9 … ng.au?dl=1

That is indeed a cool set of old speech synth examples. If you want a cool overview of the Voder, you can check out this video.

philip_bennefall wrote:

Speaking of vocoders, I did write one of those as well a few weeks ago. Haven't decided what to do with it yet but it sounds pretty cool. You can do the traditional singing synthesizer effect, and if you use a generated carrier you can also do time stretching where the speech gets slower or faster. I might release it as an open source library or a VST plugin at some point, not sure yet.

That's pretty cool. i've always been looking for good vocoders that are intelligible and clear-sounding (never was a fan of a lot of traditional analog vocoders because the speech is somewhat disguised).

If we want to start cringes, I could bring up PB Vocoder, which was actually pretty cool. Only odd thing about it for me was the sort of springy sound it had, which I'm guessing comes from the types of filters it used for band splitting. From what I've heard, most things use what are known as minimum phase filters. I don't know how it works, but I do know that such filters tend to produce that sort of sound because i've played with EQs that offer a minimum phase or linear phase option. The springy sound gets more pronounced with minimum phase when band splitting, especially if you're using a lot of bands. Linear phase sounds more clean but is more resource intensive and is probably a lot harder to implement. So most things don't use them, unless there is a particular reason why they are more appropriate.

Make more of less, that way you won't make less of more!
If you like what you're reading, please give a thumbs-up.

2019-01-22 19:59:10

@musicalman Here are a few examples of the vocoder I made.

Original speech:
https://www.dropbox.com/s/wdskvp98zpnuu … h.wav?dl=1

Vocoded with a sawtooth chord:
https://www.dropbox.com/s/1pxxlhq14kaq6 … l.wav?dl=1

Same output but slowed down:
https://www.dropbox.com/s/19vrtok5ftqgg … w.wav?dl=1

Same output with shifted formants:
https://www.dropbox.com/s/mokw4va4vwoz7 … r.wav?dl=1

Let me know what you think!

Kind regards,

Philip Bennefall

2019-01-22 21:47:04

Pb-vocoder sounded cool. It's sad, it was killed off.

2019-01-23 01:47:06

Speaking of vocoders, I found the first ever recording of a vocoder.
https://ptolemy.berkeley.edu/eecs20/speech/vocoder.html

2019-01-23 08:55:30

The audio examples I posted above are not of PB Vocoder, but of a new implementation that I wrote a few weeks ago. I'd be curious to hear what you all think of the sound. I haven't yet decided whether to release it or not, but I might clean it up and publish it at some point if people like the examples.

Kind regards,

Philip Bennefall
P.S. I wasn't able to hear the examples on that page because apparently I don't have the Java runtime. Ah well.

2019-01-23 15:19:02

To me personally, PB sounded waaay better and more oldschool while this new one is just another vocoder without any outstanding characteristics.

2019-01-23 19:42:01

This implementation sounds really good. I remember using PB Vocoder (despite the demo beeps) and even that didn't sound half bad. Would be interesting to try this thing out with Goldwave if it were released as a vst. Probably the way to go these days since this stuff is common plugin material, what with the amount of reverb plugins you see floating around for example.

2019-01-23 20:11:05

For me a good old standalone PB-vocoder would be the perfect match, but it's probably not possible to get at, so I won't get stuck to illusions and dreaming.

2019-01-24 02:02:16

Such a program isn't necessary if it means extra code. A vst is like a program anyway what with the interface and stuff like that, just without having to worry about drawing a window. After all, if Abby Road Reverb is simply a vst/vst3, then a vocoder has that same potential, including an accessible ui.

2019-01-24 02:12:46

Oh well, I was just speciffically refering to the PB-Vocoder and not just any other one. In my case it's just a somewhat stuck-in-the-old-ways situation with vsts. I'm using some simpler, but never the less good vst effects in my trusty dusty Adobe Audition 1.5, but main work/production/whatever we may call it, is done on hardware, so I don't feel a big enough need for a true daw with all that vst stuff. Still I injoy all kind of oldschool standalone proggies, whenever I can't get something done on my various hardware synths, that's why this whole thing keeps on poppin' up about me not using vst technology to the fullest. Now I think we shall stop discussing the matter in this topic or make another one for this, if there shall arise a need great enough to do it.

2019-01-24 04:36:42

hi philip.
this is a quite nice project. keep up the good work.

2019-01-24 09:16:27

Oh my vocoder is nothing special by any means, it's about as standard as they come. I just wanted to see if I could write one from scratch over a weekend. Still, I'm pretty happy with the output myself.

Kind regards,

Philip Bennefall

2019-01-28 01:50:31 (edited by musicalman 2019-01-28 01:54:45)

I do like how intelligible the speech is with your vocoder. It's crisp and direct and to me at least it has some unique flavor, probably because it kinda has that springy sound I find odd, like the voice is going through a rubber tunnel or something. But many vocoders have that and I think a lot of people like that.

One of my favorite vocoders is the MDA Talk Box. It's a super old VST with limitations but its clarity can almost not be beat. It sound is admittedly digital and perhaps robotic/tinny, which some people might not like. It sounds like it's using some LPC techniques which produce a totally diffrent sound than a conventional bandsplitter. But I've always been looking for that sort of Daft Punk sound, and MDA Talk Box is the closest to that I have found.

Make more of less, that way you won't make less of more!
If you like what you're reading, please give a thumbs-up.

2019-01-28 06:55:15

yeah, MDA TalkBox is excellent. It does indeed use LPC.

I actually just released my vocoder as open source. Check out the topic in the Developers room if you're interested. There's a command line application which allows offline rendering of Wave files, with a bunch of settings that you can tweak.

Kind regards,

Philip Bennefall

2019-01-29 10:39:18

Philip, you're vocoder sounds very good, on par with Mda TalkBox but it has those extra settings.

Best regards
T-m

2019-01-30 03:21:11

FWIW I downloaded the Vocshell program, but can't make it work. Are there any requirements for the wav files you need to use with it? When I try to process a file, it just crashes with no error message generated by the program, and I'm not sure how to view crash details on Windows 7.