2021-05-27 18:24:47

Silero Models: pre-trained enterprise-grade STT / TTS models and benchmarks.
Enterprise-grade STT made refreshingly simple (seriously, see benchmarks). We provide quality comparable to Google's STT (and sometimes even better) and we are not Google.
As a bonus:
• No Kaldi;
• No compilation;
• No 20-step instructions;
Also we have published TTS models that satisfy the following criteria:
• One-line usage;
• A large library of voices;
• A fully end-to-end pipeline;
• Naturally sounding speech;
• No GPU or training required;
• Minimalism and lack of dependencies;
• Faster than real-time on one CPU thread (!!!);
• Support for 16kHz and 8kHz out of the box;
Speech-To-Text
All of the provided models are listed in the models.yml file. Any meta-data and newer versions will be added there.
You can look it on the link:
https://github.com/snakers4/silero-models

2021-05-27 18:25:54 (edited by jonikster 2021-05-27 18:26:07)

Why am I interested in this? These guys agreed to make several new synthesizers for Uzbek, Tajik, Persian and some other languages.
I tried to run it for Windows, and it was very, very difficult!
The question is, is it possible to create a NVDA or SAPPY synthesizer from this?
Dear developers, please look at it and comment. I like the quality of voices, but it's possible to run only from the code!

2021-05-27 18:54:13

maybe. Depends if they support finding word boundaries or not.  If not, it might be doable but lower quality.  however "runs in realtime on a CPU" may not mean what you think and it's probably incredibly terrible on a laptop in terms of battery life and system responsiveness.  Expect very high latency.  It's likely that it will be difficult to stream the output of the model, and it may take up to seconds to generate speech.  If it takes 1 second for 2 seconds of speech, that's still "faster than realtime", but not good enough for your screen reader.

My Blog
Twitter: @ajhicks1992

2021-05-27 20:21:27

Here is a simple example. The only problem, Audio is an array of audio data to play or write to the file. While I don't know how.
Attention! Here is Russian text, but it is possible to use an English model with English text.

import os
import torch
from utils import *

device = torch.device('cpu')
torch.set_grad_enabled(False)
symbols = '_~абвгдеёжзийклмнопрстуфхцчшщъыьэюя +.,!?…:;–'
local_file = 'model.jit'

if not os.path.isfile(local_file):
  torch.hub.download_url_to_file('https://models.silero.ai/models/tts/ru/v1_kseniya_16000.jit',
                                 local_file)

model = torch.jit.load('model.jit',
                       map_location=device)
model.eval()
example_text = 'Это просто пример!'
sample_rate = 16000
model = model.to(device)  # gpu or cpu

audio = apply_tts(texts=[example_text],
                  model=model,
                  sample_rate=sample_rate,
                  symbols=symbols,
                  device=device)

print(example_text)
display(Audio(audio[0], rate=sample_rate))

2021-05-27 20:38:09

yeah, but the important thing is how long does that script take to run?

My Blog
Twitter: @ajhicks1992

2021-05-27 20:52:53

@camlorn
Not fast, Although the Creator assures that it's fast.
But, most importantly, I would like to get the opportunity to create audio books using these synthesizers.
There are some problems, for example, it reads only one sentence
to 140 characters.
I would like to know about creating it for SAPPY-5 to create audio books.

2021-05-27 20:59:35

We are very interested in this, because for many years my friends from Uzbekistan wanted to create a synthesizer of the Uzbek language, but no one could help. And here they asked only the case of the text or the book, which they independently split on the case, and audio records.
But it will be sad if my Uzbek friends cannot use it, as it's available only from the code.

2021-05-27 21:01:45

You're probably not going to get what you want out of this.

Time your script.  If it's slower than 200 milliseconds it won't work in a screen reader.

Most of these speech models have a character limit. You'd need to split the text into sentences first, but many sentences are longer than 140 characters.

Machine learning text to speech systems are still kind of far away when it comes to running on your computer as part of sapi or something.

My Blog
Twitter: @ajhicks1992

2021-05-27 21:08:12

@camlorn
But if not for use in the screen access program, but for books? There are programs that using SAPPY voices for creating audio files with books.

2021-05-27 21:49:19

@9
If you want audiobooks you don't need sapi and can just run the algorithm directly on the text files.

If the script takes 1 second per sentence then it takes 1 second before you can start listening at all.  If you time how long your script takes to run and tell us, we can tell you what you might be able to do with it.  I'm not sure why you're not just timing it given that you've already got one?  Are you unable to run it?

My Blog
Twitter: @ajhicks1992

2021-05-27 21:57:16 (edited by jonikster 2021-05-27 22:08:04)

@camlorn
Now I definitely cannot say how long, because I still have not found a way to write it into a file.
Why do we need SAPI?
Because there are convenient programs to create audio books with speech synthesis.

2021-05-27 22:33:29

You can write a convenient program to make audiobooks with speech synthesis in like 50 lines of Python tops. Sapi will take 500 or more.  In any case the character limit means that you will have to do some clever stuff to even make it work asuming that you can in the first place.

We don't care how long it takes to write to a file. Time this line:

audio = apply_tts(texts=[example_text],
                  model=model,
                  sample_rate=sample_rate,
                  symbols=symbols,
                  device=device)
My Blog
Twitter: @ajhicks1992

2021-05-28 00:30:30 (edited by jonikster 2021-05-28 00:31:17)

So!
I did it for English!
It works not quite correctly, since I specified not exactly correctly the string of symbels. Tomorrow the developer will answer me and it will be fix.
Just, guys developers, can you help us turn it into an opportunity for audio books?
For your convenience, I've created an executable file. You can download it here:
https://dropmefiles.com/8OSCf
And venv for that you can download here:
https://dropmefiles.com/Q0ACM
And here is the code!
import os, time
import wave
import torch
import contextlib
from utils import *
from pygame import mixer
mixer.init()
device = torch.device('cpu')
torch.set_grad_enabled(False)
symbols = '_~abcdefghijklmnopqrstuvwxyz +.,!?…:;–'
local_file = 'model.jit'

if not os.path.isfile(local_file):
  torch.hub.download_url_to_file('https://models.silero.ai/models/tts/en/v1_lj_16000.jit',
                                 local_file)

model = torch.jit.load('model.jit',
                       map_location=device)
model.eval()
sample_rate = 16000
example_text = input('Enter some text: ')
model = model.to(device)  # gpu or cpu

audio = apply_tts(texts=[example_text],
                  model=model,
                  sample_rate=sample_rate,
                  symbols=symbols,
                  device=device)

def write_wave(path, audio, sample_rate):
    """Writes a .wav file.
    Takes path, PCM audio data, and sample rate.
    """
    with contextlib.closing(wave.open(path, 'wb')) as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(sample_rate)
        wf.writeframes(audio)

for i, _audio in enumerate(audio):
  write_wave(path=f'test.wav',
             audio=(audio[0] * 32767).numpy().astype('int16'),
             sample_rate=16000)
s = mixer.Sound('test.wav')
sec=s.get_length()
s.play()
time.sleep(sec)
time.sleep(0.1)

2021-05-28 07:23:42

@jonikster,
many of the deep learning models, are even have a lot of computations during inference (the computations are worse during training always).
now, about making audio version of books, you need to present those texts to the model (as it should be less than or equal to the sequence length of the model,which probably exceeds it, if it is tacotron or a rnn based model).
if it is a convolutional only based model (like DCTTS), that will be a little different even in terms of speed, and sequence generations (which text and audio are), which I assume, these models are not.

2021-05-29 16:50:09

I ran this script in both forms, both the executable and the source and, well, let's just say the experience won't be satisfactory at all, especially if it's going to be used with a screen reader.
In short, the hole thing took around 2 minutes or so. First, it just stood there, idle, complaining about a file that's found, but isn't supposed to be there, that took around 30 seconds to disappear, I thought it froze.
Then, it had to download a file, the model, I assume. That took around 20 secs as well, though I think my net could be the issue here.
The longest time however, it took to speak the thing itself, the opening frase in the fate of the jedi series, "the darkness is eternal, all powerful, unchangeable."
That alone took over 40 seconds to run. Besides, the voice had intonation issues, like those from 15.ai, missplaced pauses and all. There are natural sounding that don't require so much time to run, like vocalizer and acapela. If I want free, then I just take rhvoice.
Have I done something wrong to make it cause this?

2021-05-29 18:26:01

@15
No, probably not.  That was the point I was trying to make.  Jonikster has seen their "in realtime on a CPU" claim and has drawn the wrong conclusion from it.  This stuff is actually that slow.

You might be able to get a really fast version of it running if you have a compatible GPU and jump through a bunch of hoops to get whatever ml framework they're using to recognize it, and maybe that'd work for audiobooks in a reasonable amount of time.  Honestly 1 word every 5 seconds works for audiobooks if you don't mind running it for days, so it could still maybe work for that.

I see the point of this: OP speaks a language without good synths.  This is part of why I want to write one, because I think that properly applied and sacrificing characteristics that sighted people care about, it could be pretty automatic.  But you really can't just shove these models into a laptop.  The unspoken caveat about that speed thing is that they're measuring it on really beefy servers with more cores than my tower and a bunch of SSDs and stuff.

My Blog
Twitter: @ajhicks1992

2021-05-29 22:30:03

@bgt lover
Rhvoice is the best option. But, there is no documentation for it! I would gladly create a synthesizer for Rhvoice...

2021-05-29 22:54:36

@17, there already is one

"On two occasions I have been asked [by members of Parliament!]: 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."    — Charles Babbage.
My Github

2021-05-29 23:35:08

wait where the frickle do i find the documentation please?

I am a hunter named Grunt. I didn't realize that until now.

2021-05-31 23:28:48

Is some polish pretrained model available? I want to test it, but not in English.