2020-02-06 18:24:43

Philip recommended getting the raw audio samples when SAPI speaks and trimming the silence from the start, to help it speak faster.
However, I encountered a problem doing that.
What I have tried is this:
# a script that tries to output SAPI and trim silence.
from win32com.client.gencache import EnsureDispatch
import winsound
from wave import Wave_write
from io import BytesIO
voice=EnsureDispatch("SAPI.SPVoice")
stream=EnsureDispatch("SAPI.SPMemoryStream")
stream.Format.Type=34 #    SAFT44kHz16BitMono = 34

voice.AudioOutputStream=stream


while True:
    text=input("Enter text to speak")
    voice.Speak(text)
    bytereader=BytesIO()

    wavefile=Wave_write(bytereader)
    wavefile.setnchannels(1)
    wavefile.setsampwidth(2)
    wavefile.setframerate(44100)
    wavefile.writeframes(stream.GetData().tobytes())
    data=bytereader.getvalue()
    while data[0]==0: data=data[2:]
    winsound.PlaySound(data, winsound.SND_MEMORY)


There is just one obvious problem. There is no way to empty the stream! Every time you press enter, the previous text is also repeated!
Does anyone know how I can empty it?
If I call stream.SetData(0), then the, it helps, but if I speak a long string, and then a short string, some parts of the long string can still be heard when the short string has finished.
Thanks for any help!

2020-02-06 20:52:43

I had the same issue, and the only solution I found was to create a new stream each time. It seems to be a relatively light operation though, at least it has not caused me any trouble thus far.

BTW, beware of SAPI's horrible resampler. I ended up having to force the stream to the sample rate of the voice and doing my own upsampling to 44.1.

Also, keep in mind that trimming 0's will only get rid of absolute silence. You'll want to go for a slightly higher threshold in order to make this work for most voices, as they tend to output a noise floor.

Kind regards,

Philip Bennefall

2020-02-06 22:33:34

Hi and thanks.
It seems to work rather nicely.
BTW, how are you outputting the audio? Direct X or something?
Your screen reader wrapper manages all this stuff, right?

2020-02-06 22:44:10

Hi keithwipf1, can you post the script you've modified? Were you able to exclude even sounds slightly above 0?
I have modified your script in this way and it works.

# a script that tries to output SAPI and trim silence.
from win32com.client.gencache import EnsureDispatch
import winsound
from wave import Wave_write
from io import BytesIO
while True:
    voice=EnsureDispatch("SAPI.SPVoice")
    stream=EnsureDispatch("SAPI.SPMemoryStream")
    stream.Format.Type=34 #    SAFT44kHz16BitMono = 34
    voice.AudioOutputStream=stream
    text=input("Enter text to speak")
    voice.Speak(text)
    bytereader=BytesIO()
    wavefile=Wave_write(bytereader)
    wavefile.setnchannels(1)
    wavefile.setsampwidth(2)
    wavefile.setframerate(44100)
    wavefile.writeframes(stream.GetData().tobytes())
    data=bytereader.getvalue()
    while data[0]==0: data=data[2:]
    winsound.PlaySound(data, winsound.SND_MEMORY)

2020-02-06 22:56:04

BGT uses DirectSound. In my new engine, I am using Wasapi on Windows, CoreAudio on Mac and so on.

Congratulations on getting this to work!

Kind regards,

Philip Bennefall

2020-02-07 17:20:36

I could be wrong on this, but I think there is a SAPI event, EndStream, which reports the length of the raw audio data that was written to the stream.
That said, if I reset the stream to 0 every time I speak, I would be able to read the correct amount of data from the stream and avoide unnecessary memory allocation.
I don't know how to handle events though, in my genpy cache there is a file _ISpeechVoiceEvents.py, which seems to allow me to set handlers for voice events. I don't know which object to create to do that though.

2020-02-07 18:45:22 (edited by ambro86 2020-02-07 18:46:11)

Hi keithwipf1, trying your script for how I modified it, the sapi voice is reproduced, but it has the same latency as Sapi of Output Accessible2. I have now tried using sapi in a program written on pygame, and no known difference between using accessible2 output and the modified Sapi voice. Do you know the reason?
I also tried to put a slightly higher exclusion, i.e. I set to exclude sounds less than 1, however the latency remains the same.

2020-02-07 20:46:05

Try this:
from win32com.client.gencache import EnsureDispatch
import winsound
from wave import Wave_write
from io import BytesIO
voice=EnsureDispatch("SAPI.SPVoice")

while True:
    text=input("Enter text to speak")
    stream=EnsureDispatch("SAPI.SPMemoryStream")
    stream.Format.Type=34 #    SAFT44kHz16BitMono = 34
    voice.AudioOutputStream=stream
    voice.Speak(text)
    bytereader=BytesIO()

    wavefile=Wave_write(bytereader)
    wavefile.setnchannels(1)
    wavefile.setsampwidth(2)
    wavefile.setframerate(44100)
    wavefile.writeframes(stream.GetData().tobytes())
    data=bytereader.getvalue()
    while data[0]==0: data=data[2:]
    winsound.PlaySound(data, winsound.SND_MEMORY)
    stream.SetData(0)



Just note, it will block until it's finished speaking since it's a demo. Also, I'm just excluding 0s for now, still need to look at a noise gate.
Also note that Microsoft Anna may not be that laggy. You'd have to try a different voice in order to get a speedup I guess.

2020-02-07 22:32:52

Most of the Microsoft voices have some lag, actually. This includes the Windows 10 voices.

As for the event, I haven't investigated that. I might look into it when I revisit my Sapi backend at some point. However, while I don't like the idea of allocating a new stream instance each time, it has not caused any noticeable lag in my implementation.

Kind regards,

Philip Bennefall