2024 Note on Registrations

keithwipf1 · 2020-02-06 18:24:43

keithwipf1
Human antivirus
Offline

From: Canada
Registered: 2016-04-03
Posts: 852
User Karma: 41

Philip recommended getting the raw audio samples when SAPI speaks and trimming the silence from the start, to help it speak faster.
However, I encountered a problem doing that.
What I have tried is this:
# a script that tries to output SAPI and trim silence.
from win32com.client.gencache import EnsureDispatch
import winsound
from wave import Wave_write
from io import BytesIO
voice=EnsureDispatch("SAPI.SPVoice")
stream=EnsureDispatch("SAPI.SPMemoryStream")
stream.Format.Type=34 # SAFT44kHz16BitMono = 34

voice.AudioOutputStream=stream

while True:
text=input("Enter text to speak")
voice.Speak(text)
bytereader=BytesIO()

wavefile=Wave_write(bytereader)
wavefile.setnchannels(1)
wavefile.setsampwidth(2)
wavefile.setframerate(44100)
wavefile.writeframes(stream.GetData().tobytes())
data=bytereader.getvalue()
while data[0]==0: data=data[2:]
winsound.PlaySound(data, winsound.SND_MEMORY)

There is just one obvious problem. There is no way to empty the stream! Every time you press enter, the previous text is also repeated!
Does anyone know how I can empty it?
If I call stream.SetData(0), then the, it helps, but if I speak a long string, and then a short string, some parts of the long string can still be heard when the short string has finished.
Thanks for any help!

philip_bennefall · 2020-02-06 20:52:43

philip_bennefall
red potter
Offline

From: Sweden
Registered: 2007-06-07
Posts: 748
User Karma: 121

I had the same issue, and the only solution I found was to create a new stream each time. It seems to be a relatively light operation though, at least it has not caused me any trouble thus far.

BTW, beware of SAPI's horrible resampler. I ended up having to force the stream to the sample rate of the voice and doing my own upsampling to 44.1.

Also, keep in mind that trimming 0's will only get rid of absolute silence. You'll want to go for a slightly higher threshold in order to make this work for most voices, as they tend to output a noise floor.

Kind regards,

Philip Bennefall

keithwipf1 · 2020-02-06 22:33:34

keithwipf1
Human antivirus
Offline

From: Canada
Registered: 2016-04-03
Posts: 852
User Karma: 41

Hi and thanks.
It seems to work rather nicely.
BTW, how are you outputting the audio? Direct X or something?
Your screen reader wrapper manages all this stuff, right?

ambro86 · 2020-02-06 22:44:10

ambro86
Swamp machine
Offline

From: Italy
Registered: 2011-01-02
Posts: 1,117
User Karma: 98

Hi keithwipf1, can you post the script you've modified? Were you able to exclude even sounds slightly above 0?
I have modified your script in this way and it works.

# a script that tries to output SAPI and trim silence.
from win32com.client.gencache import EnsureDispatch
import winsound
from wave import Wave_write
from io import BytesIO
while True:
voice=EnsureDispatch("SAPI.SPVoice")
stream=EnsureDispatch("SAPI.SPMemoryStream")
stream.Format.Type=34 # SAFT44kHz16BitMono = 34
voice.AudioOutputStream=stream
text=input("Enter text to speak")
voice.Speak(text)
bytereader=BytesIO()
wavefile=Wave_write(bytereader)
wavefile.setnchannels(1)
wavefile.setsampwidth(2)
wavefile.setframerate(44100)
wavefile.writeframes(stream.GetData().tobytes())
data=bytereader.getvalue()
while data[0]==0: data=data[2:]
winsound.PlaySound(data, winsound.SND_MEMORY)

philip_bennefall · 2020-02-06 22:56:04

philip_bennefall
red potter
Offline

From: Sweden
Registered: 2007-06-07
Posts: 748
User Karma: 121

BGT uses DirectSound. In my new engine, I am using Wasapi on Windows, CoreAudio on Mac and so on.

Congratulations on getting this to work!

Kind regards,

Philip Bennefall

keithwipf1 · 2020-02-07 17:20:36

keithwipf1
Human antivirus
Offline

From: Canada
Registered: 2016-04-03
Posts: 852
User Karma: 41

I could be wrong on this, but I think there is a SAPI event, EndStream, which reports the length of the raw audio data that was written to the stream.
That said, if I reset the stream to 0 every time I speak, I would be able to read the correct amount of data from the stream and avoide unnecessary memory allocation.
I don't know how to handle events though, in my genpy cache there is a file _ISpeechVoiceEvents.py, which seems to allow me to set handlers for voice events. I don't know which object to create to do that though.

ambro86 · 2020-02-07 18:45:22

ambro86
Swamp machine
Offline

From: Italy
Registered: 2011-01-02
Posts: 1,117
User Karma: 98

Hi keithwipf1, trying your script for how I modified it, the sapi voice is reproduced, but it has the same latency as Sapi of Output Accessible2. I have now tried using sapi in a program written on pygame, and no known difference between using accessible2 output and the modified Sapi voice. Do you know the reason?
I also tried to put a slightly higher exclusion, i.e. I set to exclude sounds less than 1, however the latency remains the same.

keithwipf1 · 2020-02-07 20:46:05

keithwipf1
Human antivirus
Offline

From: Canada
Registered: 2016-04-03
Posts: 852
User Karma: 41

Try this:
from win32com.client.gencache import EnsureDispatch
import winsound
from wave import Wave_write
from io import BytesIO
voice=EnsureDispatch("SAPI.SPVoice")

while True:
text=input("Enter text to speak")
stream=EnsureDispatch("SAPI.SPMemoryStream")
stream.Format.Type=34 # SAFT44kHz16BitMono = 34
voice.AudioOutputStream=stream
voice.Speak(text)
bytereader=BytesIO()

wavefile=Wave_write(bytereader)
wavefile.setnchannels(1)
wavefile.setsampwidth(2)
wavefile.setframerate(44100)
wavefile.writeframes(stream.GetData().tobytes())
data=bytereader.getvalue()
while data[0]==0: data=data[2:]
winsound.PlaySound(data, winsound.SND_MEMORY)
stream.SetData(0)

Just note, it will block until it's finished speaking since it's a demo. Also, I'm just excluding 0s for now, still need to look at a noise gate.
Also note that Microsoft Anna may not be that laggy. You'd have to try a different voice in order to get a speedup I guess.

philip_bennefall · 2020-02-07 22:32:52

philip_bennefall
red potter
Offline

From: Sweden
Registered: 2007-06-07
Posts: 748
User Karma: 121

Most of the Microsoft voices have some lag, actually. This includes the Windows 10 voices.

As for the event, I haven't investigated that. I might look into it when I revisit my Sapi backend at some point. However, while I don't like the idea of allocating a new stream instance each time, it has not caused any noticeable lag in my implementation.

Kind regards,

Philip Bennefall

2024 Note on Registrations

Python how to get the raw audio data from SAPI?

Posts: 9

#1 Topic by keithwipf1 2020-02-06 18:24:43

#2 Reply by philip_bennefall 2020-02-06 20:52:43

#3 Reply by keithwipf1 2020-02-06 22:33:34

#4 Reply by ambro86 2020-02-06 22:44:10

#5 Reply by philip_bennefall 2020-02-06 22:56:04

#6 Reply by keithwipf1 2020-02-07 17:20:36

#7 Reply by ambro86 2020-02-07 18:45:22 (edited by ambro86 2020-02-07 18:46:11)

#8 Reply by keithwipf1 2020-02-07 20:46:05

#9 Reply by philip_bennefall 2020-02-07 22:32:52

Posts: 9