2024-02-23 15:10:53 (edited by maragaz 2024-02-23 15:40:33)

Hello. I am trying to add realtime voicechat feature to my game. My code works when only two people voicechat at the same time, but once the third comes in, the audio gets choppy (not laggy, so there is no latency, but the audio is choppy). Its not do to network latency or something, since I tried it in localhost, i opened two clients at the same time and turned on voicechat in both, and same happened. Also, Its an error in the client, not server. Because lets say 3 people voicechatting, but they are in different maps so they can't hear each other, the choppines doesn't occur. So it happens when one client receives audio data from multiple players at the same time. When a client receives multiple audio data from different players, only the audio in that client gets choppy. How can I fix this issue?
Here's a bit more info so you can learn more about my setup.
I use opus as the encoder
The frame size is 960 samples, so 20 ms, and I use 48000 sample rate.
None of my operations are blocking, E.G the issue is not because the playback of one audio stream blocks the other, they play at the same time. My network operations are also non-blocking, I use enet, which uses nonblocking system builtin. Only blocking operation is the opus decoding and encoding phase, but this is required, because the audio has to be first decoded and then played, so the opus blocks the thread when decoding, but when it starts playing, its not blocked.
I also buffer audio for each player, so I don't play the data right when it comes in, each player has an list which stores audio temporarily, when its length reaches a certain amount, I play the audio, and clear the list.
And the way I make them play at the same time is that each player in the game has an openal buffer and source, when the server broadcasts the data to clients, it includes the players name and the audio data in the packet, so clients can know which player the audio is coming from. Clients then put the data in to the players audio buffer (not openal buffer), when the audio buffers size reaches a certain limit, I clear the audio buffer, copy audio buffer to openal buffer, then rewind the players openal source and play it, so it plays the new data without blocking, and it can play more than 1 player's data, because each logged in player has an seperate openal buffer and source in the client.
I was using pyaudio Before for playback also, but switched to openal because pyaudio's stream read and write functions were blocking, and callbacks were the only way to make them not blocking, but since they were overly complex, I switched to openal. Since openal is low-level library, it gives me more control on how I will play the audio. But I still use pyaudio for recording, only I use openal for playback.
I was using opuslib as opus encoder before , and since it was old, I thought maybe that's the issue, and switched to pyogg library, which is another opus wrapper for python, but that also didn't fix it.
Any help would be appreciated. Thanks!

Everything about people can change, including their appearance, but the character inside a person never changes.
Regards...

Bilal

2024-02-23 18:35:49

I would have trouble writing a voice chat library from scratch.  If you can at all get away with it, find a library that handles the problem.  I've looked a few times and haven't found much other than WebRTC implementations but it's been years so maybe there's something.  Getting as far as realizing you need a jitter buffer is at least a start.

You are creating an opus decoder for every incoming stream and not trying to share it, right?  That's the most immediate thing that comes to mind. If it's not that I'd have to try to debug your code and I don't particularly have time for it.

Note that you can handle voice chat in Python if you do it in a separate process on the server, but sync or async the gil may get in the way if you try to share it with your game logic, so you may need to consider having a separate connection from the client.  How far you can get depends on if whatever libraries you're using are smart about releasing the gil.

And also make sure everything C++ is compiled in release mode.

In OpenAL there are functions to play buffers and functions to enqueue buffers on a source to play after the current one is finished.  It's been too long for me to remember names without checking references.  Make sure you're using the enqueuing ones.  Also, at least part of your delay needs to "live" in OpenAL, for example you need to always have 100ms or so at minimum queued there.  OpenAL is actually a horrible API for this, you should use the callback ones.  Why?  Because you  have to make sure that you have enough queued in OpenAL that there's still some left by the time the rest of your app gets around to pushing more in.  If not, then even though you have the buffer in Python, the OpenAL part can underflow and go silent for a brief second before you get the next buffer to it and, if that happens, it'll sound like a click.

The typical architecture--which doesn't work in Python--is to have a thread managing a bunch of ringbuffers that it fills as fast as it can and then to drain those ringbuffers in a non-blocking fashion on the audio thread.  Pretty much anything that doesn't look at least a bit like that is going to require you to crank the latencies up, and you can't do that in Python so get cranking.  Even if it looks like you have the gil gets in the way of it.  Synthesizing low-latency audio in Python, even if using C libraries to do most of it, just isn't really doable.  Before someone raises the topic of NVDA, NVDA only gets away with it because they're doing a highly specific thing that can't work for almost anything else.  But this post is long enough so I'm not going to try to explain it right now.

My Blog
Twitter: @ajhicks1992

2024-02-24 15:59:34

@camlorn no, I am not making new opus decoder for each data it receives, an decoder is created when the program starts.
As for callbacks, I find openal's method easier, so that's why I use it, and I tried clalbacks also before, but it was the same, it would work good for 2 players but once clients recieve more than 1 audio stream at the same time, it would get choppy.
And I don't think its an error in the serverside as I said, its client side. So even If i were to run it in a seperate server and make client connect to that, it wouldn't be fixed.
I didn't understand what you mean by And also make sure everything C++ is compiled in release mode. What you mean by c++ specificaly? My code is 100% python, its not using c++.
And yes I use the enqueuing ones, alSourceQueueBuffers and alSourceUnqueueBuffers. In the receiving stream part, If the buffer reaches its limit and the source is not playing, I unqueue the finished source, update openal buffer with new data, requeue buffer to source and play it. But I realized that I can't update the openal buffer data without unqueuing it from the source, which is why I first unqueue it before updating openal buffer. And my buffer size is 200 ms.

Everything about people can change, including their appearance, but the character inside a person never changes.
Regards...

Bilal

2024-02-24 18:04:38

You can't make one opus decoder for everything, you need one opus encoder and one opus decoder per person per client.  Unless the opus library says it handles this, the problem you're facing is that the decoder is trying to be used for more than one stream, and that's not how Opus works.  If you want to avoid having to decode every player's audio on every client your server will have to decode, do the mixing, and re-encode.  But without that if you have 3 players you'll need 2 decoders per player so that each player's client can decode the other two.  You only need one encoder per client since that's one stream.

You are not writing C++ code, but your Python extensions are using C++ code on your behalf.  If you've somehow ended up with any of that in debug mode that won't help things.  But based on what you just said the issue is misunderstanding how Opus works, I think

My Blog
Twitter: @ajhicks1992

2024-02-24 20:27:30

But even If ı have different decoders for each player, since one decoder blocks the main thread when decoding, they won't be able to decode data concurrently. I think you say I have to decode multiple players data at the same time, but since the opus library is blocking the thread no matter how many decoders I have, this is not possible.

Everything about people can change, including their appearance, but the character inside a person never changes.
Regards...

Bilal

2024-02-24 20:47:37

The decoder itself doesn't do I/O.  You must collect the network packets and then do all of the decoding at once once you have enough that you can just loop over decoders or whatever, Enet makes this kinda tricky and I can't comment further because I don't use or even particularly trust Enet.  Yes, it blocks a thread.  Whether it's the main thread or whether it decodes while holding the GIL depends.  You can do it on any thread you want and if the C extension is smart enough to release the GIL then you don't have to block the main thread.  I don't know if it is or not, you'd have to go read it, it may be written in one of 3 different things, they all manage the GIL differently, have fun.

But it doesn't matter what you like here because the Opus format is one audio stream per, well, stream.  And if you have 4 players that you want to hear, you have to decode 4 streams.  Which is 4 decoders.  Even if the library somehow handles this it's the same amount of work in terms of the CPU.  You can't library your way around it, because Opus etc. are standardized formats and that's just how they work.  You can't send packet 3 to a decoder and then send another player's packet 3 because the decoder is looking for the first player's packet 4.

As I said this is hard in the first place and even harder to do in Python.  If you're dead set on making your own voice chat thing then you need to learn proper networking and threading and probably use a different language.  But, in terms of the immediate question, the answer is almost certainly more decoders.

I don't know what Sam Tupy did with STW, maybe he found a cheap way that works or doesn't have enough players for it to be a problem.  But the standard solution to voice chat is actually NAT hole-punching and stun/turn servers deployed near the users.  Team Talk doesn't do stun/turn but it definitely does NAT hole-punching.  If you don't know what that is, you want to look it up.  The client-server architecture you want to use is going to be really latent even if you used Rust or C++, and after you get over 500ms or so people start talking over each other all the time.

If you want to learn more on this topic, look up the WebRTC RFCs.  Some of it such as format negotiation and data channels don't apply to you, but the rest of it is "here is how you write a very robust voice/video chat system", and it's used by Chrome and Firefox for things like Google Meet.  Unfortunately it's written for people who already know what they're doing, it's not at all like a tutorial.  But the RFCs do contain justifications for the decisions as well as telling you what decisions they made, and there are various C++ WebRTC implementations around.  Maybe you can bind one of those.

But I mean...I was trying to answer the original question.  But the real answer to the larger question is hello I am a staff software engineer who does C/Rust as the dayjob and Synthizer as the side project and getting something robust working would take me weeks if I worked on it every day.  So eh.  Good luck I suppose.  You can get as far as a few people who all happen to be close to the server you're running it on, but scaling it up to say a couple hundred users all over the world? That's tough.  Better uses for my time.  IMO better uses for yours.  "We suggest Discord or Team Talk" is one line in a help file.  That's how I'd handle it, at least until I had something popular and stable without much else to do on the project.

My Blog
Twitter: @ajhicks1992