2017-07-11 22:48:54 (edited by JLove 2017-07-11 23:12:21)

Hello all.  A friend and I have been working on a game for a while now.  I had to take some time to deal with some life issues, but honestly, a large part of the reason for the lengthy development time (especially over the last year to year and a half is that we have run into a problem which we are at a loss to solve, and one which the few BGT programmers I know are unable to shed any light on.
My friend and I initially suspected that the bug resides in the networking code, although based on all testing done to date, packets are sent and received fine, and all parameters are passed and received correctly, so we are not certain of that.  I'll try to explain the issue as best I can.
There is only one random aspect to this game, and therefore, the random seed is passed initially to the client machine, like so:
Server.send_reliable(0, "r:" + random_get_state(), Player);
And then, when events are checked:
if(string_contains(Event.message, "r:", 1) > -1)
{
string[] position = string_split(Event.message, ":", true);
random_set_state(position[1]);
The only time that the random seed is used is when the ball bounces off of the net in the middle of the grid.  Tests show that the seed is passed correctly, so this is not the issue.  I have merely included this here for completeness.
Now to try to explain the issue.  For this example, we will use a 10 x 10 grid.  Player A hits the ball diagonally to player B.  The ball lands at coordinates 8,8 for player A, which would be reflected as 1,1 for player B (obviously, the movement is mirrored for the opponent.  If player A hits the ball to the right, then it would move to the left as it came into the opponent's side.  So 9,9 to player A is equivalent to 0,0 for player B, 8,8 is equivalent to 1,1 for player B).  Eight to nine times out of 10, the ball will, indeed, land on 1,1 for player B.  However, the remaining times, the ball lands exactly one diagonal off.  So player A shows the ball landing at 8,8.  Instead of landing at 1,1 for player B, however, the ball lands at 0,0, which is exactly one diagonal square off.  If the ball is struck straight ahead rather than diagonally, then the ball is exactly one square off in the direction it was struck.  I.e., if the ball is struck straight ahead from 9,0 from player A, and lands at 9,8, Which translates to 0,1 for player B, instead of 0,1, the ball lands at 0,0.
Two things bother me about this problem.  First, it's not reliably replicated.  It occurs totally randomly, with no discernible pattern.  Tests were conducted, where 25 balls were struck, with 22 landing correctly.  The three that did not land correctly were spaced out.  The main thing that bothers me is that the way this is set up, coordinates are not passed back and forth.  In other words, I do not pass packets with coordinates with each itteration of the ball.  Those would need to be sent reliably to insure delivery, and initially, This was tried, and caused major, major problems.  The way it is done now is that instead of passing the coordinates, a packet is sent that indicates the shot that was made, with the appropriate parameters for strength, direction, height, speed, etc.  So the computer receives this, and then calls the appropriate method independently, with the parameters that were received.  Testing done has shown that this occurs correctly, and that all of the parameters used by both machines do match.  What this means is that both computers are executing the exact same ball movement code independent of any influence from the other machine, with the exact same parameters for shot strength, height, speed, etc.  If two computers are using the exact same code, with the exact same values, then both computers should arrive at the same exact result.  If I am player A, and the ball on my machine lands at square 9,9, then given the fact that the exact same movement code is executed, and the exact same parameters from my machine for the movement are being used by the other computer, the ball should land at 0,0 for player B, because the movement should exactly match.  Extensive testing has shown that the parameters are being passed and received correctly, and that the appropriate methods are being called correctly.  Everything matches exactly, except for the final landing outcome, and as stated before, many times it does match correctly, but even on the times that it doesn't, all other parameters and code execution do still match.  I have no idea what might be causing this issue, since the problem is so random, and since it has been determined that everything else is functioning correctly, there should never be a discrepancy in movement, or where the ball lands.  It can't be a timer issue, since both machines are using the same code, with the same timer parameters.  It isn't a network latency issue, because while this might mean that my machine executes your shot and calls the code a second or two after you actually strike it, as long as the parameters are received by my machine, the ball should still land correctly, based on the code and the parameters for strength of shot, etc. that were received.  Does anyone have any thoughts at all as to what this issue might be?  I would really, really like to finish this project, but it seems pointless to continue until this has been ironed out, because where the ball lands is critical to the game's scoring system, and the difference of just one coordinate can be huge.  If my machine lands the ball within the grid, and considers my shot good, it might score points for me, but if the other machine lands one square off, and the shot lands outside of the grid and is considered a bad one, then the scoring could change for the opponent, and will not at all match.  Therefore, until this problem is resolved, further coding seems unwise at best, and a waste of time at worst.  I really hope that someone can help to point in the general direction of a solution.  I appreciate any feedback.  Thanks.

JLove

2017-07-11 23:36:39

Hi,

so, what you didn't explain here is how you're using randomness itself. You told us that you pass the RNG seed around, but the calculation depending on strength, speed and such stuff should be dclear and without any randomness, shouldn't it?
If you're using randomness instead, I expect you to do some random request (random()) more in any part of the code, e.g. in the sending part. The RNGs should work synchronously as long as they are called the same amount of times with the same parameters, as soon as any of the parameters change or the one side of the game calls it one time more often than the other side they will go asynchronous and deliver totally different results.
I also don't understand the problems you're facing here. You said that you got problems delivering the coordinates to drop the ball at over the net. I can understand this, the internet can be some tricky medium and since you only got UDP at your disposal using BGT and not even TCP, this can get a bit tricky, since you won't be able to detect package loss. But you now send strength and all that kind of stuff, which is even more, so the possibility to lose packages is even higher, and calculate the coordinates out of those values. So where exactly is the difference?
I can just recommend you to let one side calculate all the coordinate stuff and transmit those over the net. This will remove your problem completely. If you had problems with that way, those problems were actually caused by the code and not by BGT or the internet.
If you can make it work the way it is now, you can even make it work with the coordinates themselves.
Best Regards.
Hijacker

2017-07-12 01:04:24

I was going to say it sounds like a time-related issue, but you did mention timers.
How do the timers work in your code? How are they used? If this game uses frames with a consistent framerate, the problem could be that one of the computers has a moment where it slows down for unrelated reasons, but this still wouldn't break a deterministic system. It sounds like there might be an extra frame somewhere, though. If you're not using a system that can be described with frames, this probably doesn't help.
I don't remember if you mentioned this already, but, when you have these bugs, and continue afterward, does it repeat, or do the peers just remain out of sync by that one glitchy step? If they get further apart from the first misstep onward, then something somewhere has gotten them out of sync, and you might need to reset the rng and send timestamps.
I remember reading a helpful article on dealing with multiplayer over a network, with attention to common synchronisity issues, but I haven't been able to find it recently. Not sure if any information there would help in this case, since it focuses more on issues more relevant to COD-style games. If I find anything useful, I'll post it.

看過來!
"If you want utopia but reality gives you Lovecraft, you don't give up, you carve your utopia out of the corpses of dead gods."
MaxAngor wrote:
    George... Don't do that.

2017-07-12 05:43:07

Thanks to both of you for your feedback.  I will address them in order:

@Hijacker:  The randomness only comes into play when the ball actually bounces off of the net that is placed ain the center of the grid.  All testing was done with shots that did not hit the net, and thus did not use the random seed at all.  In fact, just for thoroughness, because that was my first thought as well, that somehow the RNG was being triggered incorrectly, I disabled the random factor completely in the code.  I retested, and the problem still persisted.  The RNG is not the issue here.
As to the packet question, when this was first done, each time the ball moved, for each change of the X, Y, or Z axis, a packet was sent to the other machine with the updated coordinates.  This caused major problems.  Actually, the way it is done now, only one packet has to be sent for each ball strike that contains the type of shot, direction of shot, strength of shot, etc.  Then the machine receiving that packet calls the shot method with those parameters, which in turn calls the ball move method.  This means that firstly, less packets have to be sent this way, and secondly, each machine is definitely now using the exact same code to move the ball, and both machines are using the exact same strength, speed, direction, etc., parameters to do so.  Again, common sense says that any two computers executing the same exact code with the same exact values should arrive at the same exact result.
@CAE_JONES:  I have values of x, y, Z, and MH for ball movement.  X and Y are the values for the respective X and Y axes, Z is the height value, and MH is a value for maximum height, at which point the ball begins to fall.   I have timers that determine how long it takes the ball to rise and fall on the Z axis (height), and how fast it moves along X and Y.  I.e.:
class ball
{
timer X, Y, Z;
...
...
}
Then I initialize all of the timers each time before the ball actually moves, just to make sure that everything is resynchronized, like so:
void StartMove()
{
TRestart(B.X);
TRestart(B.Y);
TRestart(B.Z);
moving=true;
}
Then the ball actually moves.  The timer code that controls height looks like this:
void move()
{
if(rising and z<MH and B.Z.elapsed>=speedZ)
{
z++;
TRestart(B.Z);
}
if(z>=MH)
{
rising=false;
z--;
TRestart(B.Z);
}
if(!rising and z>0 and z<MH)
{
if(z<MH and B.Z.elapsed>=speedZ)
{
z--;
TRestart(B.Z);
}
}
The X and Y timers are handled slightly differently because of the incorporation of switch case, because it allows me to better be able to control variety of shots, etc.  Here's one example:
if(B.Y.elapsed>=speedY)
{
switch(direction)
{
case northwest:
y++;
if(B.X.elapsed>=speedX)
{
x--;
TRestart(B.X);
}
TRestart(B.Y);
break;
...
Given that both machines are running the exact same code, same timer parameters, and same parameters for shot strength, direction, etc., I cannot see where the issue is, especially when it is impossible to replicate with consistency.  I might hit 20 shots that match exactly, and then have one that doesn't, then ten more that do, then three that don't, and so on.  For purposes of testing, I made sure that the height of each shot was exactly the same, that speed, strength, and direction were also exactly the same for each one, and some landed exactly correctly, and others did not.  They were always exactly one square diagonally off if the ball was struck diagonally, or one square off if the ball was struck straight along the Y axis.  Never were they any further apart.  The offset never changes.  Does that help clear up things?  Does it help you think of a solution, or at least a general idea of what might be causing the issue?  Any and all feedback is welcome.

JLove

2017-07-12 06:18:46

Does this behavior only occur for the peer receiving the packets, and never for the sender? If it's only the receiver, I think that rules out timer problems.

看過來!
"If you want utopia but reality gives you Lovecraft, you don't give up, you carve your utopia out of the corpses of dead gods."
MaxAngor wrote:
    George... Don't do that.

2017-07-12 10:25:21

@CAE_Jones: Yep, seems like this
Imagine the following, JLove: The sender reinitializes the timer and smashes the ball. All is fine here, because the time is just all synchronized.
The receiver first resynchronizes the timer and then has to wait for the package, before smashing the ball. That means that the timer already has some time on it before the packet even arrives, which means that the timer already skipped some miliseconds before the ball starts to move, which causes it to stop earlier or do whatever you do here. DCould this be the problem? Or do you really resynchronize the timer as soon as the packet arrives in the receiving code?
Best Regards.
Hijacker

2017-07-12 10:50:22 (edited by JLove 2017-07-12 10:55:14)

@CAE_JONES:  Yes, this only occurs for the receiver, not the sender.
@Hijacker:  The timers are re-initialized after the packet is received, because the packet contains the parameters for type of shot, speed, strength, direction, etc.  The receiving machine takes these parameters from the packet, then calls the NewShot method with the values that it just received in the packet.  Once this is done, execution of code is exactly the same on both machines.  The NewShot method calls the StartMove method, which re-initializes timers, and then the move method is called.  Therefore, the timers are re-initialized on the receiving machine just as they are on the sender's, right before the ball moves.  This means that timer synchronization should not be an issue.

2017-07-12 12:32:56

Hello.

There is a bug with the random in BGT. Maybe it's this bug in your case, if you use the random_set_state and random_get_state functions many times with many seeds. You can see more details on this topic: http://www.blastbay.com/forum/viewtopic.php?id=1762

2017-07-12 14:32:05

Have you already tried getting step-by-step debug information from the game while the ball is moving? For it to end up going too far, the problem most likely shows up earlier, so if you can examine all the variables while the sender and receiver are in motion, a discrepency should show up somewhere.

看過來!
"If you want utopia but reality gives you Lovecraft, you don't give up, you carve your utopia out of the corpses of dead gods."
MaxAngor wrote:
    George... Don't do that.

2017-07-12 20:25:21

@Pragma:  Nice to see you posting here again.  Please see my posts in the crazy party thread.  In re this issue, please refer to my above posts where I point out that I disabled the RNG aspect of my code for purposes of testing this.  That was my first test.  Issue still persists even when RNG is not applicable.
@CAE_JONES:  Yes, I did do this.  However, because the tests were some time ago, I am going to run tests again, then post a copy of the current log files that will show all values sent, received, and the changes that occur with each iteration, as soon as I can get access to a second PC for testing.

2017-07-15 00:48:32 (edited by JLove 2017-07-15 00:54:46)

Ok, after writing some additional debugging code and testing, I think I have found the issue, but I am not sure why it is occurring, or how to fix it.
I will post the relevant portions of log here.  I am the host in this test, my friend the receiver.  Notice that everything matches up beautifully at first:
My machine, first line:
StartMove method executed.  Movement Timers Reset.  Timer X is now 0, timer Y is now 0, and timer Z is now 0.  Ball is currently at coordinates -1, 8.  Ball height is currently 14.
Receiver's Machine:
StartMove method executed.  Movement Timers Reset.  Timer X is now 0, timer Y is now 0, and timer Z is now 0.  Ball is currently at coordinates 19, 12.  Ball height is currently 14.
The coordinates of 19,12 are correctly matched to mine of -1, 8, since they are mirrored.  Note here that both machines have reset all timers to 0.  Now, look at the next line:
Host, My Machine:
Ball movement execution has begun.  Timer X is currently 10, timer Y is currently 10, and timer Z is currently 10.  Ball is at coordinates -1, 8.  Ball height is currently 14.
Receiver, my friend's machine:
Ball movement execution has begun.  Timer X is currently 7, timer Y is currently 7, and timer Z is currently 7.  Ball is at coordinates 19, 12.  Ball height is currently 14.
At this moment, the coordinates are still exactly correct.  However, the timers are off by 3 milliseconds.  And if they stayed there, it probably wouldn't be a big deal.  But they don't seem to stay synchronous at that disparity.  Instead, the divide fluctuates, sometimes larger, sometimes smaller.  Take a little ways down for example, line 30.  My machine:
Ball movement execution has begun.  Timer X is currently 45, timer Y is currently 45, and timer Z is currently 22.  Ball is at coordinates 0, 9.  Ball height is currently 12.
My friend's machine:
Ball movement execution has begun.  Timer X is currently 55, timer Y is currently 55, and timer Z is currently 22.  Ball is at coordinates 18, 11.  Ball height is currently 12.
Again, coordinates are correct, and the height timer, timer z, on both machines, match, but the x and y timers are 10 milliseconds apart here.  I also spot where there is sometimes a larger jump in time from one iteration to the next on one machine, but not the other.  For example, take a look at these two back-to-back iterations for each of us.  First, my machine:
Ball movement execution has begun.  Timer X is currently 144, timer Y is currently 144, and timer Z is currently 121.  Ball is at coordinates 0, 9.  Ball height is currently 12.
My friend's machine, at that same time:
Ball movement execution has begun.  Timer X is currently 140, timer Y is currently 140, and timer Z is currently 107.  Ball is at coordinates 18, 11.  Ball height is currently 12.
Next itteration, my friend's machine:
Ball movement execution has begun.  Timer X is currently 151, timer Y is currently 151, and timer Z is currently 118.  Ball is at coordinates 18, 11.  Ball height is currently 12.
At this point, the ball will move for him, because the speed to do that is set at 150, and his X and Y timers have reached 151.  But look at the next iteration from my machine.  Remember, I haven't hit the 150 mark yet; I am still at 144.  Next iteration for me:
Ball movement execution has begun.  Timer X is currently 157, timer Y is currently 157, and timer Z is currently 134.  Ball is at coordinates 0, 9.  Ball height is currently 12.
So this is what happens.  My friend's machine:
Ball has moved along the Y axis.  Timer X is currently 151, timer Y is currently 151, and timer Z is currently 118.  Ball is now at coordinates 18,10,12.
Ball has moved along the x axis.  Timer X is currently 151, timer Y is currently 151, and timer Z is currently 118.  Ball is now at coordinates 17,10,12.
Timer X has been reset, and is now 0.
Timer Y has been reset, and is now 0.
My machine looks like this:
Ball has moved along the Y axis.  Timer X is currently 157, timer Y is currently 157, and timer Z is currently 134.  Ball is now at coordinates 0,10,12.
Ball has moved along the x axis.  Timer X is currently 157, timer Y is currently 157, and timer Z is currently 134.  Ball is now at coordinates 1,10,12.
Timer X has been reset, and is now 0.
Timer Y has been reset, and is now 0.
So,  Even though I passed 150, I didn't actually move until 6 milliseconds after he did.  There are also places where the same sort of issue affects the height.  His timer will reach the point to decrement before mine will, or vice versa.  For example, mine decremented once when the timer hit 180, his when the timer hit 183.  Not a huge margin.  However, I think that what is happening is that there are times when there is just enough asynchronicity to cause one of us to move the ball that one extra square in whatever given direction it's moving.  Just to illustrate, this was the final outcome of that shot.  My machine:
The ball has landed at coordinates 12,18.
His machine:
The ball has landed at coordinates 5,1.
That is absolutely incorrect.  The ball should have landed at 6,2 for his machine, since that is equivalent to 12,18 on mine.  Instead, it landed on 5,1, which is exactly one diagonal square off.
I would think that since all timers are getting reset by both machines at the outset, as you see above, and since at the time that each movement actually occurs everything is reset as well, this should not be a problem.  Any thoughts on how to fix this?  Thanks.  All feedback welcome.

JLove

2017-07-15 18:04:16

Hey @JLove.
I remember chatting with you about this a year or so ago.
Here's the deal: you will never get timers to synchronize between multiple machines. There are just way too many factors: what else the machine is doing at the time, network latency, etc.
If your physics are reliant on the value of a timer, then whomever is the host is going to need to send their timer values to the guest so that the guest can force() their timer to be correct.
Nevertheless you'll have to be able to accommodate some margin of error here.

Official server host for vgstorm.com and developer of the Manamon 2 netplay server.
PSA: sending unsolicited PMs or emails to people you don't know asking them to buy you stuff is disrespectful. You'll just be ignored, so don't waste your time.

2017-07-15 19:13:22 (edited by JLove 2017-07-15 19:20:47)

@trajectory,
Interesting.  So could someone explain that to me?  How can network latency affect the timers in this case?  I don't send a packet with coordinates or timer values.  I merely send a packet which contains values for strength, speed, etc., and the machine takes those and then calls the NewShot function with those parameters, which then executes the code with reference to timers being reset, just as they are on my machine.  So wouldn't network latency simply make it so that his machine might make the shot after mine by a couple of seconds, but the ball still move the same way?  In other words, the receiving of the packet has nothing to do with the timers actually being reset, and the shot never even registers on his machine until the NewShot function is called by his machine, and that doesn't occur until the packet is received.  .  So his  computer might execute the code slower than mine by a few seconds, but the code still gets executed.  So the timers may not reset at the same time, but they will still reset before his ball starts to move, and since all parameters match related to strength and such, wouldn't those numbers make the ball land where mine did, since my timers were reset prior to movement, just as his were, and since I am using the same data that he is related to strength, speed and the like?  the ball may not land at the same exact time as mine, but it should still land at the same square, even if it is after mine, since it uses the same code with the same data as mine does.  Please explain this, if for no other reason than for my own edification and knowledge.
Second question for anyone out there:  What would be the best way to alter the code in this case?  Is there any possible way to alter the movement so that timers are irrelevant and can be excluded completely?  Perhaps that would be the best way, since then I wouldn't have to worry about those values or the extra packets to force his timers to reflect the values of mine at all.  Thanks.

JLove

2017-07-15 22:50:22

The asynchronisity doesn't have to be due to the network; the computers could have different specs, different background software running, or even the same background processes at different points in their execution. If something causes the system to lag, the game is affected.
I'm not sure if something tick-based would make a difference, in this case. I'd say it'd be easier to use velocity vectors, but that might not work so well with the int-based style here.
I think that using a global timer might help...
Something you might try is adjusting the time when the movement takes place based on how long the previous move took. For example, if you want to force each movement to take as close to 157ms as possible, then you'd want the movement after the 157ms frame to last 143ms, and the 151ms frame is followed by 149ms. I don't see this solving the problem entirely, but if applied to all 3 dimentions, it might help significantly.

看過來!
"If you want utopia but reality gives you Lovecraft, you don't give up, you carve your utopia out of the corpses of dead gods."
MaxAngor wrote:
    George... Don't do that.

2017-07-15 23:03:40

Would using velocity vectors remove the need for timers for movement?  I'd be willing to restructure the code if that's what it takes, and it means that the move timers can be removed and the problem gets solved.

2017-07-16 02:16:45

Vectors would work if you're using floats or doubles, but for an int-based board, it'd be a little tougher. It'd also still be vulnerable to lag unless you made each step count as the same amount of time regardless of how long it really takes, but that should work with what you have as well.
Basically, you want both clients to behave as though the exact same amount of time passes for both of them. This is where mainstream games use frames, and design for a specific framerate. So if an individual frame is supposed to be 10ms (100fps), even if the system lags for unexpected or uncontrollable reasons, the game would still behave as though 10ms pass for every frame. You would need to replace the timers in the ball class with numeric variables, and update those every frame. You'd use a global timer to keep the frames from passing too quickly.
I made a clock class for this sort of thing, but I'm not sure if I can link it since I'm on my phone.

class clock {
timer time;
uint frame=0;
double delay=5;
clock() {}
clock(double fps) {
delay=1000.0/fps;
time.restart();
}


void tick() {
double elapsed=time.elapsed-delay+1;
time.restart();
time.resume(); // Probably redundant, but I forget.
wait((elapsed>=delay) ? 1 : delay-elapsed);
frame++;
}

// Optionally, if you want to use shorter waits:
// returns true if a frame has passed, false otherwise.
bool update() {
if (time.elapsed>=delay) {
time.restart();
time.resume();
return true;
}
return false;
}
}class clock {
timer time;
uint frame=0;
double delay=5;
clock() {}
clock(double fps) {
delay=1000.0/fps;
time.restart();
}


void tick() {
double elapsed=time.elapsed-delay+1;
time.restart();
time.resume(); // Probably redundant, but I forget.
wait((elapsed>=delay) ? 1 : delay-elapsed);
frame++;
}

// Optionally, if you want to use shorter waits:
// returns true if a frame has passed, false otherwise.
bool update() {
if (time.elapsed>=delay) {
time.restart();
time.resume();
return true;
}
return false;
}
}
看過來!
"If you want utopia but reality gives you Lovecraft, you don't give up, you carve your utopia out of the corpses of dead gods."
MaxAngor wrote:
    George... Don't do that.

2017-07-16 02:45:17 (edited by JLove 2017-07-16 03:14:45)

Okay, a few questions:
1.  Wait, won't I still have the same timer issue?  I mean, isn't the guest machine's timer going to be out of sync with the host machine, just like the issue that I have with the move timers in place?
2.  :  I am assuming that the doubles I would need are for the FPS and the vectors, correct?
3.  You say replace the timers in the ball class with numeric values.  Do you mean something like:
double X = 175?  If that is correct, then how exactly do I use those numbers in relation to the FPS?
4.  Is the speed of the ball then stuck at 1000 milliseconds, no matter what, if I make the delay 1000.0/FPS?  I'd like to be able to vary the speed of movement, both on the x and y axes, based on different shots that can be chosen.  In other words, sometimes the ball might move more horizontally than forward, or vice versa, depending on the type of shot chosen.  Does that make sense?
5.  I've never done the vector thing before.  How exactly does it work?
Thanks.

JLove