RTS devlog #8: How to beat lag

11
Official Construct Team Post
Ashley's avatar
Ashley
  • 23 Dec, 2022
  • 3,074 words
  • ~12-20 mins
  • 26,789 visits
  • 1 favourites

It's time to deal with the bane of online multiplayer games: lag. Previously Command & Construct just displayed raw network updates - as soon as a message arrived it would update units accordingly. However with my low-bandwidth design on a (simulated) really terrible network, that ends up looking like this:

In other words, unplayably awful. The challenge? Make the client show the game as smoothly and accurately as possible, even under network conditions that bad.

I've been coding professionally for over a decade now, and writing code to keep the game running smoothly with an unpredictable network was still very tough to get right. But I'm here to tell you how it all works and show you the code I ended up with!

Simulating latency

At the transport layer, the Internet is chaos. Messages may not arrive at all. They can arrive in the wrong order. Every message can have a different latency. Things like reliable ordered delivery are built on top of this in software using queues and retransmission to create order from the chaos.

There are three main measurements that affect the network quality:

  1. Latency: how long a message takes to travel one way (from client to server, or vice versa). Multiplayer gamers are probably familiar with ping time, but note that may refer to the round-trip time, i.e. for a message to travel both to the destination and for a response to be received back; here I am using latency to refer to the one way time.
  2. Packet delay variation (aka PDV or jitter): how much variation there is on the latency for different messages. For example if one message takes 150ms and the next takes 250ms, that implies a PDV of 100ms.
  3. Packet loss: the percentage of messages that simply go missing and never arrive. For example if this is 5% then 5 out of every 100 messages will never be delivered.

I've noted previously how the single-player mode uses the same server architecture as a multiplayer game, as it is simpler and helps with testing. Therefore I can test how things work on a poor network by running single player mode and simulating latency, PDV and packet loss, by artificially delaying (or dropping) messages. This is far easier to develop with than actually trying to find a poor quality network and repeatedly run real multiplayer games!

Construct's Multiplayer plugin already has a latency simulation feature, but single player games don't use the multiplayer feature, so I wrote some custom latency simulation code for testing. It's all in latencySimulation.js where you can set ENABLE_LATENCY_SIMULATION to true to turn on simulated latency for testing, and there are constants for the latency, PDV and packet loss to simulate.

In short, the packet loss is a random chance that a message is dropped, and the latency and PDV are used to create an artificial delay both upon sending and receiving messages. However as with much of multiplayer coding, there are subtleties to this that must be taken care of:

  • Reliable messages subject to packet loss must still arrive - I more or less guessed a 3x multiplier to the latency to simulate retransmission, representing random extra delays to some messages.
  • Ordered messages must not become unordered due to PDV - e.g. if one message takes 400ms to send and the next takes 200ms, the second will overtake the first, but must not be treated as received until the prior message arrives to preserve the ordering guarantee.

That sorted, I then simulated a really awful network: a random 200-400ms latency (base latency of 200ms with added 200ms PDV), and 20% packet loss. This produced the appalling result shown in the previous video. My thinking is almost all real-world networks should be better than this, so if I can make the game work at least OK-ish under these conditions, it should be fine on most real networks.

Network timelines

How do we solve this? Let's talk about the theory. The overall principle is the following:

  1. Synchronise the time on both the client and server.
  2. The server timestamps every message it sends.
  3. The client can then store a timeline of updates, ordered by the timestamp.
  4. The client then follows the timeline, but with an extra added delay. This means:
    • The client has a short amount of "buffer time" for network events to arrive before it uses them.
    • The client can interpolate between the previous and the next upcoming value, so it can smoothly apply changes.

Here's a visualisation of the timeline for a single value, such as a turret's offset angle, to demonstrate the principle. If you want to properly understand how this system works, it is well worth spending a while examining this, as it's the key to the entire system of handling lag.

Note the following about how this system works.

  • The client has a good estimate of what the real server time is. It then adds the latency, and adds a bit of extra delay, and displays the game at that time.
  • The server timestamps all messages it sends with the current server time. Values received over the network (shown in blue) are placed on the timeline according to the server time in the message. This ensures they are correctly ordered and are applied on schedule regardless of variation in network timings (such as PDV).
  • Note that PDV, or retransmission for lost messages, can still cause messages to arrive late (behind the current time the client is showing). However the fact the client adds an extra delay on top of the latency means that moderate PDV just means messages fall in to the extra buffer time, rather than arriving late.
  • In the above diagram perhaps one of the updates is missing, as there is a bit of a gap. This could be due to packet loss or retransmission causing it to be too delayed to be used. However the client can see the previous value and the next upcoming value, so it can interpolate between them. This will likely cover up the fact a message went missing.

It's also interesting to note that it's not really latency that messes up the client representation of the game. Even if you have a fairly high latency, as long as there is low packet loss and low PDV, then basically everything is perfectly reliable. The only downside is the delay you see the game on. That's important for things like first-person shooters where reaction time is important, but not so much for games like this one. What really messes up the client representation is high packet loss and high PDV. This means messages keep arriving late, behind the current time the client is showing. Late messages mean the client either stops updating things or guesses where they ended up, and then later has to correct it. If it's really bad then things will start jumping all over the place. So aside from the delay caused by latency, packet loss and PDV are more important aspects of the connection quality for a smooth representation on the client.

With my chosen simulated latency of 200-400ms and 20% packet loss, there will definitely be late messages. So that makes sure I have to write code to deal with it!

Implementing timelines

The implementation is fairly complex, so I'll just point out the relevant sections of code. Also much of the code may look relatively straightforward, but that's after quite a lot of testing, rearrangement and rewriting - as I mentioned this was tough to get right, and it's the type of thing that is full of corner cases and subtleties that are easy to miss.

Clock synchronization

The first task is to get the client to measure its estimated latency. This and other network timing related tasks are handled by PingManager.

Latency is measured by sending a "ping" message every 2 seconds. The server immediately sends a "pong" message back with the current server time. Then the client can work out:

  • The estimated latency, based on the time for the "pong" to come back, but divided by 2 to make it the one-way time. (There's not actually any guarantee that the latency is the same in both directions, but we have to make a best guess.)
  • The current server time, based on the time in the "pong" message, plus the latency. This is used to calculate the difference between the server time and the client time. In theory that is a constant value, and it also means the client can calculate the server time at any instant based on its own time.

These measurements will all have some variation. So it keeps the last few latency values and averages them. It also smooths out changes in the calculated time difference at a rate of 1% (10ms per second), so even if the measured time difference varies, the smoothed time difference should end up hovering around the real time difference with no sudden jumps. Similarly the client delay (the latency plus the extra delay time) is smoothed out as the latency measurements can change over time, and we don't want to see any jumps. This actually means it runs the game ever so slightly in fast forward or slow motion to compensate as the calculated delay changes!

This means the client now has a good idea of the server time, latency, and the client delay, and none of them change suddenly. The most important time value is called the simulation time, which is the green arrow in the diagram above. This is the time the client should be representing the game. It's calculated as the estimated current server time, minus latency, minus an extra delay (somewhat arbitrarily set at 80ms at the moment).

Value timelines

There's a ValueTimeline base class that represents a sequence of timestamped updates much like in the diagram above. There are two derived classes representing variants of the basic timeline:

  • InterpolatedValueTimeline is able to interpolate between values, such as smoothly rotating between turret offset angle updates from the network.
  • SteppedValueTimeline is used for one-off events, such as network events like "projectile fired", and also the position updates (since if you remember, in the bandwidth design those are only sent every 2 seconds, so are more like events than continuous values). This type of timeline doesn't interpolate, it just returns the values when the client time reaches them.

A good example of the usage of timelines is the ClientTurret, which is a simple example as at the moment it only has one interpolated value for its offset angle. When a value is received from the network, it calls OnNetworkUpdateOffsetAngle() and inserts the value to the timeline. The client also "ticks" units every frame with the current simulation time, and in Tick() the turret looks up its current offset angle in its timeline. It also deletes old entries so they don't accumulate and waste memory.

Network events

One-off events like "projectile fired" are handled with a stepped timeline in GameClientMessageHandler. This basically just queues up received events until the simulation time catches up, at which point it then applies them.

Handling lateness

Handling late arrival of network events works out quite nicely. Late events can be easily calculated by seeing if the server time is behind the simulation on arrival, and the difference also tells the client how late it is. Then the client can simulate the event having happened at the right time in the past!

This works well with the "projectile fired" event. Projectiles have entirely predictable motion, moving in a straight line at a fixed speed. So if a "projectile fired" event arrives 400ms late, the client advances the projectile by the distance it would have travelled in 400ms. The end result is a slightly late "projectile fired" event means the projectile just suddenly materialises in the correct place. Of course if the message is really late then the projectile will have already disappeared on the server. That's too bad - the client will never see it, but as the client's display of projectiles is purely cosmetic, it won't have any bearing on the actual gameplay.

Handling late events isn't as easy for "projectile hit" and "unit destroyed" events: in both cases it's tricky to hide the fact the event arrived late. The best the client can do is avoid creating explosions if the event is really late, so as to avoid drawing the player's attention to laggy events; instead the client just tries to quietly catch up.

Late position updates

Remember that according to the bandwidth design, unit positions are only sent in "full" updates, rotating through all units every 2 seconds. At first I made clients ignore late position updates. However with poor network conditions, many position updates arrive late. Ignoring them means units drift substantially off their correct position, sometimes for relatively long periods of time, and then eventually jump back to the right place when a position update arrives on time.

I decided late position updates need to be used, otherwise the state of the game will get too far off with a poor network. But how do you use a message that says something like "this unit was at this position 400ms ago"?

The solution was to keep a history of the unit position on the client. Fortunately we can use timelines for this as well, only keeping a list of values in the past. ClientPlatform keeps its position over the past 2 seconds. Then when a late position update occurs, it can look in the history, see where it was at the given server time, and then work out how far it's off that position. For example it can determine "400ms ago I was at (100, 100), but the server said I should have been at (80, 110), so I'm currently off by (20, -10)". The logic for this is in OnNetworkUpdatePosition().

Finally, the correction is also smoothed. This also applies to updates that arrive on schedule - if the client finds out by any means that the unit is in the wrong position, we don't want it to jump. Instead it saves its offset from the true position in the #xCorrection and #yCorrection properties, and applies the correction over time in the Tick() method. For small updates this just moves the unit at 20 pixels per second linearly. Often the updates are small and slow enough to not be noticable. If it's way off though, it will move it exponentially at a 95% correction per second. This can help give the impression of the unit "catching up" instead of just teleporting somewhere else.

Lazy creation

One last network-related update I made was rather than having the server sending all the initial units on startup, the client just creates units as it receives full updates for units it doesn't yet know about. To make it look better, the units fade in when they are first created, so you see a kind of cool effect as the whole level fades in - a neat trick to cover up the network syncing.

The client also times out units if they don't receive any update from the server for several seconds. The server should update every unit every 2 seconds; if a unit goes 7 seconds without an update it assumes the "unit destroyed" event got lost, and removes the unit. However with the lazy creation, if the server does send a full update again, it will just pop back in to existence and carry on. This means even a brief complete outage in network transmission - e.g. going through a tunnel while on a train - will eventually allow the game to re-sync and carry on. (I think this is also impossible with the traditional "lockstep" approach, so that's a nice advantage!)

Results

There's now a decent framework for dealing with lag, interpolating between values, and handling late updates. Here's how the previous example now looks with exactly the same simulated latency - 200-400ms random delays with 20% packet loss.

Compare to the previous video - it's far better. Notice the top-right unit lags behind - it looks like the message with its speed update arrived late, but the client quickly finds the mistake, and then exponential correction makes it visually catch up with its correct position. After that it works smoothly.

The game should still be playable under these conditions! Even with an appalling simulated connection that loses 1 in 5 messages and has large unpredictable random delays, it's relatively smooth. The units react on a short delay and occasionally lag behind a bit, but overall it works.

I tried this on a couple of real-world Internet connections and it works very well. I think I'm right that almost all real-world connections are better than what I simulated. Even my phone running over cell data gets about 100ms latency with only 30-40ms PDV, which is low enough that messages are rarely late, and so everything looks pretty good with hardly any noticable lag. Playing from the office with a home PC over broadband had just 30ms latency.

Conclusion

I'm very pleased with this result! It was tough to get right but now there's a good system for making the best of even really bad internet connections, and with a decent quality connection it works great. I think this will be OK for most people, and probably only struggle if you have a really poor wireless signal, or if you try to play with someone on the opposite side of the globe. (Although I'd be interested to hear how it works out if you do!)

It's looking like a solid foundation: it can handle 1000 units with low bandwidth, low CPU usage, and smooth things over on poor connections. It's probably time to finally stop units being able to drive over each other! Then there's things like scenery and pathfinding, and trying to stop traffic jams. The crowd control aspect will be tough as well. But hey, I signed up for a challenge!

As ever you can try it out now at CommandAndConstruct.com, although I've noticed a problem with random disconnects sometimes - oops! I'll try to figure that out soon. All the code is on GitHub so you can dig in to the code and see all the details - as ever I've tried to make sure there are lots of clear and detailed comments.

Subscribe

Get emailed when there are new posts!

  • 2 Comments

  • Order by
Want to leave a comment? Login or Register an account!
  • I guess that I'm spoiled in the USA. Do people still use 56K modem dial-up? However, I glad you are compensating for V.90, MNP, v.35, and v.42bis protocols.

    • This has less to do with a dial-up modem and more to with infrastructure and engineering side of the internet (not just the Web). Using an outdated modem mainly determines the speed of data transmission per second, which isn't very significant for an online game. Most games only use around 50 KB/s. Even a high quality voice call is possible with that much bandwidth. But since a voice call is also a realtime interaction, the issues mentioned by Ashley still applies to it.

      Btw until 2020 (and beyond), millions of Americans were still paying for a dial-up subscription! So statistically yes, people do use it and your country's infrastructure (and tech literacy) isn't as advanced as you think. Maybe you're just one of the fortunate ones with access to 5G and Fiber connections.