Software development is hard: a collision bug post-mortem

10
Official Construct Team Post
Ashley's avatar
Ashley
  • 23 Apr, 2018
  • 2,123 words
  • ~8-14 mins
  • 4,103 visits
  • 1 favourites

Recently we received a bug report of a mysterious collision issue that only happened at certain positions. It turned out to be a surprisingly complex problem, and ended up causing our first .3 release in about 6 years. The bug is a great example of how software development can involve unexpected complications, even when much of the code is working as intended.

The bug report

The original report basically involved On collision events not working only at certain Y co-ordinates when using the platform behavior. At first this sounds pretty odd, since collisions ought to work the same everywhere. Often odd-sounding bug reports end up being mistakes in events, but we investigate to see what's going on. This time it did seem to be a real problem. Despite thousands of games using the existing collision code for years, someone had found a genuine case where On collision did not detect a collision when it should have. This is pretty rare!

Curious at this unusual case, some further debugging turned up the cause of the problem. Let's cover a bit about how the Platform behavior and collision detection code works to better understand it.

Landing on a platform

Like many game engines, the Construct runtime advances the game in steps - usually called ticks - typically at a framerate of 60 FPS. Moving objects therefore really advance in small jumps, but these happen rapidly enough to look like smooth motion.

Consequently when the Platform behavior lands, it normally actually steps inside the ground.

Obviously this looks wrong, so the platform behavior detects this and reverses the movement, pushing the player back until it is just above the obstacle. This happens all within the same step before rendering, so the player never sees it inside the obstacle, only at the finishing position on top of the ground.

It's important to note that the "push back" algorithm continues until the player is not overlapping the ground. So in the shown position above, a "Player is overlapping ground" event would not run.

Handling 'On collision'

There are three important aspects to note about how the On collision event works in this case.

#1: Triggers that aren't really triggers

One quirk of the event system is some events marked as triggers are still run in the normal top-to-bottom flow of the event sheet. In other words they act like normal events, but tend to run only for one-off ticks. On collision is one of these events.

One reason for this is performance: if the engine had to check whether collisions had happened every time an object moves, it would add a performance overhead to moving objects, which is a pretty basic feature of the engine that ought to be as fast as possible.

Another reason is it could trigger unwanted collisions. If you moved an object first on the X axis, then on the Y axis, the object could overlap something after moving only on the X axis. So the object does not visibly ever overlap anything, but it could still end up triggering a collision because it overlapped something before the overall movement was complete.

So the fact On collision runs like this is actually a useful way to make sure all movements during the tick have completed, and helps keep the engine performant. It really means something more like "is overlapping for the first time".

#2: Registered collisions

Events are single-threaded, meaning things don't normally happen simultaneously. Instead everything that happens during a tick runs in a specific order, one thing after another.

Suppose the platform behavior lands on the ground and pushes it back out until it was no longer overlapping. This happens while the platform behavior is advancing the movement. Then the engine moves on to running events. When it comes to check the On collision event, the player is not overlapping the ground. Users expect an On collision event to run in this case because the player presumably collided with the ground in order to land on it. However the platform behavior actually prevents the player from ever overlapping the ground, meaning On collision won't run unless we come up with a special way to handle this case.

So when the platform behavior detects the player hitting the ground, it pushes it back, and then tells the runtime: "remember that the player collided with the ground". The On collision event checks whether the objects are overlapping (which they aren't), then also asks the runtime "did these objects collide?" The answer is yes, so the event runs.

We call this mechanism a registered collision. The platform behavior registers that the player collided with the ground, so the On collision event can run even though it never saw the objects overlapping. The end result is everything works like the user expects - an important design goal for Construct!

#3: Collision cells

Collisions are an essential part of most games, and can have a high performance impact, especially when many instances are involved. For example if there are 1000 instances of both SpriteA and SpriteB, then the event SpriteA overlaps SpriteB must check about a million combinations!

To make sure games run fast even with many instances checking for collisions, Construct uses an important optimisation called collision cells. This essentially splits the layout in to viewport-sized cells and keeps track of which objects are in which cells. Then when it checks for a collision, it only needs to check objects in the same cell. This takes advantage of the fact objects are usually spread out across the layout, and means far-away instances don't need to be considered at all, taking up zero processing time. It allows games to scale up to a much larger area and with more objects without degrading performance too much. It's also a super simple approach so (in theory) easy to get right and maintain, and can easily be used to cover an unlimited sized area.

You might notice in the original report, the certain positions where collisions didn't work correlated to multiples of the viewport size. Hmm - that's the size of our collision cells!

The bug

If we tie all of these together, we can then understand how things go wrong. This is what happens: the ground that the player is landing on is exactly at the edge of a collision cell. When the platform behavior pushes the player back out of the ground, it registers a collision, and pushes the player back - in to a different collision cell.

The fundamental problem then is the On collision event first collects all objects from the same collision cell, then checks to see if any of them are overlapping or have a registered collision. Unfortunately the ground is no longer in the same cell - so it never even checks with the ground object - even though the platform behavior did correctly register a collision. So the problem is: when an object registers a collision then gets moved in to a different collision cell, the registered collision isn't checked.

The fix, then the other fix

The first fix was: when collecting objects in the same collision cell, also collect all registered collisions before proceeding.

Unfortunately this wasn't quite right. Registered collisions include collisions for any kind of object. So if you have "Player: On collision with ground", and the player registers a collision with a different object like a door, then the door is in the list of registered collisions and gets thrown in too! The end result is "On collision with ground" runs when you collide with an unrelated object.

Oops. This is the kind of thing that breaks a lot of games, and it did.

Despite the fact beta users have opted-in to possibly broken releases, we don't like to leave a major bug in for long. It's still an inconvenience and it undermines the ability to test anything else in the beta, essentially preventing anyone sensibly testing that release.

So under some degree of time pressure we added in an extra check to make sure only the right kinds of registered collisions are considered. So only registered collisions involving the ground - or a family the ground object is in - are considered for "On collision with ground". This came out in the .2 release. This is still a little tricky to get right, and unfortunately coding in a hurry doesn't always end well. A small mistake, essentially a typo, on a less-frequent codepath within the updated check caused some other kinds of games to throw a JavaScript error. Oops. It took one more tweak to get it right. That was the .3 release. As far as we can tell, it seems to be finally fixed now.

Lessons

We deal with thousands of bugs, but this one was particularly interesting for a few reasons.

Firstly, it touches on a whole range of things in the engine: behaviors, ticking, handling collisions, optimisations, sequence of operations, and special mechanisms to meet user expectations, all interacting in a very specific way to produce an unexpected result.

Secondly, most of all of those aspects of the engine were working as designed. The behaviors, registered collisions mechanism, and collision cells optimisation, were all working correctly. The bug is interesting because it emerges in the interaction between these otherwise correct parts of the engine. It reveals how in a complex piece of software, you can't always simply build a feature in isolation and ignore everything else: you have to take in to account how it interoperates with other parts of the code, and there is a particular class of bugs that emerges in between separately designed and tested features.

Thirdly, adding optimisations like collision cells increases the complexity of the engine, resulting in more bugs. Despite the fact the concept is really simple - just divide everything up in to cells - it ends up interacting with other parts of the engine in subtle and unexpected ways. This is why despite the fact it's possible to pile on a whole range of fancy optimisations to make as many cases as fast as possible, it is extremely difficult to also make it reliable. It's also why it's best to choose the simplest possible solution to a problem, no matter how cool it may seem to use some clever (usually meaning complicated) solution.

Finally, it's interesting how specific the circumstances are that caused the bug: it had to be using the registered collision mechanism exactly on the border of a collision cell. This is probably why it took such a long time for the bug to emerge. It was probably first introduced when collision cells were originally implemented in r155, over 4 years ago. The fact it took so long before anyone noticed suggests it's a relatively minor bug, but it's still slightly worrying to think that bugs can be found in core features even years down the line!

Conclusion

As more features, optimisations and other things are added in to software, there are more combinations between them that can show up bugs. Every time some new code is added, you have to think about how it interacts with all the other existing code, which generally gets progressively harder. It is difficult to find or predict "edge cases", where several situations all line up just right resulting in a rare and unexpected result. (This is also why it might be wise not to buy a first-generation self-driving car!)

The simpler software is, the easier it is to get right. Reliability is one of the most important qualities of software. To ensure this it's important to avoid an unchecked march towards a spaghetti of code complexity. This is a significant reason to be wary of implementing major features or optimisations that have far-reaching impact throughout the engine - it will be very difficult to get right. While there are many factors that contribute to what we decide to develop next for Construct, these kinds of complexity considerations are significant.

Dealing with bug reports takes up a significant amount of our time, since we get a lot of them, and it can be very time consuming working out the details behind problems like this. Also while many weird-sounding bugs may simply be mistakes, sometimes they're real problems, so we generally have to investigate all reports anyway. It can be hard work tracking down exactly what's happening, especially when there's several factors involved. Even once the problem is understood, it can still be difficult to make sure the fix works correctly!

Hopefully this gives you some insight in to the kind of work that goes in to routinely dealing with bug reports and issues that come up during development. In short - software development is hard!

Subscribe

Get emailed when there are new posts!

  • 5 Comments

  • Order by
Want to leave a comment? Login or Register an account!
  • Very interesting and insightful blog post.

  • Im glad i read this! In a game im developing ive always wondered how performance was maintained when i have many dozens and dozens of off-viewport physics objects. Now i know based on the cell explanation. Very insightful about the engine design. I guess webassembly will operate the same? This also may help explain why i can run previews of physics intense games through the browser, on my phone (although phones are getting so powerful these days).

  • Wow! I had this problem with Pinkman (https://store.steampowered.com/app/516480/Pinkman/) a year ago while i was still using Construct 2, i had to redesign part of a level because of it, i'm glad that you guys fixed it. Thank you, Scirra! :)

    BTW, nowadays i report every single bug that i find. :P

  • I had issues with objects collision, Thanks for the blog post!

  • I really like the game not even like but love it