2020 December 16 - 18:52 Angry rant about a bug

Angry and tired rant incoming!

At the end of a project a really nasty bug appeared, using a popular game engine and only reproducible on release builds for one of the main game consoles. To make it worse it required several hours of active gameplay to reproduce. And would go away if trying to reproduce on a debug build.

Day 1.
The symptoms of the bug is that you can suddenly, either on level load or mid level, no longer grab items or slap items and other players. Reloading the scene/level does not help, only way to solve is to completely restart the game.
First I assume this was due to some exception or other error happening but being hidden since it only happens on release builds with no debug. But after adding in-engine on-screen debug prints it became apparent that capsule/sphere-casts from the physics engine always return 0 hits once the bug is activated.

Normal raycasts seem to work though, since you can still jump and raycasting is being used to judge if the mover is grounded, will soon be grounded or was very recently grounded.
Since capsule/sphere-cast are only used at a few places, the first action was to replace capsule/sphere-casts with a bunch of normal raycasts, both methods that return multiple hits and as a fallback the method that just does a single hit.

Day 2.
Another day of testing and the results are that the new raycasts also stop working after a couple of hours.
But an interesting discovery is that slapping the mailboxes still work, also both grabbing and slapping things works normally in the two tutorial levels. And in one level slapping and grabbing actually works on one of the objects as long as it’s underwater.
Tutorial levels didn’t give me much of a clue, but the mailboxes are on another physics layer than the other items/players. So next step is to remove the layer masks and just filter the results when going through the hits, if there are any.
Water is using a trigger collider to detect objects and add buoyancy, so I tried adding an empty trigger collider covering the player as the maybe raycast only work if they start inside a trigger collider?

Day 3.
After one more day of testing the results are that while removing the layer masks didn’t solve the issue, the raycasts now start generating hits, but only on static colliders, seems raycasting only works against colliders that do not have a rigidbody attached to it. Which explains why the jumping still works as the majority of the ground is just static colliders. Same with the mailboxes, they are static objects.
Another new clue is that on levels with multiple floors the raycasting mostly works like it should on the top floors, but as soon as the mover comes down to the ground floor it stops working again.
The fact that it seems limited to the vertical axis and rigidbodies made me start thinking about the broadphase of the physics engine. While I don’t have the source for the engine, it seems likely that rigibodies would have a different broadphase test and for some reason after a few hours of playing on release builds the broadphase AABB testing would break down.
Using the standard broadphase type there’s no way to manually rebuild the broadphase data structure but when changing to multibox pruning there is. Calling this to rebuild the broadphase regions on each level load seems like a good thing to try.

Day 4-5.
After extensive testing over 2 days the bug can no longer be reproduced, seems changing the broadphase type and/or forcing a rebuild of the broadphase regions fixed it.
Everyone is happy, birds are singing, sun is shining (well, no, not in Stockholm). But damn these kinds of bugs sucks, especially on consoles.
Having source access to the engine might have helped, but finding and fixing a bug like this on a big engine would probably not have gone any faster than finding this workaround.

TLRD: on console release builds sometimes the broadphase on the physics engine of the popular game engine will stop working properly. Changing broadphase method and force a rebuild of the broadphase regions on each level load solved it.