fix(lockstep): keep blocking barrier on Linux SITL

f8da039518 added a wall-time watchdog inside LockstepComponents::wait_for_components() under #if __PX4_WINDOWS || __PX4_LINUX. The watchdog returns after ~100 us so a single delayed component cannot stall the simulator, which is needed on Windows because winpthreads can drop pthread_cond_broadcast wakes. On Linux the same short budget regularly cuts the barrier short before slow consumers (dataman_client, navigator, mavlink mission manager) have processed the previous tick. The result is dataman_client GET_ID round-trips that never see a response within the 1000 ms timeout:

  ERROR [dataman_client] timeout after 1000 ms!
  ERROR [dataman_client] Failed to get client ID!

Mission upload then silently no-ops, the navigator reports "Mission rejected: empty", and downstream tests fail with mode-set timeouts (MAVSDK Land-on-GPS-lost, MAVROS AUTO.MISSION 5 s timeout, ROS ModesTest.runMission).

Restrict the watchdog to __PX4_WINDOWS and gate the helper functions in the same guard so Linux falls back to the original px4_sem_wait() blocking path that is byte-equivalent to main. macOS already used the unguarded path before f8da039518 and continues to do so.

Reproducibility: confirmed by diffing the file against origin/main and inspecting the bound-waits commit; verified main SITL has zero dataman_client startup timeouts while this branch produces 5-6 per startup; WSL is unavailable on this host so a runtime reproduction was not possible. Astyle 3.1 reformat is a no-op.
Signed-off-by: Nuno Marques <n.marques21@hotmail.com>
This commit is contained in:
Nuno Marques
2026-05-07 16:44:44 -07:00
parent a7600b54e6
commit 3caba73975
@@ -44,6 +44,7 @@
#include <limits.h>
#include <stdlib.h>
#if defined(__PX4_WINDOWS)
static uint64_t lockstep_wall_time_us()
{
timespec ts{};
@@ -71,6 +72,7 @@ static uint64_t lockstep_component_wait_budget_us()
return wait_budget_us;
}
#endif // __PX4_WINDOWS
LockstepComponents::LockstepComponents(bool no_cleanup_on_destroy)
: _no_cleanup_on_destroy(no_cleanup_on_destroy)
@@ -158,10 +160,14 @@ void LockstepComponents::wait_for_components()
return;
}
#if defined(__PX4_WINDOWS) || defined(__PX4_LINUX)
// Do not let one temporarily blocked component stall the simulator forever.
// Keep the watchdog short: high lockstep speed factors need sub-millisecond
// barrier overhead, and active components normally release the latch first.
#if defined(__PX4_WINDOWS)
// Windows-only: winpthreads occasionally drops pthread_cond_broadcast
// wakes, which can leave the producer blocked here forever. Use a
// short wall-time watchdog so a lost wake caps barrier latency at
// ~100 us instead of stalling the simulator. Linux uses the original
// blocking wait below because POSIX condition variables on Linux
// reliably deliver every broadcast and the blocking path preserves
// strict producer/consumer ordering at startup.
const uint64_t wait_start_us = lockstep_wall_time_us();
const uint64_t max_wait_us = lockstep_component_wait_budget_us();
unsigned spin_count = 0;