mirror of
https://github.com/PX4/PX4-Autopilot.git
synced 2026-05-23 06:36:45 +08:00
fix(lockstep): keep blocking barrier on Linux SITL
f8da039518added a wall-time watchdog inside LockstepComponents::wait_for_components() under #if __PX4_WINDOWS || __PX4_LINUX. The watchdog returns after ~100 us so a single delayed component cannot stall the simulator, which is needed on Windows because winpthreads can drop pthread_cond_broadcast wakes. On Linux the same short budget regularly cuts the barrier short before slow consumers (dataman_client, navigator, mavlink mission manager) have processed the previous tick. The result is dataman_client GET_ID round-trips that never see a response within the 1000 ms timeout: ERROR [dataman_client] timeout after 1000 ms! ERROR [dataman_client] Failed to get client ID! Mission upload then silently no-ops, the navigator reports "Mission rejected: empty", and downstream tests fail with mode-set timeouts (MAVSDK Land-on-GPS-lost, MAVROS AUTO.MISSION 5 s timeout, ROS ModesTest.runMission). Restrict the watchdog to __PX4_WINDOWS and gate the helper functions in the same guard so Linux falls back to the original px4_sem_wait() blocking path that is byte-equivalent to main. macOS already used the unguarded path beforef8da039518and continues to do so. Reproducibility: confirmed by diffing the file against origin/main and inspecting the bound-waits commit; verified main SITL has zero dataman_client startup timeouts while this branch produces 5-6 per startup; WSL is unavailable on this host so a runtime reproduction was not possible. Astyle 3.1 reformat is a no-op. Signed-off-by: Nuno Marques <n.marques21@hotmail.com>
This commit is contained in:
@@ -44,6 +44,7 @@
|
||||
#include <limits.h>
|
||||
#include <stdlib.h>
|
||||
|
||||
#if defined(__PX4_WINDOWS)
|
||||
static uint64_t lockstep_wall_time_us()
|
||||
{
|
||||
timespec ts{};
|
||||
@@ -71,6 +72,7 @@ static uint64_t lockstep_component_wait_budget_us()
|
||||
|
||||
return wait_budget_us;
|
||||
}
|
||||
#endif // __PX4_WINDOWS
|
||||
|
||||
LockstepComponents::LockstepComponents(bool no_cleanup_on_destroy)
|
||||
: _no_cleanup_on_destroy(no_cleanup_on_destroy)
|
||||
@@ -158,10 +160,14 @@ void LockstepComponents::wait_for_components()
|
||||
return;
|
||||
}
|
||||
|
||||
#if defined(__PX4_WINDOWS) || defined(__PX4_LINUX)
|
||||
// Do not let one temporarily blocked component stall the simulator forever.
|
||||
// Keep the watchdog short: high lockstep speed factors need sub-millisecond
|
||||
// barrier overhead, and active components normally release the latch first.
|
||||
#if defined(__PX4_WINDOWS)
|
||||
// Windows-only: winpthreads occasionally drops pthread_cond_broadcast
|
||||
// wakes, which can leave the producer blocked here forever. Use a
|
||||
// short wall-time watchdog so a lost wake caps barrier latency at
|
||||
// ~100 us instead of stalling the simulator. Linux uses the original
|
||||
// blocking wait below because POSIX condition variables on Linux
|
||||
// reliably deliver every broadcast and the blocking path preserves
|
||||
// strict producer/consumer ordering at startup.
|
||||
const uint64_t wait_start_us = lockstep_wall_time_us();
|
||||
const uint64_t max_wait_us = lockstep_component_wait_budget_us();
|
||||
unsigned spin_count = 0;
|
||||
|
||||
Reference in New Issue
Block a user