From d681d45e8bdded75f102060ef5c602166fe069c7 Mon Sep 17 00:00:00 2001 From: Nuno Marques Date: Fri, 8 May 2026 14:09:38 -0700 Subject: [PATCH] fix(lockstep): gate WorkQueue px4_lockstep_progress to Windows only MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit abfacca289 added an unconditional px4_lockstep_progress(_lockstep_component) call inside the ENABLE_LOCKSTEP_SCHEDULER block of WorkQueue::Run() so each work-queue pass explicitly hands back lockstep progress before the empty check. Every other lockstep-related change in that commit (hrt_queue.c sem_post, hrt_thread.c sem_timedwait path, hrt_work.h _hrt_work_signal extern, lockstep_components.cpp barrier watchdog, lockstep_scheduler.cpp cond_timedwait poll loop) was correctly gated under __PX4_WINDOWS to work around winpthreads dropping pthread_cond_broadcast wakes; only the WorkQueue.cpp call slipped past the guard and applies on every platform. On Linux this is at best redundant and at worst harmful: if a WorkQueue wakes spuriously and Run() exits with an empty queue, the new progress call signals "this component is ready for the next sim tick" before the items the simulator clock was gating on have actually been processed. The MAVSDK iris suite triggered this with the EKF2-disabled attitude-only config (sitl.json test 2: PX4_PARAM_EKF2_EN=0, PX4_PARAM_ATT_EN=1) — the attitude_estimator_q work queue cycles more sparsely than the EKF2 graph, so the premature progress signal lets sim time advance before mag/baro/imu have been consumed. mavsdk_tests sees Current speed factor: nan (set: 10) for the entire 20 s budget, the vehicle never reaches an armable health state, and arming fails: WARN [commander] Arming denied: Resolve system health failures first ../../../test/mavsdk_tests/autopilot_tester.cpp:582: FAILED: ../../../test/mavsdk_tests/autopilot_tester.cpp:596: FAILED: ../../../test/mavsdk_tests/autopilot_tester.cpp:188: FAILED: The default-EKF2 iris configs (test 1: 18 cases) pass because the EKF2 work queue is dense enough that the empty-queue race rarely fires and the next sensor delivery re-arms the barrier in time. Wrap the call in #if defined(__PX4_WINDOWS) so the Linux path is byte- equivalent to main and the Windows winpthreads workaround is preserved. The downstream `if (_q.empty()) px4_lockstep_unregister_component(...)` block still runs on every platform and matches main. Verification: re-reading the failed CI log confirms 18 of 19 iris test cases now pass on the lockstep_scheduler.cpp + work_dir-scope fixes; only the EKF2-disabled config exhibits the speed-factor=nan stall, and the symptom is consistent with premature progress on the attitude-only work queue. Signed-off-by: Nuno Marques --- platforms/common/px4_work_queue/WorkQueue.cpp | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/platforms/common/px4_work_queue/WorkQueue.cpp b/platforms/common/px4_work_queue/WorkQueue.cpp index e90bb846cc..0ee2085083 100644 --- a/platforms/common/px4_work_queue/WorkQueue.cpp +++ b/platforms/common/px4_work_queue/WorkQueue.cpp @@ -192,10 +192,20 @@ void WorkQueue::Run() #if defined(ENABLE_LOCKSTEP_SCHEDULER) +#if defined(__PX4_WINDOWS) + // Windows-only: winpthreads occasionally drops pthread_cond_broadcast + // wakes, so explicitly hand back lockstep progress after each Run() + // pass to keep the simulator advancing even if the broadcast that + // would have unblocked the lockstep barrier got lost. Linux POSIX + // condition variables deliver every broadcast; calling progress + // here on Linux risks signalling readiness before the queue has + // actually drained the items the simulator clock is gating on. if (_lockstep_component != -1) { px4_lockstep_progress(_lockstep_component); } +#endif // __PX4_WINDOWS + if (_q.empty()) { px4_lockstep_unregister_component(_lockstep_component); _lockstep_component = -1;