fix(lockstep): drop racy poll-loop in Linux cond_timedwait

abfacca289 wrapped the Linux #else branch of LockstepScheduler::cond_timedwait() in a 500 ms pthread_cond_timedwait poll-loop with a manual _time_us re-check, intended as a wall-time watchdog symmetric to the Windows winpthreads workaround. The loop has a wakeup classification bug: when an external pthread_cond_broadcast() unblocks the waiter (e.g. test_locked_semaphore_getting_unlocked broadcasting from another thread), pthread_cond_timedwait returns 0 and timed_wait.timeout is still false. Every guard in the loop falls through, the comment misclassifies it as a spurious wakeup, and the body re-enters pthread_cond_timedwait for another 500 ms. The broadcaster is gone and set_absolute_time() is no longer being called, so timed_wait.timeout never flips and the _time_us<=time_us re-check inside ETIMEDOUT never resolves. functional-lockstep_scheduler_test deadlocks until ctest's 1500 s timeout:

  Start  71: functional-lockstep_scheduler_test
  71/157 Test #71: functional-lockstep_scheduler_test ...***Timeout 1500.01 sec
  [ RUN      ] LockstepScheduler.All
  INFO  [lockstep_scheduler] setting initial absolute time to 12345678 us
  INFO  [lockstep_scheduler] setting initial absolute time to 12345678 us
  INFO  [lockstep_scheduler] setting initial absolute time to 12345678 us

Restore the original single pthread_cond_wait(cond, lock) on Linux, matching origin/main, and drop the now-dead "ETIMEDOUT && !timeout" salvage block downstream that only existed to paper over the loop. The Windows #if defined(__PX4_WINDOWS) || defined(_WIN32) branch (per-waiter SRWLOCK + SleepConditionVariableSRW) is untouched.

Verification: build/px4_sitl_test ctest passes 157/157 in 54.5 s with functional-lockstep_scheduler_test completing in 2.18 s. MinGW cross-build of px4.exe links cleanly and Wine smoke-test boots through param/dataman/simulator_mavlink without crash.
Signed-off-by: Nuno Marques <n.marques21@hotmail.com>
This commit is contained in:
Nuno Marques
2026-05-08 13:52:40 -07:00
parent 3ced7609b2
commit 3bec843dfc
@@ -306,42 +306,7 @@ int LockstepScheduler::cond_timedwait(pthread_cond_t *cond, pthread_mutex_t *loc
pthread_mutex_lock(lock);
#else
while (!timed_wait.timeout) {
struct timespec poll_deadline;
clock_gettime(CLOCK_REALTIME, &poll_deadline);
long total_ns = poll_deadline.tv_nsec + 500L * 1000L * 1000L;
poll_deadline.tv_sec += total_ns / 1000000000L;
poll_deadline.tv_nsec = total_ns % 1000000000L;
result = pthread_cond_timedwait(cond, lock, &poll_deadline);
if (timed_wait.timeout) {
break;
}
if (result == ETIMEDOUT) {
// Watchdog fired; re-check _time_us in case we lost the
// broadcast. _timed_waits_mutex is taken here only to
// synchronise with set_absolute_time.
_timed_waits_mutex.lock();
if (time_us <= _time_us) {
timed_wait.timeout = true;
}
_timed_waits_mutex.unlock();
continue;
}
if (result != 0) {
break;
}
// Spurious wakeup or real signal not reflected in `timeout`
// (shouldn't happen normally). Loop to re-check.
}
result = pthread_cond_wait(cond, lock);
#endif
const bool timeout = timed_wait.timeout;
@@ -350,13 +315,6 @@ int LockstepScheduler::cond_timedwait(pthread_cond_t *cond, pthread_mutex_t *loc
result = ETIMEDOUT;
}
if (result == ETIMEDOUT && !timeout) {
// pthread_cond_timedwait returned ETIMEDOUT but broadcast also
// flipped `timeout` true between our check and this moment — we
// re-enter the loop above so this branch is effectively dead.
result = 0;
}
timed_wait.done = true;
if (!timeout && _setting_time) {