mirror of
https://github.com/PX4/PX4-Autopilot.git
synced 2026-05-09 22:08:56 +08:00
fix(lockstep): drop racy poll-loop in Linux cond_timedwait
abfacca289 wrapped the Linux #else branch of LockstepScheduler::cond_timedwait() in a 500 ms pthread_cond_timedwait poll-loop with a manual _time_us re-check, intended as a wall-time watchdog symmetric to the Windows winpthreads workaround. The loop has a wakeup classification bug: when an external pthread_cond_broadcast() unblocks the waiter (e.g. test_locked_semaphore_getting_unlocked broadcasting from another thread), pthread_cond_timedwait returns 0 and timed_wait.timeout is still false. Every guard in the loop falls through, the comment misclassifies it as a spurious wakeup, and the body re-enters pthread_cond_timedwait for another 500 ms. The broadcaster is gone and set_absolute_time() is no longer being called, so timed_wait.timeout never flips and the _time_us<=time_us re-check inside ETIMEDOUT never resolves. functional-lockstep_scheduler_test deadlocks until ctest's 1500 s timeout:
Start 71: functional-lockstep_scheduler_test
71/157 Test #71: functional-lockstep_scheduler_test ...***Timeout 1500.01 sec
[ RUN ] LockstepScheduler.All
INFO [lockstep_scheduler] setting initial absolute time to 12345678 us
INFO [lockstep_scheduler] setting initial absolute time to 12345678 us
INFO [lockstep_scheduler] setting initial absolute time to 12345678 us
Restore the original single pthread_cond_wait(cond, lock) on Linux, matching origin/main, and drop the now-dead "ETIMEDOUT && !timeout" salvage block downstream that only existed to paper over the loop. The Windows #if defined(__PX4_WINDOWS) || defined(_WIN32) branch (per-waiter SRWLOCK + SleepConditionVariableSRW) is untouched.
Verification: build/px4_sitl_test ctest passes 157/157 in 54.5 s with functional-lockstep_scheduler_test completing in 2.18 s. MinGW cross-build of px4.exe links cleanly and Wine smoke-test boots through param/dataman/simulator_mavlink without crash.
Signed-off-by: Nuno Marques <n.marques21@hotmail.com>
This commit is contained in:
@@ -306,42 +306,7 @@ int LockstepScheduler::cond_timedwait(pthread_cond_t *cond, pthread_mutex_t *loc
|
||||
|
||||
pthread_mutex_lock(lock);
|
||||
#else
|
||||
|
||||
while (!timed_wait.timeout) {
|
||||
struct timespec poll_deadline;
|
||||
clock_gettime(CLOCK_REALTIME, &poll_deadline);
|
||||
long total_ns = poll_deadline.tv_nsec + 500L * 1000L * 1000L;
|
||||
poll_deadline.tv_sec += total_ns / 1000000000L;
|
||||
poll_deadline.tv_nsec = total_ns % 1000000000L;
|
||||
|
||||
result = pthread_cond_timedwait(cond, lock, &poll_deadline);
|
||||
|
||||
if (timed_wait.timeout) {
|
||||
break;
|
||||
}
|
||||
|
||||
if (result == ETIMEDOUT) {
|
||||
// Watchdog fired; re-check _time_us in case we lost the
|
||||
// broadcast. _timed_waits_mutex is taken here only to
|
||||
// synchronise with set_absolute_time.
|
||||
_timed_waits_mutex.lock();
|
||||
|
||||
if (time_us <= _time_us) {
|
||||
timed_wait.timeout = true;
|
||||
}
|
||||
|
||||
_timed_waits_mutex.unlock();
|
||||
continue;
|
||||
}
|
||||
|
||||
if (result != 0) {
|
||||
break;
|
||||
}
|
||||
|
||||
// Spurious wakeup or real signal not reflected in `timeout`
|
||||
// (shouldn't happen normally). Loop to re-check.
|
||||
}
|
||||
|
||||
result = pthread_cond_wait(cond, lock);
|
||||
#endif
|
||||
|
||||
const bool timeout = timed_wait.timeout;
|
||||
@@ -350,13 +315,6 @@ int LockstepScheduler::cond_timedwait(pthread_cond_t *cond, pthread_mutex_t *loc
|
||||
result = ETIMEDOUT;
|
||||
}
|
||||
|
||||
if (result == ETIMEDOUT && !timeout) {
|
||||
// pthread_cond_timedwait returned ETIMEDOUT but broadcast also
|
||||
// flipped `timeout` true between our check and this moment — we
|
||||
// re-enter the loop above so this branch is effectively dead.
|
||||
result = 0;
|
||||
}
|
||||
|
||||
timed_wait.done = true;
|
||||
|
||||
if (!timeout && _setting_time) {
|
||||
|
||||
Reference in New Issue
Block a user