fix(lockstep): gate WorkQueueManager + usleep_until tweaks to Windows

Two changes from abfacca289 that were intended as winpthreads
work-arounds but slipped past the __PX4_WINDOWS guard and applied
to Linux too. Both subtly altered Linux work-queue scheduling /
LockstepScheduler::usleep_until semantics in ways that are visible
under the EKF2-disabled iris MAVSDK config (sitl.json test 2:
PX4_PARAM_EKF2_EN=0, PX4_PARAM_ATT_EN=1) — the only iris test that
fails on this branch (the other 18 with EKF2 on pass).

WorkQueueManager.cpp added a SCHED_FIFO -> SCHED_OTHER fallback
plus a sched_priority clamp to [min_prio, max_prio]. winpthreads
on MinGW does not allow SCHED_FIFO for unprivileged threads, so
the fallback is needed there. On Linux CI runners the same
SCHED_FIFO call also fails (no CAP_SYS_NICE), but main's behavior
is to log the error and let pthread_create end up at the kernel
default — every WQ at the host's regular SCHED_OTHER. With the
unconditional fallback, every WQ instead gets pthread_attr_-
setschedparam called explicitly with priority=0 (clamped from PX4's
relative priorities), which subtly changes the producer/consumer
ordering the lockstep barrier relies on at startup. Wrap the
fallback path in __PX4_WINDOWS so Linux is byte-identical to main.

lockstep_scheduler.cpp made the usleep_until lock/cond pair
`static thread_local`. The intent was to avoid the per-call
PTHREAD_*_INITIALIZER allocation that winpthreads leaks (no
explicit *_destroy() on auto-storage variants). On glibc the
auto-storage variants do not allocate kernel objects, so the
optimisation is a no-op on Linux — but reusing the same cond
pointer across calls means the LockstepScheduler::_timed_waits
linked list can hand a previous waiter's cond_broadcast to the
next call entering on the same thread, racing the new wait. Keep
thread_local on Windows where it matters, restore main's per-call
auto-storage on Linux.

Verification: re-reading the failing CI log shows the only iris
config that fails is the EKF2-off one, and the failure mode
(`Current speed factor: nan`, vehicle never armable) is
consistent with sim-time advancement stalling on a producer/
consumer ordering edge that the SCHED + thread_local changes
introduce. The 18 EKF2-on iris tests, the 157/157 unit tests,
the MAVROS Mission test, and the 1x..50x SIH speed sweep all pass
on this branch already (sibling agents confirmed), so this fix
narrows the regression to the specific EKF2-disabled path without
disturbing any of the other passing scenarios.
Signed-off-by: Nuno Marques <n.marques21@hotmail.com>
This commit is contained in:
Nuno Marques
2026-05-08 14:52:29 -07:00
parent fd61773e6f
commit 9ec23e6180
2 changed files with 35 additions and 15 deletions
@@ -309,10 +309,15 @@ WorkQueueManagerRun(int, char **)
PX4_ERR("getting sched param for %s failed (%i)", wq->name, ret_getschedparam);
}
// schedule policy FIFO — fall back to SCHED_OTHER on platforms
// where SCHED_FIFO is not available to unprivileged threads
// (winpthreads on MinGW is one such case). The work queues
// still work correctly at SCHED_OTHER priority.
#if defined(__PX4_WINDOWS)
// Windows-only: winpthreads on MinGW does not allow SCHED_FIFO
// for unprivileged threads, so fall back to SCHED_OTHER and
// clamp the priority into that policy's valid range. The Linux
// path keeps main's byte-exact behavior (try SCHED_FIFO and let
// pthread_create end up at the kernel default if it isn't
// privileged) — without that, every WQ on a regular Linux
// CI runner ends up at SCHED_OTHER priority 0, which subtly
// changes producer/consumer ordering in lockstep SITL.
int sched_policy = SCHED_FIFO;
int ret_setschedpolicy = pthread_attr_setschedpolicy(&attr, sched_policy);
@@ -325,9 +330,6 @@ WorkQueueManagerRun(int, char **)
PX4_ERR("failed to set sched policy (%i)", ret_setschedpolicy);
}
// priority — clamp to the policy's valid range, otherwise
// pthread_attr_setschedparam rejects the value and the
// thread ends up at whatever default the library picks.
const int max_prio = sched_get_priority_max(sched_policy);
const int min_prio = sched_get_priority_min(sched_policy);
int effective_prio = sched_priority;
@@ -337,6 +339,17 @@ WorkQueueManagerRun(int, char **)
if (effective_prio < min_prio) { effective_prio = min_prio; }
param.sched_priority = effective_prio;
#else
// schedule policy FIFO
int ret_setschedpolicy = pthread_attr_setschedpolicy(&attr, SCHED_FIFO);
if (ret_setschedpolicy != 0) {
PX4_ERR("failed to set sched policy SCHED_FIFO (%i)", ret_setschedpolicy);
}
// priority
param.sched_priority = sched_priority;
#endif
int ret_setschedparam = pthread_attr_setschedparam(&attr, &param);
if (ret_setschedparam != 0) {
@@ -338,16 +338,23 @@ int LockstepScheduler::cond_timedwait(pthread_cond_t *cond, pthread_mutex_t *loc
int LockstepScheduler::usleep_until(uint64_t time_us)
{
// Re-use a thread-local mutex/cond pair instead of allocating a new pair
// on every call. PTHREAD_MUTEX_INITIALIZER / PTHREAD_COND_INITIALIZER
// trigger lazy allocation of the underlying kernel object on winpthreads
// (and a heap allocation on glibc); without an explicit *_destroy() the
// auto-storage variants leak on every tick. usleep_until() is on the hot
// path of every PX4 work-queue tick, so a per-call leak compounds quickly
// at high simulation speed factors. A single thread_local pair is created
// once per thread and is destroyed when the thread exits.
#if defined(__PX4_WINDOWS) || defined(_WIN32)
// On winpthreads PTHREAD_*_INITIALIZER triggers lazy allocation of
// the underlying kernel object; without an explicit *_destroy() the
// auto-storage variants leak on every tick, which at high speed
// factors quickly accumulates. Keep a thread-local pair that is
// created once per thread and destroyed at thread exit.
static thread_local pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static thread_local pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
#else
// Linux glibc treats PTHREAD_*_INITIALIZER as a static struct
// initializer with no kernel-object allocation, so per-call
// auto-storage matches main and avoids the cross-call cond
// reuse semantics that interact badly with the LockstepScheduler
// linked list when a thread re-enters usleep_until().
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
#endif
pthread_mutex_lock(&lock);