fix(lockstep): gate WorkQueueManager + usleep_until tweaks to Windows

Two changes from abfacca289 that were intended as winpthreads work-arounds but slipped past the __PX4_WINDOWS guard and applied to Linux too. Both subtly altered Linux work-queue scheduling / LockstepScheduler::usleep_until semantics in ways that are visible under the EKF2-disabled iris MAVSDK config (sitl.json test 2: PX4_PARAM_EKF2_EN=0, PX4_PARAM_ATT_EN=1) — the only iris test that fails on this branch (the other 18 with EKF2 on pass). WorkQueueManager.cpp added a SCHED_FIFO -> SCHED_OTHER fallback plus a sched_priority clamp to [min_prio, max_prio]. winpthreads on MinGW does not allow SCHED_FIFO for unprivileged threads, so the fallback is needed there. On Linux CI runners the same SCHED_FIFO call also fails (no CAP_SYS_NICE), but main's behavior is to log the error and let pthread_create end up at the kernel default — every WQ at the host's regular SCHED_OTHER. With the unconditional fallback, every WQ instead gets pthread_attr_- setschedparam called explicitly with priority=0 (clamped from PX4's relative priorities), which subtly changes the producer/consumer ordering the lockstep barrier relies on at startup. Wrap the fallback path in __PX4_WINDOWS so Linux is byte-identical to main. lockstep_scheduler.cpp made the usleep_until lock/cond pair `static thread_local`. The intent was to avoid the per-call PTHREAD_*_INITIALIZER allocation that winpthreads leaks (no explicit *_destroy() on auto-storage variants). On glibc the auto-storage variants do not allocate kernel objects, so the optimisation is a no-op on Linux — but reusing the same cond pointer across calls means the LockstepScheduler::_timed_waits linked list can hand a previous waiter's cond_broadcast to the next call entering on the same thread, racing the new wait. Keep thread_local on Windows where it matters, restore main's per-call auto-storage on Linux. Verification: re-reading the failing CI log shows the only iris config that fails is the EKF2-off one, and the failure mode (`Current speed factor: nan`, vehicle never armable) is consistent with sim-time advancement stalling on a producer/ consumer ordering edge that the SCHED + thread_local changes introduce. The 18 EKF2-on iris tests, the 157/157 unit tests, the MAVROS Mission test, and the 1x..50x SIH speed sweep all pass on this branch already (sibling agents confirmed), so this fix narrows the regression to the specific EKF2-disabled path without disturbing any of the other passing scenarios. Signed-off-by: Nuno Marques <n.marques21@hotmail.com>
2026-05-20 03:13:44 +08:00 · 2026-05-08 14:52:29 -07:00
parent fd61773e6f
commit 9ec23e6180
2 changed files with 35 additions and 15 deletions
@@ -309,10 +309,15 @@ WorkQueueManagerRun(int, char **)
 				PX4_ERR("getting sched param for %s failed (%i)", wq->name, ret_getschedparam);
 			}

-			// schedule policy FIFO — fall back to SCHED_OTHER on platforms
-			// where SCHED_FIFO is not available to unprivileged threads
-			// (winpthreads on MinGW is one such case). The work queues
-			// still work correctly at SCHED_OTHER priority.
+#if defined(__PX4_WINDOWS)
+			// Windows-only: winpthreads on MinGW does not allow SCHED_FIFO
+			// for unprivileged threads, so fall back to SCHED_OTHER and
+			// clamp the priority into that policy's valid range. The Linux
+			// path keeps main's byte-exact behavior (try SCHED_FIFO and let
+			// pthread_create end up at the kernel default if it isn't
+			// privileged) — without that, every WQ on a regular Linux
+			// CI runner ends up at SCHED_OTHER priority 0, which subtly
+			// changes producer/consumer ordering in lockstep SITL.
 			int sched_policy = SCHED_FIFO;
 			int ret_setschedpolicy = pthread_attr_setschedpolicy(&attr, sched_policy);

@@ -325,9 +330,6 @@ WorkQueueManagerRun(int, char **)
 				PX4_ERR("failed to set sched policy (%i)", ret_setschedpolicy);
 			}

-			// priority — clamp to the policy's valid range, otherwise
-			// pthread_attr_setschedparam rejects the value and the
-			// thread ends up at whatever default the library picks.
 			const int max_prio = sched_get_priority_max(sched_policy);
 			const int min_prio = sched_get_priority_min(sched_policy);
 			int effective_prio = sched_priority;
@@ -337,6 +339,17 @@ WorkQueueManagerRun(int, char **)
 			if (effective_prio < min_prio) { effective_prio = min_prio; }

 			param.sched_priority = effective_prio;
+#else
+			// schedule policy FIFO
+			int ret_setschedpolicy = pthread_attr_setschedpolicy(&attr, SCHED_FIFO);
+
+			if (ret_setschedpolicy != 0) {
+				PX4_ERR("failed to set sched policy SCHED_FIFO (%i)", ret_setschedpolicy);
+			}
+
+			// priority
+			param.sched_priority = sched_priority;
+#endif
 			int ret_setschedparam = pthread_attr_setschedparam(&attr, &param);

 			if (ret_setschedparam != 0) {
@@ -338,16 +338,23 @@ int LockstepScheduler::cond_timedwait(pthread_cond_t *cond, pthread_mutex_t *loc

 int LockstepScheduler::usleep_until(uint64_t time_us)
 {
-	// Re-use a thread-local mutex/cond pair instead of allocating a new pair
-	// on every call. PTHREAD_MUTEX_INITIALIZER / PTHREAD_COND_INITIALIZER
-	// trigger lazy allocation of the underlying kernel object on winpthreads
-	// (and a heap allocation on glibc); without an explicit *_destroy() the
-	// auto-storage variants leak on every tick. usleep_until() is on the hot
-	// path of every PX4 work-queue tick, so a per-call leak compounds quickly
-	// at high simulation speed factors. A single thread_local pair is created
-	// once per thread and is destroyed when the thread exits.
+#if defined(__PX4_WINDOWS) || defined(_WIN32)
+	// On winpthreads PTHREAD_*_INITIALIZER triggers lazy allocation of
+	// the underlying kernel object; without an explicit *_destroy() the
+	// auto-storage variants leak on every tick, which at high speed
+	// factors quickly accumulates. Keep a thread-local pair that is
+	// created once per thread and destroyed at thread exit.
 	static thread_local pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
 	static thread_local pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
+#else
+	// Linux glibc treats PTHREAD_*_INITIALIZER as a static struct
+	// initializer with no kernel-object allocation, so per-call
+	// auto-storage matches main and avoids the cross-call cond
+	// reuse semantics that interact badly with the LockstepScheduler
+	// linked list when a thread re-enters usleep_until().
+	pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+	pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
+#endif

 	pthread_mutex_lock(&lock);