[core] Collapse duplicate to_remove_ read in Scheduler::call fast path

Read to_remove_ once at the top of the cleanup block instead of loading it twice (once via cleanup_() -> to_remove_empty_(), once again for the MAX_LOGICALLY_DELETED_ITEMS check). GCC cannot CSE across the memw barriers that std::atomic<uint32_t>::load emits on Xtensa, so both the fast-path zero check and the max check were generating independent memw + l32i sequences: memw; l32i a8, [to_remove_]; beqz ... ; to_remove_empty_() memw; l32i a8, [to_remove_]; bltui 5 ; to_remove_count_() Reading the counter once into a register and branching on the result collapses the common zero-case to a single memw + l32i + beqz. The non-zero path pays one extra read after cleanup_slow_path_ (which may have decremented the counter), but that path already takes lock_ so the extra load is negligible. Scheduler::call stays at 344 B (unchanged); no flash layout shift, so adjacent code (feed_wdt_slow_ etc.) stays in the same cache lines and the measurement can't be confounded by MMIO timing drift. An earlier version of this change also marked Scheduler::millis_64() ESPHOME_ALWAYS_INLINE to drop its out-of-line $isra$0 clone; that grew Scheduler::call by 56 B, shifted feed_wdt_slow_ in flash, and measurably regressed the wdt bucket on a busier winefridge.yaml config, so the inlining is dropped. This change keeps only the memw collapse.
2026-05-24 01:37:15 +08:00 · 2026-04-24 16:56:02 -05:00
parent f62972c2c6
commit 614d018e4b
1 changed files with 17 additions and 7 deletions
@@ -601,14 +601,24 @@ uint32_t HOT Scheduler::call(uint32_t now) {
  }
 #endif /* ESPHOME_DEBUG_SCHEDULER */

-  // Cleanup removed items before processing
-  // First try to clean items from the top of the heap (fast path)
-  this->cleanup_();
+  // Cleanup removed items before processing. Read the counter once so the
+  // common zero-case is a single atomic load + branch; the old code called
+  // cleanup_() (which loads to_remove_) and then to_remove_count_() again for
+  // the MAX check, producing a redundant memw+load pair on the fast path.
+  // GCC cannot CSE across the memw barriers that std::atomic<uint32_t>::load
+  // emits on Xtensa, so the duplicate load was unavoidable without manual
+  // restructuring.
+  if (this->to_remove_count_() != 0) {
+    // First try to clean items from the top of the heap (fast path).
+    this->cleanup_slow_path_();

-  // If we still have too many cancelled items, do a full cleanup
-  // This only happens if cancelled items are stuck in the middle/bottom of the heap
-  if (this->to_remove_count_() >= MAX_LOGICALLY_DELETED_ITEMS) {
-    this->full_cleanup_removed_items_();
+    // If we still have too many cancelled items, do a full cleanup.
+    // This only happens if cancelled items are stuck in the middle/bottom
+    // of the heap. Re-read to_remove_ because cleanup_slow_path_ may have
+    // decremented it.
+    if (this->to_remove_count_() >= MAX_LOGICALLY_DELETED_ITEMS) {
+      this->full_cleanup_removed_items_();
+    }
  }
  // IMPORTANT: This loop uses index-based access (items_[0]), NOT iterators.
  // This is intentional — fired intervals are pushed back into items_ via