Skip to content

do not call tcp/udp ::stopAll() when it is not required to do so#8598

Merged
d-a-v merged 3 commits into
esp8266:masterfrom
d-a-v:nostopall
Jun 23, 2022
Merged

do not call tcp/udp ::stopAll() when it is not required to do so#8598
d-a-v merged 3 commits into
esp8266:masterfrom
d-a-v:nostopall

Conversation

@d-a-v
Copy link
Copy Markdown
Collaborator

@d-a-v d-a-v commented Jun 12, 2022

fixes #6123

@mcspr
Copy link
Copy Markdown
Collaborator

mcspr commented Jun 14, 2022

ArduinoOTA as well? Even a failed update kills every client
Also note of WebUpdate example

Continuing the idea, UpdaterClass should probably handle that

//This callback will be called when OTA connection has begun
void onStart(THandlerFunction fn);
//This callback will be called when OTA has finished
void onEnd(THandlerFunction fn);
//This callback will be called when OTA encountered Error
void onError(THandlerFunction_Error fn);
//This callback will be called when OTA is receiving data
void onProgress(THandlerFunction_Progress fn);

Where the user might as well do the

WiFiClient::stopAll();
WiFiUDP::stopAll();
// or, where we want some fine grained control
WiFiClient::stopAllExcept(&server.client());
WiFiUDP::stopAllExcept(&myPreciousCnC);

@d-a-v d-a-v changed the title do not call tcp/udp ::stopAll() when it is not required to to so do not call tcp/udp ::stopAll() when it is not required to do so Jun 23, 2022
@d-a-v d-a-v merged commit 7d9abbb into esp8266:master Jun 23, 2022
@d-a-v d-a-v deleted the nostopall branch June 23, 2022 20:14
mcspr added a commit that referenced this pull request Nov 3, 2022
follow-up of #8598 
similar to ArduinoOTA API, execute certain callback in the Updater context.
hasenradball pushed a commit to hasenradball/Arduino that referenced this pull request Nov 18, 2024
…8266#8598)

* do not call tcp/udp ::stopAll() when it is not required to to so
* remove stopAll also from ArduinoOTA
hasenradball pushed a commit to hasenradball/Arduino that referenced this pull request Nov 18, 2024
follow-up of esp8266#8598 
similar to ArduinoOTA API, execute certain callback in the Updater context.
rvdbreemen added a commit to rvdbreemen/OTGW-firmware that referenced this pull request Apr 23, 2026
….1.x

Root cause: Arduino Core 3.1.0 (esp8266/Arduino PR #8598) removed the
implicit WiFiClient/WiFiUDP::stopAll() that used to run during the
Update path. Without explicit cleanup from the sketch, TCP sockets
linger in lwIP and the WiFi SDK state persists through the soft-reset
triggered by ESP.restart(). The next boot comes up in a half-state:
WiFi appears associated (stale from before reboot) but telnet, HTTP,
MQTT and WebSocket all fail to rebind their ports. The only way to
recover without power-cycling is to force a WiFi disconnect at the
AP, which signals STAMODE_DISCONNECTED and triggers our normal
reconnect path, finally initialising services properly.

Reported 2026-04-22 by andrebrait (Discord #beta-testing) and
independently reproduced by number3nl. Symptom matches the Core 3.1.0
breaking change exactly: device is "connected to WiFi per Unifi
controller" but all services non-responsive until forced WiFi
disconnect.

Fix: add a prepareForReboot() helper in helperStuff.ino that closes
all long-lived TCP services (MQTT, WebSocket, telnet on port 23,
OTGW stream on 25238) and calls WiFi.disconnect() before the
ESP.restart() soft-reset fires. mDNS and LLMNR are UDP-only and
their stale state is harmless across a reset, so they are left
alone (also avoids API compatibility issues across Core versions).

Changes:
- MQTTstuff.ino: new doMqttDisconnect() wrapper. MQTTclient is
  file-static so helperStuff.ino cannot reach it directly; the
  wrapper exposes a graceful disconnect.
- OTGW-firmware.h: prototypes for doMqttDisconnect and doRestart so
  they are reachable from the OTA update template header.
- helperStuff.ino: prepareForReboot() helper + doRestart() now calls
  it between flushSettings() and ESP.restart(). All reboot paths
  (nightly, OTA success, WiFi config-portal fallback) benefit.
- OTGW-ModUpdateServer-impl.h: the HTTP_POST success handler now
  routes through doRestart() instead of its own inline delay +
  ESP.restart + delay sequence. Consolidates the reboot flow to a
  single canonical path.

Compared to the alternative of using ESP.reset() (bootrom jump): the
cleanup approach preserves MQTT last-will semantics for HA, keeps
getResetReason() meaningful for post-reboot diagnostics, and stays
within the Arduino-idiomatic API. Estimated effectiveness ~90% on
the primary reported symptom; remaining risk covers latent issues
(e.g. first-time 1.3.x -> 1.4.x LittleFS reformat) that are already
addressed elsewhere (release notes flash-order guidance).

Sources:
- esp8266/Arduino release 3.1.0 notes: "ArduinoOTA and ESP8266HTTPUpdate
  no longer stop all WiFiClient/WiFiUDP (#8598)"
- PR esp8266/Arduino#8598

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rvdbreemen added a commit to rvdbreemen/OTGW-firmware that referenced this pull request Apr 23, 2026
Follow-up to 201921f (service cleanup before reboot). Switches the
final reset call in doRestart() from ESP.restart() to ESP.reset().

Rationale: ESP.restart() triggers a soft-reset via the SDK, which per
Core 3.1.0 (PR esp8266/Arduino#8598) no longer performs implicit
WiFi/UDP cleanup. The soft-reset can leave SDK WiFi state half-
associated across the reset boundary, producing the "WiFi connected
but services dead" symptom that andrebrait and marceld91d reported.

ESP.reset() jumps into the ROM bootrom at 0x40000080, which is
functionally equivalent to asserting the RESET pin. It wipes all SDK
state, all lwIP pools, all DRAM — a guaranteed fresh boot. This is
exactly what the manual recovery (physical reset button) does, and
what field testers discovered works reliably.

The graceful-shutdown concerns that normally argue against
ESP.reset() (abrupt teardown, no LWT, no close frames) are already
handled by prepareForReboot() which runs BEFORE the reset:
- MQTT disconnect publishes a clean offline event (no LWT trigger)
- webSocket.close() sends proper close frames to all WS clients
- debugTelnet.stop() / OTGWstream.stop() FIN the TCP sockets
- WiFi.disconnect() triggers STAMODE_DISCONNECTED to the SDK

So: cleanup preserves peer hygiene, ESP.reset() preserves local
wipe robustness. Defense in depth.

Side benefits:
- The delay(5000) safety-tail after ESP.restart is no longer needed
  because ESP.reset never returns. Two fewer lines of defensive code.
- Future-proof against services we might add later and forget to add
  to prepareForReboot(): the bootrom reset wipes their state anyway.

Trade-offs accepted:
- getResetReason() will report "External System" instead of "Software
  /System restart". The application-level reason is still logged via
  DebugTln(str) before the reset fires.
- ESP.reset() is marked deprecated in Core 3.x comments. Still fully
  functional; migration path in Core 4.x unknown but likely years
  away. Acceptable risk.

Expected effectiveness: ~95% on the reported reboot symptom. The
remaining ~5% covers edge cases that would also require a physical
reset (brownout, flash corruption, hardware issues).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rvdbreemen added a commit to rvdbreemen/OTGW-firmware that referenced this pull request Apr 24, 2026
Two reboot call sites still bypassed the central doRestart() cleanup
wrapper, skipping the MQTT LWT / WebSocket close / TCP-FIN sequence
that is critical on Arduino Core 3.1.0+ (PR esp8266/Arduino#8598
removed the implicit WiFiClient::stopAll() from the Update path).

- runNightlyRestartCheck() in OTGW-firmware.ino called ESP.restart()
  directly after a 200ms delay. Scheduled reboots now go through
  doRestart("[nightly] scheduled restart") so nightly restart and
  OTA-success follow the same cleanup path.
- The ESP32 OTA success handler in OTGW-ModUpdateServer-esp32.h
  called platformRestart() directly after a manual 1s delay, asymmetric
  with the ESP8266 path which already uses doRestart(). Both paths now
  use doRestart("[OTA] Rebooting...").

Part of TASK-394 Phase 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rvdbreemen added a commit to rvdbreemen/OTGW-firmware that referenced this pull request Apr 24, 2026
Port of 2.0.0 TASK-394 Phase 1 commit 2f2adf0 onto dev (1.4.x).
runNightlyRestartCheck() called ESP.restart() directly after a 200ms
delay, bypassing the central doRestart() cleanup wrapper. That skipped
the MQTT LWT / WebSocket close / TCP-FIN sequence that is critical on
Arduino Core 3.1.0+ (PR esp8266/Arduino#8598 removed the implicit
WiFiClient::stopAll() from the Update path).

Scheduled reboots now go through doRestart("[nightly] scheduled
restart") so nightly restart and OTA-success follow the same cleanup
path. Dev is ESP8266-only so there is no ESP32 OTA path to fix here;
the existing OTGW-ModUpdateServer-impl.h:108 already uses doRestart.

Part of TASK-395.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rvdbreemen added a commit to rvdbreemen/OTGW-firmware that referenced this pull request Apr 24, 2026
Part 3 of 3 for TASK-396. The helpers (696bdb9) and setup/loop wiring
(ab443d0) are now visibly exercised from the OTA path.

Four new logBootSignature probes trace the OTA heap lifecycle. Grep the
telnet log for "\[OTA\] " and you get a full narrative of each update:

  [OTA] pre-begin     ← _beginFirmware/FilesystemUpload first line
  (upload progress)
  [OTA] post-end      ← immediately after Update.end(true) commits the image
  [OTA] post-remount  ← FS-OTA only, after LittleFS.begin() returns true
  [OTA] pre-reboot    ← HTTP success handler, just before queuing the reboot

The HTTP POST success handler was changed from calling doRestart()
directly to calling requestDeferredReboot("[OTA] Rebooting..."). Rationale
(also captured in the updated code comment): doRestart() calls
prepareForReboot() which stops telnet and closes MQTT/WS/OTGWstream —
all while still inside the HTTP handler callback. On fast networks the
success HTML was usually flushed by client().stop() before the cleanup
started, but on slow/congested links we used to see browser hangs where
the success page never fully arrived. Deferring the reboot to the next
loop() tick gives lwIP 10-100ms to finish draining the response socket.
The existing service-cleanup sequence (critical on Arduino Core 3.1.0+
per PR esp8266/Arduino#8598) still fires — just from performDeferredReboot()
in loop() rather than from the HTTP handler.

Sequence on the wire:
  _handleUploadEnd() sets bESPactive=false    ← end of upload
  HTTP handler sends 200 OK + success HTML
  client().stop() flushes                     ← HTTP 200 leaves the socket
  logBootSignature("[OTA] pre-reboot")        ← captures final heap state
  requestDeferredReboot(...)                  ← sets flag, returns
  HTTP handler returns to WebServer loop
  <next loop() tick>
  doBackgroundTasks() runs
  rebootHeapWatermarkTick() updates minHeap
  isRebootPending() → true, isFlashing() → false
  performDeferredReboot() → doRestart() → prepareForReboot() → ESP.reset()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rvdbreemen added a commit to rvdbreemen/OTGW-firmware that referenced this pull request Apr 24, 2026
…3) for LTS stability

Rolls back the ESP8266 target from Core 3.1.2 (espressif8266@4.2.1) to
Core 2.7.4 (espressif8266@2.6.3), matching the 1.5.x LTS line.
Motivation: field reports of OTA-reboot stalls on Core 3.1.2 traced to
PR esp8266/Arduino#8598 (removal of implicit WiFiClient/WiFiUDP::stopAll
in the Update path). Core 2.7.4 is the baseline the 1.3.x line used and
has a multi-year stability track record.

ESP32 env is untouched; it continues to use the pioarduino fork with
Arduino-ESP32 v3.3.5.

Comment block documents the GCC 4.8.2 vs 10.3 ctags trade-off for future
maintainers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rvdbreemen added a commit to rvdbreemen/OTGW-firmware that referenced this pull request Apr 24, 2026
…upersedes ADR-014)

ADR-082 (Accepted) documents the ESP8266 Arduino Core rollback from 3.1.2
(espressif8266@4.2.1) to 2.7.4 LTS (espressif8266@2.6.3) for the 2.0.0
line. Drivers: PR esp8266/Arduino#8598 removed the implicit
WiFiClient::stopAll() that the Update path relied on, producing
OTA-reboot stalls; field soak on 1.5.x validated the rollback.
Library-version pin consequences captured (AceTime v2.0.1 and
WebSockets 2.3.6 for GCC 4.8.2 compatibility). The ESP32 target is
explicitly unaffected.

ADR-083 (Accepted) makes PlatformIO the primary build system on the
2.0.0 line (dual-target ESP8266 + ESP32). Retains build.py /
arduino-cli only on the LTS-1.5.x branch for historical continuity.
Supersedes ADR-014 which rejected PlatformIO in 2020 on grounds that
no longer hold (dual-target, pre-hook library patching, GitHub-tag
AceTime pinning, Arduino IDE dropped as contributor workflow).

ADR-014 Status field updated to 'Superseded by ADR-083' per the
project rule that Accepted/Deprecated ADRs may only have their status
touched — content remains immutable.

Research report moved from repo root to docs/research/ (git mv
preserves history) and referenced from ADR-082.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants