ESP8266 intermittent hard lockup on network access
Posted: Tue Sep 11, 2018 4:47 pm
For weeks, I have been struggling with a very serious bug: intermittently, network calls (usually an HTTPClient.GET() but sometimes also UDP queries to an NTP server) lock up and never return. The system is locked up hard and neither the hardware watchdog nor Ticker-based software watchdogs trigger a reset which implies (I think) that the hardware watchdog is being stroked and interrupts are blocked.
I should start by echoing what others have said: there should never be a case where OS (or Arduino core) code strokes the hardware watchdog inside a loop. A failure of the watchdog to reset the system when the main loop stops running is a cardinal sin in embedded systems.
Unfortunately, I have not been able to reproduce this reliably, but it happens often (I often can't go a full day without this happening). Sometimes the system will recover after tens of minutes or hours, sometimes it must be hard reset. This has happened on many devices so it's not a random bad unit.
I have been experiencing this with Arduino core 2.4.0-2.4.2 with LWIP 1.4. I tried LWIP2 (small memory footprint), but was having more serious stability issues with it so I'm sticking with 1.4 for now.
A code segment that this happens with is:
It works most of the time, but sometimes the GET() never returns.
The GET is retrieving a very small file (~20 bytes).
It is possible that several of these GETs are hitting the server at the same time.
Another unusual aspect of my application is that I frequently turn off the WiFi modem and then turn it back on before attempting any network communications:
I turn the modem back on with:
Any ideas would be much appreciated. If you are experiencing the same thing, please post a reply so we can get a feel for how many people are experiencing this and please share information about how this is happening in your application so we can maybe get some clues as to what the underlying problem is.
I should start by echoing what others have said: there should never be a case where OS (or Arduino core) code strokes the hardware watchdog inside a loop. A failure of the watchdog to reset the system when the main loop stops running is a cardinal sin in embedded systems.
Unfortunately, I have not been able to reproduce this reliably, but it happens often (I often can't go a full day without this happening). Sometimes the system will recover after tens of minutes or hours, sometimes it must be hard reset. This has happened on many devices so it's not a random bad unit.
I have been experiencing this with Arduino core 2.4.0-2.4.2 with LWIP 1.4. I tried LWIP2 (small memory footprint), but was having more serious stability issues with it so I'm sticking with 1.4 for now.
A code segment that this happens with is:
Code: Select all
HTTPClient httpClient;
static const char *myUrl[] = "http://myserver.com/myfile.txt";
...
httpClient.begin(myUrl);
httpClient.setTimeout(1000);
httpClient.setUserAgent(getHostname());
int httpResult = httpClient.GET();
It works most of the time, but sometimes the GET() never returns.
The GET is retrieving a very small file (~20 bytes).
It is possible that several of these GETs are hitting the server at the same time.
Another unusual aspect of my application is that I frequently turn off the WiFi modem and then turn it back on before attempting any network communications:
Code: Select all
wifi_station_set_auto_connect(0);
wifi_station_disconnect();
Serial.printf_P(PSTR("WiFi: Stopping...\r\n"));
bool stopped = false;
int timeout=20; // max time allowed to disconnect
while (--timeout && !stopped) {
static uint attempts = 100;
uint32_t status = wifi_station_get_connect_status();
stopped = (status == STATION_IDLE);
if (--attempts && !stopped) {
Serial.printf("%d", status);
delay(50);
}
}
if (!stopped) {
wifi_station_set_auto_connect(1);
wifi_station_connect();
Serial.printf_P(PSTR("Wifi: disconnect failed.\r\n"));
return;
}
I turn the modem back on with:
Code: Select all
wifi_fpm_do_wakeup();
wifi_fpm_close();
wifi_set_opmode_current(STATION_MODE);
wifi_station_connect();
wifi_station_set_auto_connect(1);
Any ideas would be much appreciated. If you are experiencing the same thing, please post a reply so we can get a feel for how many people are experiencing this and please share information about how this is happening in your application so we can maybe get some clues as to what the underlying problem is.