Using the new Arduino IDE for ESP8266 and found bugs, report them here

Moderator: igrr

User avatar
By dalbert
#78182 For weeks, I have been struggling with a very serious bug: intermittently, network calls (usually an HTTPClient.GET() but sometimes also UDP queries to an NTP server) lock up and never return. The system is locked up hard and neither the hardware watchdog nor Ticker-based software watchdogs trigger a reset which implies (I think) that the hardware watchdog is being stroked and interrupts are blocked.

I should start by echoing what others have said: there should never be a case where OS (or Arduino core) code strokes the hardware watchdog inside a loop. A failure of the watchdog to reset the system when the main loop stops running is a cardinal sin in embedded systems.

Unfortunately, I have not been able to reproduce this reliably, but it happens often (I often can't go a full day without this happening). Sometimes the system will recover after tens of minutes or hours, sometimes it must be hard reset. This has happened on many devices so it's not a random bad unit.

I have been experiencing this with Arduino core 2.4.0-2.4.2 with LWIP 1.4. I tried LWIP2 (small memory footprint), but was having more serious stability issues with it so I'm sticking with 1.4 for now.

A code segment that this happens with is:
Code: Select all   
        HTTPClient httpClient;
        static const char *myUrl[] = "http://myserver.com/myfile.txt";
...
        httpClient.begin(myUrl);
   httpClient.setTimeout(1000);
   httpClient.setUserAgent(getHostname());
   int httpResult = httpClient.GET();

It works most of the time, but sometimes the GET() never returns.

The GET is retrieving a very small file (~20 bytes).
It is possible that several of these GETs are hitting the server at the same time.

Another unusual aspect of my application is that I frequently turn off the WiFi modem and then turn it back on before attempting any network communications:
Code: Select all   
  wifi_station_set_auto_connect(0);
  wifi_station_disconnect();
  Serial.printf_P(PSTR("WiFi: Stopping...\r\n"));
  bool stopped = false;
  int timeout=20; // max time allowed to disconnect
  while (--timeout && !stopped) {
    static uint attempts = 100;
    uint32_t status = wifi_station_get_connect_status();
    stopped = (status == STATION_IDLE);
    if (--attempts && !stopped) {
       Serial.printf("%d", status);
       delay(50);
    }
  }
  if (!stopped) {
    wifi_station_set_auto_connect(1);
    wifi_station_connect();
    Serial.printf_P(PSTR("Wifi: disconnect failed.\r\n"));
    return;
  }


I turn the modem back on with:
Code: Select all   
  wifi_fpm_do_wakeup();
  wifi_fpm_close();
  wifi_set_opmode_current(STATION_MODE);
  wifi_station_connect();
  wifi_station_set_auto_connect(1);


Any ideas would be much appreciated. If you are experiencing the same thing, please post a reply so we can get a feel for how many people are experiencing this and please share information about how this is happening in your application so we can maybe get some clues as to what the underlying problem is.
Last edited by dalbert on Wed Sep 12, 2018 11:38 am, edited 3 times in total.
User avatar
By dalbert
#78198 To add to this, the problem also happens intermittently with UDP queries. The system is not out of memory (heap is around 18-20K) and the same code will run successfully many times before locking up.

For example, the code snippet below failed to print the messages after the first printf indicating that it had locked up while sending or receiving the packet.

Running Arduino core 2.4.2 (SDK 2.2.1(cfd48f3) with LWIP 1.4

Code: Select all
static void ICACHE_FLASH_ATTR sendNTPpacket(IPAddress& address) {
   // set all bytes in the buffer to 0
   memset(packetBuffer, 0, NTP_PACKET_SIZE);
   // Initialize values needed to form NTP request
   // (see URL above for details on the packets)
   packetBuffer[0] = 0b11100011;   // LI, Version, Mode
   packetBuffer[1] = 0;     // Stratum, or type of clock
   packetBuffer[2] = 6;     // Polling Interval
   packetBuffer[3] = 0xEC;  // Peer Clock Precision
   // 8 bytes of zero for Root Delay & Root Dispersion
   packetBuffer[12]  = 49;
   packetBuffer[13]  = 0x4E;
   packetBuffer[14]  = 49;
   packetBuffer[15]  = 52;

   // all NTP fields have been given values, now
   // you can send a packet requesting a timestamp:
   udp.beginPacket(address, 123); //NTP requests are to port 123
   udp.write(packetBuffer, NTP_PACKET_SIZE);
   udp.endPacket();
}

....
   // called periodically in main loop:

         Serial.printf_P(PSTR("NTP: request time..."));
         sendNTPpacket(ntpServerIp);
         delay(1000); // wait for reply
         int cb = udp.parsePacket();
         if (!cb) {
             Serial.printf_P(PSTR("NTP: no response.\r\n"));
         } else if (cb < (int)sizeof(NtpPacket)) {
             Serial.printf_P(PSTR("NTP: incomplete pkt.\r\n"));
         } else {
             Serial.printf_P(PSTR("NTP: received %d bytes\r\n"), cb);
         }
   ...
User avatar
By TD-er
#78669 I don't have a solution, but I will follow this with a lot of interest.
For ESPeasy I have been looking for a long time now to find the cause of these lockups.
I do know the ESP8266 will never reboot when it wasn't reset after flashing using the serial port (OTA updates are different). So if a node crashes without a proper reset/reboot after flashing, it will halt.
But for those nodes that did get a proper reset/restart after flashing, I see a lot of watchdog restarts.
To most WiFiClient instances, I already added an explicit timeout setting, but it doesn't seem to make a difference.
It looks like there is some bug in the 2.4.x code.
I also tried 2.4.1 and 2.4.2 and LWIP 2.0 low memory and now LWIP1.4 again. but the issue remains.