Left here for archival purposes.

User avatar
By MKoehler
#15234
Eyal wrote:There is Good news and there is Bad news...

The Good New first.

I did a few builds from the period 20150118-24, going back in time.

I am running the dsleep loop I presented yesterday at a test. It proved to be a good stress test so far, crashing within a few minutes.

The image from the end of the 24rd crashes. We knew this.
The image from the end of the 23rd crashes.
The image from the end of the 22nd works. (a few hours later it was shows as dates the 23rd...)

I then tested the other patches from the 23rd. The last patch that works reliably is
VIP6(xingyuewang,http://517513.cn) commit si7021 module.
https://github.com/nodemcu/nodemcu-firm ... e81d8c0b8b


The first one that fails is
merge mqtt branch to master and build pre_build bin
https://github.com/nodemcu/nodemcu-firm ... dd54cadedd


The Bad News:

I cannot see how this patch causes the failure. However, maybe the addition of a large module was enough to breach some memory map rule?

I am now testing the failing patch but with mqtt deselected in user_config.h. So far so good... 15 minutes... 20m...30m...[later]...1h... [even later] stopped after 2 hours (about 3000 loops).
Recall: this test does 25 dsleep/wakeup cycles per minute.

Closing Remarks

My app does not use mqtt (and still crashes) so the issue is probably not related specifically to this module. Furthermore. we do not know if the failure in January is the same as today, we only know that the same exception occurs at the same location intermittently.

Next step: is there anyone reading this that has the skills to take this issue further?Now is the time to chime in...

Thanks


Hi,

If your little lua test code is a reliable tool for diagnosing all cases of zombie mode it resulted in the following table:

DeepSleep not OK OK
nodemcu_float_0.9.6-dev_20150406.bin nodemcu_float_dev20150311.bin
nodemcu_float_0.9.6-dev_20150331.bin
nodemcu_float_0.9.5_20150318.bin
nodemcu_integer_0.9.5_20150318.bin

So at least we have advanced about 2 months. Also I have discovered that the four failing versions are all bigger than 403KB. Could it be just :evil: a memory issue?

Heinz-Georg
User avatar
By Eyal
#15492 So far I see the same. I have built from source and see the same (always good to reproduce) so I am now slowly adjusting it to my preferred setup (removing many modules). If I remove all unneeded modules then it starts the wakeup failure again so this is a binary search here, and each test that does not fail in a minute needs to run for a few hours to be taken as "probably good".

So it does not seem to be a size issue. My (failing) module is smaller in fw size and memory footprint than the working one. Maybe if too many modules are removed (as I do) the boot goes a bit faster and does not allow enough time for some startup settling of some signals?

It will be a while before I have a clearer picture, but right now I can run my wireless sensors with the known good fw. I am leaving two version running overnight now.

BTW, I enable/disable modules in app/include/user_modules.h.

cheers

[26 Apr update] Maybe someone is still interested... I was running many tests for a few days now. Tracking the dev branch I found that things worked later than the 20150311 release until the 18th, and stopped with this commit:
https://github.com/nodemcu/nodemcu-firm ... a78ff5f72e

It does not look like a size problem, I has larger and smaller images working or failing.

However, it does look like a timing issue though. At some point, to see what is going on, I enabled DEVELOP_VERSION in /app/include/user_config.h. The result was that
- it is very chatty [bad]
- the uart stays at 74880 (not 9600) [good]
- it does not crash [very good]

I then built HEAD of dev with similar results. Fails as is and works with DEVELOP_VERSION.

My latest tests adds a one line display at the start and one can see how it takes much longer to reach the app.

My current test program:

Code: Select allnow = tmr.now()/1000000
heap = node.heap()
count = collectgarbage("count")*1024
print (string.format("%.3f: used %d, heap %d", now, count, heap))

magic_pin = 1   -- gpio5, LOW on this pin will stop the program

gpio.mode (magic_pin, gpio.INPUT, gpio.PULLUP);
if 0 == gpio.read (magic_pin) then
   print ("aborting by magic")
else
   print ("will wake up in 2s\n")
   node.dsleep(2*1000000, 4)
end

The display using HEAD of dev as is

Code: Select all0.082: used 8060, heap 15856

and with debug enabled

Code: Select all0.102: used 8015, heap 10936


More that 10 times slower. And a lot of heap lost, so I may not be able to use this hack to run my app (but I will try).

I suspect that the debug prints allows some timing to go "right". Maybe the flash speed measurement? Waiting for a gpio to settle? I am guessing.

Any help from the developers will be appreciated.

cheers
User avatar
By delinend
#15700 Wow ! I have tryed many hardware things... Pull up/down on GPIO0, GPIO2, GPIO15 an RST. Also a 470pF capasity on the RST to GND. All without luck :cry:
My ESP-07 is going intozombie mode, after 5-8 minutes, with 1 minute deepsleep.

But the I suddenly tryes to flash "INTERNAL://DEFAULT" and "INTERNAL://BLANK" via NodeMCUflasher. And now my ESP-07 has rund for 5 days with 1 minut deepsleep loop, without any problems :D I have also removed all pull up/down from my board. Only a pull down 12Kohm on GPIO15.
(Select the 0x7C000 default and 0x7E000 blank sections. Uncheck the box for the 0x10000 IROM section.)
I followed this: http://benlo.com/esp8266/esp8266QuickStart.html

Please try it !!