Double-Decker Stacked ESPs for Fault Tolerance?

6 posts

Reply

Re: Double-Decker Stacked ESPs for Fault Tolerance? #57866

By Oldbod - Wed Nov 09, 2016 2:37 am

User mini profile
View full profile

Oldbod

Posts: 151
Joined: Sun Feb 14, 2016 5:21 am

Status: Off-line

- Wed Nov 09, 2016 2:37 am #57866 Some great ideas here. Just a thought, but beware of unintended effects. As it becomes more complex, a perceived loss of heartbeat say can trigger a restart when everything but the heartbeat generating or monitoring device is working fine. As these are purpose built systems the builders can easily avoid the issue but I have painful memories of effects of too clever monitoring. Start with alerts being generated would be my general thought, and make sure any restart saves enough information about cause first. Most of the time " It's restarting but I don't know why " is exactly where you don't want to be...

Re: Double-Decker Stacked ESPs for Fault Tolerance? #57873

By Electroguard - Wed Nov 09, 2016 7:36 am

User mini profile
View full profile

Electroguard

Posts: 590
Joined: Fri Jul 17, 2015 4:26 am

Status: Off-line

- Wed Nov 09, 2016 7:36 am #57873 Yeah, started it with eyes open to avoid those sorts of problems Oldbod, and managed to get the proof-of-concept working just fine... got a bit carried away with things though.

Basically I used 2 identical scripts running on conjoined twins that share parallel-connected gpio's and a couple of cross-connects (gpio to reset, and serial2 TX to RX and vice versa). Random boot delays ensure one always boots before the other. The first to timeout on unanswered pings will issue 'intention' to become 'Main', and then if not over-ruled during a 'decency' timeout to allow for any late Main responses, will PROMOTE itself as 'Main'. The other twin will then boot up as the mains 'Shadow', to periodically check that Main is still ok.

So it is not up to the 'Main' to send keep-alives, it's up to the 'Shadow' to periodically PING for the live presence of 'Main'.

It was sufficient for my evaluation purposes to just send serial2 PINGs and check for response, but to protect against wifi disconnections it would probably be safer to test for presence of a 'Main' webpage control variable.

The 'heartbeat' is just a local 1 sec timer on each, but each tick jumps to a Pulse subroutine that checks and decrements a load of different down counters, then branches appropriately when any hit zero. So it effectively gives multiple different Timer branches depending on number of elapsed heartbeats.

That allows the 'Shadow' to wait a 'between pings' delay, make a specified number of retry pings to the 'Main' and wait a specified number of heartbeats for a response before another retry, then when all benefit of doubt has gone, the 'Shadow' can blip the 'Mains' reset pin, wait a couple of heartbeats to ensure it is in reset and the gpios are floating, then promote itself as the new 'Main'.
If the old 'main' boots back up, it will see the existing 'Main' and therefore become the new 'Shadow', periodically pinging for the assigned number of unanswered time-delayed 'main' ping retries before attempting to promote itself again.

The PINGS and other commands such as DEMOTE, REBOOT, END etc are all issued via a 'Shout' subroutine which sends the outgoing commands using any enabled methods, which can be serial, serial2, UDP, and hardware-cross-connected handshake.
I didn't crossconnect the serial pins, to prevent bootup problems and to allow possibility for individual serial monitoring and control of each device.
Serial2 at 9600 allowed reliable communications of every command and even allowed for acknowledgements to make doubly sure.
UDP allows any and all such nodes to be remotely monitored and controlled using any commands that have been included into the common vocabulary. It also allows facility for a remote eg: 'Watchdog' node that could similary monitor and hard-reboot the 'main' whenever necessary, or perhaps a 'Secondary' node which might act as a stand-in for the 'Main'.

It was all breadboarded proof-of-concept though, I haven't actually soldered up any double-decker ESPs yet.

Gives me another functional tool for my increasing 'bag of bits' collection, but I'm beginning to doubt I'll be able to assemble the required modules together to achieve specific useful units simply because of the 100 vars + branches limitation... gives less than 50 global commands in the vocabulary with corresponding subroutine branches.

Sounds a lot until you want to speak individual words to a voice chip as acknowledgements for infra-red remote control button presses etc.

So I've got an 'uneasy feeling' that I might end up having to use Esp_Basic just as a fault-tolerant 'Main/Shadow' coms interface to serially attached Arduinos which do the actual needed functionality, which is why I ended up exploring the situation in greater depth.

Where to get modules?

https://www.banggood.com/custlink/KvGGhGF4wG

Reply

ESP8266 Community Forum