Chat freely about anything...

User avatar
By sfranzyshen
#57195 I just finished pushing code to the no-timing branch for more testing ... here is what has been done ...
made the code loop friendly by breaking up scancb & connectbestap using the _scanStatus flag ...
eliminate the os_timer for scans ... this is not interrupt driven critical ...
added an even simpler application (testHere.ino) ... trying to eliminate application bugs as a problem ...
and handle when the AP tcp timeouts(drops) the STA -> if wifi is still good try to reconnect tcpConnect() ... I am only testing this on two nodes (or now) ... changes will be added to the devel and master branches i all test good here :mrgreen:

UPDATE: l already found a problem ... i'll pick it up again tomorrow :(
Code: Select allhandleNodeSync(): Already connected to node 0.  Dropping
User avatar
By sfranzyshen
#57232
sfranzyshen wrote:I just finished pushing code to the no-timing branch for more testing ... here is what has been done ...
made the code loop friendly by breaking up scancb & connectbestap using the _scanStatus flag ...
eliminate the os_timer for scans ... this is not interrupt driven critical ...
added an even simpler application (testHere.ino) ... trying to eliminate application bugs as a problem ...
and handle when the AP tcp timeouts(drops) the STA -> if wifi is still good try to reconnect tcpConnect() ... I am only testing this on two nodes (or now) ... changes will be added to the devel and master branches i all test good here :mrgreen:

UPDATE: l already found a problem ... i'll pick it up again tomorrow :(
Code: Select allhandleNodeSync(): Already connected to node 0.  Dropping

... this is the core problem with easyMesh ...

OK, so after stripping this code down to the bare bones ... minimum application layer ... just connection and nodesync code ... and two nodes ... I still reach timeouts on the AP during wifi scanning ... resulting in the STA disconnecting and reconnecting ... shaking everything up ... generating a lot of messages ... eventually overloading the sendQueue ... leading to wdt resets ... some nodes recover wdt resets ... but not always

... if I disable wifi scanning on the AP node ... I do not reach timeouts ...

I have pushed these changes (hack) upto the no-timing branch if anyone else wants to experiment with it ... it here ... https://github.com/sfranzyshen/easyMesh/tree/no-timing ... If you change one node (the intended AP node) to not scan the network see the code in easyMeshSTA.cpp ... and just un-remark the idle line ...
Code: Select all    if ( staticThis->_meshAPs.empty() ) {  // no meshNodes left in most recent scan
        //      debugMsg( GENERAL, "connectToBestAP(): no nodes left in list\n");
        // wait 5 seconds and rescan;
        debugMsg( CONNECTION, "connectToBestAP(): no nodes left in list, rescanning\n");
//        os_timer_setfn( &_scanTimer, scanTimerCallback, NULL );
//        os_timer_arm( &_scanTimer, SCAN_INTERVAL, 0 );
        _lastScanned = staticThis->getNodeTime();
        _scanStatus = RESCAN;
//        _scanStatus = IDLE; //un-remark this to disable rescanning over and over on AP ... for test
        return false;
    }


So NOW the problem is how to handle messaging ... during scans .
User avatar
By sfranzyshen
#57296 since we are tied to the esp8266 (for now) we only have one radio to work with. the STA and AP are always set to be the same channel. if the STA changes the channel to connect to another AP ... it's AP channel is also changed. for now ... we have only one radio to work with. so, when we scan all 14 wifi channels we are switching the radio away from the current channel (the one we are messaging over) to the other channels to scan for APs ... when done ... we return back to our current (mesh) channel and try to catch up with the incoming messages we missed ... and the outgoing messages that were queued up ... it's during this period in the protocol that things don't always catch-up ... timing out connected STAs ... and it breaks. as of right now the protocol dictates that any node connected as a STA to another node's AP ... doesn't re-scan the wifi network ... yet if a node doesn't have a STA connection to another node's AP ... it will re-scan the wifi network every 5 sec (SCAN_INTERVAL) ... looking for a node that is not already part of the mesh to connect to. this means that at least one node of any easymesh mesh will perpetually be in this scanning loop ... doomed to fail. so we need to add some controls into the protocol to handle messaging during wifi scanning. I feel that once this problem is addressed ... we will finally be able to create something stable. here is a list of mechanisms i'm playing with to try and address this issue ... all feedback is welcome ...

- limit scans to mesh wifi channel & ssid (already in devel branch)
- scan one channel at a time ... returning back to the mesh channel in between each scan
- stop outgoing message sending before scans
- handle timeouts (incoming) messages during scans
- notify other nodes about scans
User avatar
By sfranzyshen
#57317 I just pushed the latest changes to the no-timing branch that limits scans to the mesh wifi channel only (set to 1) and kind of handles timeouts during scans ... I also added feedback for when message get queued ... I'm running the testHere example (strip app layer) with two nodes ... so far ... no drops, timeouts, queued messages, or wdt resets ... of course this is only two nodes ... running the simplest amount of code ... so any of you multi node setups out there want to run it for a bit? I'm going to move up to three whole nodes next :) and maybe even ... run the startHere example !
https://github.com/sfranzyshen/easyMesh/tree/no-timing

UPDATE: I have now been running it for over 6 hours without a single dropped connection, timeout, queued message, or wdt reset. it seams that as long as we stay on the same channel we don't miss incoming data ... and outgoing data doesn't seam to be bothered by a same channel scan either ... and as long as this holds up with many nodes(4+) ... this might mean we're ready to move on to the timesync stuff ... "time" to step it up :D