You might argue that this is 'expensive' compared to trhowing together something based around an esp8266 but I think the esp32 is the right tool and the total solutio is already available off-the-shelf. I don't know how much workload the cpu/hardware needs to do, but my gut-feeling is that the esp8266 has neither the compute power nor the amount (and resolution) of analogue inputs to do the job.
Does your question stem from reading that support is coming for esp8266?
torntrousers wrote:I was looking in this area a few months ago. It seems not quite there yet for the ESP8266. I think it should theoretically be possible, its got the I2S input so you can easily connect an I2S mic (eg) directly to it and the CPU power should be enough for a neural network doing voice keyword spotting. The obvious code to do that would be Tensorflow and there is an experimental version of that for micro controllers, but its very new and I've not even been able to get it to compile cleanly yet. Espressif has the ESP-ADF for voice stuff on the ESP32, but that also is pretty recent and doesn't support custom keyword spotting yet, coming in March this year maybe. I gave up and ended up using a Raspberry Pi Zero to make a non-internet connected voice controlled light switch.
Thank you for this information!