I am seeing behavior from my AM2315 similar to the ones described here (http://www.switchdoc.com/2016/04/am2315-highly-reliable-esp8266-arduino-driver/).
My OurWeather is reporting version 032. I am hitting the "/fullDataString" endpoint every sixty seconds and recording the information into a CSV file. As described in the article, only a power-cycle of the OurWeather board fixes the issue. I've seen this about every 36-48 hours over the last week. The sensor is mounted inside the 3D-printed radiation shield and we are in bone-dry Northern California.
I'm recompiling my server to add +/- 10% jitter to the sample interval, but would like to know if there are other options to reduce the frequency of this issue. My OurWeather is located at the top of a fence, so power cycles are problematic.
Thanks for the quick response. I had the station on for a while and this didn't happen; it began happening once I started hitting the web service every 60 seconds as opposed to when I was hitting it intermittently during my dev cycle.
1/The first time it happened reset had no effect and so I power-cycled. Since then, and after reading the whitepaper, I've always power-cycled. Will try reset next time it happens. 2/Will do - adjusting my service to log nan responses and will post upon the next recurrence. 3/I'll bring it in from the sun and run it with debug enabled and grab the logs.
Thanks! A quick update- Overnight I got a new, previously unseen failure mode. Both the OurWeather and my server process locked up during a RPC response. The LCD was frozen and using a different device timed out as well. Reset button cleared it up, and, interestingly enough, unfroze the server. I have the logs for that but nothing particularly interesting - it merely ends with 4x "bcn_timout,ap_probe_send_start".
I have updated from GitHub and am running the 033 version you published yesterday with debug logging being captured. Will let you know how things go after a day or so.
One nit: Whenever I update from GitHub I have to do some manual edits to fix header file names - I'm using Ubuntu and it's a case-preserving, case-sensitive file system, so Crc16.h, elapsedMillis.h, and Time.h are all technically mis-spelled in some of the source files (those forms are the mis-spelled ones - correct versions are all lower-case). If I have time I'll send you a PR with a fix for those.
OK- It appears that it went into NaN mode again last night; a reset did not fix it. Unfortunately the logs rolled over in the Arduino serial monitor before I checked it this morning. I've disabled a bunch of the verbose memory leak log statements and will be more diligent in checking it.
OK - Managed to get the first part of a repro. I've captured the part of the debug logs where the badReads go non-zero. This one was interesting in that only the humidity and dew point went 'nan'. A reset did not fix it but a power cycle did. There was a REST call around the time the sensor went bad; I wonder if the read made by the REST call interfered with a read made by the cycling of the LCD? The RTC time returned by OurWeather on the last good call to 'fulldatastring' was 08/19/2018 20:32:12. The next REST call (calls are made every 60 seconds +/- 5 seconds of jitter) returned RTC time of 8/19/2018 20:33:10 with the first set of 'nan' values. Between those two calls it looks like an LCD cycle began at 20:32:18 and the AM2315 was still good.
re: verbose memory leak log statements - I will need to uncomment them, but every time I hit the "fulldatastring" REST endpoint about 30-40 lines of log statements are issued - one for each value in the JSON response and the available heap afterwards. Here is a quick excerpt from an older log file:
Sending command Command: u State: u State of buffer at the start: OUTPUTBUFFERSIZE=2000 Added to buffer: EnglishOrMetric Added to buffer: 0 Added to buffer: AirQualitySensor Added to buffer: 2816 Added to buffer: ESP8266HeapSize Added to buffer: 27728 Added to buffer: OurWeatherTime Added to buffer: 08/13/2018 21:46:06 Added to buffer: FullDataString Added to buffer: 22.10,50.40,25.26,101353.00,107.14,0.00,0.00,270.00,0.70,0.00,0.00,0.00,0.00,270.00,270.00,0,08/13/2018 21:46:06,StageCoach North,2816,4,4.16,6.80,0.00,0.00,4.98,3.20,0.00,0.00,0.00,0.00,0.00,0.00,V:1,WXLMB ,0,,,0,,,0 Added to buffer: FirmwareVersion Added to buffer: 032 Added to buffer: ThunderBoardLast
They eat up the buffer in the Arduino IDE so I commented them out. Not a big deal unless you're trying to capture a repro that only happens every day or so.
>>I want you to run an experiment. Change your web call to every hour and let's see what happens! I think we have the web interface interrupting an I2C call and hosing things. Maybe.
What do we think we're going to learn with this experiment? Right now, we have a repro case every 12-18 hours. If I drop the web interface to 1/60th of the frequency, we're not going to know anything for 30-60 days, assuming we have a race condition on the bus. More interesting, IMO, would be to up the web call frequency to constant (i.e. as soon as one web service call finishes, start the next one). If it happens proportionally more often, then we have a better understanding.
I tend to agree with you that there is some kind of interference on the bus. How hard would it be to put a mutex on the bus (most elegant), or, barring that, make an RPC path that exclusively returns the most recently returned values?
For the second fix, since the code that cycles through the LCD is gathering all of the data from the sensors every ~5-10 seconds, why not return those values and not hit the bus at all?
Drop it to 1/2 the frequency then. This is great verification that this is the problem.
Now, here is the problem. The REST interface is not running in a thread that we are controlling, so were to put a semaphore or mutex is a bit of a quandary. This may very well be an interrupt problem from the WiFi interface. If it is, then I have some issues to resolve. Disabling interrupts in this kind of a system is a real potential problem. And I don't even know (as of this moment) wether the I2C bus will function with the interrupts disabled. Many questions will be answered soon!
I'm looking at it this week, but the place that is interfering may be buried deep in the WiFi operating system.