|
Post by edharris on Aug 13, 2018 8:02:19 GMT -8
Hello-
I am seeing behavior from my AM2315 similar to the ones described here (http://www.switchdoc.com/2016/04/am2315-highly-reliable-esp8266-arduino-driver/).
My OurWeather is reporting version 032. I am hitting the "/fullDataString" endpoint every sixty seconds and recording the information into a CSV file. As described in the article, only a power-cycle of the OurWeather board fixes the issue. I've seen this about every 36-48 hours over the last week. The sensor is mounted inside the 3D-printed radiation shield and we are in bone-dry Northern California.
I'm recompiling my server to add +/- 10% jitter to the sample interval, but would like to know if there are other options to reduce the frequency of this issue. My OurWeather is located at the top of a fence, so power cycles are problematic.
|
|
|
Post by SDL on Aug 13, 2018 10:37:35 GMT -8
Hi Ed,
Does it take a power cycle or can you hit the reset button and the software and have it start reading again? This is a key piece of information.
You are getting NaN from the AM2315 for both the temperature and humidity? Post a fullDataString with the result.
Is this a new phenomena? How long have you had the station up?
Any chance of hooking up a serial port, recompiling the software and turning debug on? Then we could capture the result (plus the AM2315 driver gives a lot of statistics).
It will be interesting to see if the jitter makes any real difference.
BP
|
|
|
Post by edharris on Aug 13, 2018 12:42:44 GMT -8
Thanks for the quick response. I had the station on for a while and this didn't happen; it began happening once I started hitting the web service every 60 seconds as opposed to when I was hitting it intermittently during my dev cycle.
1/The first time it happened reset had no effect and so I power-cycled. Since then, and after reading the whitepaper, I've always power-cycled. Will try reset next time it happens. 2/Will do - adjusting my service to log nan responses and will post upon the next recurrence. 3/I'll bring it in from the sun and run it with debug enabled and grab the logs.
-Ed
|
|
|
Post by SDL on Aug 13, 2018 13:39:47 GMT -8
Ed,
Excellent. This will help us figure out what is going on. Under debug, the AM2315 driver reports a bunch of stuff.
The REST Interface being hit is the critical piece of information here I think. I'm going to update the REST interface and roll out a version V033. Might fix a couple of problems.
John
|
|
|
Post by SDL on Aug 13, 2018 14:44:15 GMT -8
|
|
|
Post by edharris on Aug 14, 2018 7:31:49 GMT -8
Thanks! A quick update- Overnight I got a new, previously unseen failure mode. Both the OurWeather and my server process locked up during a RPC response. The LCD was frozen and using a different device timed out as well. Reset button cleared it up, and, interestingly enough, unfroze the server. I have the logs for that but nothing particularly interesting - it merely ends with 4x "bcn_timout,ap_probe_send_start".
I have updated from GitHub and am running the 033 version you published yesterday with debug logging being captured. Will let you know how things go after a day or so.
One nit: Whenever I update from GitHub I have to do some manual edits to fix header file names - I'm using Ubuntu and it's a case-preserving, case-sensitive file system, so Crc16.h, elapsedMillis.h, and Time.h are all technically mis-spelled in some of the source files (those forms are the mis-spelled ones - correct versions are all lower-case). If I have time I'll send you a PR with a fix for those.
|
|
|
Post by SDL on Aug 14, 2018 16:08:30 GMT -8
Ed,
Looking forward to the results of our test here. Interesting that the GItHub works fine on the Mac. Send a PM over and I will fix the spelling.
I'm waiting for your results before releasing V033 to the auto update system.
BP
|
|
|
Post by edharris on Aug 15, 2018 8:50:55 GMT -8
OK- It appears that it went into NaN mode again last night; a reset did not fix it. Unfortunately the logs rolled over in the Arduino serial monitor before I checked it this morning. I've disabled a bunch of the verbose memory leak log statements and will be more diligent in checking it.
-Ed
|
|
|
Post by SDL on Aug 16, 2018 9:15:36 GMT -8
Ed,
"I've disabled a bunch of the verbose memory leak log statements".
Give me an example of one of these. This is an interesting statement. String should no longer have those problems, if that is what to which you are referring.
Bummer on V033. It is the REST interface somehow. Our test system has been running for months. Time to start hammering the REST interface.
BP
|
|
|
Post by edharris on Aug 19, 2018 20:09:54 GMT -8
OK - Managed to get the first part of a repro. I've captured the part of the debug logs where the badReads go non-zero. This one was interesting in that only the humidity and dew point went 'nan'. A reset did not fix it but a power cycle did. There was a REST call around the time the sensor went bad; I wonder if the read made by the REST call interfered with a read made by the cycling of the LCD? The RTC time returned by OurWeather on the last good call to 'fulldatastring' was 08/19/2018 20:32:12. The next REST call (calls are made every 60 seconds +/- 5 seconds of jitter) returned RTC time of 8/19/2018 20:33:10 with the first set of 'nan' values. Between those two calls it looks like an LCD cycle began at 20:32:18 and the AM2315 was still good. re: verbose memory leak log statements - I will need to uncomment them, but every time I hit the "fulldatastring" REST endpoint about 30-40 lines of log statements are issued - one for each value in the JSON response and the available heap afterwards. Here is a quick excerpt from an older log file: Sending command Command: u State: u State of buffer at the start: OUTPUTBUFFERSIZE=2000 Added to buffer: EnglishOrMetric Added to buffer: 0 Added to buffer: AirQualitySensor Added to buffer: 2816 Added to buffer: ESP8266HeapSize Added to buffer: 27728 Added to buffer: OurWeatherTime Added to buffer: 08/13/2018 21:46:06 Added to buffer: FullDataString Added to buffer: 22.10,50.40,25.26,101353.00,107.14,0.00,0.00,270.00,0.70,0.00,0.00,0.00,0.00,270.00,270.00,0,08/13/2018 21:46:06,StageCoach North,2816,4,4.16,6.80,0.00,0.00,4.98,3.20,0.00,0.00,0.00,0.00,0.00,0.00,V:1,WXLMB ,0,,,0,,,0 Added to buffer: FirmwareVersion Added to buffer: 032 Added to buffer: ThunderBoardLast They eat up the buffer in the Arduino IDE so I commented them out. Not a big deal unless you're trying to capture a repro that only happens every day or so.
|
|
|
Post by edharris on Aug 29, 2018 5:24:51 GMT -8
BP-
Any thoughts? This continues to happen about every 24 hours with a one-minute call to the web service interface. Happy to add any additional logging you feel is necessary.
Thanks, -Ed
|
|
|
Post by SDL on Aug 30, 2018 10:51:20 GMT -8
Ed,
Every minute you call the web interface? Right?
I want you to run an experiment. Change your web call to every hour and let's see what happens! I think we have the web interface interrupting an I2C call and hosing things. Maybe.
BP
|
|
|
Post by edharris on Sept 3, 2018 15:52:16 GMT -8
>>I want you to run an experiment. Change your web call to every hour and let's see what happens! I think we have the web interface interrupting an I2C call and hosing things. Maybe. What do we think we're going to learn with this experiment? Right now, we have a repro case every 12-18 hours. If I drop the web interface to 1/60th of the frequency, we're not going to know anything for 30-60 days, assuming we have a race condition on the bus. More interesting, IMO, would be to up the web call frequency to constant (i.e. as soon as one web service call finishes, start the next one). If it happens proportionally more often, then we have a better understanding. I tend to agree with you that there is some kind of interference on the bus. How hard would it be to put a mutex on the bus (most elegant), or, barring that, make an RPC path that exclusively returns the most recently returned values? For the second fix, since the code that cycles through the LCD is gathering all of the data from the sensors every ~5-10 seconds, why not return those values and not hit the bus at all?
|
|
|
Post by SDL on Sept 4, 2018 19:30:07 GMT -8
Ed, Drop it to 1/2 the frequency then. This is great verification that this is the problem. Now, here is the problem. The REST interface is not running in a thread that we are controlling, so were to put a semaphore or mutex is a bit of a quandary. This may very well be an interrupt problem from the WiFi interface. If it is, then I have some issues to resolve. Disabling interrupts in this kind of a system is a real potential problem. And I don't even know (as of this moment) wether the I2C bus will function with the interrupts disabled. Many questions will be answered soon! I'm looking at it this week, but the place that is interfering may be buried deep in the WiFi operating system.  BP
|
|
|
Post by edharris on Sept 8, 2018 7:42:35 GMT -8
OK- On a two-minute polling interval the NaN issue repro'ed in 62 hours. It required a hard power-cycle to clear.
|
|