Friday, February 11, 2011

Wi-Fi board still crashing!

Well, looks like the Wi-Fi board crashed again during my last run, after only running for less than an hour (about 3,000 seconds).

The strange thing is, when this happens (this last time, at least) the board rebooted itself and opened new server connections... But by then, the DE3 board firmware was totally hung up (probably in the serial library).

Now, I could try to deal with this by adding capabilities to detect the serial peer going down, and try to buffer up data so that we can stream it out when the connection is re-established.... But (a) that's a lot of extra complexity, and (b) the Wi-Fi board shouldn't be rebooting itself in the first place.

What else can I try? I can't put the Wi-Fi board inside a shielded enclosure (in case RF noise is causing the problem), because that would block the Wi-Fi signal, since it uses an embedded antenna.

Perhaps it is cosmic rays causing the crashes? But it will be hard to block those, too...

The mystifying thing, though, is that this spontaneous rebooting never happened before, till just recently, which makes me think that some change to the Wi-Fi script is triggering it... I added a little bit of stuff to the script (to handle the new pass-thru commands to the CTU), but I doubled the main stack size in case it was overflowing, and that didn't help with the problem... And the auxilliary stacks should already be plenty big enough (1,000 entries).

Here are the stack sizes I am using now:
  • AT+SET 42="256" - Program counter stack.
  • AT+SET 40="1000" - Space for simple variable stack frames.
  • AT+SET 41="1000" - Space for complex variable stack frames.
Personally, I don't understand why the program would need this much stack space anyway - my stacks just don't get that deep! So, I am doubting that this is really the problem... But, not sure what else to try at this point...

Aha! Some insights gained from watching the RX (data in) and RTS (flow control out) signals carefully on the scope:
  1. RTS is deasserted (raised) briefly after a fixed time delay after the end of a transmission; this is consistent with the EZURiO docs (this delay is set by the _UARTRCVTMO() function, and defaults to 255*4 = 1020 bit periods).

  2. It stays high for an amount of time that varies somewhat (possibly because of other threads) but is (normally) at least a certain minimum amount of time. This is also consistent with the EZURiO docs; this delay is set by the _UARTSLEEPCOUNT() function, and defaults to 400 bit periods.

  3. While RTS remains high (deasserted), usually the Nios UART core does not send any data - indicating that it is indeed paying attention to this signal. It waits until RTS goes low (is asserted) before sending data.

  4. However, occasionally (I saw this at least once) the Nios UART will have already started a transmission when RTS goes high, but does at least manage to turn it off shortly before the end of the sleep count.

  5. Finally, one time I happened to see the RTS glitch high for an extremely brief interval (possibly as small as one bit period) while the Nios UART was sending, and by the next second, the module had crashed.
These observations suggest to me that what is happening is that the EZURiO perhaps is not turning off RTS soon enough before its buffer fills, and so is getting swamped by data from the Nios. Or, perhaps there is a race condition when RTS is first raised that means that if data is coming in at that moment, sometimes, something about the buffer state gets screwed up.

This all makes sense, because this crashing problem started after I turned down the baud rate of the GPS->Nios connection, which resulted in more (& larger) gaps in the echoed data stream from the Nios->WiFi. The potential problem always existed, but it only began manifesting after these big gaps became present, since they allowed the possibility that the RTS might deassert at about the same time that the next data burst was starting.

This raises several alternative possibilities regarding now to proceed:
  1. Try turning down the baud rate of the Nios --> WiFi connection to match the rate (56,700) of the other connection - this should reduce the gaps in the data stream during which the RTS may possibly be raised.

  2. Try turning up the _UartSleepCount(), giving the Nios more time to respond to the raised RTS by halting the data flow. (However, not knowing exactly how the Wi-Fi board's receive buffer is working, I am uncertain whether this would really solve the problem.)

  3. Try turning down the _UartRcvTmo(), so that the RTS pulses will happen sooner, and hopefully not be as likely to overlap with the start of the next transmission burst. However, this seems like an unreliable method, and it may in fact lead to more RTS pulses (since more of the transmission gaps will be large enough), and more problems.
OK, I did #1; changed the baud rate on the DE3->WiFi link to 57,600. So far so good; this has cut down on the transmission gaps, thus RTS pulses, and appears to basically eliminate cases where the RTS pulse arrives at the same time that a new transmission burst is starting. This change really shouldn't have been be necessary, but as a workaround it's much easier than trying to somehow debug EZURiO's serial receiver firmware (which I don't even have the source code to).

Emailed EZURiO to report this apparent firmware bug, so hopefully they can fix it in a future version of the firmware.

No comments:

Post a Comment