Wednesday, February 9, 2011

Crashing Woes

Arrived at work hoping to see a nice fresh two-day long data file, only to find that the run I started shortly before leaving on Monday crashed after only 10 minutes into the run.

I'm not sure what happened... Everything went smoothly for about 10 minutes (Mon. 18:11 to 18:22) and then the Wi-Fi board restarted, and the data flow from the FPGA board got hung up. The Wi-Fi board produced regular heartbeats after its restart, but no more data was received from the FPGA board - it was stuck in its 'hung' state. The Wi-Fi board started to respond to a typed command ('help') but then it locked up (but kept producing heartbeats!)

I clearly need to improve the firmware to detect serial data stream lockups and recover from them more gracefully. I also need to try to figure out why the WiFi board spontaneously rebooted, and why it got hung up responding to a command afterwards. One thing that would help would be to turn up the network debug level.

  • It's also possible that the firmware hung in part because of diagnostic messages the Wi-Fi board sent to STDOUT on startup which it didn't understand. Turning off PRINT_NOW flag in debugmodes.uwi to suppress that output. Oops, I mean, turning on NO_PRINT. Hm, but deferred output would be nicer... Working on getting that working...

  • Another spontaneous reboot! And no clues in the log file... maybe I need to turn on full network debugging...

  • Interesting thought: The Nios firmware could be programmed to initialize the Wi-Fi board appropriately... Except that it might have trouble executing cold-boots (power cycles) that are occasionally needed.
In other matters, Gordon finished producing the new data file yesterday based on Copy(11), and I need to start analysis of that in OpenOffice Calc and give him instructions on what analysis to do in SciLab. Also I need to require MS Office to be installed.
  • Ended up spending most of the day fiddling around with the Wi-Fi script, with the goal of getting network debugging turned on, and startup diagnostics to STDOUT turned off. Finally got there at the end of the day, about 7 pm. But for some reason, all 4 GPS satellites currently in range are generating TRAIM alarms. I had to reset the GPS to get the PPS going again. I wonder if it will just take the TRAIM algorithm a while to lock in. Anyway, I guess I will leave it running & check out its progress when I get in on Friday.

No comments:

Post a Comment