Saturday, January 28, 2012

Sat., Jan. 28th

Woke up in middle of night, decided to write down some more notes.

We wonder if maybe the FEDM is overheating more readily now for some reason, and started looking into the possibility of making temperature measurements.  After a little poking around, Darryl & I realized that both the DE3 board and the FEDM board actually have on-board temperature sensors.  However, these are not directly measuring the temperatures of the FPGAs, but only the ambient temperatures at the surface of the PCBs.  Even so, it might be useful to add gelware & firmware to actually take those measurements periodically and report them to the server.  That way we could detect, for example, overheating within the overall electronics box, once installed.  The DE3 kit, I'm pretty sure, came with sample FPGA source code for the Control Panel app which reads temperatures there; by starting from that code we could probably develop this capability pretty quickly.  And on the FEDM, we can figure out the interface to the temperature sensor chip by looking at its datasheet, if we decide that's worth the trouble.  The chip (looking at the BOM) is the MAX1668MEE+.  For easy reference I downloaded the datasheet from Digi-Key to the FEDM_Design folder in Dropbox.

Another factor is the ADCs, which still may not be turned off, and which may therefore be contributing to the overall FEDM temperature; we should maybe still write code to interface with those sometime soon.  The ADC datasheet is already in our Dropbox folder.

I considered that power supply voltage sag could also be contributing, but I measured the voltage on the WiFi board powered from the FEDM and it only seemed slightly slow.  Also, boosting its voltage up to 5V directly from the Agilent power supply didn't affect the behavior.

Earlier in the day, I tried applying this same boost to the CTU and this seemed to help it avoid the resets.

When speaking with Samad earlier in the day, I mentioned that there is somewhat of a tradeoff at work in the power supply design.  If he does a power distribution board that just routes power from the DE3 supply to the DE3 board, there is a risk that voltages could sag a bit just by going through multiple connectors, although this concern should be reduced somewhat by taking advantage of the multiple +5V wires from the DE3 supply (thus maximizing conductance from that supply to the power distribution board) as he & I discussed.

An alternative would be to use a higher-voltage, single-voltage wall-plug supply with adequate power, and feed this to a board with our own voltage regulators; the voltages from these can then feed out directly to the other boards in the system, thereby reducing the number of power connectors in series that those regulated levels will need to pass through, and thus reducing the IR drops.  But then, we have to worry about cooling the voltage regulators.

One thing that I think we definitely want to do is look at powering the CTU WiFi and GPS directly from the power distribution board, instead of routing their +5V supply through the DE3 - since we still don't have documentation specifying the current supply capacity of the paths from the DE3's +5V input to the corresponding pins of its GPIO headers.

Anyway, back to the current problem where the new timing-sync datapath works fine for a few seconds and then dies.  Before it was dying so quickly, we had a temporary issue where the input pulse had collapsed.  However, I think this was caused by a probe connection that was accidentally touching another node; after moving the probe cable this went away.  But the problem with the datapath dying got worse and worse as the evening progressed.  The first time it was working it stayed on for a while.  On each subsequent test it seemed to die sooner and sooner.  This is one of the reasons I thought the problem might be heating-related - maybe something on the board was getting gradually warmer and warmer?  Anyway, try again Monday after it "rests" over the weekend.

A couple of thoughts on our diagnostic strategy looking forwards:

  • It may be that Darryl's test module (now significantly rewritten by myself) is getting stuck in a specific state, and identifying which state it is in may give us a clue about why.  But on second thought, I don't think this is the problem, because after the "death" occurs the front-end module is in state 0, which means it received the return handshake.  So it's not that the datapath is getting jammed - it's that it's no longer responding to pulses.  It could be the very front-end edge-capture module that is dying.  If this is heating-related that could well be the case, since that is a high-speed (200MHz) module and could experience timing related problems if there is a heating issue.
  • If the test module were causing the problems, then replacing it with the actual firmware (now complete and ready to test) might have a hope of fixing the problem.  However, based on the above observations I don't think this is the case.  Something else is going wrong early in the datapath.  So probably the thing to do is go back again and apply diagnostic probes to the very front-end module.  The problem is likely happening somewhere in there, and if I can figure out where, it can possibly be tweaked to make it less timing-constrained (if that is what's happening).
Anyway, that's enough notes for now - I'm going back to sleep.

Another idea:  Switch the scope from "Auto" to "Normal" mode so I can see what the last timing-edge data stream looks like.  This might give me a clue about what exactly is starting to fail.

To get ready for the students to help work on the Python server, I cleaned up the "Server Code" folder on Dropbox (organizing various things into appropriate subfolders), and wrote a README.txt file explaining the new file hierarchy.

In Server Code/docs/, I started writing a "Programmers' Reference Guide" with the intent of documenting in detail the present code to aid the students (or other future developers) in modifying it.  However, after several hours of work, I only finished documenting one class in one module (namely, model.SensorNet).  So, finishing this guide may not be practical.  However, what's there may still be helpful as an example of good documentation, and to help the students get started working with that module.

Actually, my original intent today was also to begin the code changes to support the new object model.  I did write some comments towards that end at the top of model.py, and added new symbols to __all__, but haven't gotten any farther yet.

OK, late at night now - I've been adding more comments to model.py and cleaning up & rearranging things a little, but still haven't made substantive changes.  Before I do I may need to go into the lab and make sure the server is still running.  Then after I make the first set of changes (move some functionality from SensorNode into the WiFi_Module class), I can regression-test to make sure I didn't break anything.

No comments:

Post a Comment