Wednesday, February 29, 2012

Wed., Feb. 29th

Today my main task is:  Get everything set up so the students can do the midterm HW/SW review.  We expect Dr. Arora, at least, to attend.  Dr. Kwan is not.  We don't know about Dr. Ordonez.

Juan and Michael Dean are here and are going to tweak the FEDM high-speed subsystem from yesterday to see if they can improve the performance any before they start Logic-Locking.

I resoldered the jury-rigged SMA connector on the OCXO board (old CTU controller) which had broken off earlier in the week.

I re-supplied power cables to everything to get ready for the demo.  Let's try doing a dry run.
  1. Starting server.
  2. Switching on:  (1) CTU WiFi, (2) DE3, (3) GPS in rapid sequence (~1 sec between).
We get the following output:

HOST_STARTING,CTU_GPS,1.9
HOST_READY
$ACK,WIFI_READY*60
$GPRMC,225634.008,V,3025.660,N,08417.096,W,0.0,0.0,280212,4.0,W*7D
$GPGGA,225634.008,3025.66011,N,08417.09553,W,0,00,99.0,104.16,M,-29.7,M,,*64
$PDMETRAIM,2,0,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0*43
$PDMEPOSHOLD,0,0000.000,N,00000.000,E,000.00*4A
...

and in the Console window:

Node #0's host (type CTU_GPS, firmware version 1.9) is starting up...
Node #0's host is ready to accept commands.
   ERROR: WiFi_Module.sendHost():  I don't know any way to communicate with the sensor host in the present briding mode.  Giving up.
Node 0 reports that its bridging mode has changed to TREFOIL.
Node 0's bridge mode is now TREFOIL.
...

This is because the server hasn't yet received the bridge-mode command that tells it the Wi-Fi board has switched to Trefoil mode at the time when the GPS manager is trying to send the warm-start command.  This needs to be fixed sometime.  For now, we can try warm-starting manually:

3.  Manually type "HOST GPS $PDME,1" in the UART window.  Get:
HOST GPS $PDME,1
HOST GPS $PDME,1
$ACK,GPS $PDME,1*24
$PDMEHEADER1: DeLORME GPS2058_HW_1.0.1
$PDMEHEADER2: DeLORME GPS2058_FW_2.0.1
$GPTXT,COSMICi Custom_Config_0.0.3
$GPGGA,182901.764,3025.66011,N,08417.09553,W,0,00,99.0,104.16,M,-29.7,M,,*6E
$PDMETRAIM,2,0,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0*43
$PDMEPOSHOLD,0,0000.000,N,00000.000,E,000.00*4A
$GPRMC,182902.014,V,3025.660,N,08417.096,W,0.0,0.0,290212,4.0,W*75
$GPGGA,182902.014,3025.66011,N,08417.09553,W,0,00,99.0,104.16,M,-29.7,M,,*6D
$PDMETRAIM,2,0,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0*43
$PDMEPOSHOLD,0,0000.000,N,00000.000,E,000.00*4A
...

After a short delay, we acquire a satellite.  Now, let's switch on POSHOLD and TRAIM modes.  The commands sent by the old CTU firmware were:

$PDME,21,1,3025.694,N,08417.1,W,0040
$PDME,22,1,0.000000100

So we prefix these with "HOST GPS" and send:

4. Type "HOST GPS $PDME,21,1,3025.694,N,08417.1,W,0040" in the UART window.

Oops, got an error in gps.py... Just forgot the "this." in front of the _PDME_Record() constructor call.  Fixed that, re-running.

Now we're not getting any satellites...  The clock is about a minute behind (due to the GPS being powered off briefly as I was restarting everything).  Maybe I should put the batteries back in once we get a signal, so that doesn't happen again...

We seem to be slow re-acquiring, so I'm trying a cold-start now.  It erases the position/time settings.  (Not just the almanac!)  We seem to be taking a long time to re-acquire.  Maybe I shouldn't have done that...  If it doesn't eventually sync up, I might need to reinitialize using the demo app now.

Anyway, whatever...  We can still run our demo without the real GPS time sync.  We still need to manually start the timer with the START command.  That's fine.

5.  Power up the FEDM & its Wi-Fi board.
6.  Type "HOST START" at the UART.

I had some issues with loose cables; eventually we need to go through and create properly soldered connectors everywhere (or at least high-quality removable connectors).

Anyway, the demo is running OK at the moment except for the GPS still not acquiring satellites.

UPDATE: It finally did acquire satellites after I manually set the time & location and then just let it sit for a long time.  It took maybe 30 minutes though.

Juan and Michael made some more minor simplifications and got the slow-corner speed for the high-speed stuff up to 537.35 MHz.  They are archiving that and putting the archive up on the group blog.

Next I will show them how to logic-lock just the components we want.

One thing that might be moderately useful, until we can power up the whole CTU all at once through the power distribution board:  Program the system to start up appropriately no matter what order the individual components of the CTU are powered up in.  Possible orders are:
  1. (1) WiFi, (2) DE3, (3) GPS (nominal order).  - Currently, even this doesn't work ideally well because the server sees that the host is up and then tries sending the warm-start command to the GPS before the GPS is even turned on.  The server needs to recognize that the GPS isn't responding, display a warning (and a reminder to the operator to turn on the GPS), and keep retrying until it is turned on.  Also, with this sequence, if it is executed too slowly, we can miss catching the WIFI_READY message in the CTU firmware and fail to automatically unmute the GPS pass-thru.
  2. (1) WiFi, (2) GPS, (3) DE3. - This order works pretty well, apart from the fact that the DE3 (& thus the server) never get to see the turn-on messages from the GPS - so we can't rely on those messages to detect that the GPS has turned on.  Also, with this sequence, if it is executed slowly, we can miss the WIFI_READY message.
  3. (1) DE3, (2) WiFi, (3) GPS - This actually works OK, we get "--" followed by "oh" on the 7-segment display temporarily when we turn on the Wi-Fi, and output like the following.

    $ACK,*65
    $ERR,UNK_CMD,*00
    $ACK,WIFI_STARTING,v0.19*67
    $ACK,WIFI_READY*60
    $PDMEHEADER1: DeLORME GPS2058_HW_1.0.1
    $PDMEHEADER2: DeLORME GPS2058_FW_2.0.1
    $GPTXT,COSMICi Custom_Config_0.0.3
    $GPRMC,223052.165,V,3025.656,N,08417.093,W,0.0,0.0,290212,4.0,W*76
    $GPGGA,223052.165,3025.65566,N,08417.09318,W,0,00,99.0,111.34,M,-29.7,M,,*65
    $PDMETRAIM,2,0,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0*43
    $PDMEPOSHOLD,0,0000.000,N,00000.000,E,000.00*4A
    ...

    Note the HOST_STARTING and HOST_READY messages are missing, because when those are sent, the Wi-Fi board isn't yet turned on, so they go in the bit bucket.  So the server doesn't receive the information it requires to realize that this is the CTU.  Beyond that, other than getting rid of the "ACK" and "UNK_CMD" on the initial empty line (00 byte) from the Wi-Fi, I'm not sure how this could be improved.  (Other than adding the automatic warm-start etc.)
  4. (1) DE3, (2) GPS, (3) WiFi - When this works, we get output like this:

    $ACK,*65
    $ERR,UNK_CMD,*00
    $ACK,WIFI_STARTING,v0.19*67
    $ACK,WIFI_READY*60
    $GPRMC,224111.999,A,3025.655,N,08417.095,W,0.5,165.9,290212,4.0,W*60
    $GPGGA,224111.999,3025.65503,N,08417.09511,W,1,03,5.4,111.33,M,-29.7,M,,*57
    $PDMETRAIM,2,0,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0*43
    $PDMEPOSHOLD,0,0000.000,N,00000.000,E,000.00*4A

    Of course, the HOST_STARTING and HOST_READY messages are missing.  The GPS startup messages are also missing because we intentionally mute them until we receive the WIFI_READY.
  5. (1) GPS, (2) WiFi, (3) DE3 - This may work, and produce output like the following (although whether we catch the WIFI_READY message depends on how quickly we turn the DE3 on):

    HOST_STARTING,CTU_GPS,1.9
    HOST_READY
    $ACK,WIFI_READY*60
    $GPRMC,224401.999,A,3025.651,N,08417.100,W,0.2,0.0,290212,4.0,W*61
    $GPGGA,224401.999,3025.65087,N,08417.09954,W,1,03,5.2,111.30,M,-29.7,M,,*52
    $PDMETRAIM,2,0,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0*43
    $PDMEPOSHOLD,0,0000.000,N,00000.000,E,000.00*4A
    ...

    Note the HOST_STARTING and HOST_READY messages are present b/c the Wi-Fi board has already powered up by the time the DE3 sends them.
  6. (1) GPS, (2) DE3, (3) WiFi - Comes out similar to case 4.  Works fine apart from the fact that we miss the GPS startup messages.  However, they are not particularly critical.  Output similar to the following:

    $ACK,*65
    $ERR,UNK_CMD,*00
    $ACK,WIFI_STARTING,v0.19*67
    $ACK,WIFI_READY*60
    $GPRMC,224919.999,A,3025.684,N,08417.092,W,0.3,63.4,290212,4.0,W*57
    $GPGGA,224919.999,3025.68390,N,08417.09239,W,1,03,4.9,109.57,M,-29.7,M,,*5C
    $PDMETRAIM,2,0,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0*43
    $PDMEPOSHOLD,0,0000.000,N,00000.000,E,000.00*4A
    ...
George brought in the brackets for holding up the sample case, and the plexiglass enclosure.
We had the review, with Dr. Arora & Matthieu present.  It seemed to go pretty well - Dr. Arora asked a lot of questions.

After the review I happened to notice a bug in the CTU firmware where I was getting an E5 (discrepancy too large) error consistently (this might have been triggered when we re-acquired time lock) - that was just because I didn't re-align last_counter_val after seeing a phase mismatch; fixed that.  Then I tested the startup sequences 3-6 above.

The GPS reliably re-acquired pretty quickly (and without manual commands) later in the afternoon, although that may have been because I had the batteries in, so it never powered down totally, just went into sleep mode.  We need to do some more experiments w.r.t. re-acquiring time lock after it has been off for a while, to really pin down what's needed for that.


Tuesday, February 28, 2012

Tue., Feb. 28th

Both Darryl & David are supposed to be here today.  M. Dean asked via email if he could study for a midterm tomorrow instead of coming in, but I replied that I really think he should come in, and do his studying in the evening.  This is our last full work day before the Midterm HW/SW Review tomorrow (@ 3:30 pm) and before our self-imposed end-of-month deadline (end of day tomorrow) to demonstrate full 500 MHz functionality of the system.  The work needs to get done.

Yesterday David was copying the Quartus SP2 install file to the VirtualBox on the center iMac, to try to fix some compile errors he was getting.  When he gets here he can do the install and then try compiling again.

Yesterday Aarmondas was working on stripping the slow-speed stuff from the timing-sync datapath; he can continue that today.

If all goes well, we will hopefully get to the point today of compiling all the high-speed components together by themselves.  If that gets over 500 MHz, then assign them to the root logic-lock region and compile again (should give the same speed).  Finally, the slow-speed components then need to be reconnected, and a final recompile done with them in place.  In theory, the 500 MHz speed for the high-speed components should be retained, with the full functionality of the system in place.  Then the whole thing will still need to be tested, to demonstrate proper functionality and that the desired speed has been attained.

Meanwhile, I can continue finishing up the implementation of the automatic server-driven warm-start of the GPS, and test it.  If that is insufficient to get the GPS to acquire satellites, then I can also try a cold-start, and also try initializing the date/time properly, and see if that helps.  More coding needed for that though.

When I left off yesterday, I was about to implement wifi.WiFi_Module._uartSrvConnected(), etc. in wifi.py.  These methods allow us to figure out which connections from the node are currently up & running, so we'll know which of them we can use to send a message to the node.  Normally, all of them should be active once the node has started up, but in case one of the connections goes down for some reason, it would be nice to be able to detect this, and gracefully fall back on an alternate connection.  However, for now we'll just assume that if the corresponding attribute of the .wifi_module object model is non-null, the connection is still active.

Archived log files.  Made those changes.  Starting server.  Fixing some more minor bugs.

OK, now finally we are not getting Python errors, and we are getting the following transactions on the UART connection:

----------------------------------------------------------------------
At Tue Feb 28 15:07:34 2012 + 203 ms opened node0.uart.trnscr transcript...


Tue Feb 28 15:07:45 2012 + 239 ms: < 
Tue Feb 28 15:07:45 2012 + 240 ms: < HOST_STARTING,CTU_GPS,1.9
Tue Feb 28 15:07:45 2012 + 247 ms: < HOST_READY
Tue Feb 28 15:07:45 2012 + 256 ms: > HOST GPS $PDME,1
Tue Feb 28 15:07:45 2012 + 836 ms: < $ACK,GPS $PDME,1*24
Tue Feb 28 15:07:45 2012 + 837 ms: < $ERR,UNK_CMD,GPS*44

And the server console dutifully reports:

   ERROR: SensorHost._handleHostMsg(): Sensor host reports a UNK_CMD error with data [GPS].

So, the command is getting sent, but generates an UNK_CMD error response from the embedded host.  Guess I didn't yet update the firmware to support the new "GPS" command?  Or didn't yet load the new firmware onto the DE3 board?  Anyway, let's check.  I might also have to turn on firmware debugging - fortunately this can be done via a simple slider switch.

Oops, Eclipse is complaining that I upgraded Quartus to SP2 but didn't upgrade the Nios2 tools yet.  Downloading ftp://ftp.altera.com/outgoing/release/91sp2_nios2eds_windows.exe...  Installing...

I think I want to get rid of the bad-checksum errors when the GPS module boots - have it no longer complain if the checksum is missing.  OK, did that.

System startup sequence is still very finicky about the order in which the boards are powered up.  For example, here I powered up the DE3 as the Wi-Fi board was opening its windows:

Console messages:


|------------------------------------------------------------|
|  Node 0 log started.                                      |
|VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV|
Node 0 reports its MAC address is 00:1E:3D:33:ED:CF.
Node 0 turned on at Tue Feb 28 17:09:59 2012 + 875 ms.
Starting AUXIO server for node 0 on port 52737...
Starting UART server for node 0 on port 63766...
Node 0 reports that its bridging mode has changed to NONE.
Node 0's bridge mode is now NONE.
Node #0's host (type CTU_GPS, firmware version 1.9) is starting up...
Node #0's host is ready to accept commands.
   ERROR: WiFi_Module.sendHost():  I don't know any way to communicate with the sensor host in the present briding mode.  Giving up.
Node 0 reports that its bridging mode has changed to TREFOIL.
Node 0's bridge mode is now TREFOIL.
   ERROR: SensorHost._parseMsg(): Checksum failed on line [$ERR,BAD_CHK,[Œ$PDMEHEADER1: DeLORME GPS2058_HW_1.0.1]*A7]; ignoring line...
 WARNING: GPS_Module.sentMessage(): The GPS module sent a message of a type [PDMEHEADER2: DeLORME GPS2058_FW_2.0.1] which I don't know how to handle.  Ignoring...
Heartbeat #1 received from node 0 at Tue Feb 28 17:11:04 2012 + 827 ms.
Heartbeat #2 received from node 0 at Tue Feb 28 17:12:04 2012 + 210 ms.

UART transcript:

Tue Feb 28 17:05:38 2012 + 774 ms: < HOST_STARTING,CTU_GPS,1.9
Tue Feb 28 17:05:38 2012 + 782 ms: < HOST_READY
Tue Feb 28 17:05:40 2012 + 222 ms: < $ACK,WIFI_READY*60
Tue Feb 28 17:05:46 2012 + 127 ms: < $ERR,BAD_CHK,[Œ$PDMEHEADER1: DeLORME GPS2058_HW_1.0.1]*A7
Tue Feb 28 17:05:46 2012 + 131 ms: < $PDMEHEADER2: DeLORME GPS2058_FW_2.0.1
Tue Feb 28 17:05:46 2012 + 137 ms: < $GPTXT,COSMICi Custom_Config_0.0.3

And we never got the GPS Manager sending the warm-start message.

Let's do a test where we power things up with substantial delays in between.  Here we get:

UART transcript:
----------------------------------------------------------------------
At Tue Feb 28 17:18:34 2012 + 439 ms opened node0.uart.trnscr transcript...

Tue Feb 28 17:18:47 2012 + 736 ms: < 
Tue Feb 28 17:18:47 2012 + 737 ms: < HOST_STARTING,CTU_GPS,1.9
Tue Feb 28 17:18:47 2012 + 744 ms: < HOST_READY
Tue Feb 28 17:18:47 2012 + 753 ms: > HOST GPS $PDME,1
Tue Feb 28 17:18:48 2012 + 330 ms: < $ACK,GPS $PDME,1*24
Tue Feb 28 17:18:54 2012 + 969 ms: < $ERR,BAD_CHK,[Œ$PDMEHEADER1: DeLORME GPS2058_HW_1.0.1]*A7

Some console output:

Node #0's host (type CTU_GPS, firmware version 1.9) is starting up...
Node #0's host is ready to accept commands.
   ERROR: SensorHost._parseMsg(): Checksum failed on line [$ERR,BAD_CHK,[Œ$PDMEHEADER1: DeLORME GPS2058_HW_1.0.1]*A7]; ignoring line...

I don't know why the first line from the GPS always has this garbage character "Œ" at the start.  Maybe we should strip it off somehow?  Wrote some code for that in the CTU firmware; compiled; still need to burn & test.

Monday, February 27, 2012

Mon., Feb. 27th

David is here, and I've uploaded my changes from over the weekend to Dropbox (C:\Users\Mike\Documents\My Dropbox\COSMICi_devel\FEDM_design\FEDM_code\q91)

The important files that are new or changed are:
  • se_reg_en_56_pip.vhd - Version of this register with pipelined enable signal.
  • se_dff_en.vhd - Updated this DFF to use Altera DFFE primitive to try to ensure that enable is really an enable and not a feedback path.
  • se_pulse_cap_tsedge_56.vhd - Updated this edge-capture module to use the new register.
  • se_pulse_cap_56.vhd - Updated this pulse-capture module to use the new register.
  • cscnt_pipeline_register_56.bdf - New 2x56-bit register w/ no reset/enable, to use in pipeline to fan-out 
  • tsedge_datapath_v2_56.bdf - Timing edge-capture datapath modified to use new pipeline register at counter input.
  • pmt_ic_datapath2_56.bdf - Pulse capture datapath modified to use new pipeline register at counter input.
  • pmt_ic_datapath_v3_56.bdf - Pulse capture datapath modified to use new pipeline register at counter input.
Plan for today (for the students):
  • Do changes like I did for the high-speed mockup, but in the real design.
    • For reference, the high-speed mockup project is at FEDM_code\High_Speed_Mockup.qar on Dropbox.  (Don't expand it within the main folder though.)
I'm checking to see whether using the Altera DFF megafunction registers in the counter pipeline in my working project (Q:\) caused worse performance than in the mockup where I used our own VHDL.  I don't expect it would have made much difference, but you never know.  The technology map for the Altera registers looked kind of strange.

Got 214.96 MHz; same as yesterday.  So it doesn't matter what kind of register we use there.

Some notes on what I put in my LogicLock region yesterday (so I can delete these from my test project):
  • *|pulseform_cap_56:*|pulse_prep_56:*|se_pulse_cap_56:*
    • This gets all the front-end pulse-capture instances within the 3 main pulseform-capture datapaths.
  • hspeed_counter_56:inst22
    • This includes both the PLL and the high-speed counter.
  • pmt_ic_datapath2_56:inst20|cscnt_pipeline_register_56:inst4
    • Pipeline register for counter input in the 1st pulseform-capture datapath.  
  • pmt_ic_datapath_v3_56:inst*|cscnt_pipeline_register_56:inst4
    • Pipeline register for counter input in the 2nd & 3rd pulseform-capture datapaths.
  • tsedge_datapath_v2_56:inst11|cscnt_pipeline_register_56:inst4
    • Pipeline register for counter input in the timing-sync edge-capture datapath.
  • tsedge_datapath_v2_56:inst11|pulse_prep_tsedge_56:inst2|se_pulse_cap_tsedge_56:inst
    • Front-end timing-sync edge-capture module.
Need to tell the students this trick for putting just the desired components into LogicLock, when they're ready for it.

Samad showed me his new design for the power-distribution board - it's much improved, and just about ready to give to Donte for fabrication.

David and Juan cut/stubbed the slow-speed stuff out of one of the pulseform-capture datapaths, and David is going to do the same to the other one (which has two instances).  Aarmondas just got here, & David's going over what we're doing with him, so hopefully Aarmondas can do the same thing with the timing-sync datapath.

Just for a lark, I'm going to try compiling with the high-speed components in a child region of the root region, with Reserved turned on so the low-speed stuff (hopefully) can't interfere with it at all.  I don't think it'll fit, but what the heck, it's an easy enough thing to try.  Huh, it fits!  242.72 MHz.  Better than without it, but still not the best possible.``

What else to work on today?  Finally testing & debugging the new server code to warm-start the GPS?  Sounds good...  Some of the wires on the Wi-Fi board got detached, had to reconnect them...  OK...

Worked through various bugs in the new code (all before the point where we send the warm-start command).  When I stopped, I was just about to implement the as-yet-unimplemented methods wifi.WiFi_Module._uartSrvConnected(), _auxioSrvConnected(), _mainSrvConnected().  (Should be easy; just out of time now.)

Sunday, February 26, 2012

Sun., Feb. 26th

Came in to do a couple more improvements to the FEDM input-capture architecture.  Unfortunately the main building doors were locked, but fortunately the loading dock door wasn't.  I think now that the only reason the main doors were unlocked yesterday is that apparently there was some kind of symposium going on here - that also explains why the parking lot was full.  There is a sign in the parking lot that says "reserved for symposium attendees."  I ignored it though and parked there anyway, since the parking lot is not very full today, so I conclude that the symposium was yesterday.  (It would be very unlikely for it to be held on a Sunday.)

NOTE:  If I put a new pipeline stage for the counter value in the pulse-capture datapaths, but don't put one in the timing-sync capture datapaths, then the timing-sync capture datapath will return time values that are 1 cycle ahead, relative to those obtained from the pulse-capture datapaths.  This can be corrected for in software, but it might be cleaner to do it in hardware if we have room to fit all 4 of the new 112-bit pipeline registers (instead of just 3).  I think I'll try it for now, but remove it if we have fitting problems later.

OK, so I've now created the following new module in Q:\:
  • cscnt_pipeline_register_56.bdf - This just uses two 56-bit-wide Altera DFF megafunction instances to buffer the sum and carry values.  I didn't bother including any reset/enable functionality, since the output of this module should quickly reflect the counter's behavior (with only a 1-cycle delay) in any case.  If there are speed problems, I could try replacing it with a custom VHDL module, since this seems to work better in some cases (perhaps only in the case where reset/enable bits are included).  But, I doubt this pipeline stage will end up being a performance bottleneck on either its input or output sides.  Its input is a register, and it fans out to only 5 places now (as opposed to 16 for the original counter), with only a little bit of logic delay (probably just 1 LAB's worth) in each place.
And, I inserted an instance of this module in the counter input of the following three modules:
  • pmt_ic_datapath2_56.bdf
  • pmt_ic_datapath_v3_56.bdf
  • tsedge_datapath_v2_56.bdf
I guess I could have done it at the top level, but that schematic is getting kind of crowded.  Or, I could have done it in pulseform_cap_56.bdf, and avoided having to put it in both versions of the pulse-capture datapath - but I wanted it not to be buried so deeply.  Anyway, these schematics may all end up needing to be reorganized later anyway, if it turns out that we have to split up the design in order to put the parts we want into a LogicLock region.  I'm hoping, though, that there might be a way to add instances into a LogicLock region without having to do that.  Maybe by adding them one-at-a-time through the region's Properties dialog?  That might allow clicking down into substructures...  Check for this capability later.

Anyway, for now, let's go ahead and try the compile...  This will take a while...  19 minutes.

Yikes, we're back down to 214.96 MHz, as opposed to 271 MHz yesterday!  Possibly the extra resource usage from the pipeline registers is making the fitter have to stretch more in general.  Let's look at the bottlenecks...  First one is:

inst23|inst3|inst3|inst3|fall_c_reg|\byte_arr:4:bit_arr:3:sedff_inst|prim_dffe_inst|datain

OK, so this node is apparently the input to the falling-edge time-capture register for the carry value.  Perhaps I was wrong to think that the output of the new pipeline register wouldn't still be a bottleneck?  What to do?  Add 15 more pipeline registers, one for each of the individual pulse-cap instances?  That will increase our resource usage quite a bit, and may lead to fitting problems.

Maybe first I'll try LogicLocking just the high-speed components.  OK, it looks like all you have to do is drag a representative instance into the region (its parents will get chosen arbitrarily), and then edit the entries in the Properties window to appropriately wild-card portions of the instance name so as to capture all of the desired instances.  So, now I've got the high-speed counter and the 4 new pipeline registers and the 15 pulse-capture modules and the timing edge-capture module all assigned to the root region.  That should be it for the high-speed components.

Let's retry the compile now...  What I'm hoping at the moment (at least, it would be nice) is that Quartus will first compile and optimize all the stuff that goes in the root LogicLock region, lock it down in its place, and then fill in the remaining parts of the design around it.  I don't know if it's really smart enough to do this.  If that doesn't work, then we'll have to do something more complicated, like dividing up the entire design into high-speed and low-speed portions, stubbing out all the low-speed parts, THEN freshly put the high-speed parts into the root logic-lock region, add the low-speed parts back in, and do an incremental recompilation.  That is straightforward but time-consuming and so I should probably leave it for the students to tackle.  Anyway, I am freezing cold in here, and need to leave soon.

Another thing to try is a different implementation of the pipeline registers, although you'd expect Altera's own DFF megafunction not to do TOO badly...

The new compile changed nothing, speed-wise:  Still 214.96 MHz.  Bottleneck still the same.  Darn, it looks like we'll still have to divide up the design.

One more thing to try: Allow Quartus to automatically insert pipeline stages as needed to meet timing constraints.  I've been avoiding this out of worry that it might mess up the timing of the high-speed logic.  But it might be worth a try.  Hm, looking...  So far, I've only found an option that adds pipeline stages for asynchronous reset signals.  That isn't the problem we're having.

Finally, one more thought:  We could remove the reset/enable inputs from the pulse-cap modules to hopefully improve their performance.  If we resort to this though, we'd need to think carefully about how this will affect the behavior of subsequent logic on startup/reset.

Anyway, that's all for today, I'm going to get some lunch and work on my visa application...

Saturday, February 25, 2012

Sat., Feb. 25th

Thinking about trying to get into the lab today to get some work done. Something to try: * See if the max temperature for the slow-corner timing analysis can be adjusted downwards. Right now the default max temp is 85 degrees C and the fmax for the slow corner for my mockup of the high-speed components comes out a bit under 500 MHz, specifically 487. But it's possible that if we can set the max temp lower (say 25 C), then the fmax reported by TimeQuest would come out above 500, telling us that if we want to come up to that speed, we need to come down to that temp. This seems likely since the fast corner, at 0 degrees, is actually meeting the 500 MHz timing constraints (not highlighted in red).

Arrived at lab about 2:45 pm to find the building unlocked and the parking lot full of cars.  Apparently, they must have stopped locking this building on Saturdays at some point since the last time I tried to get in on a weekend.  That's good news, because it gives us more days to get work done before time runs out.

Looking at Quartus settings to see if there's any way to adjust the temperature of the hot corner.  Didn't find it yet, but I did find the "Fitter Settings -> Optimize multi-corner timing" option, which optimizes the design not just for the slow corner.  However, this doesn't yet turn on multi-corner timing analysis.  That requires some setting for the timing analyzer.  Haven't figured that out yet.

Aha, under "Operating Settings and Conditions -> Temperature -> Junction temperature range", we can change the High temperature from 85 C to some other value.  Let's try 25 C.  Oops, it doesn't actually let me change it!

Interestingly, there is an option under PowerPlay power analysis to auto-compute the junction temperature using a cooling solution.  For this, you specify an ambient temperature, the length of the heat sink, and the air flow in I think liters per minute.  We could try doing this using the parameters of our actual cooling system design.

Doing a test compile - what changed was just that I turned on "Optimize multi-corner timing" and I also turned on some output from the PowerPlay power analyzer.   Don't think it actually ran the power analyzer yet though.  Ran it manually just to see what the computed temperature was with no heat sink.

Still getting 487.09 MHz.

In the TimeQuest timing analyzer, found a "Report Bottlenecks" option that identifies the worst-case nodes.  One of them is the "wfall_del" node in se_reg_en_56.vhd.  Well, what's going on there?  This is being used as the "enable" signal for two different 56-bit registers, that is, it is fanning out to 112 different flip-flops.  So it's perhaps unsurprising that there is a setup-time bottleneck at this node.

One way to fix this would be to add a pipeline register for the fanout of the enable signal.  Trying that.  (Created & used new module se_reg_en_56_pip.vhd, which creates a duplicate enable signal every 8 bits.)

Yep, now that node is no longer a bottleneck.  But fmax is even worse now!  (464.04 MHz)  Perhaps because the extra flip-flops spread out the design some more?

Anyway, we could go through the design playing whack-a-mole, adding pipeline registers each time we find a bottleneck.  But this is pretty time-consuming, and there's no guarantee that, even if we do this, we will get back over 500 MHz at the slow corner.

Alternatively, if the cooling system does its job, then the present design mockup (even without the enable-pipeline) already meets the 500 MHz timing constraint at the fast corner, or at sufficiently low temperatures (we don't know how low yet).

To clarify, the present design I'm working with is located at:

  • Machine:           COSMICi
  • Folder:              C:\LOCAL\Quartus_projects\q9v1sp2\COSMICi_FEDM
  • Project:             COSMICi_FEDM.qpf
  • Revision:           COSMICi_FEDM_RevA
  • Top-level file:    HSMockup_LogicLock_test.bdf
Next bottleneck:  In se_pulse_cap_56 again, in rise_s_reg output; not sure why this is a bottleneck though since it wasn't supposed to feed back into the high-speed logic.  Hm; looking at this node in the chip planner and in the technology map, it seems to have feedback.  Maybe I should be using the Altera DFF primitives, instead of trying to roll my own dffs in VHDL - the synthesized logic looks like it might not be using enables properly, and may be using feedback paths instead.

Modifying se_dff_en to directly use the Altera DFFE primitive.

Now the speed is even lower; 446.03 MHz!!!  However, at least that node is no longer a bottleneck.

Next bottleneck:  In the enable-pipeline register (en_preg_inst) in se_reg_en_56_pip.  OK, that should be fanning out to the real enable inputs of 8 real hardware DFFs now.  Hm...  No, maybe that's not true.  Looking at the bottleneck node in the chip planner seems to indicate that's not the case.  EN isn't used, and there is a feedback path from the output of this register back to the logic that computes its input; this logic wouldn't need to be present if the real low-level enable signal was used.

I thought the DFF at lower-right would have used its enable input, but apparently it doesn't.
Let's also look in the Technology Map.  The node name we're looking for is inst18|inst10|inst2|fall_c_reg|\byte_arr:6:en_preg_inst|q|adatasdata.  Ah, that one uses a se_dff which I didn't change yet to use the primitive.  Fix that.

OK, now the node name of the first bottleneck is:

inst18|inst10|inst2|fall_c_reg|\byte_arr:6:en_preg_inst|prim_dff_inst|adatasdata

Here's a question:  Is the bottleneck in *computing* this value (input to the register) or *using* this value (output from the register)?  I've been assuming the latter but maybe it's actually the former.  In that case, perhaps the problem is from complicated logic computing wfall, or something, which then fans out to 7 pipeline registers (actually 14, for the bytes of both sum and carry).  OK, then the solution is to use the delayed version of wfall again, which effectively adds another pipeline stage.  This will perturb the timing of the pulse-cap module a little and needs to be repaired later (the handshake might get out of sync a little with the actual change in the registers).  OK I fixed that by adding a delay in the hs_prod output.

447.03 MHz.  1st bottleneck now:  inst18|inst12|inst2|wrise_buf|prim_dff_inst~_Duplicate_8|adatasdata.  Let's add the enable fanout pipeline there too.

539.67 MHz!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Changing se_dff_re (used in shift register) to also use the Altera primitive DFF made things worse again; still above 500 but just barely.

Let's therefore try changing se_dff, se_dff_en, and se_dff_re all back to use behavioral code instead of the Altera DFF primitive, and recompile, see if that does better.  Hm, no, still just 509.68 MHz.  What if only se_dff_en uses the Altera primitive?  Now back up to 539.67 MHz.  Weird, but oh well.

OK, let me list which source files are new/changed as a result of today's work:
  • se_pulse_cap_56.vhd     (Changed to use new _pip version of se_reg_en_56.)
  • se_reg_en_56_pip.vhd   (New file; pipelines fanout of enable signal.)
  • se_dff_en.vhd                (Changed to use Altera DFFE primitive.)
Plus, of course, there's the new pipeline stage I added (two se_reg_re instances for sum and carry parts of counter) in the counter fan-out at the front of each datapath.  This may or may not still be necessary.  Each counter bit fans out to 19 (3x6 + 1) places right now.  Not sure if that's really a bottleneck, since the pulse_cap tests Darryl & I did the other day turned out to be bogus anyway.

Anyway, I think now, all we have to do is:
  1. Move the changes in these files back into the master project (Q:\) and the Dropbox.
  2. Try compiling the whole thing with those changes.  If it still fits and is now fast enough, then we're done (modulo testing) and we don't even need to use LogicLock after all.
  3. If it fits but isn't fast enough again, then we have to continue with the LogicLock work.  Make a version of the project that separates out the high-speed components from the others.  Drag the high-speed components into the root LogicLock region, and recompile.  That may be enough.  (If that doesn't work, do it again but without the low-speed components present, then add the low-speed components back in; if Incremental Compilation is turned on, then this should work.)
  4. If it doesn't even fit, due to the new pipeline registers, then we will have to figure out something to shrink to make room.  Fewer stages in synchronizer chain for pulse inputs?  Currently we have 8 stages.  This may be overkill.
First, I'll try recompiling Q:\ just with the changes in se_pulse_cap_56.vhd and its submodules, without adding the pipeline stages for fanning out the counter values.

Oops, it didn't fit, but just barely; we needed only 3 LABs more than we had on the chip!  Well, let's first try shrinking those synchronizer chains from 8 stages down to to 6 stages; that should help, since there are 18 of them in the design, that could save us maybe as many as 36 LABs (plus a couple in the timing-sync datapath, which, by the way, needs to match, or it will throw off our time calculations by 4 ns).  I suppose whether this will fix the fitting problem depends, however, on whether the LAB usage is limited by logic or by registers, since the synchronizer chains are register-only.

Another idea to save space is to get rid of the 6th input path we are using in each pulse-capture datapath, since these are for the 6th DAC, but only 5 of the DACs are working on our board and we're skipping over the broken one.  We are always feeding an always-"OFF" signal in place of the last comparator output for each input channel.

OK, stubbed out the output of the 6th instance of pulse_prep_56 in pulseform_cap_56.  That should cause that instance itself, together with the logic that uses its output, to get compiled away.  This should result in a significant reduction in the resource usage of the pulseform-capture datapaths.

Compiling Q:\ again...  Taking forever...   Done.  No dice, only 271 MHz at the slow corner, and timing constraint not met at fast corner.  Faster than before, though.  Next: I need to add the pipeline registers for the counter (so far only tried them in mockup, and indeed, the bottlenecks report shows the counter output as a bottleneck), and also the design probably needs to be split into fast and slow components so we can try LogicLock again, the right way this time.  Enough for today though.

Friday, February 24, 2012

Fri., Feb. 24th

Darryl & David are supposed to be here from 2-6 pm today.  Aarmondas is supposed to be here from 4-5. I told George that he really should come in today, since he wasn't here yesterday.  I haven't seen Brian in a while.  Not sure if Juan & M. Dean are coming today or not.

Ray is supposed to be here soon with the replacement hard drive for the Acer.  David can try setting that up.

Some things that can be worked on today:
  • Performance improvement:
    • Try adding a pipeline stage in the counter fanout.  (Also, disable the path for the currently-unused 6th threshold.)
    • Try logic-locking the counter + several pulse_cap's together.
  • GPS initialization:
    • Test & debug my recent server code changes.  See if the warm-start is enough to trigger acquisition.
George came by and traced out board placements on the plywood.
Antony is here and is going to back up COSMICi's main filesystem (C:\ drive).  We ended up using the Windows utility.  He repartitioned the drive so 500 GB is for use by Windows machines.

One thing I want to check out at some point:  What speed will the old dual-edge triggered versions of our high-speed components run at in a LogicLock region?  If we can get them to run at 500 MHz, then we might be able to actually attain the true 1 ns time resolution that we were originally hoping for.

Didn't get very good results for the dual-edge-triggered carry-save counter, unfortunately (about 300 MHz).  Maybe try again later after we understand what we're doing better...

Meanwhile, we realized some things about LogicLock:

  • In some of our earlier tests yielding crazy results like 800 or 900 MHz, actually the time-critical logic was getting compiled away.  So, not all of those earlier results are actually meaningful.  The KEEP attribute does not prevent this; it only prevents internal nodes from getting compiled away by optimizations (not by getting erased entirely due to being unused).
  • If you simply take a component out of LogicLock and then put it back in, you can get a different speed result, because it can place the "Auto" region in a different place.
Mike got a compilation with 487 MHz for a mockup design including all the high-speed components in the root region, with a pipeline stage for fanning out the counter.  This is nearly fast enough, and might be fast enough at low temperatures (the slow corner didn't meet the timing constraints, but the fast corner did).  However, some caveats:  The pin assignments are not made yet, and also this is not our exact design structure for the high-speed parts, just a quick rough mockup of it.

Trying again with everything in a new Auto region.  This time got 465.12 MHz, with everything shoved over into the left-hand third of the chip.  Try again...

Trying again after deleting the region and recreating it.  Didn't include the clock input and the main output this time.  Got 465.12 again.

Let's try again with NO LogicLock at all.  This time got 487.09 again - the same as when using the root region.  I notice that the layout in this case is non-rectangular.

Finally, let's try once more with putting all the high-speed components into the root region, except this time w/o the main input/output pins included.  Same thing.

Probably a good way to proceed would be to put the real high-speed components into the root region, compile those, then add the rest of the design and recompile, and see if the speed is preserved; and then (if we are still in the ballpark of 490 MHz) test the design with cooling installed to see if it seems to be working.

Thursday, February 23, 2012

Thu., Feb. 23rd

COSMICi advising mtg @ 9:30 - Had the meeting.  Everyone was there but Brian; but he showed up right after the meeting, and says he will be there for future meetings.  It sounds like the team will be in relatively good shape for their review next Wed. @ 3:30, although the ME guys still need to finish some fabrication work before then.  Brian can't make it to the review, so he is going to take his part (the enclosure) around to the evaluators ahead of time on Tuesday so they can see it, and then bring it to lab so it will be there during the main review on Wednesday.  No word yet from EIT on opening ports for off-campus access to license server.  [ ] Ping them on this again later.


Stopped by lab around 2 pm to let Mike Dean & Darryl in to do some work on the LogicLock, started license server for them, then had to go back over to Engineering for a meeting there.  Returned a little after 4:30 about the time MD was leaving; he said the entire input-capture system didn't fit in the LogicLock region.  No big surprise, I suppose.  Oh well, we didn't need the whole thing to be in there anyway.


Met with Samad to go over his power-distribution board design.  There were a couple of issues with his capacitors, and I also suggested he might want to group the outputs by destination board rather than by voltage; this will help the wire bundles look neater.  He is going to redesign it and we'll go over it again next week; hopefully it will soon be about ready to fabricate.  Donte can make this one in-house.


Worked with Darryl for a bit as he tried LogicLock on the pulse_cap module by itself.  (That is the one that captures the rising and falling edge times for a single digital pulse.)  At first he had a problem with too many output pins but we fixed that by routing the outputs to a new VHDL module that just applied a KEEP attribute to all its inputs.  Then he found it was limiting the input clock frequency for 500 MHz but that was just an I/O pin limitation which was fixed by generating the clock using a PLL.  Then he got something like 900 MHz.  Since both the high-speed counter and pulse-cap can run at well above 500 MHz by themselves, this suggests to me that our current speed issue might just be due to the fan-out of the counter value.  Currently it fans out to 18 instances of pulse_cap.  This can be reduced to 15 since we are only using 5 of the 6 DACs at present.  However, it is still a significant fan-out.  A combinational buffer or a pipeline register at the counter input to each datapath would reduce the maximum fan-out from 15 to 5.  Of these two, the pipeline register would perform better since the combinational delay of 3+5=8 fanout-delay units for the buffer is greater than the max 5 fanout-delay units we'd get in the pipelined approach.  Anyway, we should experiment with this approach tomorrow.  It's possible that, if this is really the underlying cause of the speed problem, adding the buffering might fix it without our even having to use LogicLock regions.  However, even in this case, having experimented with LogicLock will still have been helpful in terms of letting us track down the cause of the problem.  I also think that it still might be a good idea to move the high-speed components into a LogicLock region anyway, simply because Quartus seems to do a better job of optimization within those regions.  If we get the design just right, and eliminate the present speed bottlenecks, recent experiments suggest that in the final design, we might be able to achieve speeds well over 500 MHz, perhaps around 800 MHz.  Or, we could try again with the dual-edge-triggered version of the high-speed stuff and perhaps even hit 1 GHz.  Anyway, the possibilities look promising.


Antony is going to stop by on Friday (tomorrow) afternoon to get some soldering advice and back up COSMICi's hard drive.


Another thing I want to do tomorrow (besides help as needed with LogicLock stuff) is test my current server code for initializing the GPS module.  So far we just do a warm-start.  See if that works.



Wednesday, February 22, 2012

Wed., Feb. 22nd

Juan is here, spoke with him about the Quartus version/device issue.

He recompiled the high-speed counter in a LogicLock region with the device set properly and got close to 700 MHz!  He is still using the Classic timing analyzer though; these tests should probably be redone in TimeQuest for improved accuracy / more detailed reporting.

Samad & Aarmondas arrived, and the students are going to go have their internal team meeting in the library.

Spoke to Samad briefly about the power board design.  We are going to go over it in depth tomorrow afternoon (no class tomorrow).

Later this afternoon I will go over the startup procedure with everyone (ECE students).  They should also plan to practice it themselves soon.  Also, they should put the server (& Python 3.1) on their laptops so that they can monitor the run from there.

David didn't yet have the full version of Quartus on either his laptop (under parallels) or in VirtualBox on Ray's iMac.  He's started those downloads.  On his laptop he needs the FAT32 version.  Meanwhile, while those are downloading he's just reading documents.

My main technical task for today is continuing to work on the remote GPS initialization code.  Let's look at where we stand at the moment:

When we startup CTU's WiFi+DE3+GPS, we get:


HOST_STARTING,CTU_GPS,1.9
HOST_READY
$ERR,BAD_CHK,[$PDMEHEADER2: DeLORME GPS2058_FW_2.0.1]*25
$ERR,BAD_CHK,[$GPTXT,COSMICi Custom_Config_0.0.3]*27
$ACK,WIFI_READY*60
$GPRMC,222517.001,V,3025.676,N,08417.112,W,0.0,0.0,131211,4.1,W*70
$GPGGA,222517.001,3025.67587,N,08417.11218,W,0,00,99.0,051.73,M,-29.7,M,,*60
$PDMETRAIM,2,0,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0*43
$PDMEPOSHOLD,0,0000.000,N,00000.000,E,000.00*4A
...

Now, for some reason the main server console is reporting this error message:

 WARNING: SensorHost._handleHostMsg(): Unknown host message type [ERR]. Ignoring...
 WARNING: SensorHost._handleHostMsg(): Unknown host message type [ERR]. Ignoring...

More details from COSMICi.server.log:

2012-02-22 15:05:02,058 | COSMICi.server.model |  Thread-19:    node #0  uart0.con0.rcvr   |              model.py:2436:_handleHostMsg      |  WARNING: SensorHost._handleHostMsg(): Unknown host message type [ERR]. Ignoring...
2012-02-22 15:05:02,059 | COSMICi.server.model |  Thread-19:    node #0  uart0.con0.rcvr   |              model.py:2436:_handleHostMsg      |  WARNING: SensorHost._handleHostMsg(): Unknown host message type [ERR]. Ignoring...

Let's look at model.py line 2436...  OK, yes, we don't have a case for message type ERR yet.  Let's add one and try again.

Hm, if you start up the boards with more of a delay between them, the NMEA stream starts up muted (as I intended).  Wonder why it didn't work when I started them more quickly.  Hm.

Took Juan & Aarmondas through the startup sequence.

OK, here again is the initial output from the CTU.  This time, the Wi-Fi board got started up so long before the DE3 that we missed the WIFI_READY message.  The "ACK,UNMUTE" is in response to an UNMUTE command typed by the user to unmute the NMEA pass-through.

HOST_STARTING,CTU_GPS,1.9
HOST_READY
$ERR,BAD_CHK,[$PDMEHEADER2: DeLORME GPS2058_FW_2.0.1]*25
$ERR,BAD_CHK,[$GPTXT,COSMICi Custom_Config_0.0.3]*27
$ACK,UNMUTE*77
$GPRMC,222829.029,V,3025.676,N,08417.112,W,0.0,0.0,131211,4.1,W*7A
$GPGGA,222829.029,3025.67587,N,08417.11218,W,0,00,99.0,051.73,M,-29.7,M,,*6A
$PDMETRAIM,2,0,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0*43
$PDMEPOSHOLD,0,0000.000,N,00000.000,E,000.00*4A
...

Initial server console errors are now:

   ERROR: SensorHost._handleHostMsg(): Sensor host reports a BAD_CHK error with data [[$PDMEHEADER2: DeLORME GPS2058_FW_2.0.1]].
   ERROR: SensorHost._handleHostMsg(): Sensor host reports a BAD_CHK error with data [[$GPTXT].

At least this looks a little bit less retarded.  We should probably, however, at some point modify the CTU firmware to not bother reporting a BAD_CHK error if the NMEA checksum is entirely missing, but only if it's actually present but the value mismatches.

Some more server console output:

|------------------------------------------------------------|
|  Node 1 log started.                                      |
|VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV|
Node 1 reports its MAC address is 00:1E:3D:33:E9:42.
Node 1 turned on at Wed Feb 22 15:47:08 2012 + 380 ms.
Starting AUXIO server for node 1 on port 52738...
Starting UART server for node 1 on port 63767...
Node 1 reports that its bridging mode has changed to NONE.
Node 1's bridge mode is now NONE.
Node #1's host (type FEDM, firmware version v0.10) is starting up...
Node 1 reports that its bridging mode has changed to TREFOIL.
Node 1's bridge mode is now TREFOIL.
 WARNING: SensorHost._handleHostMsg(): Unknown host message type [DAC_LEVELS]. Ignoring...
Node #1's host is ready to accept commands.
 WARNING: SensorHost._handleHostMsg(): Unknown host message type [FIFO_FULL]. Ignoring...
 WARNING: SensorHost._handleHostMsg(): Unknown host message type [NC_PULSES]. Ignoring...
 WARNING: SensorHost._handleHostMsg(): Unknown host message type [NC_PULSES]. Ignoring...
   ERROR: SensorHost._handleHostMsg(): Sensor host reports a UNK_CMD error with data [WIFI_STARTING].
 WARNING: SensorHost._handleHostMsg(): Unknown host message type [NC_PULSES]. Ignoring...
 WARNING: SensorHost._handleHostMsg(): Unknown host message type [NC_PULSES]. Ignoring...
 WARNING: SensorHost._handleHostMsg(): Unknown host message type [NC_PULSES]. Ignoring...
 WARNING: SensorHost._handleHostMsg(): Unknown host message type [PULSE]. Ignoring...
 WARNING: SensorHost._handleHostMsg(): Unknown host message type [PULSE]. Ignoring...
...

The data that generated this was (with blank lines removed):

HOST_STARTING,FEDM,v0.10
DAC_LEVELS,-0.200,-2.500,-0.299,-0.447,-0.669,-1.000
HOST_READY
FIFO_FULL,3,1
NC_PULSES,121360254,153,51,201
NC_PULSES,239902156,120,70,205
ACK,WIFI_STARTING,v0.19
ERR,UNK_CMD,WIFI_STARTING,v0.19
ACK,WIFI_READY
NC_PULSES,355829245,142,80,200
NC_PULSES,457349235,92,69,200
NC_PULSES,587443282,157,70,204
PULSE,0,0,3,1,632470702,3,(0,(1,(2,2),4),4)
PULSE,0,0,2,1,632470705,1,(0,7)

The "Unknown host message type" warning messages are all happening because I haven't yet implemented the ShowerDetectorHost subclass of SensorHost, or mutated the .sensor_host instance into it, and so the _handleHostMsg() method call is still getting handled by the SensorHost base class, which of course doesn't know about all these message types, since of course they are specific to the ShowerDetectorHost.  But, we'll worry about all that later; for now we need to finish working on the GPS initialization.

I think first I want to break model.py up into several files as it's getting a little unwieldy (2,664 lines).  Here are some thoughts about how to break it up into modules:
  • model - Defines SensorNet, SensorNode, SensorHost?
  • wifi - Defines WiFi_Module
  • ctu - Defines CTU_Node, CTU_Host
  • gps - Defines GPS_Module, GPS_Manager
  • fedm - Defines Detector_Node, ShowerDetector_Node, Detector_Host, ShowerDetector_Host
Working on the split.  Of course, as usual, having multiple interdependent cross-referencing modules creates lots of weird problems having to do with the module loading order.  I finally got it to work by importing wifi, ctu, & fedm modules at the end of model.py (instead of at the beginning).  That's necessary for ctu & fedm modules, at least, since they inherit from the SensorHost class defined in model.py, so that class definition has to be loaded before those derived class get defined.

At home working on GPS initialization code.  I'm wondering now if communicating between threads via an update flag is the right approach.  Because wouldn't we need a separate flag for each type of message?  It might be cleaner to have a subscription system, in which the GPS manager "subscribes" to specific types of messages, via registering a callback, and then the GPS proxy "publishes" announcements of the messages, calling the callback.  But then, if we still have a separate thread for the GPS manager, the callback routines will still have to signal that thread using flags or something similar.  So it doesn't really save us any trouble, I think...  OK, I ended up creating an "Inbox" abstraction that tracks the most recent value of a given type of data record and alerts whoever's interested in that record about updates by waving a flag.

Made some progress on gps.py.  I'm wondering now if it might not work simply to do a warm start so that the ephemeris gets invalidated.  (The current problem might be that it thinks the ephemeris from December is still valid even though it isn't.)  The almanac should still be good, since it was last working less than 6 months ago (specifically, last December).  I could even try doing this manually, with a "HOST GPS $PDME,1" command line typed in node 0's UART bridge terminal window.  If that works reliably to trigger the initial acquisition, then it makes initialization extremely simple.  I have already written the code to send the warm-start command (although not yet tested).  Test this in lab tomorrow (after the faculty meeting).

Tuesday, February 21, 2012

Tue., Feb. 21st

Darryl is supposed to be here from 2-6 pm today, and Michael Dean from 2:30-5:30.  Their plan is to continue working on LogicLock stuff.  Also, Michael Dean wants a demonstration on how to run the system so he can take notes to prepare to run a demo during the Midterm HW/SW Review.

Earlier I emailed Samad some suggestions for improvements in his design for the power distribution board. He showed me his layout yesterday; it is looking pretty good.  I suggested he add some bypass capacitors for low-pass filtering, and make sure that all same-voltage nodes are tied together.

For myself, I have two goals for today:
  • [ ] Get set up to run the demo for Michael D.  Do a practice run first.
  • [ ] Continue working on the server-side code to remotely initialize the GPS.  Since ongoing development work might interfere with the functionality of the demo, M.D. should probably make a snapshot of the server code before I start making changes.  He can then run the server on his laptop.
First, on the demo setup:
  • I am plugging the GPS kit and the Wi-Fi board into COSMICi (Dell PC) via USB to supply power.  I will leave their power switches off until we are ready to run the demo.  This is because powering them through the DE3 board currently doesn't work reliably.  Once Samad finishes his new power distribution board, we can power all components from that.
  • Archiving current log files to new folder "Dropbox/COSMICi_devel/Server Code/data/logs 2012-02-21."
  • Starting server.  Minimizing the Python process's Windows console window.
  • Going to quickly power-on the DE3 board, GPS module, and Wi-Fi board at close to the same time (to simulate them all being powered on at the same time through the power distribution board).  Something went wrong; we never got the critical HOST_STARTING message that identifies this as the CTU.  I think I powered up the boards in the wrong order.  Should perhaps power on the Wi-Fi board first, then the DE3, then the GPS.  Let's try again...  Restarting server...
Meanwhile, MD and Darryl are upgrading their Quartuses to (9.1) SP2 so that they can more easily share files with me, David & Juan who are all on SP2 (I'm not sure about Aarmondas).  Darryl is going to redo his counter test with inputs connected to see if it still hits 612 MHz.  Meanwhile, MD is going to try logic-locking everything but the CPU and see if he can get over 500 MHz.  
  • Now I am getting a BadChecksum exception on the $PDMEHEADER1 line which the GPS generates on powerup...  It starts with a garbage character (Œ) so that is probably why.  The exception is then causing the whole UART bridge connection request handler instance to crash.  Need to catch this exception somewhere
Here's the output (node0.uart.trnscr) before the crash:

Tue Feb 21 14:53:49 2012 + 769 ms: < HOST_STARTING,CTU_GPS,1.9
Tue Feb 21 14:53:49 2012 + 774 ms: < HOST_READY
Tue Feb 21 14:53:49 2012 + 779 ms: < $ERR,BAD_CHK,[Œ$PDMEHEADER1: DeLORME GPS2058_HW_1.0.1]*A7

The Python stack trace is:

Traceback (most recent call last):
  File "C:\Users\Mike\Documents\My Dropbox\COSMICi_devel\Server Code\src\communicator.py", line 1795, in handle
    msg = Message(data, self.conn, thetime)
  File "C:\Users\Mike\Documents\My Dropbox\COSMICi_devel\Server Code\src\communicator.py", line 397, in __init__
    if (conn != None): conn._announce(self)
  File "C:\Users\Mike\Documents\My Dropbox\COSMICi_devel\Server Code\src\communicator.py", line 823, in _announce
    h.handle(msg)                       # Tell it to handle the message.
  File "C:\Users\Mike\Documents\My Dropbox\COSMICi_devel\Server Code\src\model.py", line 1238, in handle
    this.wifi_module.node.sensor_host.sentMessage(msg)
  File "C:\Users\Mike\Documents\My Dropbox\COSMICi_devel\Server Code\src\model.py", line 2268, in sentMessage
    this._handleMsg(msg)        # Dispatch to message handler.
  File "C:\Users\Mike\Documents\My Dropbox\COSMICi_devel\Server Code\src\model.py", line 2574, in _handleMsg
    msgWords = this._parseMsg(msg)  # Parse the message into fields.
  File "C:\Users\Mike\Documents\My Dropbox\COSMICi_devel\Server Code\src\model.py", line 2471, in _parseMsg
    msgStr = nmea.stripNMEA(msgStr)     # Warning: This may throw an nmea.BadChecksum exception.
  File "C:\Users\Mike\Documents\My Dropbox\COSMICi_devel\Server Code\src\nmea.py", line 100, in stripNMEA
    raise BadChecksum       # Raise a "bad checksum" exception.
nmea.BadChecksum

OK, that's fixed; now we get:

HOST_STARTING,CTU_GPS,1.9
HOST_READY
$ERR,BAD_CHK,[Œ$PDMEHEADER1: DeLORME GPS2058_HW_1.0.1]*A7
$ERR,BAD_CHK,[$PDMEHEADER2: DeLORME GPS2058_FW_2.0.1]*25
$ERR,BAD_CHK,[$GPTXT,COSMICi Custom_Config_0.0.3]*27
$ACK,WIFI_READY*60
$GPRMC,212148.005,V,3025.676,N,08417.112,W,0.0,0.0,161211,4.1,W*7C
$GPGGA,212148.005,3025.67587,N,08417.11218,W,0,00,99.0,051.73,M,-29.7,M,,*69
$PDMETRAIM,2,0,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0*43
$PDMEPOSHOLD,0,0000.000,N,00000.000,E,000.00*4A
...

But we also get the following errors:

   ERROR: SensorHost._parseMsg(): Checksum failed on line [$ERR,BAD_CHK,[Œ$PDMEHEADER1: DeLORME GPS2058_HW_1.0.1]*A7]; ignoring line...
   ERROR: SensorHost._handleHostMsg(): Unknown host message type [$ERR]. Ignoring...
   ERROR: SensorHost._handleHostMsg(): Unknown host message type [ERR]. Ignoring...
   ERROR: SensorHost._handleHostMsg(): Unknown host message type [ERR]. Ignoring...

Damn, I just realized that everyone has been using a screwed-up version of the overall Quartus FEDM_code project with "AUTO" device selected instead of the actual device part number.  This might have got screwed up on Dropbox after someone migrated to SP2, since it is screwed up in my machine now too (now that I upgraded and rewrote the database).  I reset the device selection to the correct one (EP2S30F484C3N) & re-imported the pin assignments from my backup copy of the project (pre-SP2-migration) of Feb. 15th.  Now I am recompiling the design to make sure everything still fits.  Then I am going to re-upload the project to Dropbox.  Everyone needs to make sure they are doing their LogicLock experiments with the actual device selected!  AUTO results are not necessarily applicable to our real chip!  (Although they may be close, since we have the fastest speed grade.)  But fitting-related results definitely are not applicable to the actual chip, since the Stratix II that we have isn't the largest one.

Also:  You can't use Web Edition because it does not have support for the specific Stratix II part numbers.

This time we got:

HOST_STARTING,CTU_GPS,1.9
HOST_READY
$ERR,BAD_CHK,[Œ$PDMEHEADER1: DeLORME GPS2058_HW_1.0.1]*A7
$ERR,BAD_CHK,[$PDMEHEADER2: DeLORME GPS2058_FW_2.0.1]*25
$ERR,BAD_CHK,[$GPTXT,COSMICi Custom_Confi]*2C
$ACK,WIFI_STARTING,v0.19*67
$ACK,WIFI_READY*60
$GPRMC,212153.010,V,3025.676,N,08417.112,W,0.0,0.0,161211,4.1,W*72
$GPGGA,212153.010,3025.67587,N,08417.11218,W,0,00,99.0,051.73,M,-29.7,M,,*67
$PDMETRAIM,2,0,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0*43
$PDMEPOSHOLD,0,0000.000,N,00000.000,E,000.00*4A
...

Not sure why the $GPTXT message got cut off.  Oh well...

OK, got both boards working together, & went through the startup sequence with Michael Dean & Darryl.  MD took notes.  Basically, in summary, the sequence is:

1.  Plug in detectors.
2.  Start server app.
3.  Power up CTU boards:  (a) Wi-Fi, (b) DE3, (c) GPS.
4.  Power up FEDM (make sure fan is present).
5.  Type "HOST START" in the CTU's UART bridge window.  

Monday, February 20, 2012

Mon., Feb. 20th

Over the weekend, I rescheduled the weekly advising meetings with the Senior Design students to Thursday @ 9:30 am (from same time Tuesday) to help Juan avoid a time conflict with his RA job.

David got email assistance from Altera for his issues installing Quartus under Parallels on his Mac laptop; he is downloading the correct version now, and he says he will be here starting at 2:00 pm.  Hopefully, some of the Senior Design students will also be here so that he can help them work on the LogicLock task.  Juan is here now.  He is currently planning to work 12:00 - 4:00 pm Mondays and Wednesdays and to work in the science library if the lab is not open when he gets here.  We noticed that the "COSMICi Calendar" was not visible to him for some reason.  I explicitly gave each of the project members access to this calendar, and emailed instructions for accessing it to the group.  David has now added his hours there as well.  Aarmondas arrived and I added his info as well.  We still need Michael Dean's up-to-date contact info and lab schedule - got it from Aarmondas, adding it to the blog & calendar.

My own main technical goal for today:  Continue working on the server-side code to remotely initialize the GPS module to the correct current time & location (& any other initialization that is needed to help it acquire satellites & establish a time lock).

Also, need to install the Quartus service packs.  Trying the SP2 install now...  It completed with no errors, except for a USB driver where it said that the already-installed version was newer so I told it not to update that driver.  Now, opening the FEDM_code project (Q:\COSMICi_FEDM.qpf), and letting it update the database files.  Same with the GPS app project (C:\f\DE3\S3\SB+SOPC\GPS_FPGA_app\Quartus_II_Project/DE3_GPSapp/DE3_GPSapp.qpf).  Next time I do a compile, hopefully everything will work.

David, Juan & Aarmondas together made some good progress on the LogicLock task - they now have the high-speed counter in a LogicLock region with analyzed fmax above 500 MHz (actually almost 600).  Aarmondas is working on doing the same for the edge-capture module in the timing-sync datapath, and Darryl will work on the pulse-capture module in the pulseform-capture datapath.

Emailed the students some suggestions for the midterm HW/SW review.

Now:  I want to make sure the system is ready for the students to get trained on how to start it up whenever they need to.

Burned the latest version of the autorun script (with nodeid.txt #1) to the Wi-Fi board for the FEDM (#1).  The board for the CTU is the one labeled #3 (its internal node ID is #0).

Tomorrow I will do a test with both subsystems (CTU+FEDM) together (haven't done that in a little while). (Most of today got eaten up helping manage/guide the students, so I didn't get much done myself...  May work from home tonight though.)

Saturday, February 18, 2012

Fri., Feb. 17th

Met with Senior Design students at 2:30 pm at COE.  Group secretary should post minutes of that meeting.

Was in lab from about 3:30 - 4:30.  Darryl & David were there.  David is downloading Quartus to his Windows XP installation under Parallels on his Mac laptop - the first version he tried didn't work, so he is trying another one.  Meanwhile Darryl is experimenting with LogicLock on his laptop.  He says that the timing analyzer gets an fmax of 611 MHz for the high-speed counter by itself, which is in line with the experiments we did last year (we had successfully run the counter by itself at 600 MHz).  It might also be worth trying the dual-edge triggered one again.  Alternatively, we could effectively double the frequency to 1.2 GHz by just using the clock signal itself as the low bit of the counter (but in order for that bit to actually be measurable, we'd have to have a double-edge-triggered version of the pulse-capture circuit that runs at that speed).

I added Darryl & David to the group on Blackboard to help them coordinate with the Senior Design students using the group email, group blog, etc.  I also wrote a message to Aarmondas asking him to please coordinate a schedule for the ECE students Juan & Michael Dean to be in the lab working with Darryl & David on the Logic Lock stuff.

Some things for me to do on Monday:

[ ] Try installing the Service Packs (SP1 & SP2) for 9.1 to facilitate letting us all use the shared network drive Q:\ (\\COSMICi\shared\FEDM_code\q91).  The downloads should have got finished before I left Wednesdeay.

Thursday, February 16, 2012

Thu., Feb. 16th

Spent today at my other job.  Will meet with COSMICi Senior Design team at the E-school tomorrow at 2:30 pm (need to reschedule future meetings to an earlier time if possible).

Working from home a little tonight.  Plan: See how far I can get through the new SensorHost model.

In Dropbox/COSMICi_devel/Server Code/src.  In model.py.  In WiFi_Module._UART_MsgHandler.handle().  Let's comment out the early return and see what happens.

First, archiving the current log files in a new folder with today's date in Server Code/data.

OK, now running the server.  It's started.

Started UwTerminal on COM1:; now powering up the Wi-Fi module...  OK, the TikiTerm windows for the three (MAIN/AUXIO/UART) connections to the server from the module are up.

Now, in the UwTerminal, let's try manually typing just the first line of CTU output from yesterday's test run:

HOST_STARTING,CTU_GPS,1.9

OK, straightforward error in handle():  The node has no .sensor_host attribute set.  We can fix that.  A placeholder SensorHost object ought to be created when the node is created.  Its initializer just needs the node.    OK, SensorNode's initializer now creates the .sensor_host sub-object.  Let's try again...

OK, now it's complaining that SensorHost has no .sentMessage() method.  I didn't write that yet?  Ah, I had done so in ShowerDetectorHost, moving that definition up to SensorHost.  (It just dispatches to the _handleMsg() method.)  Let's try again...

Hm, for some reason we are getting a SYN (0x16) character:


2012-02-16 21:14:21,733 | COSMICi.server.comm  |  Thread-19:    node #0  uart0.con0.rcvr   |       communicator.py: 816:_announce           |    DEBUG: Connection._announce():  Announcing incoming message [ ] to our message handlers...
2012-02-16 21:14:21,733 | COSMICi.server.comm  |  Thread-19:    node #0  uart0.con0.rcvr   |       communicator.py: 821:_announce           |    DEBUG: Connection._announce():  Announcing incoming message [ ] to a [std.brdg] message handler...
2012-02-16 21:14:21,733 | COSMICi.server.comm  |  Thread-19:    node #0  uart0.con0.rcvr   |       communicator.py: 821:_announce           |    DEBUG: Connection._announce():  Announcing incoming message [ ] to a [Wi-Fi.UART] message handler...
2012-02-16 21:14:21,733 | COSMICi.server.model |  Thread-19:    node #0  uart0.con0.rcvr   |              model.py:1221:handle              |    DEBUG: WiFi_Module._UART_MsgHandler.handle(): The Wi-Fi module relayed the message [ ] from the sensor host to the server.
2012-02-16 21:14:21,733 | COSMICi.server.model |  Thread-19:    node #0  uart0.con0.rcvr   |              model.py:2412:_handleHostMsg      |    DEBUG: SensorHost._handleHostMsg(): Handling host message [['\x16']]...


2012-02-16 21:14:21,733 | COSMICi.server.model |  Thread-19:    node #0  uart0.con0.rcvr   |              model.py:2427:_handleHostMsg      |    ERROR: SensorHost._handleHostMsg(): Unknown host message type [ ]. Ignoring...
2012-02-16 21:14:21,733 | COSMICi.server.comm  |  Thread-19:    node #0  uart0.con0.rcvr   |       communicator.py: 826:_announce           |    DEBUG: Connection._announce():  Finished announcing incoming message [ ] to message handlers...

Also, for some reason I have to hit Enter twice, still haven't figured that out... ^M^J (CR/LF) at end of line seems to work better...  But then I get _handleHostStarting() not defined...  Ah, I forgot the "this."  Try again...  OK, that's better, now on server console I get the following:

Node #0's host (type CTU_GPS, firmware version 1.9) is starting up...

Let's now try the next line...

HOST_READY

This does nothing observable (all it does is toggle a couple of flags).  Add diagnostic statement...  Ok, now it shows:


Node #0's host is ready to accept commands.


Meanwhile, on the line-end issue...  It's starting to come back to me...  After the end-of-line character, we are waiting to see what the next character will be before proceeding, so that if it's another end-of-line character we can ignore it.  This could be programmed better.  However, I am currently stuck with using io.TextIOWrapper which has this limitation.  Bleh.

Anyway, for now I am working around by streaming a file "mock_CTU_output.txt" instead of typing the lines by hand.  The file contents are:


HOST_STARTING,CTU_GPS,1.9
HOST_READY
$ACK,WIFI_STARTING,v0.19*67
$ACK,WIFI_READY*60
$GPRMC,212143.013,V,3025.676,N,08417.112,W,0.0,0.0,161211,4.1,W*70


and now from the server I get in response:


Node #0's host (type CTU_GPS, firmware version 1.9) is starting up...
Node #0's host is ready to accept commands.
   ERROR: SensorHost._handleHostMsg(): Unknown host message type [ACK]. Ignoring...
   ERROR: SensorHost._handleHostMsg(): Unknown host message type [ACK]. Ignoring...
   ERROR: SensorHost._handleHostMsg(): Unknown host message type [GPRMC]. Ignoring...


So, that is pretty much what's expected. I haven't implemented these other messages yet so of course they are "unknown message".  Now slogging through code to change class to CTU_Host and then dispatch GPRMC to the GPS module proxy...

OK, after debugging for a while (during which I learned not to begin a method name with double-underscore!) we are now making it all the way into the GPS_Module._handleGPRMC() method.  That's where I left off the other day.  It pulls data out of the fields but does nothing with it.  It should probably store the data as some module state information, and wave some flag that will be being watched by the GPS_Manager thread (am I even starting that thread yet?).  Of course, yon GPS manager still needs to be written.

In a bit more detail, probably we should stuff all the GPRMC data into attributes of a new GPRMC_Record object, and install that as the value of the gps_module.last_GPRMC_record attribute, also tagged with the system time at which yon data was received.  Then raise a flag gps_module.got_GPRMC.  GPS manager will be waiting for that flag to be touched, and will go fetch the last_GPRMC_record.  It will parse out the fields that it's interested in, and notice that the date/time is way off from the system time (NIST-slaved via NTP), and then this will trigger it to inform the model/proxy of the correct date & time; the gps_module object will translate this info to an appropriate $PDME command which it will pass to the CTU_host proxy, which will pass it (prefixed with the new "HOST" command name) to the WiFi_Module proxy, which will pass it to the UART bridge connection, which will send it to the real Wi-Fi module, which will relay it to the real host, which will strip off "HOST" and send it to the GPS.  Ta-dah!  That's not so hard, is it?  Not really.  Maybe another day of coding and a day of testing.  And once that infrastructure is in place, adding more code to send other kinds of commands to the GPS (and process other kinds of incoming GPS commands) will be easy.

That's enough for tonight...