Tuesday, January 31, 2012

Tue., Jan. 31st

Some things to do today:
  • [/] Give feedback on paper
  • [ ] Modify CTU firmware to allow server to control GPS remotely.
  • [ ] Write server-side code to massage GPS into behaving properly.
Today is the primary election day, but I forgot to vote on our county ballot measures before I left home this morning!  Oops!

Just copied the current files in Q:\ and Q:\software_v4\FEDM_ctrl_fw to the corresponding locations on Dropbox.   (in FEDM_code\q91)

Also copied the latest version of the DE3 GPS application to Dropbox, from:

  • C:\f\DE3\S3\SB+SOPC\GPS_FPGA_app\Quartus_II_Project\DE3_GPSapp
to:

  • C:\Users\Mike\Documents\My Dropbox\FAMU\COSMICi\GPS_FPGA_app\Quartus_II_Project\DE3_GPSapp
I also made a local backup of it in:

  • C:\LOCAL\GPS_FPGA_app\Quartus_II_Project\DE3_GPSapp
This is because I'm going to begin making changes to it, and the students might be able to help.

Current revision is RevA.  I'm going to make a new revision called COSMICi_DE3_GPSapp_RevC.

Let's look at the UART peripheral interface to see if we can set its baud rates.  The two UART peripheral interfaces are:

  • uart_1 (WiFi) - Baud rate 115,200; baud rate can be changed by software; include CTS/RTS.
  • uart_2 (GPS) - Baud rate 57,600; baud rate can be changed by software; CTS/RTS are not included.
IIRC, I believe uart_1 was for talking to the Wi-Fi, and uart_2 was for talking to the GPS.  Let me check.  Yes, the comments in DE3_GPSapp.v (top-level file) and GPSapp.v (main application file) make this clear.

OK, so now I need to figure out how to actually change the baud rate in software.  Let's look at the docs for the UART peripheral interface.  Was that in the Nios II Software Developer's Handbook?  Or the Embedded Design Handbook?  Hm, or the Embedded Peripherals IP User Guide?  Aha, that's it.

Here's the material on changing the baud rate:


Baud Rate Options
The UART core can implement any of the standard baud rates for RS-232 connections.
The baud rate can be configured in one of two ways:
■ Fixed rate—The baud rate is fixed at system generation time and cannot be
changed via the Avalon-MM slave port.
■ Variable rate—The baud rate can vary, based on a clock divisor value held in the
divisor register. A master peripheral changes the baud rate by writing new values
to the divisor register.
7–4 Chapter 7: UART Core
Instantiating the Core
Embedded Peripherals IP User Guide June 2011 Altera Corporation
1 The baud rate is calculated based on the clock frequency provided by the Avalon-MM
interface. Changing the system clock frequency in hardware without regenerating the
UART core hardware results in incorrect signaling.
Baud Rate (bps) Setting
The Baud Rate setting determines the default baud rate after reset. The Baud Rate
option offers standard preset values.
The baud rate value is used to calculate an appropriate clock divisor value to
implement the desired baud rate. Baud rate and divisor values are related as shown in
Equation 7–1 and Equation 7–2:
Baud Rate Can Be Changed By Software Setting
When this setting is on, the hardware includes a 16-bit divisor register at address
offset 4. The divisor register is writable, so the baud rate can be changed by writing a
new value to this register.
When this setting is off, the UART hardware does not include a divisor register. The
UART hardware implements a constant baud divisor, and the value cannot be
changed after system generation. In this case, writing to address offset 4 has no effect,
and reading from address offset 4 produces an undefined result.


This text is from pages 7-3 to 7-4.  The equations are:

  • divisor = int((clock frequency) / (baud rate) + 0.5)
  • baud rate = (clock frequency) / (divisor + 1)
Looking at table 7-4, the divisor register (16 bits) is register 4.

Let's look at the firmware, in the Nios II 9.1 IDE (legacy).  It looks like altera_avalon_uart_fd.h has support for ioctl() operations.  Looking at termios.h, it seems that there is support for setting the baud rate in the termios structure which has fields for this purpose.

Looking at the doc on page 7-10.  We need the preprocessor option -DALTERA_AVALON_UART_USE_IOCTL.  Then we use the options TIOCMGET and TIOCMSET.

Damn, the stupid legacy Nios II IDE keeps crashing whenever I try to modify the C/C++ build properties (wanted to do this to add the -D option).  Looks like I'll have to migrate this project to Eclipse.

Considering using the Micrium MicroC/OS-II embedded OS in this application.  (We probably have enough memory for it on this board.)  At least then, maybe the re-entrant versions of the STDIO routines would work properly!  And we could have proper threads (or tasks, or whatever they call 'em).

Note to self:  Pick up uC/OS-II manual from my ECE office on way home.

Ugh, getting errors from the hello-world build for uC/OS-II.  Thinking now this is maybe not worth the hassle.  Let's try again tomorrow, this time just creating a regular Eclipse project without uC/OS-II.

Tomorrow:  Need to make it a priority to try out the PADS license.

This evening I spent a little while trying to see if the central server would run on Mac OS X (Snow Leopard) on my Mac Mini at home, under Python 3.1.4.  No such luck.  Apparently, the OS X version of Tcl/Tk (TkAqua) has a restriction that the toolkit must run in the main thread, whereas presently we create a separate "guibot" thread to run GUI operations.  It will take some doing to fix this problem.  Not sure yet exactly what the best way to do it is, or if it is worth doing.  We can wait till later to fix it, or perhaps just run the server under a VM if we need to run it on a Mac.

I made a valiant effort this evening to fix the problem by enlisting the main thread into the role of the guibot thread (instead of creating a new thread to be the guibot), and improving the Worker methods so that a worker can give blocking tasks to itself without creating a deadlock situation.  However, I am still having problems; I couldn't make the main thread object callable even after adding a .__call__() attribute to it.  Not sure what's going on there.  Perhaps the main thread is not really a class object?

I should make all my threads (except the main thread) daemon threads - these don't prevent the python process from exiting.  This would allow the server to be killed by interrupting the main thread, I think, which can be done by sending it a keyboard interrupt.  I thought I tried this before though, and it didn't help for some reason?

Monday, January 30, 2012

Mon., Jan. 30th

Probably the biggest priority for today is to track down the cause of this problem where the timing-sync datapath stops responding shortly after the system starts up.  (If we can't fix that, everything else is pointless!)  Some things to try:
  • Put the scope is "Normal" as opposed to "Automatic" trigger mode, see if there's anything unusual about last pulse.
  • Hook up the JTAG cable and turn on diagnostic output in the firmware to see if it gives us any clues - e.g., if the firmware were resetting or suspending the datapath, that would cause problems.
  • Look at the 200 MHz (5 ns period) clock output from the PLL to make sure it is still running.
  • Look at some raw bits of the clock counter (a block of sum bits from somewhere in the middle) to make sure it is actually still counting.
  • Look at more of the internal state bits of the front-end edge-capture module.
  • Actually try out the new firmware (in place of the present output stub) and see if that works any better.  (Although it doesn't seem possible that the output stub could be causing the problem, since we already know the front-end edge-capture state machine isn't getting stuck waiting for a handshake return.)
Waiting for Dropbox to finish syncing before I start server.  In the meantime, should I update the Wi-Fi modules?
    George came by and showed me a couple of sample copper heat pipes (5/8" dia.) with rubberized coatings.  I suggested an overnight freezer test re: condensation.  I also suggested they buy a new thermoelectric plate so they don't have to tear apart this one (so we can keep using it in the meantime, and so we have proper specs on the new one).  He said he has does thermal calculations and it should get the chip down to 1 degree C.  I told him to include the calculations in the next report & presentation.

    OK, doing some testing now.  There is nothing weird about the last pulse received.  And no output currently from the nios terminal (debug output is still disabled).  However, I noticed that the thing always seems to die at the same time that we acknowledge the WIFI_READY message.

    One possibility:  Maybe the timing-sync datapath is getting reset or un-enabled somehow as a side-effect of what is happening in that part of the code.  The real code to control it hasn't been integrated yet.  If I constantly leave it not-reset and enabled, will that fix the problem?  Worth a try.  Doing Quartus compile now.  In the meantime, I can work on integrating Juan's changes to the firmware.

    Here are Juan's notes on his changes, from Dropbox\FEDM_design\FEDM_code\q91\eclipse_workspaces\jpc_workspace\1_25_12.txt:

    tsdp_driver.h
    - Changed the names of members of tsdp_status struct to "pulse_index" and "last_pulse_time"
    - Created global tsdp_status struct object "last_tsdp_status"
    - Commented out "timing_pulse_count" since we already have "pulse_index" which also keeps track of
    number of timing pulses received.
    tsdp_driver.c
    - Got rid of "reentr.h" since it appears that we were only using re-entrant printf in tspd_diagnostics()
    - Got rid of globals: pulses_seen, initialized and last_pulse time since we are not using a global tsdp_status struct.
    tsdp_pull_pulse:
    - Removed threshold level index variable "i"
    - Getting rid of PulseForm data structure and now using tsdp_status data structure
    - Getting rid of number of levels crossed (1) and falling time
    - Storing the long_word into last_pulse_time of the global tspd_status struct
    tsdp_run and tsdp_pause:
    - Changed INFO message to "Timing-Sync" instead of "Input-Capture"
    - Changed base address to TSDP_CTRL_BASE and enable mask to TSDP_ENABLE_MASK
    tsdp_reset:
    - Changed the comments and INFO messages to reflect the new datapath
    tsdp_notify:
    - Got rid of PulseForm data structure since we need to use timing-sync data structure
    - Created timing_pulse object to interface with the timing sync data structure (tsdp_status)
    Still need to handle sync errors!!! (we will do this later since we first need to have basic functionality)
    icdp_driver.c
    - Filling in timing-sync data into the global struct using the PulseForm data structure

    icdp_driver.h
    - Added global struct to the PulseForm data structure.
    - Include header file to tsdp_driver.h

    server.c
    - Showing "last_pulse_time" 1/27/12
    Update on datapath changes:  Removing the reset/enable control seems to have fixed the "dying" problem.  Let me burn the new design to make sure.  Yep, seems OK.

    Now working on merging Juan's changes.  Found a conflict between the HAVE_DATA() macros in tsdp_driver.h and icdp_driver.h.  Renamed tsdp's to TS_HAVE_DATA() to fix.

    Aha, I have a suspicion as to why the ts datapath was hanging.  When WIFI_READY is received, we display a NORMAL-mode message to stdout.  Since the JTAG port was disconnected, or no, because the reentrant output routines don't work propertly, this would hang while interrupts were disabled, thereby preventing... Oh, I don't know.  Anyway, I'm going to try out the new firmware.

    Finding a few syntax errors while compiling, fixing...  OK, the code compiles now.  A couple things to node: interrupt_timing.c is now in its proper place as interrupt.c (the old interrupt.c is preserved as interrupt.c.bak).  Also, the stdout_buf.c code, which I am still working on, is excluded from the present build.

    Now doing the Quartus compile. The Nios is now hooked up to drive the new datapath.  Removed the output stub in new top-level file _top15.

    While the compile is brewing, I'm working on my stdout_buf.c code.  One neat thing I can do with it (once it's complete) is to pass its output to the server via log messages.  Like HOST_DIAG, as an alternative to LOGMSG for log messages coming directly from the sensor host via the UART bridge.  This can be used for diagnostic output when the real STDOUT is muted.

    OK, doing a run now with all 4 detectors & the timing-sync datapath.  Some notes:

    • Trial #1 config:
      • Case #1 -> Cable #1 -> SMA#1 -> "PMT #1" pad -> ICDP channel #0 - 33
      • Case #2 -> Cable #2 -> SMA#2 -> "PMT #2" pad -> ICDP channel #1 - 10
      • Case #3 -> Cable #3 -> SMA#3 -> "PMT #4" pad -> ICDP channel #2 - 214 (FIFO full, lost pulses)
    That trial produced so much junk in the output (FIFO_FULL warnings, etc.) that it was difficult to see any actual pulse data.   Let's switch cables 2&3 to opposite SMAs, so that their pulse rates are more balanced and hopefully the coincidence detector won't be overwhelmed by the imbalance.
    • Trial #2 config: 
      • Case #1 -> Cable #1 -> SMA#1 -> "PMT #1" pad -> ICDP channel #0 - 173
      • Case #2 -> Cable #2 ->  SMA#3 -> "PMT #4" pad -> ICDP channel #2 - 59
      • Case #3 -> Cable #3 ->  SMA#2 -> "PMT #2" pad -> ICDP channel #1 - 202
    Trial #2 is behaving much better, no FIFO_FULL or lost-pulse events now.  The pulse rates are fairly balanced, although we could still maybe improve it a bit.  We are getting some 3-way coincidences.  The pulses are labeled with the time references; this seems to be working.  Here is a sample of some output:

    HOST_STARTING,FEDM,v0.10

    DAC_LEVELS,-0.200,-2.500,-0.299,-0.447,-0.669,-1.000

    HOST_READY
    NC_PULSES,1645056982,169,64,201
    NC_PULSES,3190429776,161,55,202

    ACK,WIFI_READY
    NC_PULSES,4688923073,173,59,202
    PULSE,73011,5980986267,2,1,5981023576,1,(0,2)
    PULSE,73011,5980986267,3,1,5981023577,1,(0,6)
    NC_PULSES,6332065602,190,63,201
    NC_PULSES,7771016797,141,55,200
    NC_PULSES,9305709023,156,47,202
    NC_PULSES,10760125101,145,67,201
    PULSE,138181,11319708242,3,2,11319785150,1,(0,22)
    PULSE,138181,11319708242,1,1,11319785156,1,(0,5)
    NC_PULSES,12149040460,145,56,200
    NC_PULSES,13780610760,184,70,204
    NC_PULSES,15433514672,173,76,203
    PULSE,204374,16742234860,3,3,16742285552,1,(0,3)
    PULSE,204374,16742234860,1,2,16742285552,1,(0,7)
    NC_PULSES,16826526099,132,59,202
    NC_PULSES,18455981655,153,68,201
    NC_PULSES,19816021711,159,57,201
    PULSE,251390,20593783022,1,3,20593836141,1,(0,3)
    PULSE,251390,20593783022,3,4,20593836145,1,(0,1)
    NC_PULSES,21493783386,195,69,200
    NC_PULSES,22915565251,135,53,200
    PULSE,292683,23976503452,3,5,23976529821,1,(0,7)
    PULSE,292683,23976503452,2,2,23976529821,1,(0,6)
    NC_PULSES,24454253751,165,66,200
    NC_PULSES,25999090947,167,68,202
    NC_PULSES,27472235284,152,52,200
    NC_PULSES,28912083859,167,55,201
    NC_PULSES,30590600583,182,72,200
    NC_PULSES,32097939517,162,62,206
    NC_PULSES,33786325217,203,59,205
    NC_PULSES,35446408708,162,68,200
    PULSE,440067,36050193749,3,6,36050253295,1,(0,12)
    PULSE,440067,36050193749,1,4,36050253297,1,(0,5)
    PULSE,440067,36050193749,2,3,36050253298,1,(0,4)
    PULSE,442189,36224027895,3,7,36224058045,1,(0,64)
    PULSE,442189,36224027895,2,4,36224058046,1,(0,20)
    PULSE,442189,36224027895,1,5,36224058046,1,(0,16)
    NC_PULSES,36914981364,162,53,200
    NC_PULSES,38386147406,153,57,203
    NC_PULSES,39795890861,137,67,206
    NC_PULSES,41231182045,161,61,203
    PULSE,503847,41275048491,2,5,41275065988,1,(0,6)
    PULSE,503847,41275048491,3,8,41275065990,1,(0,6)
    NC_PULSES,42748100077,139,54,203
    NC_PULSES,44347019332,164,70,202
    NC_PULSES,46060983716,190,74,211
    PULSE,563018,46122334272,3,9,46122381769,1,(0,5)
    PULSE,563018,46122334272,2,6,46122381769,1,(0,3)
    NC_PULSES,47734791310,181,64,201
    NC_PULSES,49299944169,161,63,200
    NC_PULSES,50767054455,152,53,200
    NC_PULSES,52367216984,180,65,203
    NC_PULSES,53873868242,154,57,201
    NC_PULSES,55627477276,194,72,204
    NC_PULSES,57098080579,143,70,205
    PULSE,709732,58141139048,1,6,58141181168,1,(0,10)
    PULSE,709732,58141139048,3,10,58141181170,1,(0,4)
    NC_PULSES,58550858533,139,51,200
    NC_PULSES,60207382783,151,55,209
    NC_PULSES,61761822335,163,70,203
    NC_PULSES,63099029287,140,65,202
    NC_PULSES,64660494559,152,65,200
    NC_PULSES,66217918160,174,58,206
    NC_PULSES,67680886852,152,54,201
    PULSE,837403,68599942245,3,11,68600016324,1,(0,12)
    PULSE,837403,68599942245,2,7,68600016324,1,(0,8)
    PULSE,838030,68651306060,1,7,68651370992,1,(0,4)
    PULSE,838030,68651306060,3,12,68651370994,1,(0,5)
    NC_PULSES,69352417124,148,73,213
    NC_PULSES,70960480577,167,78,205
    PULSE,868239,71126026126,1,8,71126041421,1,(0,12)
    PULSE,868239,71126026126,3,13,71126041425,1,(0,2)
    NC_PULSES,72536209134,144,60,210
    PULSE,892863,73143223220,3,14,73143298091,1,(0,9)
    PULSE,892863,73143223220,2,8,73143298092,1,(0,8)
    NC_PULSES,73768295892,121,39,201
    NC_PULSES,75230961517,152,62,208
    NC_PULSES,76740108234,154,62,211
    NC_PULSES,78451831524,172,61,202
    NC_PULSES,79861097624,146,62,203
    NC_PULSES,81437231122,177,55,200
    NC_PULSES,83114305623,172,59,201
    NC_PULSES,84581433149,141,63,201
    NC_PULSES,86094560412,155,79,209
    NC_PULSES,87682077780,151,63,203
    PULSE,1073206,87916914548,2,9,87916946298,1,(0,8)
    PULSE,1073206,87916914548,1,9,87916946301,1,(0,4)
    NC_PULSES,89258876064,165,57,202
    STOP
    STOP

    ACK,STOP

    We're now in a position to check the rate of timing-sync pulses, to see if we're getting them all - if not, then the accuracy of our "absolute time" calculations could become badly screwed up, drifting away from where they should be.  Let's start with the first two time-references seen in the above data:
    • Time Reference #1:  Timing-sync edge #73,011 on PLL clock cycle #5,980,986,267.
    • Time Reference #2:  Timing-sync edge #138,181 on PLL clock cycle #11,319,708,242.
    Let's compute the time deltas during this interval:
    • We had 138,181 - 73,011 = 65,170 timing-sync cycles (nominally 409.6 us each) = 26.693 632 secs.
    • We had 11,319,708,242 - 5,980,986,267 = 5,338,721,975 PLL cycles (nominally 5 ns each) = 26.693 609 875 secs.
    The difference is -0.000 022 125 secs = -22.125 us (PLL behind OCXO).  Since this discrepancy is less than one timing-sync cycle, we probably did not miss any edges.  In relative terms, the PLL is slow by -0.828 849 ppm.  

    That is within the frequency calibration tolerances involved, which are 2 ppm and 1 ppm for the CTU's OCXO and the FEDM's TCXO respectively.

    Now let's see how stable the relative frequencies are.  We'll do this by looking at another interval, based on the next time reference in the above dataset:
    • Time Reference #3:  Timing-sync edge #204,374 on PLL clock cycle #16,742,285,552.
    So during the interval from time reference #2 to #3, we had:
    • Timing-sync cycles:   204,374 - 138,181 = 66,193 @ 409.6 us = 27.112 652 8 secs.
    • PLL cycles:   16,742,285,552 -  11,319,708,242 = 5,422,577,310 @ 5 ns = 27.112 886 55 secs.
    This time, the TCXO is ahead of the OCXO by +0.000 233 75 secs = 233.75 us.  It's conceivable here that we might have missed a single sync pulse in the 2nd interval, which would have set the OCXO-based sync count behind by 409.6 us.

    The relative frequency deviation is +8.621 436 ppm (TCXO ahead of OCXO).  This is outside the bounds of specifications which is another reason to think that we might have missed a sync pulse.

    It would probably be a good idea for the ISR in the timing-sync edge-capture driver to do a sanity check on the elapsed time since the previous timing-sync pulse.  Ideally there should be 409.6 us / 5 ns = 81,920 cycles of the 200 MHz PLL clock for every timing-sync pulse.  With calibration errors in the ppms, the number of cycles actually seen should not deviate from this number by more than +/- 1 cycle at most.  Thus, we should do some massive error-checking on the difference seen here.  If it is different from the expected value by 1 cycle, report this at INFO level.  If it is different by >1 cycle, report this at WARNING level.  If it is different by >10 cycles, report this at ERROR level.  If it's fairly close to 2 x 81,920 = 163,840 (within +/- 10 of this, say), treat it as a dropped pulse, and report it at WARNING level. If it's much larger than this, so that more than 1 pulse in a raw may have been dropped, treat this as a CRITICAL error, since basically our absolute time measurements will be totally screwed at that point.  (Well, we could try and patch over longer outages, but the chances of getting permanently misaligned will increase the longer the outage is...)

    Left the current run going, to collect some shower data overnight...

    [From home, later this evening...]

    Some more thoughts on the startup sequence...  Really the time-sync counts in my run this evening are not meaningful, because the CTU had already been running for a while before the FEDM started up, so there is no way to align the time-sync counts with the absolute time stamps registered by the CTU.  Also, the GPS module wasn't acquiring satellites, as usual.

    It would be a good idea for us to build some more smarts into the system, so that these issues can be resolved automatically (or mostly so) on startup.  Right now, it is super-tricky even to get the GPS acquiring satellites (until we figure out how to do it without manual fiddling), and so the chances that the CTU will work properly when first powered up (after the FEDM is already running) are close to zilch.

    Here would be what a sensible startup sequence might look like:

    • CTU and FEDM subsystems are powered on, roughly simultaneously (within a few seconds of each other, at least).  This will be easier to do once Samad designs/builds us a new, higher-quality power supply solution that can power both boards through our ATX supply.  For now, we can just manually power up both boards at about the same time.
    • The FEDM board starts up with its high-speed time counter all datapaths reset, in a suspended state, waiting...  Likewise, the CTU starts up with its PPS edge-capture datapath and its OCXO-based counter reset, in a suspended state, waiting...
    • The Wi-Fi scripts start up, at more or less the same time, opening up the terminal windows on the server, as presently, and going into Trefoil mode.  Just as they are about to enter their main loops, they send the "WIFI_READY" message to their hosts.  (This is already implemented.)
    • When the FEDM gets the WIFI_READY message, it sends the FEDM_READY message to the server.  (I think we forgot to initialize the TSDP_status; remember to do this...)
    • Likewise, when the CTU gets the WIFI_READY message, it sends a CTU_READY message to the server.
    • Once the server has received the CTU_READY, it's time for it to start setting up the CTU properly.  It sends a GPS_UNMUTE message to tell the CTU to start passing the NMEA datastream through to the server.  Then it needs to go through a series of steps where it asks the CTU to relay commands to the GPS for it, with a command sequence something like the below (or, maybe it would be easier to just do this stuff in the CTU firmware?  not sure...).  Actually, I'm not really even sure what a reliable GPS startup sequence would be; I'll probably have to fiddle around a bit till we get it right.  In the worst case, during this process, we might have to revert to the default configuration, which means changing baud rates.  We could do that, but it means we need support for this in the SOPC system & the firmware.
      • Mute NMEA output temporarily.
      • Turn off POSHOLD and/or TRAIM modes (may make acquisition easier).
      • Hot-restart the GPS.  Tell it our estimate of the current time (from NTP) & location.
      • Watch the number of satellites; wait to get enough for a fix (at least one, yo?).
      • Go back into POSHOLD and/or TRAIM modes.
    • When the server has received the FEDM_READY message from the FEDM, and the GPS has acquired satellites and TRAIM is reporting an accuracy value, finally at that point we are able to start capturing absolute-time-tagged pulse data; to start that we do the following:
      • Server sends a message to the FEDM telling it to start running its time-sync capture datapath (TS_GO).  At this point, the pulse-capture datapath should still be disabled, since we haven't actually received any time pulses yet.
      • Server sends a message to the CTU telling it to start its time counter; after the first PPS edge is received, the counter gets reset, and the CTU starts sending time-sync pulses to the FEDM every 409.6 us.  The FEDM is ready, and starts registering them.  Perhaps it would be a good idea, once every second (2,441.40625 time-sync rising-edges) or so, to have the FEDM report the current TSDP_status (# of sync pulses, index of pulse) to the server; this would facilitate calibration and measurement of frequency stabilities.  For this, I need my output buffer working...
      • Anyway, once the FEDM is receiving time-sync pulses, it reports this to the server.  The server then says, "OK, you may now begin collecting data."  ('GO' command).  Or maybe the FEDM just does this by itself automatically after starting to receive time-sync pulses.
      • At this point, both CTU & FEDM are sending their normal data streams to the server.  From there, it's just server-side coding...

    Sunday, January 29, 2012

    Sun., Jan. 29th

    I'm getting ready to test the server application at home, just to make sure that I didn't break anything yet.

    Got an error on startup when displaying the very first log message on the GUI console window.  This could be a Python version compatibility issue - the server was developed for 3.1.1 and on my home office PC I have 3.2.2 (latest stable release).  However, I'm not sure why this is causing problems, because I ran the server successfully at home previously?  Was I using a different version back then?  Anyway, I should debug and fix this problem.  It would be nice to be able to run the server in the latest version of Python.

    Looking at the log file, it seems that the logger didn't get configured properly or something - the file name, function name, and line number aren't getting set automatically in the log messages like they are supposed to.  Perhaps they changed the logging package in some incompatible way?

    I'm going to try to install the latest release of 3.1 (which is 3.1.4) at home and see if the server runs better with that.

    OK, that seemed to fix that error (although I'm not sure why this happened, because there was nothing in the "what's new about 3.2" doc that obvious would have broken this.  Now I'm having a problem opening the mainserver socket.  It looks like sitedefs.py is assigning the wrong IP address - I'm no assigning Theo a static IP from my router since I'm bridging the wireless from my laptop.  OK, fixed that.

    The server starts up now, but I can't fully test it without a WiFi module.  I actually have a WiFi module here.  However, I will have to recompile the autorun script to adjust for the new server IP address.

    Really, we need to rearchitect the discovery protocol for initiating communication between the WiFi board and the server so that they can automatically find each other as long as they are on the same subnet as each other, without already knowing each others' IP addresses.  This could be accomplished using broadcast packets, perhaps.  Right now there is no discovery protocol per se; the WiFi boards have the server's IP address hard-coded into them.  This is not very elegant.

    At minimum, the server IP address should be in a config file (called "config.txt" or something) that can get downloaded onto the WiFi board without having to recompile the autorun script.  This file could also include the Node ID information, so that we would no longer need the existing nodeid.txt file.

    In my notes in the sites.uwi module of the autorun script, I talked about having "site.txt" and "debug.txt" config files to contain site-specific settings and debug levels.  However, these two functions could be easily absorbed into a single config.txt file.  But, a module to read and interpret that file would still need to be written.  And when I was last working on the autorun script I was bumping up against memory limitations.  That could happen again if I try to add new features.  Best for now, probably, if I just modify the hard-coded IP address in sites.uwi and recompile.  Doing that now.

    OK, the Wi-Fi board started up and opened its 3 connection windows on Theo with no errors.

    It occurs to me that the easiest kind of "input stub" might be to simply stream a sample text stream that might be produced by the DE3 or FEDM board directly to the Wi-Fi board using UwTerminal's "Stream File Out" option.  If the students have a Wi-Fi board to work with, this is a heck of a lot easier than creating a whole new app to simulate it.

    While I had the WiFi autorun script open, I made a couple of minor changes to it to reduce the startup time (mostly, turning off info logging to the network).  These changes won't take effect, of course, until the next time we re-burn the modules.  I've done that now.

    I added a message BRIDGE_MODE to the Wi-Fi board to support informing the server of the Wi-Fi board's current bridging mode, since this influences whether and how best the server can communicate with the Wi-Fi module and to the host behind it.

    Fixed a minor bug in the Wi-Fi script having to do with call stack maintenance.

    Changed the server to now put the node name in the window title after the first command is received.

    Fixed a bug that was causing the server to become crippled if commands were typed in the console window.

    Saturday, January 28, 2012

    Sat., Jan. 28th

    Woke up in middle of night, decided to write down some more notes.

    We wonder if maybe the FEDM is overheating more readily now for some reason, and started looking into the possibility of making temperature measurements.  After a little poking around, Darryl & I realized that both the DE3 board and the FEDM board actually have on-board temperature sensors.  However, these are not directly measuring the temperatures of the FPGAs, but only the ambient temperatures at the surface of the PCBs.  Even so, it might be useful to add gelware & firmware to actually take those measurements periodically and report them to the server.  That way we could detect, for example, overheating within the overall electronics box, once installed.  The DE3 kit, I'm pretty sure, came with sample FPGA source code for the Control Panel app which reads temperatures there; by starting from that code we could probably develop this capability pretty quickly.  And on the FEDM, we can figure out the interface to the temperature sensor chip by looking at its datasheet, if we decide that's worth the trouble.  The chip (looking at the BOM) is the MAX1668MEE+.  For easy reference I downloaded the datasheet from Digi-Key to the FEDM_Design folder in Dropbox.

    Another factor is the ADCs, which still may not be turned off, and which may therefore be contributing to the overall FEDM temperature; we should maybe still write code to interface with those sometime soon.  The ADC datasheet is already in our Dropbox folder.

    I considered that power supply voltage sag could also be contributing, but I measured the voltage on the WiFi board powered from the FEDM and it only seemed slightly slow.  Also, boosting its voltage up to 5V directly from the Agilent power supply didn't affect the behavior.

    Earlier in the day, I tried applying this same boost to the CTU and this seemed to help it avoid the resets.

    When speaking with Samad earlier in the day, I mentioned that there is somewhat of a tradeoff at work in the power supply design.  If he does a power distribution board that just routes power from the DE3 supply to the DE3 board, there is a risk that voltages could sag a bit just by going through multiple connectors, although this concern should be reduced somewhat by taking advantage of the multiple +5V wires from the DE3 supply (thus maximizing conductance from that supply to the power distribution board) as he & I discussed.

    An alternative would be to use a higher-voltage, single-voltage wall-plug supply with adequate power, and feed this to a board with our own voltage regulators; the voltages from these can then feed out directly to the other boards in the system, thereby reducing the number of power connectors in series that those regulated levels will need to pass through, and thus reducing the IR drops.  But then, we have to worry about cooling the voltage regulators.

    One thing that I think we definitely want to do is look at powering the CTU WiFi and GPS directly from the power distribution board, instead of routing their +5V supply through the DE3 - since we still don't have documentation specifying the current supply capacity of the paths from the DE3's +5V input to the corresponding pins of its GPIO headers.

    Anyway, back to the current problem where the new timing-sync datapath works fine for a few seconds and then dies.  Before it was dying so quickly, we had a temporary issue where the input pulse had collapsed.  However, I think this was caused by a probe connection that was accidentally touching another node; after moving the probe cable this went away.  But the problem with the datapath dying got worse and worse as the evening progressed.  The first time it was working it stayed on for a while.  On each subsequent test it seemed to die sooner and sooner.  This is one of the reasons I thought the problem might be heating-related - maybe something on the board was getting gradually warmer and warmer?  Anyway, try again Monday after it "rests" over the weekend.

    A couple of thoughts on our diagnostic strategy looking forwards:

    • It may be that Darryl's test module (now significantly rewritten by myself) is getting stuck in a specific state, and identifying which state it is in may give us a clue about why.  But on second thought, I don't think this is the problem, because after the "death" occurs the front-end module is in state 0, which means it received the return handshake.  So it's not that the datapath is getting jammed - it's that it's no longer responding to pulses.  It could be the very front-end edge-capture module that is dying.  If this is heating-related that could well be the case, since that is a high-speed (200MHz) module and could experience timing related problems if there is a heating issue.
    • If the test module were causing the problems, then replacing it with the actual firmware (now complete and ready to test) might have a hope of fixing the problem.  However, based on the above observations I don't think this is the case.  Something else is going wrong early in the datapath.  So probably the thing to do is go back again and apply diagnostic probes to the very front-end module.  The problem is likely happening somewhere in there, and if I can figure out where, it can possibly be tweaked to make it less timing-constrained (if that is what's happening).
    Anyway, that's enough notes for now - I'm going back to sleep.

    Another idea:  Switch the scope from "Auto" to "Normal" mode so I can see what the last timing-edge data stream looks like.  This might give me a clue about what exactly is starting to fail.

    To get ready for the students to help work on the Python server, I cleaned up the "Server Code" folder on Dropbox (organizing various things into appropriate subfolders), and wrote a README.txt file explaining the new file hierarchy.

    In Server Code/docs/, I started writing a "Programmers' Reference Guide" with the intent of documenting in detail the present code to aid the students (or other future developers) in modifying it.  However, after several hours of work, I only finished documenting one class in one module (namely, model.SensorNet).  So, finishing this guide may not be practical.  However, what's there may still be helpful as an example of good documentation, and to help the students get started working with that module.

    Actually, my original intent today was also to begin the code changes to support the new object model.  I did write some comments towards that end at the top of model.py, and added new symbols to __all__, but haven't gotten any farther yet.

    OK, late at night now - I've been adding more comments to model.py and cleaning up & rearranging things a little, but still haven't made substantive changes.  Before I do I may need to go into the lab and make sure the server is still running.  Then after I make the first set of changes (move some functionality from SensorNode into the WiFi_Module class), I can regression-test to make sure I didn't break anything.

    Friday, January 27, 2012

    Fri., Jan. 27th

    Went over latest firmware changes from Juan.  Made a couple of corrections.  The support for the new timing-sync datapath now appears to be complete and is ready to merge into the main trunk & test.

    Samad came in and I made some suggestions to him about designing a new power distribution board.

    I think I realized why stream_time_pulse_out_test in the new datapath wasn't working - it needed to wait an extra cycle after raising pump_data to give stream_pulse_out time to react.  That's why the first byte was always zero, because the data word hadn't been provided yet.

    Now it is working - the first byte is there and counts up from 0 to 255 appropriately over time, but now, after running fine for a short time, the whole datapath suddenly stops running.  Go home and sleep on it.

    Thursday, January 26, 2012

    Thu., Jan. 26th

    Ray asked me to focus on installing PADS on Friday.  Meanwhile, hopefully Darryl can help continue debugging the timing-sync edge-capture datapath, and Juan can finish up the new firmware code needed to support it.  Hopefully we will be ready to test everything either late Friday or Monday.

    George will also come by Friday and show me his copper heat pipe.

    On or before next Tue., Jan. 31st, we are supposed to attain a major milestone, which is to demonstrate transmission of absolute time-referenced shower data (for coincidences involving all 3 paddles) to the server, using the present (200 MHz) version of the datapath.

    After this, the next milestone deadline for the ECE students is Wed., Feb. 29th, by which we want to demonstrate the above again, but this time using the optimized (500 MHz) version of the datapath (accomplished with help from LogicLock and possibly other hand-optimization).

    To help in the planning for this activity, here are some suggested intermediate milestones, and deadlines for them.  (If we do not meet these intermediate deadlines I think we would be hard pressed to achieve the overall milestone by its due date.)

    • On or before Wed., Feb. 8th:  Demonstrate the 56-bit high-speed counter running at 500 MHz, and with all its components logic-locked into placement locations.  Optional: By then, demonstrate 600 MHz speed or higher, possibly using pseudo-dual-edge-triggered registers.
    • On or before Wed., Feb. 15th:  Demonstrate the 56-bit high-speed counter together with the front-end module of the timing-sync edge-capture datapath, both running correctly at 500 MHz with all components logic-locked into placement locations.  Optional:  Demonstrate even higher speed, maybe by using PDE registers.
    • On or before Wed., Feb. 22nd:  Demonstrate the 56-bit high-speed counter together with the front-end modules of all four datapaths (1 timing-sync edge-capture + 3 pulseform-capture channels), all running correctly at >=500 MHz with all components logic-locked into placement locations.
    • On or before Wed., Feb. 29th:  Add all the slower-speed components back in, show everything still fits, demonstrate full system operation at the new higher speed.  (Cooling solution may be necessary by this point.)
    Considering power supply issue.  We might be able to use our existing supply if we utilize more of the pins on it.  Did a little research on this, then emailed the following results to Samad:

    Samad, here's a webpage explaining power supply connectors:


    The big connector upstream of the one we've been using is the ATX 20+4 pin connector.  I compared wire colors and this is correct.  If you look at the table of pinouts above, you will see there are several additional +5V outputs besides the one we are currently using (pin 22).  So, it looks like if you build a power supply board that interfaced to this connector, you would have enough current capacity to power everything (although you should compare the power specs on this webpage against our needs to make sure).  If I were you, I would shop on Digi-Key for the appropriate header to mate with this connector and mount on a printed circuit board.  Then we can design a new power distribution board in PADS and solve our power problems.

    Tested timing sync datapath with last night's mods to Darryl's output stub.  Still screwy.  I sure don't see anything that could be wrong in the earlier modules.  Playing around with Darryl's code some more.

    Wednesday, January 25, 2012

    Wed., Jan. 25th

    Samad gave me the new 5V supply he purchased this morning.  It is BARELY big enough (2.6A), but we can try it.  Told him next time to oversize the supply by 50% or so to make sure there is enough headroom.

    Today:  I'm planning to finish debugging the new timing-sync edge-capture datapath using Darryl's output stub.  Meanwhile, hopefully Juan will also come in and tie up the loose ends in the firmware.

    I took the commented code out of David & Dean's new stream_pulse_data_tsedge_56.vhd module and inspected it for errors.  It looked OK apart from a possible occasional wrong behavior on reset (which I fixed).

    I went through Darryl's code and simplified it significantly before I try debugging it; getting ready to test again.

    Juan came in and I gave him more detailed instructions on what needs to be done to finish up the firmware; he is working on that now.

    Tested Samad's supply brick with the multimeter.  It outputs nearly 9V at 0A!  I tried putting a 100 ohm load on it, and the voltage fell to about 7.5A.  It's possible that at 2.5A it would put out close to 5V, but I didn't have a large enough 3 ohm resistor handy.  I really don't want to risk frying the real FEDM board by applying too much voltage to it; the output of this supply is apparently not regulated. We are there going to need a voltage regulator board to do this properly if we're going to use this supply.

    I have a theory about why the Wi-Fi board in the CTU often has problems.  The OCXO draws more current when it is warming up, so maybe during this time, the voltage on the 5V net is sagging.  On second thought, that's not it because the OCXO only uses 3.3V.  Still, the 5V net could be sagging anyway, maybe while the GPS is trying to acquire satellites.

    OK, I tested the voltage on that net using a barrel-plug connector on the Wi-Fi board.  At the moment it is sagging quite a bit, down at 4.5V.  Let's try a cold boot and see what happens.  It starts at 4.75V and then dips quite a bit as the Wi-Fi module is starting up, possibly due to power consumption from the antenna, or maybe to the GPS module's startup.

    Clearly then, the DE3's existing power supply is inadequate to provide the specified 5V to all 3 of the CTU components that use it (DE3, Wi-Fi, and GPS).  This badly needs to be addressed.

    Tuesday, January 24, 2012

    Tue., Jan. 24th

    Today, Darryl, David, & Michael Dean are here.

    I asked Darryl to do a new output stub to stream all 7 bytes of the timing-sync edge individually to the scope.  That will give us an independent way to verify the results from the new, more simplified datapath.  (I couldn't really look at the results with the existing stub because it always just produced 1,0.)

    Meanwhile, David & Michael Dean are working together on the additional simplifications to the timing-sync datapath (removing the code dealing with the "number of levels crossed" stuff).

    Juan will hopefully be here tomorrow to work on the C code (firmware changes).

    Meanwhile, I am contemplating writing a new module in the C code to route all output to STDOUT through the main loop so that we no longer have to worry about the re-entrant versions of the newlib routines (which don't seem to work anyway without uC/OS-II).

    The ELF linker says that we have 62K available working memory.  If we set aside 20K for the stack+heap, that leaves 42K.  If we reserve half of that for the pulse buffer and half for the STDOUT buffer, that is 21K each, or 21,504 bytes.  In terms of lines, suppose we limit the line length to 128 characters.  Then that is 168 lines of text, or in other words a couple of screenfuls.  Hopefully that will be sufficient.  Working on the new module now.  Created new files stdout_buf.h, stdout_buf.c, and memory.h.

    Meanwhile, the students finished their changes to the datapath and I integrated them into Q:\.  Then we tested with the scope, but the datapath got stalled - some change in one of the later modules caused them to never return the handshake back up the pipe.  We found a couple of bugs in Darryl's state machine and fixed them.  Now it no longer hangs, but the output sequence doesn't make sense in some ways.  Bytes seem out of order.  There is more work to do.

    Monday, January 23, 2012

    Mon., Jan. 23rd

    David is here and I am going over the needed datapath changes with him.  I also uploaded the latest gelware files to the dropbox (FEDM_design\FEDM_code\q91).  Students, please note that the top-level filename and the revision name changed.

    Also, copied several files in the firmware folder (software_v4) from Q:\ to dropbox, files for which the latest version was on Q:\:
    • main.c
    • dac_driver.c
    • pulsebuf.c
    • server.c
    Just want to make sure the students are working with the latest version of all the files.  I also copied the new version of several files that I worked on at home over the weekend from Dropbox back to Q:\:
    • interrupt_timing.c  (modified)
    • tsdp_driver.c  (renamed & modified)
    • tsdp_driver.h  (renamed & modified)
    • timeval.h  (new file)
    Now trying to find Juan & Michael's top-level file with the top-level wiring changes to support the new PIO (not sure they did one, but checking)...   It's not COSMICi_FEDM_top14, because Dropbox says there's only one version of that (the one I just copied in).  Aha, but with COSMICi_FEDM_top13, the previous version is dated 12/6, which is when Juan and Mike Dean were working on the firmware.  I'll restore that one.

    Nope, the FEDM_NiosSys icon still doesn't show the new PIOs.  Let's go back to the newer version.

    Let's now just try opening the SOPC System in their folder.  Hm, still don't see the new PIOs.  Maybe I overwrote their FEDM_NiosSys.sopc?  Nope, there's only one version of it in that folder.

    Perhaps they didn't actually create the new PIOs yet?  Texted Juan & Darryl to ask them.  If they didn't do it, then I guess I will just go ahead and do it myself, to save time.  Should only take a minute.

    Darryl says he doesn't think they were done yet.  Guess I'll do them myself.

    Looking back at the pulseform-capture datapath for reference, in SOPC Builder, it looks like icdp_ctrl was 16 bits, with both input and output ports, and individual bit setting/clearing, and has output initialized to 0x5, which means:

    • ICDP_RESET   = 1  (datapath initially held in reset state)
    • RUN_PAUSEn = 0 (datapath operation initially not enabled)
    • NEG_INPUT    = 1 (datapath will take negative input pulses)
    • ICDP_SEL[1:0] = 0 (input channel #0 (SMA#1) selected initially)
    • PUMP_DATA   = 0 (not reading out data from datapath)
    Also, wrt its input, it synchronously captures rising edges, and allows bit-clearing for the edge-capture register, and generates an edge-sensitive IRQ.

    Let's look back at the firmware to see how the control bits for the new datapath were defined.  In tsdp_driver.h, they are:

    • TSDP_RESET    = bit 0 (output)
    • TSDP_ENABLE   = bit 1 (output)
    • TSDP_PUMPDATA = bit 2 (output)
    • TSDP_HAVEDATA = bit 3 (input)
    • TSDP_SYNCERR  = bit 4 (input)
    The name of the control PIO is 'tsdp_ctrl'.  Clearly, an 8-bit PIO size will be sufficient for the time being.  The reset value should be 0x1 (TSDP_RESET=1, all others 0).

    With regards to the data PIO, icdp_data is input-only, 32 bits.  tsdp_data can be the same.

    OK, both of those are created.  Auto-assigned IRQs and base addresses.  Generating the SOPC system files now.  System generation successful.  Let's now try a Quartus compile, to make sure the design fits.  First I went ahead & wired up all the new I/Os (except for the pump-data input to the new time-sync datapath, which is still being driven by the test stub) to make sure that important stuff doesn't get compiled away.  OK, that's done.  Doing Quartus compile now.

    Juan is having car problems and might not be in today.  Taking a break down in the break room to finish my lunch while the compile is brewing.

    The Quartus compile succeeded; great.  Now we just need to finish up the gelware/firmware revisions.  Darryl, Michael Dean, & Aarmondas are all going to be here tomorrow, I believe.  David said he'd pass on the information I gave him earlier to Dean.

    I'm puttering around a little in the firmware, cleaning up main.c a little.  And setting it up to initialize & start running the new datapath.

    Tweaking server.c a little... server_tell_starting() now sends the message "HOST_STARTING,FEDM,v0.9". This is a generic message that could be sent by all nodes in the sensor network right after their Wi-Fi output is opened.  The second field identifies the node type (currently either "FEDM" or "CTU").  The third field identifies the firmware version ID.  Similarly, server_tell_ready() now sends "HOST_READY".  These changes are towards more uniformity with the CTU firmware (which can have similar behavior) to make it easier to write server code to properly track the identity and status of the remote hosts.


    Saturday, January 21, 2012

    Sat., Jan. 21st

    Since time is short, today I am starting the code review of Darryl & Juan's new code from home.  Currently working in the Dropbox, since I don't have remote access to the office PC with its "Q:\" network share.

    First, I renamed tcdp_driver.{c,h} to tsdp_driver.{c,h} for consistency with how this module is named in the files themselves.  Either name would have been OK (tc="time capture" or ts="time sync") but at minimum, we just need to be self-consistent, please!

    Going through tsdp_driver.h, I noticed that the students duplicated the DoubleWord and time_val typedefs in the new tsdp_driver.h file.  That is poor coding practice; since those definitions are needed in more than one module, they need to be pulled out into their own module, to eliminate the redundancy.  This is easily done!  If you don't do this, then you are just setting yourself up for extra work later if you ever need to change the definition, and you'll have major, hard-to-debug problems if one version of the definition gets changed and you forget to make the same change in the other one!  So, I created a new module "timeval.h," included by both tcdp_driver.h and tsdp_driver.h, and moved the DoubleWord and time_val typedefs into there.

    The TSDP_status data type looks useful.  I added a comment pointing out that the pulse count will roll over every 20 days.  I also renamed the struct to tsdp_status (all lowercase) to distinguish it from the typedef.

    There seems to be a redundancy between the extern int timing_pulse_count and the pulse_accum field in the TSDP_status structure.  I'm not sure yet why both are needed, or where TSDP_status is used.  We'll see. - NOTE ADDED LATER:  It looks like neither of these is even used anywhere yet!

    tsdp_init() declaration looks useful.
    tsdp_reset() looks useful.
    removed obsolete comments about icdp_have_data() and icdp_notify().
    tsdp_run(), _pause(), _handle_have_data(), and _handle_sync_error() declarations all look fine.

    Now going through tsdp_driver.c.

    Again there is inconsistency between the "TS" and "TC" names; changing all to "TS".

    report_buf_full is commented out, but it should have been replaced with something like report_sync_err to ensure that sync errors get reported to the server.

    get_next_word() is mostly repeated code from icdp_driver.c.  If we are really going to use the same interface to both datapaths, it might make sense to try to abstract out the common functionality to reduce code size.  However, doing this gets a bit tricky since each datapath has its own control bits.  You'd have to do a parameterized macro, or pass additional arguments, or do a class hierarchy.  Probably best to leave it as-is for now.

    In get_next_word(), the parens were missing after the invocation of the HAVE_DATA macro.  These are needed to convey that it is a function-like macro and not a constant.  Fixed that.

    get_longword() is completely identical, textually, to the one in icdp_driver.c, and this common functionality should probably be abstracted away.  E.g., you could pass a function pointer argument pointing to the appropriate get_next_word() method to use.  But really, the "right" approach to express the shared functionality here would be to use a class hierarchy, with different classes representing the different types of datapaths, in which shared methods like get_longword() could be inherited from a common base class, while things that are different like bit numbers, base addresses, etc. could live in a more specific derived class.  However, this would represent a major refactoring of the code, and is probably overkill for now.

    tcdp_pull_pulse() is completely wrong, because it is written as if it were pulling an entire pulseform with a variable number of levels crossed, whereas really, all we need is a single leading-edge time.  This kind of simplification was the whole point of creating the new datapath; otherwise, we could have just added another channel to the existing datapath!  This goes along with my earlier critique of a few days ago, which is that I had asked (& expected) the students working on the gelware to simplify the new timing-sync datapath to remove all the unnecessary information from it, including the number of levels crossed.  Also, this C code doesn't even match what the datapath actually does, because the datapath only provides the leading-edge time, and this code is looking for both leading- and trailing-edge times.

    I don't know why it's so hard to get students to just THINK, and do what makes sense!  This goes for both the gelware and firmware groups!  Designing the new datapath was supposed to have been an EASY task, certainly it was easier than designing the old, more complicated one was to begin with!  All they had to do was to figure out what kinds of simplifications were appropriate to the new timing-sync application, in contrast to the pulse-form capture application needed for the PMT input channels, and then DO those simplifications!  ALL of them!

    Going forwards, we have a couple of choices here.  We can (a) revise the datapath gelware to get rid of the extra unnecessary information (number of levels crossed, which is always 1), or (b) leave that garbage there, and just skip over and ignore it (or else verify that's it's 1, as a sort of pointless error-check) in the C code.  In either case, the C code has to be changed to no longer pull the falling edges, since they are not present, and not relevant.  Also, the result should not even go into a PulseForm data structure, since that is inappropriate for the timing-sync pulses.  PulseForm wasn't even defined in tsdp_driver.h (you had commented it out, and I deleted it), nor does it belong in there!  Ugh.

    In tsdp_notify()... WTF?  Why the heck are you still doing pb_add()?  That was only relevant for the actual cosmic-ray pulse data, which needs to be buffered in RAM so that it can be streamed out to the server!  There are too many timing-sync pulses (2,000/sec.) to possibly do that with them!  If you put them in the RAM buffer, you'll just quickly fill up the buffer!  And they aren't even the same kind of information as the cosmic ray pulses, so it wouldn't make sense to mix them into the pulse buffer anyway!  You need to THINK about what this timing-sync data is for, and use it for that!!!

    Finally, you commented out tsdp_handle_buf_full()... It's true that you don't have BUF_FULL interrupts per se, since there is no FIFO buffer any more, but in its place, you have the SYNC_ERROR interrupt, for if the datapath gets stalled and loses a pulse!  You need to do something about that!

    In interrupt_timing.c - you didn't define a separate INTERRUPT_MASK for the tsdp!  You are still using the one for the original icdp, which is not applicable to the new PIO!

    in tsdp_isr(), cur_mask is not needed at all, since you only have one HAVE_DATA bit.

    In conclusion: The job of updating the firmware to integrate support for the new timing-sync datapath is not finished by far, and substantial changes are still needed before the code will be complete, apparently correct, and ready for testing.  I made a few minor corrections as I went along in the code review, but the major changes still need to be done.  Here is a list of the most important changes needed:
    • [ ] Ideally, the gelware should be changed to remove all of the "number of thresholds crossed" stuff, which is completely unnecessary and inappropriate to even include in the timing-sync edge-capture datapath.
    • [ ] tsdp_pull_pulse() needs to be changed to do what's appropriate for the new datapath.
    • [ ] tsdp_notify() needs to USE the received timing-sync edge information in a sensible way.  For example, remember the data for the most recent pulse, together with the number of timing-sync pulses received, and (elsewhere in the code) tag each individual cosmic-ray pulse with this information as it comes in.
    • [ ] Sync errors need to be handled appropriately.  Presently they are not handled at all.
    If we are going to meet our goal of having the timing-sync datapath fully integrated & tested by the end of the month, these code changes need to be completed within the next 3 working days (by Wednesday) so that I can have a few more days to review, test & debug if necessary.  Students: If you can't complete all these changes by end of Wednesday, I will have to go ahead and do them myself.  Let me know ASAP which it will be.  We have no more time to wait.  If necessary, ask questions if you don't understand what you need to do.  

      Fri., Jan. 20th

      Darryl & David are here working on the paper with Ray.

      Michael Sprouse (new volunteer) is here and I took him through the system, both hardware and software.  I showed him where I was at when I left Wednesday, testing the new timing-sync edge-capture datapath using the scope.  When I left Wednesday I was thinking that there was still some problem with the front-end edge-capture module because the state machine seemed to be skipping some states.  But while showing the traces to Michael Sprouse I realized that I just hadn't zoomed in far enough - the state machine is indeed going through the expected state sequence 0, 1, 2, 3, it's just that states 1 and 2 only last 5 ns each (one cycle of the 200 MHz clock).  State 1 is wait_fall, which we'd normally expect to stick around for 100 ns, but what's happening here, apparently, is that there is some intense ringing on the input that causes it to go low immediately after first going high.  It's not really a serious problem because we really only care about the problem of the leading rising edge, but anyway, next time I'm in the lab, I should test to see if this problem goes away if we remove the probe on the PMT_3 (TimingSig) board-level node; it's possible that at least some the ringing may be being caused by the probe itself, even though (I thought) we were using a 1-Mohm probe (check this again to make sure).

      Then I turned to integrating the new firmware - Darryl confirmed that the latest version of it is in Dropbox (\FEDM_design\FEDM_code\q91\software_v4\FEDM_ctrl_fw).  The new files are tcdp_driver.{c,h} (this module should be renamed to tsdp_driver) and interrupt_timing.c, which was copied from interrupt.c and then modified to add the new functionality.  I went into Cygwin and did a context diff between interrupt.c and interrupt_timing.c to see what changed between them.

      One surprise is that the code to actually integrate the timing information into the data sent to the server apparently has not been written yet - so, even if everything that Darryl & Juan already did works perfectly (and this still needs to be tested), we still have more coding to do before we achieve our milestone of getting a complete stream of pulses plus absolute time-tagged data to the server.

      Wednesday, January 18, 2012

      Wed., Jan. 18th

      OK, let's try the gelware version that I compiled before leaving yesterday, which tapped out the current-state bits of se_pulse_cap_tsedge_56.vhd.

      OK, we seem to be stuck in state 1, "wait_fall" - that's the state where we're waiting for the falling edge.  This makes sense, because the students might have commented out the code that looks for the falling edge.  (Although I thought I told them we still needed a state for that.)

      Looking at the code... The state register is enabled by the "write state" signal (wstate).  wstate should go high when wfall ("write fall time") is high.  wfall is supposed to go high in the "wait_fall" state when "just_fell" is high.  Aha, they commented out the computation of "just_fell."  No wonder.

      Fixed that; now we seem to be getting real results.  Let me probe HD_3 (on J55 pin 2) with the pink probe.  Yep, that looks good too.  OK, I think we're ready to integrate the firmware now.

      One thing though:  It looks like the current design is 'counting' the number of thresholds crossed (which is always 1) and passing this information along through the datapath and to the CPU.  This is unnecessary, and ideally it would have been simplified away when they first made the new datapath.  However, we probably don't have time now to go back and make this change.  We should probably just proceed with the design as-is.

      Emailed the ECE senior design students about the result of today's testing; asked them to confirm where the latest version of the firmware (Nios C code) is sitting, so I can integrate it into the "master copy" on Q:\.

      Remembered I still needed to test with the real timing-sync signal, from the DE3 board.  Hm, not getting reliable results now.  Signal too weak?  Need lower threshold?  Aha, looks like the DE3 is still producing a negative pulse, from when I was trying that.  Need to recompile for positive pulse.

      OK, that's fixed.  The timing signal is being passed properly between the boards.  And its amplitude is larger now, for some reason (perhaps because that old 10-ohm short is gone now?) - it's ~1.6V now - plenty big enough to detect with a substantial threshold (say 0.8V) that would give us a healthy noise margin.

      An alternative - we could set that pin to use a digital I/O standard like 1.5V, and then do away with that DAC level entirely - then we could allow that level to be part of the normal voltage ladder.  Trying that now.  (Made code changes; doing Quartus compile now.)  Due to substantial changes to pin assignments, I've set the current design under a new revision ID, COSMICi_FEDM_RevA.  The old pin assignments are still in the previous revision, "New_with_Nios_trim".

      To summarize the present versioning information:
      • Project revision is:  COSMICi_FEDM_RevA
      • Firmware version number is presently:  0.10.
      • Top-level Quartus file is presently named:  COSMICi_FEDM_top14.bdf  (internal version 0.15).
      Oops, we got a fitter error - looks like there is a conflict between the possible VCCIO values for I/O bank 2 and the choice of 1.5V as the I/O standard for the TimingSig pin (which I chose to be K20).  What if I used a different pin, like H1?  (TimingSig fans out to 12 different input pins.)  Or better yet, W1?  That's in VREFGROUP_B6_N0, which isn't being used for anything else currently.  In fact, all of IOBANK_6 isn't being used presently.  (The previous two observations are from the Pin Planner.)  OK, let's try again with W1 as the choice for the TimingSig input, with a 1.5V digital I/O standard...  Yep, that fixed the fitting problem.  Burning now...

      OK, that works, but the diagnostic outputs look a bit noisy.  Sometimes the state machine skips a state; this is not good.  Perhaps because W1 has a longer distance to travel across other circuits?  Let's try cleaning it up by disabling diagnostic outputs we aren't actively using.  How about we get rid of the bus outputs, and let the green probe just look at the have-data signal?  We'll have it be our only output.  It shouldn't interfere with the state machine, since it doesn't change until later.

      Tuesday, January 17, 2012

      Tue., Jan. 17th

      The Quartus compile I started Friday finished after I left.  First thing I did today:  Burned it into the FEDM.  Next major thing on agenda:  Debug the timing input capture datapath (which is still producing no output with stream_pulse_test_out).

      David texted that he is working from home on the paper today (and that this is OK with Dr. O'Neal).

      OK, let's debug.  First, let's look at the HAVE_DATA signal from the timing-edge input capture datapath; this is HD_3.  If it's not rising, then that would explain why nothing else is working.

      The HD_3 node is already tapped out to J55 pin 2.  Let's take a look at it...  Let's use channel 3 (previously used to monitor the threshold level from DAC#6).  Current scope configuration is:
      • Channel 4 (green) - J60 pin 1 - PMT_3 node (TimingSig)
      • Channel 2 (blue) - J79 pin 1 - pf3[5] - Output of comparator between TimingSig and DAC#6
      • Channel 3 (pink) - J55 pin 2 - HD_3 - HAVE_DATA output from tsedge_datapath_v1_56
      Pink stays low.  So now, we're going to have to dig into the internals of the datapath to figure out why it's hosed.

      First, let's see if the DP is reporting that it's stalled.  Hooking up the sync_error status output flag to pin T7 --> J46 pin 2 (behind J49, the DE9 header).  Since the input signal TimingSig is a known factor now (coming from the Waveform generator), we'll trigger off of the blue comparator output instead, and cannibalize the green probe (channel 4) to monitor sync_error.

      Current scope configuration is:
      • Channel 2 (blue) - J79 pin 1 - node pf3[5] - Output of comparator between TimingSig and DAC#6
      • Channel 3 (pink) - J55 pin 2 - node HD_3 - HAVE_DATA output from tsedge_datapath_v1_56
      • Channel 4 (green) - J46 pin 2 - node sync_error - status flag output from  tsedge_datapath_v1_56
      At this point we have to recompile in Quartus, since sync_error wasn't already tapped out.

      OK, that compile finished and I burned it too the board.  sync_error is staying low, so tsedge_datapath_v1_56 isn't getting stalled (or at least, isn't detecting that it's stalled).  Let's crack open tsedge_datapath_v1_56.  The first part of it (which produces sync_error) is pulseform_cap_tsedge_56.  Let's look at its handshake output, hs_prod.  Let's tap it out.

      Calling it int_hsprod_tap (output port of tsedge_datapath_v1_56) --> int_hsp_debug (top-level node) --> PIN_E12 --> J48 pin 2 (behind DE9 connector).

      Let's cannibalize scope channel 3 (pink) to inspect it.  Current scope config is therefore:
      • Channel 2 (blue) - J79 pin 1 - node pf3[5] - Output of comparator between TimingSig and DAC#6  (Triggering on this one.)
      • Channel 4 (green) - J46 pin 2 - node sync_error - status flag output from tsedge_datapath_v1_56
      • Channel 3 (pink) - J48 pin 2 - node int_hsp_debug - Tap out of internal producer handshake from inside tsedge_datapath_v1_56.
      OK, nothing from that either.  Therefore, we're going to have to crack open pulseform_cap_tsedge_56, to figure out why it isn't generating its producer handshake output (port hs_prod).  The internal node is named "prod".  It comes from pulse_combine_tsedge_56.  However, before this point is another, internal handshake, the hs_datarec output from pulse_prep_tsedge_56, which feeds into the hs_data1 input of pulse_combine_tsedge_56.  Let's tap this out and inspect it.  I'll just use the same output path as before.  Recompiling now...

      Nope, nothing there either.  Now we have to crack open pulse_prep_tsedge_56.  Inside there is another internal handshake, between the hs_prod output of se_pulse_cap_tsedge_56 and the hs_prod input of cs_combine_tsedge_56.  Let's call that node int_prod_hs, and tap it out as port int_phs_tap, and again we'll reuse the same output pathway as before.  Again, it should come out on the pink trace.

      Still nada!  Let's look inside se_pulse_cap_tsedge_56.vhd.  Let's tap out the current-state bits.  We'll put them on digital input bus 2 (B2) on the scope, taking them out on pin 1 of J76 & J77 (replacing pf3[1..0] which we aren't using).

      Doing the Quartus recompile... I have to leave now for the EEP workshop; we'll have to continue this tomorrow.

      Friday, January 13, 2012

      Fri., Jan. 13th

      Today both the CTU and the FEDM were acting pretty flaky.  I couldn't get the CTU to connect properly to the server so we set up the Tektronix as an input stub to the FEDM.  Then the FEDM wasn't reliably doing the threshold comparison on the input, and the input node was sitting at the wrong level.  Also the FEDM kept rebooting itself.  David and I spent a long time fiddling around and doing various tests.  Finally at one point David noticed a little fleck of metal sitting across two pins of one of the DACs.  After getting rid of that (and actually even before then!), we no longer had the 10 ohm short of PMT_3 to +2.5VCC.  However, we still had an unexplained 500 ohm short from there to ground.  So, we went ahead and biased PMT_3 to GND and reintroduced the DAC#6 setting at +300 mV.  Also, David noticed that the cooling plate wasn't seating properly on the metal block.  Now we finally have a nice reliable input pulse, and no more resets, but still no output from stream_pulse_out_test.  We'll have to finish debugging that next week.

      Some goals for the coming months, from group meeting:
      • By end of January, have timing sync datapath tested, debugged, and validated, and a complete set of all data (timing data plus pulse data from all 3 detectors) streaming to server.  (Still at 200 MHz.)
      • By end of February, have the LogicLock work finished, and 500 MHz data streaming to server.  Also have the mechanical hardware substantially completed (for the midterm hardware demo).
      • By April have all hardware ready to install in CLC (if not installed), and have the server software in pretty good shape at least - improvements to it can always be made later.

      Wednesday, January 11, 2012

      Wed., Jan. 11th

      We finally have the authorization codes from Mentor.  Late this afternoon, at lab, I will start downloading & installing the software, so we can begin designing new boards and/or making changes to Sachin's board.

      We have the workshop scheduled from 2:00-4:00 pm today.  Need to find out from Aarmondas if a room was reserved, if so where.  I've loaded the presentation onto my iPad.  During my lunch hour I'm planning to go to Best Buy and buy the little VGA converter dongle, so I can hopefully give the presentation directly from the iPad.

      Once I'm back at the lab, I also need to install and test the Quartus compile I started last night before I left.

      Aarmondas and I reserved B202B, the small conference room next to the Dean's conference room, for the workshop.  I asked Aarmondas to call & email everyone to let them know the location.

      Gave the workshop.  The ECE senior design students, as well as Brian Kirkland and Michael Sprouse were in attendance.  Going to post the slides on the group blog.

      Burned the latest gelware (expecting a negated timing sync input which is compared with the first threshold) onto the FEDM board.  Now need to test with the scope.

      No data yet from stream_pulse_out_test.  Let's examine the comparator output pf3[0].  Hooked it up to J76 pin 1.  Now doing a Quartus compile.

      OK, pf3[0] is going to 0 shortly after PMT_3 (TimingSig) pulse crosses below the VTH1 threshold, as expected.  I measured the period of pf3[0], and it is 409.6 us, as expected, so any noise on the input node is not enough to cause glitching of the comparator output.  So why am I getting no data from stream_pulse_out_test?  Need to test some more signals tomorrow.

      Tuesday, January 10, 2012

      Tue., Jan. 10th

      To do today: Modify firmware to set the last DAC level to +300 mV for detecting the timing-sync edge crossing.

      David is out sick, but Darryl is here.  He is looking at some Altera online courses.

      Have to leave a little early this evening to go to the entrepreneur workshop.  I guess.  (Not very excited about going.)

      License server is not running, or not serving our floating licenses.  Starting LMTOOLS and re-reading license file.  Now the license server is up.

      Examined the layout and the board carefully looking for things that might account for DAC #2 failing as well as the 10 ohm short between +2.5VCC and PMT_3.  I noticed on the layout that PMT_3 passes underneath the chip for DAC #2.  Also, it crosses right underneath a pad of C139 which is part of the +2.5VCC node.  A hole between layers in either location might account for the 10 ohm short (since DAC #2 is powered by the +2.5VCC supply).  However, peering closely at both parts thru multiple magnifying lenses, I didn't see anything that was clearly suspicious. You can't really see underneath the parts, anyway.  However, just in case, I blew on both parts with the dust remover spray (1,1-difluouroethane, from Radio Shack) - who knows, this might get rid of a bit of grit wedged underneath the chip.  Obviously this is just a desperation maneuver, and I don't expect it will necessarily help.

      Tidied up a couple of slides for the workshop, which is tomorrow at 2.  Earlier today I emailed Aarmondas asking him to reserve a room.

      I'm now modifying the init_dacs() function in dac_driver.c, to set the last threshold to +300 mV.  The others are still arranged in a logarithmic ramp from -200 mV to -1V, although now with one fewer step.

      Compiled new code in Eclipse.  Compiling it into Quartus design.

      CTU is connecting/running fine today, aside from no satellites acquired (unsurprising since GPS is cold-booting).  We probably really need to get a new GPS module that can connect more quickly.

      Ah, I just remembered, due to the 10 ohm short between PMT_3 and +2.5VCC, when the power is on, that node floats at +2.5V instead of at GND.  Therefore, the timing sync pulse has to be a negative pulse.  That is accomplished easily enough by a NOT on CLK_OUT in the Quartus design for the CTU.

      This means (on the good side) that we can go back to the 5 thresholds we had previously, and just re-use the first (-200 mV) threshold for the timing sync input.

      Made those changes, now doing the Quartus recompile.  Have to leave now though.

      Monday, January 9, 2012

      Mon., Jan. 9th, '12

      Plan for today:  Remove the surface-mount chip capacitor C148 (next to chip resistor R91) using a relatively sharp-tipped soldering iron, by going back and forth between the two pads heating them.  OK, we did that.

      David and I measured the strength of the pulse on the PMT_3 (TimingSig) internal node at 360-400 mV (equilibrium "high" pulse level, although there is ringing up to about +900 mV with these probe cables).  This is, I think, a bit better than before the capacitor was removed.  But it's still not high enough to allow us to leave the current threshold levels unchanged.  We'll have to modify the firmware to set the last threshold level to about 300 mV (as opposed to the +1.5V where it is currently).  I am a little bit concerned that this 300 mV level is so low that we may see occasional noise pulses if we won't actively filter them out.  We can filter out noise pulses in software, except for the occasional one that is close in time to the expected timing pulses.

      We also re-checked the resistance between node PMT_3 and the +2.5VCC node.  It is still only 10 ohms, which is probably still contributing to our problems.  I went through the layout & schematic and checked the nominal resistance of all the resistors connected to the +2.5VCC node - they are all high, and many of them only connect to disconnected paths, so the problem is still unexplained.  I also checked the R between +2.5VCC and GND, it is 365 ohms, so it doesn't account for the 10 ohms.  I'm about ready to give up on trying to track down this problem.  The only way to make progress on it at this point might be to get some kind of IR imaging camera and try to find an unexpected hot-spot on the board.  2.5V across 10 ohms is 250 mA, times 2.5V makes 625 mW; although this is less than a watt, it still is possibly enough to produce some observable local heating.

      I took Samad's power distribution board out of the loop for now, because it is too difficult to maintain a reliable power connection to the CTU with it in place.  This problem needs to be addressed sometime.  For now, we will just power the CTU directly from the power supply.

      The CTU's Wi-Fi module is not reliably connecting to the server today for some reason.  It did manage to connect once, but it didn't feed through any data.  Need to run some more tests sometime to try to track that problem down.

      I had a minor syntax bug in the new appdefs.py file, which was quickly fixed.