Monday, January 30, 2012

Mon., Jan. 30th

Probably the biggest priority for today is to track down the cause of this problem where the timing-sync datapath stops responding shortly after the system starts up.  (If we can't fix that, everything else is pointless!)  Some things to try:
  • Put the scope is "Normal" as opposed to "Automatic" trigger mode, see if there's anything unusual about last pulse.
  • Hook up the JTAG cable and turn on diagnostic output in the firmware to see if it gives us any clues - e.g., if the firmware were resetting or suspending the datapath, that would cause problems.
  • Look at the 200 MHz (5 ns period) clock output from the PLL to make sure it is still running.
  • Look at some raw bits of the clock counter (a block of sum bits from somewhere in the middle) to make sure it is actually still counting.
  • Look at more of the internal state bits of the front-end edge-capture module.
  • Actually try out the new firmware (in place of the present output stub) and see if that works any better.  (Although it doesn't seem possible that the output stub could be causing the problem, since we already know the front-end edge-capture state machine isn't getting stuck waiting for a handshake return.)
Waiting for Dropbox to finish syncing before I start server.  In the meantime, should I update the Wi-Fi modules?
    George came by and showed me a couple of sample copper heat pipes (5/8" dia.) with rubberized coatings.  I suggested an overnight freezer test re: condensation.  I also suggested they buy a new thermoelectric plate so they don't have to tear apart this one (so we can keep using it in the meantime, and so we have proper specs on the new one).  He said he has does thermal calculations and it should get the chip down to 1 degree C.  I told him to include the calculations in the next report & presentation.

    OK, doing some testing now.  There is nothing weird about the last pulse received.  And no output currently from the nios terminal (debug output is still disabled).  However, I noticed that the thing always seems to die at the same time that we acknowledge the WIFI_READY message.

    One possibility:  Maybe the timing-sync datapath is getting reset or un-enabled somehow as a side-effect of what is happening in that part of the code.  The real code to control it hasn't been integrated yet.  If I constantly leave it not-reset and enabled, will that fix the problem?  Worth a try.  Doing Quartus compile now.  In the meantime, I can work on integrating Juan's changes to the firmware.

    Here are Juan's notes on his changes, from Dropbox\FEDM_design\FEDM_code\q91\eclipse_workspaces\jpc_workspace\1_25_12.txt:

    tsdp_driver.h
    - Changed the names of members of tsdp_status struct to "pulse_index" and "last_pulse_time"
    - Created global tsdp_status struct object "last_tsdp_status"
    - Commented out "timing_pulse_count" since we already have "pulse_index" which also keeps track of
    number of timing pulses received.
    tsdp_driver.c
    - Got rid of "reentr.h" since it appears that we were only using re-entrant printf in tspd_diagnostics()
    - Got rid of globals: pulses_seen, initialized and last_pulse time since we are not using a global tsdp_status struct.
    tsdp_pull_pulse:
    - Removed threshold level index variable "i"
    - Getting rid of PulseForm data structure and now using tsdp_status data structure
    - Getting rid of number of levels crossed (1) and falling time
    - Storing the long_word into last_pulse_time of the global tspd_status struct
    tsdp_run and tsdp_pause:
    - Changed INFO message to "Timing-Sync" instead of "Input-Capture"
    - Changed base address to TSDP_CTRL_BASE and enable mask to TSDP_ENABLE_MASK
    tsdp_reset:
    - Changed the comments and INFO messages to reflect the new datapath
    tsdp_notify:
    - Got rid of PulseForm data structure since we need to use timing-sync data structure
    - Created timing_pulse object to interface with the timing sync data structure (tsdp_status)
    Still need to handle sync errors!!! (we will do this later since we first need to have basic functionality)
    icdp_driver.c
    - Filling in timing-sync data into the global struct using the PulseForm data structure

    icdp_driver.h
    - Added global struct to the PulseForm data structure.
    - Include header file to tsdp_driver.h

    server.c
    - Showing "last_pulse_time" 1/27/12
    Update on datapath changes:  Removing the reset/enable control seems to have fixed the "dying" problem.  Let me burn the new design to make sure.  Yep, seems OK.

    Now working on merging Juan's changes.  Found a conflict between the HAVE_DATA() macros in tsdp_driver.h and icdp_driver.h.  Renamed tsdp's to TS_HAVE_DATA() to fix.

    Aha, I have a suspicion as to why the ts datapath was hanging.  When WIFI_READY is received, we display a NORMAL-mode message to stdout.  Since the JTAG port was disconnected, or no, because the reentrant output routines don't work propertly, this would hang while interrupts were disabled, thereby preventing... Oh, I don't know.  Anyway, I'm going to try out the new firmware.

    Finding a few syntax errors while compiling, fixing...  OK, the code compiles now.  A couple things to node: interrupt_timing.c is now in its proper place as interrupt.c (the old interrupt.c is preserved as interrupt.c.bak).  Also, the stdout_buf.c code, which I am still working on, is excluded from the present build.

    Now doing the Quartus compile. The Nios is now hooked up to drive the new datapath.  Removed the output stub in new top-level file _top15.

    While the compile is brewing, I'm working on my stdout_buf.c code.  One neat thing I can do with it (once it's complete) is to pass its output to the server via log messages.  Like HOST_DIAG, as an alternative to LOGMSG for log messages coming directly from the sensor host via the UART bridge.  This can be used for diagnostic output when the real STDOUT is muted.

    OK, doing a run now with all 4 detectors & the timing-sync datapath.  Some notes:

    • Trial #1 config:
      • Case #1 -> Cable #1 -> SMA#1 -> "PMT #1" pad -> ICDP channel #0 - 33
      • Case #2 -> Cable #2 -> SMA#2 -> "PMT #2" pad -> ICDP channel #1 - 10
      • Case #3 -> Cable #3 -> SMA#3 -> "PMT #4" pad -> ICDP channel #2 - 214 (FIFO full, lost pulses)
    That trial produced so much junk in the output (FIFO_FULL warnings, etc.) that it was difficult to see any actual pulse data.   Let's switch cables 2&3 to opposite SMAs, so that their pulse rates are more balanced and hopefully the coincidence detector won't be overwhelmed by the imbalance.
    • Trial #2 config: 
      • Case #1 -> Cable #1 -> SMA#1 -> "PMT #1" pad -> ICDP channel #0 - 173
      • Case #2 -> Cable #2 ->  SMA#3 -> "PMT #4" pad -> ICDP channel #2 - 59
      • Case #3 -> Cable #3 ->  SMA#2 -> "PMT #2" pad -> ICDP channel #1 - 202
    Trial #2 is behaving much better, no FIFO_FULL or lost-pulse events now.  The pulse rates are fairly balanced, although we could still maybe improve it a bit.  We are getting some 3-way coincidences.  The pulses are labeled with the time references; this seems to be working.  Here is a sample of some output:

    HOST_STARTING,FEDM,v0.10

    DAC_LEVELS,-0.200,-2.500,-0.299,-0.447,-0.669,-1.000

    HOST_READY
    NC_PULSES,1645056982,169,64,201
    NC_PULSES,3190429776,161,55,202

    ACK,WIFI_READY
    NC_PULSES,4688923073,173,59,202
    PULSE,73011,5980986267,2,1,5981023576,1,(0,2)
    PULSE,73011,5980986267,3,1,5981023577,1,(0,6)
    NC_PULSES,6332065602,190,63,201
    NC_PULSES,7771016797,141,55,200
    NC_PULSES,9305709023,156,47,202
    NC_PULSES,10760125101,145,67,201
    PULSE,138181,11319708242,3,2,11319785150,1,(0,22)
    PULSE,138181,11319708242,1,1,11319785156,1,(0,5)
    NC_PULSES,12149040460,145,56,200
    NC_PULSES,13780610760,184,70,204
    NC_PULSES,15433514672,173,76,203
    PULSE,204374,16742234860,3,3,16742285552,1,(0,3)
    PULSE,204374,16742234860,1,2,16742285552,1,(0,7)
    NC_PULSES,16826526099,132,59,202
    NC_PULSES,18455981655,153,68,201
    NC_PULSES,19816021711,159,57,201
    PULSE,251390,20593783022,1,3,20593836141,1,(0,3)
    PULSE,251390,20593783022,3,4,20593836145,1,(0,1)
    NC_PULSES,21493783386,195,69,200
    NC_PULSES,22915565251,135,53,200
    PULSE,292683,23976503452,3,5,23976529821,1,(0,7)
    PULSE,292683,23976503452,2,2,23976529821,1,(0,6)
    NC_PULSES,24454253751,165,66,200
    NC_PULSES,25999090947,167,68,202
    NC_PULSES,27472235284,152,52,200
    NC_PULSES,28912083859,167,55,201
    NC_PULSES,30590600583,182,72,200
    NC_PULSES,32097939517,162,62,206
    NC_PULSES,33786325217,203,59,205
    NC_PULSES,35446408708,162,68,200
    PULSE,440067,36050193749,3,6,36050253295,1,(0,12)
    PULSE,440067,36050193749,1,4,36050253297,1,(0,5)
    PULSE,440067,36050193749,2,3,36050253298,1,(0,4)
    PULSE,442189,36224027895,3,7,36224058045,1,(0,64)
    PULSE,442189,36224027895,2,4,36224058046,1,(0,20)
    PULSE,442189,36224027895,1,5,36224058046,1,(0,16)
    NC_PULSES,36914981364,162,53,200
    NC_PULSES,38386147406,153,57,203
    NC_PULSES,39795890861,137,67,206
    NC_PULSES,41231182045,161,61,203
    PULSE,503847,41275048491,2,5,41275065988,1,(0,6)
    PULSE,503847,41275048491,3,8,41275065990,1,(0,6)
    NC_PULSES,42748100077,139,54,203
    NC_PULSES,44347019332,164,70,202
    NC_PULSES,46060983716,190,74,211
    PULSE,563018,46122334272,3,9,46122381769,1,(0,5)
    PULSE,563018,46122334272,2,6,46122381769,1,(0,3)
    NC_PULSES,47734791310,181,64,201
    NC_PULSES,49299944169,161,63,200
    NC_PULSES,50767054455,152,53,200
    NC_PULSES,52367216984,180,65,203
    NC_PULSES,53873868242,154,57,201
    NC_PULSES,55627477276,194,72,204
    NC_PULSES,57098080579,143,70,205
    PULSE,709732,58141139048,1,6,58141181168,1,(0,10)
    PULSE,709732,58141139048,3,10,58141181170,1,(0,4)
    NC_PULSES,58550858533,139,51,200
    NC_PULSES,60207382783,151,55,209
    NC_PULSES,61761822335,163,70,203
    NC_PULSES,63099029287,140,65,202
    NC_PULSES,64660494559,152,65,200
    NC_PULSES,66217918160,174,58,206
    NC_PULSES,67680886852,152,54,201
    PULSE,837403,68599942245,3,11,68600016324,1,(0,12)
    PULSE,837403,68599942245,2,7,68600016324,1,(0,8)
    PULSE,838030,68651306060,1,7,68651370992,1,(0,4)
    PULSE,838030,68651306060,3,12,68651370994,1,(0,5)
    NC_PULSES,69352417124,148,73,213
    NC_PULSES,70960480577,167,78,205
    PULSE,868239,71126026126,1,8,71126041421,1,(0,12)
    PULSE,868239,71126026126,3,13,71126041425,1,(0,2)
    NC_PULSES,72536209134,144,60,210
    PULSE,892863,73143223220,3,14,73143298091,1,(0,9)
    PULSE,892863,73143223220,2,8,73143298092,1,(0,8)
    NC_PULSES,73768295892,121,39,201
    NC_PULSES,75230961517,152,62,208
    NC_PULSES,76740108234,154,62,211
    NC_PULSES,78451831524,172,61,202
    NC_PULSES,79861097624,146,62,203
    NC_PULSES,81437231122,177,55,200
    NC_PULSES,83114305623,172,59,201
    NC_PULSES,84581433149,141,63,201
    NC_PULSES,86094560412,155,79,209
    NC_PULSES,87682077780,151,63,203
    PULSE,1073206,87916914548,2,9,87916946298,1,(0,8)
    PULSE,1073206,87916914548,1,9,87916946301,1,(0,4)
    NC_PULSES,89258876064,165,57,202
    STOP
    STOP

    ACK,STOP

    We're now in a position to check the rate of timing-sync pulses, to see if we're getting them all - if not, then the accuracy of our "absolute time" calculations could become badly screwed up, drifting away from where they should be.  Let's start with the first two time-references seen in the above data:
    • Time Reference #1:  Timing-sync edge #73,011 on PLL clock cycle #5,980,986,267.
    • Time Reference #2:  Timing-sync edge #138,181 on PLL clock cycle #11,319,708,242.
    Let's compute the time deltas during this interval:
    • We had 138,181 - 73,011 = 65,170 timing-sync cycles (nominally 409.6 us each) = 26.693 632 secs.
    • We had 11,319,708,242 - 5,980,986,267 = 5,338,721,975 PLL cycles (nominally 5 ns each) = 26.693 609 875 secs.
    The difference is -0.000 022 125 secs = -22.125 us (PLL behind OCXO).  Since this discrepancy is less than one timing-sync cycle, we probably did not miss any edges.  In relative terms, the PLL is slow by -0.828 849 ppm.  

    That is within the frequency calibration tolerances involved, which are 2 ppm and 1 ppm for the CTU's OCXO and the FEDM's TCXO respectively.

    Now let's see how stable the relative frequencies are.  We'll do this by looking at another interval, based on the next time reference in the above dataset:
    • Time Reference #3:  Timing-sync edge #204,374 on PLL clock cycle #16,742,285,552.
    So during the interval from time reference #2 to #3, we had:
    • Timing-sync cycles:   204,374 - 138,181 = 66,193 @ 409.6 us = 27.112 652 8 secs.
    • PLL cycles:   16,742,285,552 -  11,319,708,242 = 5,422,577,310 @ 5 ns = 27.112 886 55 secs.
    This time, the TCXO is ahead of the OCXO by +0.000 233 75 secs = 233.75 us.  It's conceivable here that we might have missed a single sync pulse in the 2nd interval, which would have set the OCXO-based sync count behind by 409.6 us.

    The relative frequency deviation is +8.621 436 ppm (TCXO ahead of OCXO).  This is outside the bounds of specifications which is another reason to think that we might have missed a sync pulse.

    It would probably be a good idea for the ISR in the timing-sync edge-capture driver to do a sanity check on the elapsed time since the previous timing-sync pulse.  Ideally there should be 409.6 us / 5 ns = 81,920 cycles of the 200 MHz PLL clock for every timing-sync pulse.  With calibration errors in the ppms, the number of cycles actually seen should not deviate from this number by more than +/- 1 cycle at most.  Thus, we should do some massive error-checking on the difference seen here.  If it is different from the expected value by 1 cycle, report this at INFO level.  If it is different by >1 cycle, report this at WARNING level.  If it is different by >10 cycles, report this at ERROR level.  If it's fairly close to 2 x 81,920 = 163,840 (within +/- 10 of this, say), treat it as a dropped pulse, and report it at WARNING level. If it's much larger than this, so that more than 1 pulse in a raw may have been dropped, treat this as a CRITICAL error, since basically our absolute time measurements will be totally screwed at that point.  (Well, we could try and patch over longer outages, but the chances of getting permanently misaligned will increase the longer the outage is...)

    Left the current run going, to collect some shower data overnight...

    [From home, later this evening...]

    Some more thoughts on the startup sequence...  Really the time-sync counts in my run this evening are not meaningful, because the CTU had already been running for a while before the FEDM started up, so there is no way to align the time-sync counts with the absolute time stamps registered by the CTU.  Also, the GPS module wasn't acquiring satellites, as usual.

    It would be a good idea for us to build some more smarts into the system, so that these issues can be resolved automatically (or mostly so) on startup.  Right now, it is super-tricky even to get the GPS acquiring satellites (until we figure out how to do it without manual fiddling), and so the chances that the CTU will work properly when first powered up (after the FEDM is already running) are close to zilch.

    Here would be what a sensible startup sequence might look like:

    • CTU and FEDM subsystems are powered on, roughly simultaneously (within a few seconds of each other, at least).  This will be easier to do once Samad designs/builds us a new, higher-quality power supply solution that can power both boards through our ATX supply.  For now, we can just manually power up both boards at about the same time.
    • The FEDM board starts up with its high-speed time counter all datapaths reset, in a suspended state, waiting...  Likewise, the CTU starts up with its PPS edge-capture datapath and its OCXO-based counter reset, in a suspended state, waiting...
    • The Wi-Fi scripts start up, at more or less the same time, opening up the terminal windows on the server, as presently, and going into Trefoil mode.  Just as they are about to enter their main loops, they send the "WIFI_READY" message to their hosts.  (This is already implemented.)
    • When the FEDM gets the WIFI_READY message, it sends the FEDM_READY message to the server.  (I think we forgot to initialize the TSDP_status; remember to do this...)
    • Likewise, when the CTU gets the WIFI_READY message, it sends a CTU_READY message to the server.
    • Once the server has received the CTU_READY, it's time for it to start setting up the CTU properly.  It sends a GPS_UNMUTE message to tell the CTU to start passing the NMEA datastream through to the server.  Then it needs to go through a series of steps where it asks the CTU to relay commands to the GPS for it, with a command sequence something like the below (or, maybe it would be easier to just do this stuff in the CTU firmware?  not sure...).  Actually, I'm not really even sure what a reliable GPS startup sequence would be; I'll probably have to fiddle around a bit till we get it right.  In the worst case, during this process, we might have to revert to the default configuration, which means changing baud rates.  We could do that, but it means we need support for this in the SOPC system & the firmware.
      • Mute NMEA output temporarily.
      • Turn off POSHOLD and/or TRAIM modes (may make acquisition easier).
      • Hot-restart the GPS.  Tell it our estimate of the current time (from NTP) & location.
      • Watch the number of satellites; wait to get enough for a fix (at least one, yo?).
      • Go back into POSHOLD and/or TRAIM modes.
    • When the server has received the FEDM_READY message from the FEDM, and the GPS has acquired satellites and TRAIM is reporting an accuracy value, finally at that point we are able to start capturing absolute-time-tagged pulse data; to start that we do the following:
      • Server sends a message to the FEDM telling it to start running its time-sync capture datapath (TS_GO).  At this point, the pulse-capture datapath should still be disabled, since we haven't actually received any time pulses yet.
      • Server sends a message to the CTU telling it to start its time counter; after the first PPS edge is received, the counter gets reset, and the CTU starts sending time-sync pulses to the FEDM every 409.6 us.  The FEDM is ready, and starts registering them.  Perhaps it would be a good idea, once every second (2,441.40625 time-sync rising-edges) or so, to have the FEDM report the current TSDP_status (# of sync pulses, index of pulse) to the server; this would facilitate calibration and measurement of frequency stabilities.  For this, I need my output buffer working...
      • Anyway, once the FEDM is receiving time-sync pulses, it reports this to the server.  The server then says, "OK, you may now begin collecting data."  ('GO' command).  Or maybe the FEDM just does this by itself automatically after starting to receive time-sync pulses.
      • At this point, both CTU & FEDM are sending their normal data streams to the server.  From there, it's just server-side coding...

    No comments:

    Post a Comment