The Cosmic Inquirer

Tuesday, May 17, 2011

Rip out its Guts and Start Over!

My plan for today is to start developing the new skeleton Quartus project for the FEDM board. Actually, instead of entirely ripping out the guts of the current design, I am thinking of initially just segregating it into a sub-module, so that we can quickly refer back to parts of it if/when that is needed.

First, though, I wrote a recommendation later for David Grosby, which he needs to give to payroll for them to file along with his other employment paperwork for his summer appointment. (I'll wait to do the letter for the other intern, Darryl, until he confirms that he is joining us for sure.)

Earlier today I also found an email from one more prospective intern, Michael Sprouse, but unfortunately I had to tell him that we already had made the offers. However, I invited him to still volunteer to help out if he wished.

Darryl still needs to get with Dr. O'Neal to talk about his appointment - he missed him yesterday.

I stuffed Sachin's design into a submodule so we can still access it as needed in our design, while removing all the clutter from the top-level schematic.

I then verified that we can re-load the original design onto the FEDM board and it still works correctly. (Ditto for my new version where I put it in a sub-module, although I only tested the threshold-setting VI, not the high-speed data communication one.)

We then configured Quartus on the Acer XP partition to use the license server on COSMICi (Mike's Dell), so the students could start working.

Then I showed the students how to put together a basic Nios system design in SOPC Builder. I started with just a NiosII/f, 64K on-chip memory (128K wouldn't fit on the FPGA with Sachin's stuff still on there, although to be fair I don't know if he's actually using the on-chip memory), and regular+JTAG UARTs. (The goal here is to do a quick test of our serial communication capability (which Ray wanted to see) before we start doing more complicated stuff.)

We created the skeleton firmware development project for the Nios II IDE, based on a "hello world" template. Still need to insert code to open UART_0 and print to it.

I identified the pins needed for the serial port and created the pin assignments for them and wired them to the SOPC system symbol in the top-level schematic. Next, need to test this system within the Nios II IDE, and see if it prints "Hello World" to the console as expected.

After that, add code to print some text to the extra serial port, and view it in UwTerminal or something.

Monday, May 16, 2011

A New Week, New Students

This week, my goal is to get the new summer students (Tyler, David, and maybe Darryl) up and running with the summer development plan.

Tyler and David met me at 1 pm and I talked with them about the project. David is coming back at 3:30 pm to meet with Dr. O'Neal about his summer appointment paperwork.

Tyler is installing Quartus 9.1 on the Acer XP partition. Mike is going to email Tyler and David the key Quartus files from his present design, so they can begin studying them. OK, did that.

Next, Mike is going to create a Dropbox folder for working on the new project, and share it with Tyler and David. I have started on that. Then Mike needs to create the Quartus project framework for everyone to work in.

Darryl came by late in the afternoon and Mike spoke with him too. He still needs to meet with Dr. O'Neal.

Mike also needs to write David a letter of recommendation to feed to the payroll bureaucracy.

Blogger Blues

Blogger.com was down late last week so I was unable to post these notes and sent them to Ray and Tyler instead. Now Blogger is back up so I am posting them now!

Following is a list of major steps that need to be taken, including some new gelware components that need to be developed, and other related action items, based on the approach that we're basically going to throw away all of the programming work (Quartus gelware, LabView code) that Sachin has done and start over (instead of reverse-engineering everything):

Port my dual-edge-triggered carry-save counter over to the FEDM board, and experimentally determine its max frequency on the Stratix II using PLL clock drivers.
Instantiate a Nios system for the Stratix II, for use in firmware development.
Create a simple serial interface module (probably just a PIO device) to allow programming the DAC voltages in firmware, write that firmware, and test it.
Create input-capture circuits for measuring start/stop times of threshold crossings, with CPU interface (probably just a PIO again), and test them.
Develop firmware that puts together all of this data for each PMT pulse received (including the absolute time information obtained from the sync pulses), and sends it over the serial port to the server (this can use our existing EZURiO Wi-Fi modules).
Develop server-side code (in Python) for data analysis and visualization.

Items #1 and #2+3 are pretty independent of each other, so Tyler and/or the other student(s) can potentially help with (or be primarily responsible for the development of) one or both of these. Other than this, later items depend on earlier ones before they can be fully tested, but potentially the student(s) can help with these steps as well.

Overall, I think our new design will be much simpler than Sachin's, apart from its not being directly usable from LabView. But, we can always still develop a LabView interface for it if we want to (since we'll understand how it works, this should be easy to do).

Wednesday, May 11, 2011

Savior of the Carries

Today I tried clocking my recursive pseudo-dual-edged carry-save counter with the 600 MHz clock from the PLL that I configured yesterday. No dice! ( Of course, the single-edged version of the counter works just fine.)

I tried some tweaks: Inserting a CLKCTRL unit after the PLL, and inserting a CARRY_SAVE primitive on the carry/save outputs of each half-adder cell, to tell Quartus to use dedicated carry-chain resources. These seemed to help a little, but still no dice. Of course, after turning the PLL output frequency down to 300 MHz, it worked fine - that is, up to bit 4 of the counter; but then I still had trouble with bit 8!

After some more fiddling, I got up to 400 MHz (dual-edge). Weird, I'm finding that things work better if I take out my manual clock buffering. That means I don't even really need the recursive register design (with the clock buffer tree) any more. But, then I tried an array design and now it doesn't work again! Argh. Everything is so sensitive to seemingly irrelevant changes. Who knows, maybe the recursive design got fitted in a way that reduced local clock skew... Now I can only seem able to get up to 350 MHz, even in the design that I thought got to 400 before...

OK, I got back up to 400 MHz now, after taking out KEEP attributes from the PDE_DFF. Let's try 500 MHz... OK, that works. Now 600 MHz (where we started out): OK, there it breaks down again. Let's try 550: That works.

Monday, May 9, 2011

Phasers locked, captain...

What to do this week?

At some point, I should probably go back to working on the journal article. I was thinking for a while that if I did some analysis of the ring oscillator as a short-time-slice TDC element, that could be usefully integrated into the article. That may still be true, but it is taking a while.

Let's consider, for a moment, what the next steps would be along that path.

1. Create an input-capture circuit (with appropriate synchronizer stages) to register the OCXO edges against the ring-oscillator half-cycles. At 1.7 ns half-cycle for the RO, we would expect to see about 29.4 RO half-cycles per OCXO half-cycle. 6 bits (unsigned values 0-63) would be adequate to encode these deltas. Or, if there isn't too much variance in the RO period, even just 4 bits (signed values -8 to +7) would be more than adequate to encode the discrepancies in the deltas relative to some "expected" value (say 30). So that's 8 bits per 100 ns period. So it would take about 26.8 seconds to entirely fill up a 256 MB block of DDR SDRAM with 2^29 = 536,870,912 of those 4-bit samples.

2. Add a command to the firmware to (at a desired time) initiate one of these half-minute data-collection runs, and then stream the data to the server for processing. The data collection routine itself will probably have to execute in a custom state machine, because since the DE3 board has only a 50 MHz built-in clock, so we will generate a new data point once every 2.5 CPU clock cycles (50 ns), and this is almost certainly not enough cycles for a software loop (whether polling or interrupt-based) to pull the data from a PIO register and then write it to SDRAM using the type of HLL call used in the demo code. Therefore, this gets tricky because we have to replicate what such a call is doing in our own custom state machine. In other words, we need to create our own host device for the Avalon bus fabric.

Also, at the 57,600 baud rate we're using for the serial comm. link to the EZURiO board, and with a minimum of 10 bit-periods per byte (8 data, start, stop), the data rate for the data upload to the server is at best 5,760 BPS, so a 256 MB data transfer will take ~46,603 secs. = 776.7 min. = 12.95 hr. = basically one overnight. To avoid this bottleneck, we should perhaps consider interfacing to an Ethernet card (there isn't one already built into the DE3, unfortunately) and thereby sending the data directly to the server in real time. Unfortunately, there isn't an Ethernet port already built into the DE3, so we would have to add a daughter card, like this one: http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=71&No=355. If we added a little Wi-Fi client-mode router (like this: http://www.dlink.com/products/?pid=346), that would re-establish wireless connectivity to the server. But still, we have to deal with all the complexity of interfacing to the network (using a whole TCP/IP stack and the like).

Or, the other option is to forget about offloading the data to the server, and instead just do the desired data analysis directly in the embedded firmware. This should be pretty straightforward, and makes a lot more sense. It shouldn't take long. Then all we have to transmit to the server is, say, the Allan deviation results for (say) about a thousand points on a logarithmic time scale, ranging from 1 to 512M half-cycles (basically 9 orders of magnitude).

Finally, if we decide to actually use the ring oscillators for timing of individual sensor events on the FEDM board, then we may want to think about doing some calibration & measurement of their frequency variations on the fly in the sensor application.

One other thing to think about: Using the phase-locked-loop (PLL) module included in the Stratix II to create faster clocks that are synced-up with the board clock. We have an EP2S30 class FPGA which has "fast PLLs" 1-4 and "enhanced PLLs" 5-6. The "enhanced" PLLs support clock frequency multiplication up to 512x, and the "fast" PLLs support up to 32x.

This raises the possibility that we could use the PLLs to sync up the FEDM's clocks with the 409.6-us sync pulses from the CTU, in a simpler manner than by constantly registering all these multiple clocks against each other. A simple circuit using the built-in 10MHz clock could convert the 409.6us-period, 100ns-pulse-width pulse from the FEDM into an approximately 50% duty-cycle clock (with a precisely-timed rising edge) suitable for feeding into a PLL. After going through one "enhanced" stage with a 512x multiplier, this gives us an 8-microsecond (125 kHz) clock slaved to the CTU. After a 2nd 512x "enhanced" stage, we have a 15.625ns (64 MHz) clock. Then after a 16x "fast" stage, we have a ~0.986ns (1.024 GHz) clock. Let's see if that's too fast for the FPGA.

OK, the EP2S30 is at a "-3" speed grade, the minimum clock high and low times are 612 ps each (table 5-37 from Stratix III datasheet). This implies a minimum clock period of 1.224 ns, or a maximum frequency of 816 MHz. We could get close to that by using a multiplier of 12x in the third PLL; then the period would be 1.30 ns and the half-cycle would be 0.65 ns, if we can get away with using that to drive a PDEDFF-based carry-save counter, then that would give us less than 1 ns time resolution on the input capture circuit that finds the level-crossing times in terms of the half-cycles of this fast clock. As long as the PLLs are doing their job, these times should be precisely defined relative to the master clock that comes from the CTU sync pulses.

Oh, actually, it's not going to be quite that good... The maximum PLL output frequency for the Stratix II is only 550 MHz. Still, twice that is still over 1 GHz. The Stratix III goes up to 600 MHz.

There may be an issue with the minimum frequency of the PLL. The minimum input clock frequency is 2 MHz. So, we cannot go directly from the CTU's 409.6-us sync pulses. However, we could base it off the 50 MHz TCXO board clock instead... If we multiply this by 11x, we get the max PLL output frequency of 550 MHz. The period is 1.81 ns and the half-period is 0.91 ns. Still a little better than 1 ns. And we can measure the sync pulse arrival time in units of that, and the cosmic ray shower pulse arrival time in units of that, and thereby get in the neighborhood of the desired accuracy.

OK, I instantiated an ALTPLL Megafunction variation for an 11x clock multiplier, and used it to generate a 550 MHz (later 600 MHz) clock from the 50 MHz board clock. That worked just fine, although as before, the waveform at that speed looked pretty rounded (sine-wave) - although again that may be just due to the board/probe cable. On-chip the signal may look better. The acid test will be to use this signal to drive the carry-save counter.

Phase-locked loop test on DE3 board. Top: 50 MHz board clock (digital trace).
Bottom: 12x (600 MHz) output from PLL (analog & digital traces superimposed).

Talked to Ray for a while about the strategic issue of whether to proceed with trying to reverse-engineer Sachin's stuff well enough that we figure out how to add more TDCs as needed for our absolute time measurements, or instead just redo the design by just counting cycles (or half-cycles) of a single fast oscillator (like the 500 MHz one I just made with the PLL). Really, it comes down to the question of whether we really need better any than 1 ns resolution on the pulse width. Ray is going to look at the science (e.g., difference between shower front development in neutrino vs. hadron initiated shower) and give me an answer on that. However, his feeling is that pulse width differences below 1 ns probably aren't going to matter. In which case, we should probably proceed by just redoing the gelware with our own design. We can rip out much of what Sachin has done and redo it. We still need the ability to program the DACs, but everything else can be re-done from scratch in our own way. We can design a little input-capture circuit to get the rise/fall times of each pulse, and just replicate it for each of the threshold comparators (LVDS inputs). Then we can have firmware transfer the data to the PC however we want.

On tap for tomorrow: (1) Make sure that I can actually drive my pseudo-dual-edge triggered carry-save counter using this 600 MHz clock (i.e., @1.2M counts per second). (2) Design input-capture circuit around that counter. (3) Use it to capture rise/fall times of an input pulse.

Friday, May 6, 2011

Ashes to Ashes

There's nothing in the Stratix III datasheet about ring oscillators, but the clock tree (of the C2) is supposed to handle speeds up to 730 MHz.

Spent a little time reading a thread on the Altera forum about ring oscillators. The posters recommend using an LCELL primitive and the assignment editor to control placement. Routing variation is still an issue, but they say that if all cells are in the same logic array block, and the RO is driving a register in that block, then the routing should stay consistent.

So, what I'm thinking now is that a pseudo dual-edged FF could be implemented in the same block as the ring oscillator, and configured in a T flip-flop configuration, so that its output would have the same frequency as the ring oscillator. Then the output of this PDEFF register could be sent to the destination logic (such as the carry-save counter) to hopefully well isolate the placement/routing within the RO block from that of the destination logic.

Another idea: Merge the ring oscillator with the PDEDFF, by using a slightly delayed version of the TFF output as the TFF clock. The advantage here is that we make sure that the ring oscillator does not run any faster than the PDEDFF can handle.

OK, I drew that circuit and simplified it. Basically it is just two T flip-flops (rising-edge triggered and falling-edge triggered) with their outputs XOR'ed together, and the output of that XOR (delayed slightly by a buffer, say) is used is the clock of both flip-flops.

I'm going to try building that now, as a schematic. I found the Technology Map Viewer is helpful to see exactly how the design compiles into cells. Here's what I came up with, after realizing I needed LCELL on both the rising and falling edge triggered T flip-flops (originally I didn't have an LCELL after the NOT):

Unfortunately, the Technology Map viewer reveals that Quartus is reorganizing the logic in some unexpected way. To get more control, I am redoing the design in VHDL as follows:

library ieee;
use ieee.std_logic_1164.all;
use work.rtl_attributes.all; -- Borrowed from the IEEE 1076.6 (2004) spec. Needed for KEEP attribute.

entity pde_tff_ro2 is

port ( clk_out : out std_logic );

end entity pde_tff_ro2;

architecture impl of pde_tff_ro2 is

signal int_clk : std_logic; -- Internal clock signal.
signal int_clk_d1 : std_logic; -- Internal clock signal, delayed by 1 LUT propagation delay.

signal rq,fq : std_logic; -- Rising- and falling-edge TFF outputs.

-- Prevent certain key signals from being optimized away.

attribute KEEP of int_clk : signal is True;
attribute KEEP of int_clk_d1 : signal is True;
attribute KEEP of clk_out : signal is True;

begin

int_clk <= rq xor fq; -- Exclusive OR of rising & falling edge TFF outputs.
int_clk_d1 <= int_clk; -- Hoping this inserts an extra LCELL due to KEEP attribute.
clk_out <= int_clk; -- Output gets another copy of the internal clock.

-- Rising-edge-triggered toggle flip-flop.

re_tff: process is begin
wait until rising_edge(int_clk_d1);
rq <= not rq;
end process;

-- Falling-edge-triggered toggle flip-flop.

fe_tff: process is begin
wait until falling_edge(int_clk_d1);
fq <= not fq;
end process;

end architecture impl;

OK, that seems to give me the design I want:

This image is from the Technology Map Viewer. The double-boxed elements are LUTs and the registers are individual flip-flops in the LABs.

I need to look at the design in the Chip Planner as well, to make sure all the LUTs are being placed in the same LAB. They seem to be, except that post-fitting it looks like the internal clock is being routed through a CLKCTRL module...

This could be good or bad. It is good in that it reduces skew (uses low-skew interconnect resources), but bad in that it can increase delay to get down there and back. So it may reduce the RO frequency.

I should perhaps use an ALTCLKCTRL megafunction to do regional clock generation from the main output of this module... Anyway, let's worry about that later.

DERP, I just realized that without an external kick, this clock will never get started because it will never generate its own edges. So, I need to design a start-up circuit for it.

OK, I tried to fix that problem by just gating the int_clk with an AND gate controlled by a slider switch, but still no dice.

It occurs to me that, even with the kicker to get it started, this design is perhaps fatally delicate, in that if it ever settles down, it will never spontaneously start going again.

I tried various ways to fix the problem, but no dice. I think I'm going to back off from this whole register-based oscillator idea, and just revert to doing an ordinary ring oscillator.

Before I left, I did that, and it worked (with the PDEDFF-based carry-save counter); the half-cycle period is about 1.7 ns.

Wednesday, May 4, 2011

Ring Around the Rosie

Today, set up to do a direct test of the ring oscillator frequency/period. That seems to work fine. Here's the scope trace showing a ~1V logic swing "very rounded square wave" output measured via a 75ohm cable (I couldn't find any 50ohm ones lying around). The period is about 1.7ns, frequency about 580 MHz. Possibly it's smeared out in part because the board trace to the CLK_OUT connector isn't rated for signals at that high a frequency - there might not be enough impedance control. Or maybe the problem is the cable.

Let's try a 5-stage ring oscillator, and see if the longer period means the wave shape and amplitude will fare better. Ideally, the period should be 67% longer (5/3x). It turns out to be 2.5 ns, which is only about 50% longer, interesting, though possibly the discrepancy is due to measurement error. Oh and the amplitude is larger, as expected: 1.35V. The wave shape is different, too, and in an interesting way:

The lows are flatter than the highs, which makes sense since nFETs are faster than pFETs (since electron mobility. Frequency is now 395 MHz, and meanwhile the period of bit 20 of the counter is 5.3 ms, which is consistent with the 2.5 ns for the ring oscillator period. I don't think the counter was working at all with the 3-stage ring oscillator on the last test, but I should probably check again to make sure. Nope, it wasn't. OK, now let's try 7-stage:

At 7 stages, the period is 3.42 ns (still only about 2x that of the 3-stage RO, instead of 2.5x) and frequency is 293MHz. Interesting. Amplitude is now about 1.6V (peak-to-peak). Period of bit 20 is 7.16 ms. So, if I did dual-edge-triggered registers on that signal, it would be about the same period as single-edge triggered with the original 1.7ns 3-stage period, and it would probably work more reliably due to the larger signal swing and more flat-topped waveform. So, that is probably in fact the best approach. However, I'm still baffled by why I was able to get *faster* ring oscillators on the DE2 board than on the DE3. I probably need to do some more experiments on the DE2 at home. And study the datasheets for both devices some more.