The Cosmic Inquirer: Sat., Feb. 25th

Thinking about trying to get into the lab today to get some work done. Something to try: * See if the max temperature for the slow-corner timing analysis can be adjusted downwards. Right now the default max temp is 85 degrees C and the fmax for the slow corner for my mockup of the high-speed components comes out a bit under 500 MHz, specifically 487. But it's possible that if we can set the max temp lower (say 25 C), then the fmax reported by TimeQuest would come out above 500, telling us that if we want to come up to that speed, we need to come down to that temp. This seems likely since the fast corner, at 0 degrees, is actually meeting the 500 MHz timing constraints (not highlighted in red).

Arrived at lab about 2:45 pm to find the building unlocked and the parking lot full of cars. Apparently, they must have stopped locking this building on Saturdays at some point since the last time I tried to get in on a weekend. That's good news, because it gives us more days to get work done before time runs out.

Looking at Quartus settings to see if there's any way to adjust the temperature of the hot corner. Didn't find it yet, but I did find the "Fitter Settings -> Optimize multi-corner timing" option, which optimizes the design not just for the slow corner. However, this doesn't yet turn on multi-corner timing analysis. That requires some setting for the timing analyzer. Haven't figured that out yet.

Aha, under "Operating Settings and Conditions -> Temperature -> Junction temperature range", we can change the High temperature from 85 C to some other value. Let's try 25 C. Oops, it doesn't actually let me change it!

Interestingly, there is an option under PowerPlay power analysis to auto-compute the junction temperature using a cooling solution. For this, you specify an ambient temperature, the length of the heat sink, and the air flow in I think liters per minute. We could try doing this using the parameters of our actual cooling system design.

Doing a test compile - what changed was just that I turned on "Optimize multi-corner timing" and I also turned on some output from the PowerPlay power analyzer. Don't think it actually ran the power analyzer yet though. Ran it manually just to see what the computed temperature was with no heat sink.

Still getting 487.09 MHz.

In the TimeQuest timing analyzer, found a "Report Bottlenecks" option that identifies the worst-case nodes. One of them is the "wfall_del" node in se_reg_en_56.vhd. Well, what's going on there? This is being used as the "enable" signal for two different 56-bit registers, that is, it is fanning out to 112 different flip-flops. So it's perhaps unsurprising that there is a setup-time bottleneck at this node.

One way to fix this would be to add a pipeline register for the fanout of the enable signal. Trying that. (Created & used new module se_reg_en_56_pip.vhd, which creates a duplicate enable signal every 8 bits.)

Yep, now that node is no longer a bottleneck. But fmax is even worse now! (464.04 MHz) Perhaps because the extra flip-flops spread out the design some more?

Anyway, we could go through the design playing whack-a-mole, adding pipeline registers each time we find a bottleneck. But this is pretty time-consuming, and there's no guarantee that, even if we do this, we will get back over 500 MHz at the slow corner.

Alternatively, if the cooling system does its job, then the present design mockup (even without the enable-pipeline) already meets the 500 MHz timing constraint at the fast corner, or at sufficiently low temperatures (we don't know how low yet).

To clarify, the present design I'm working with is located at:

Machine: COSMICi
Folder: C:\LOCAL\Quartus_projects\q9v1sp2\COSMICi_FEDM
Project: COSMICi_FEDM.qpf
Revision: COSMICi_FEDM_RevA
Top-level file: HSMockup_LogicLock_test.bdf

Next bottleneck: In se_pulse_cap_56 again, in rise_s_reg output; not sure why this is a bottleneck though since it wasn't supposed to feed back into the high-speed logic. Hm; looking at this node in the chip planner and in the technology map, it seems to have feedback. Maybe I should be using the Altera DFF primitives, instead of trying to roll my own dffs in VHDL - the synthesized logic looks like it might not be using enables properly, and may be using feedback paths instead.

Modifying se_dff_en to directly use the Altera DFFE primitive.

Now the speed is even lower; 446.03 MHz!!! However, at least that node is no longer a bottleneck.

Next bottleneck: In the enable-pipeline register (en_preg_inst) in se_reg_en_56_pip. OK, that should be fanning out to the real enable inputs of 8 real hardware DFFs now. Hm... No, maybe that's not true. Looking at the bottleneck node in the chip planner seems to indicate that's not the case. EN isn't used, and there is a feedback path from the output of this register back to the logic that computes its input; this logic wouldn't need to be present if the real low-level enable signal was used.

I thought the DFF at lower-right would have used its enable input, but apparently it doesn't.

OK, now the node name of the first bottleneck is:

Here's a question: Is the bottleneck in *computing* this value (input to the register) or *using* this value (output from the register)? I've been assuming the latter but maybe it's actually the former. In that case, perhaps the problem is from complicated logic computing wfall, or something, which then fans out to 7 pipeline registers (actually 14, for the bytes of both sum and carry). OK, then the solution is to use the delayed version of wfall again, which effectively adds another pipeline stage. This will perturb the timing of the pulse-cap module a little and needs to be repaired later (the handshake might get out of sync a little with the actual change in the registers). OK I fixed that by adding a delay in the hs_prod output.

447.03 MHz. 1st bottleneck now: inst18|inst12|inst2|wrise_buf|prim_dff_inst~_Duplicate_8|adatasdata. Let's add the enable fanout pipeline there too.

539.67 MHz!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Changing se_dff_re (used in shift register) to also use the Altera primitive DFF made things worse again; still above 500 but just barely.

Let's therefore try changing se_dff, se_dff_en, and se_dff_re all back to use behavioral code instead of the Altera DFF primitive, and recompile, see if that does better. Hm, no, still just 509.68 MHz. What if only se_dff_en uses the Altera primitive? Now back up to 539.67 MHz. Weird, but oh well.

OK, let me list which source files are new/changed as a result of today's work:

se_pulse_cap_56.vhd (Changed to use new _pip version of se_reg_en_56.)
se_reg_en_56_pip.vhd (New file; pipelines fanout of enable signal.)
se_dff_en.vhd (Changed to use Altera DFFE primitive.)

Plus, of course, there's the new pipeline stage I added (two se_reg_re instances for sum and carry parts of counter) in the counter fan-out at the front of each datapath. This may or may not still be necessary. Each counter bit fans out to 19 (3x6 + 1) places right now. Not sure if that's really a bottleneck, since the pulse_cap tests Darryl & I did the other day turned out to be bogus anyway.

Anyway, I think now, all we have to do is:

Move the changes in these files back into the master project (Q:\) and the Dropbox.
Try compiling the whole thing with those changes. If it still fits and is now fast enough, then we're done (modulo testing) and we don't even need to use LogicLock after all.
If it fits but isn't fast enough again, then we have to continue with the LogicLock work. Make a version of the project that separates out the high-speed components from the others. Drag the high-speed components into the root LogicLock region, and recompile. That may be enough. (If that doesn't work, do it again but without the low-speed components present, then add the low-speed components back in; if Incremental Compilation is turned on, then this should work.)
If it doesn't even fit, due to the new pipeline registers, then we will have to figure out something to shrink to make room. Fewer stages in synchronizer chain for pulse inputs? Currently we have 8 stages. This may be overkill.

First, I'll try recompiling Q:\ just with the changes in se_pulse_cap_56.vhd and its submodules, without adding the pipeline stages for fanning out the counter values.

Oops, it didn't fit, but just barely; we needed only 3 LABs more than we had on the chip! Well, let's first try shrinking those synchronizer chains from 8 stages down to to 6 stages; that should help, since there are 18 of them in the design, that could save us maybe as many as 36 LABs (plus a couple in the timing-sync datapath, which, by the way, needs to match, or it will throw off our time calculations by 4 ns). I suppose whether this will fix the fitting problem depends, however, on whether the LAB usage is limited by logic or by registers, since the synchronizer chains are register-only.

Another idea to save space is to get rid of the 6th input path we are using in each pulse-capture datapath, since these are for the 6th DAC, but only 5 of the DACs are working on our board and we're skipping over the broken one. We are always feeding an always-"OFF" signal in place of the last comparator output for each input channel.

OK, stubbed out the output of the 6th instance of pulse_prep_56 in pulseform_cap_56. That should cause that instance itself, together with the logic that uses its output, to get compiled away. This should result in a significant reduction in the resource usage of the pulseform-capture datapaths.

Compiling Q:\ again... Taking forever... Done. No dice, only 271 MHz at the slow corner, and timing constraint not met at fast corner. Faster than before, though. Next: I need to add the pipeline registers for the counter (so far only tried them in mockup, and indeed, the bottlenecks report shows the counter output as a bottleneck), and also the design probably needs to be split into fast and slow components so we can try LogicLock again, the right way this time. Enough for today though.

The Cosmic Inquirer

Saturday, February 25, 2012

Sat., Feb. 25th

No comments:

Post a Comment