Reproduction micros

Peter Corlett abuse at cabal.org.uk
Mon Jul 25 16:36:57 CDT 2016


On Mon, Jul 25, 2016 at 01:46:43PM -0700, Guy Sotomayor Jr wrote:
>> On Jul 25, 2016, at 1:34 PM, Sean Conner <spc at conman.org> wrote:
>> It was thus said that the Great Peter Corlett once stated:
>>> Unsurprisingly, the x86 ISA is brain-damaged here, in that some
>>> instructions (e.g. inc") only affect some bits in EFLAGS, which causes a
>>> partial register stall. The recommended "fix" is to avoid such
>>> instructions.

>>  I'm not following this.  On the x86, the INC instruction modifies the
>> following flags: O, S, Z, A and P.  So okay, I need to avoid INC to prevent
>> a partial register stall, therefore, I need to use ADD.  Let me check ...
>> hmm ... ADD modifies the following: O, S, Z, A, P and C.  So now I need to
>> avoid ADD as well?  I suppose I could use LEA but then there goes my bignum
>> addition routine ... 
>>  -spc (Or am I missing something?)

Yes, in that I was taking a potshot at x86's expense, and skipped the technical
details because contemporary x86 architecture is seriously off-topic. But since
I've now been asked...

> No Peter is wrong. All of the modern x86 (at least the Intel CPUs) are OOO
> machines with large register files (192 comes to mind) that do register
> renaming to map the register(s) used by a particular instruction back into an
> “architectural” register (no copy is actually done). The flags register is
> also part of the register re-naming. The only stalls occur when one
> instruction needs the results from an instruction that hasn’t committed it’s
> results yet (ie the instruction is still in “flight”).

It is the *partial* update that's key. If you do an INC and then read EFLAGS or
execute an instruction such as JBE that needs C and some other flag(s), the
information has to be derived from *two* renamed registers. This typically
involves an extra micro-op in the instruction stream to do this fixup, although
the details will obviously vary by CPU model.

But I'm only repeating this information from the experts, so if you still think
I'm wrong, read their reference material:

http://www.agner.org/optimize/microarchitecture.pdf is Agner Fog's optimisation
guide with more detail that mere mortals really need. Page 154 covers this for
the latest Skylake CPUs and uses INC in its example.

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
is Intel's own optimisation guide. Section 3.5.2.6 discusses partial flag
register stalls, although it doesn't specifically mention INC.

This stuff is way more complex than any normal person can keep in their head.
It's possible to learn all the edge cases and avoid the performance hit in
hand-written assembly, but it's a lot easier to just give it to the compiler to
puzzle out. That's its job.

Can we now go back to talking about interesting CPUs? :)



More information about the cctech mailing list