>>>> "David" == David V Corbin
<dvcorbin(a)optonline.net> writes: 
 David> Paul, Thank for your comments.
 David> First you left out my intentional use of the word
 David> "consistantly" when discussing performance. I was not
 David> discussing raw performance. If you ask a given programmer of
 David> any experience level (except novice) to code the EXACT same
 David> algorithm once a week for a year, I strongly doubt that you
 David> will get 52 identical binaries. If nothing else a different
 David> register will be used somewhere, etc.
True, but so what?  If the programmer is adequately competent then all
instances of the solution will meet spec.
 David> When I have been in a marathon coding session [fortunately not
 David> as common, the body will not really take it any more], I am
 David> much more likely to code a register over-write or forget that
 David> a given instruction effects some conditional flag [or code an
 David> unnecessary "update the conditionals" type instruction] than
 David> the compiler is when compiling a piece of
 David> C/C++/C#/VB/Cobol/Fortram code.
That's why startups that believe their engineers should work 7 day
weeks and 60+ hours per week for months on end are disasters waiting
to happen.
 David> Turning the discusion to performance [raw]. On simpler
 David> microprocessors [PIC, 6809, 6502, etc] I will concede that an
 David> excellent assembler level program can beat the compiler in
 David> many cases. From a business standpoint, it may very well not
 David> be worth the additional cost however. On a multi-processor
 David> pentium class machine with L1 and L2 cache, working in a
 David> virtual memory environment, etc, my experience has shown that
 David> most senior programmers with years of experience can not even
 David> come close to matching the compiler!
That's directly counter to my experience.  The example I mentioned
(admittedly a rather special case -- a block XOR operation for a RAID
storage system) was implemented on a high end MIPS processor with two
level cache, 4 functional units, etc.
There are plenty of reasons why we beat the compiler by a large
margin.  One reason is that we know constraints that you can't express
in C (alignment, buffer size always a multiple of 512, etc.).  A big
reason is that the humans involved -- especially the apps engineer at
the CPU vendor end -- know a lot more about the machine's pipeline,
the queue depth of its memory interface, etc., than the compiler
does.  That second part is admittedly fixable; the first point is
not.  But even now, several years later, the compiler is still nowhere
close to knowing as much as we did.
To pick another example -- if you want performance, would you code
memcpy() in C?  No way.  You can beat the compiler, and it's worth the
effort to do so.
As for the business issue, indeed it has long been true that assembler
is for special cases.  If a week or two of work can cut 50% out of a
routine that's 30% of the system, you've just improved the system
performance by 15% or more, and that's very definitely worth doing
 from the business point of view. 
       paul