Fröhling

Niels Fröhling about [X]HTML, JS and technical tidbits

  • Home
  • About
01. February 2009

Uh, more on compilers

Ethatron in Optimizations

Uh, I don’t know how I may introduce this one. I don’t know if I should call it a horrific failure or just hilarious. This is what comes out of VS2008 for a signed short division:

  symSend = mergedNGram % numCases;
  mergedNGram /= numCases;

The produced code:


  00024	0f bf 86 2c 04
        00 00		 movsx	 eax, WORD PTR [esi+1068]
  0002b	99		 cdq
  0002c	b9 03 00 00 00	 mov	 ecx, 3
  00031	f7 f9		 idiv	 ecx
  00033	0f b7 ca	 movzx	 ecx, dx
  00036	66 89 86 2c 04
        00 00		 mov	 WORD PTR [esi+1068], ax

This little bugger knows exactly we have shorts (he sign extends them to longs), and he also knows, that the results must be shorts (he does not consider results over 65535)! I got courious and tried everything to make it produce non-long division, especially a char division. Nada, njiet.
I would forgive it it’s behaviour if it would be just a plain simple non-optimizing compiler, but this one is complex and it does have the entire boundary-check done right! And anyway he damn doesn’t produce a 6-7 clocks faster 8bit division!?

Kein Kommentar
20. December 2008

Peephole optimization, what’s that?

Ethatron in Optimizations

You never stop learning, I just explored how I can force a compiler into producing the sbb op on x86. It’s quite difficult actually, most of them produce for example cmp + seta + neg for the term -(a > b). In fact the only deterministic behaviour is with a very special ternery term (a > b ? -1 : 0).
So the function I tried to optimize for the optimzer to be optimized was:

  // max(l, t)
  unsigned char L = (l - t); t += (L & (L < l ? -1 : 0));

The produced code, even though almost cool, was terrifying at the same time:

  0002c	8a 54 24 16	 	mov	 dl, ...
  00030	8a 44 24 04	 	mov	 al, ...
  00034	8a ca		 	mov	 cl, dl
  00036	2a c8		 	sub	 cl, al
  00038	3a ca		 	cmp	 cl, dl
  0003a	1a d2		 	sbb	 dl, dl
  00047	81 e2 ff 00 00 00	and	 edx, 255
  0004d	22 d1		 	and	 dl, cl
  0004f	02 c2		 	add	 al, dl

I marked the minor redundancy in rose, and the real offending ones red. And the worst is, I couldn’t force him to quit then. In this case it was the VC7, maybe you don’t want to expect too much from that one, I though and threw the ball to some others.

The result was VC8, VC9 and the gcc all at least dropped the and, but none, not one removed the cmp + mov (the cmp does not affect the borrow-bit, which is successively used by the sbb, thus is a nop, because it’s the one op using the backupped variable, the mov is also a nop).

The correct minimal code would be:

  0002c	8a 54 24 16	 	mov	 cl, ...
  00030	8a 44 24 04	 	mov	 al, ...
  00036	2a c8		 	sub	 cl, al
  0003a	1a d2		 	sbb	 dl, dl
  0004d	22 d1		 	and	 dl, cl
  0004f	02 c2		 	add	 al, dl

The morale: compilers still don’t do flag-aware peephole-optimization. That’s too bad after the computers became fast enough to profit from almost 40 years of compiler-development …

Kein Kommentar
15. July 2008

MMX unsigned 16×16 multiply

Ethatron in Optimizations

Havn’t found this anywhere, so I had to do it myself. The title says it all, so here it is without drama:

// mm7 = 0x8000 0x8000 0x8000 0x8000
// mm3 = 0×7FFF 0xFFFF 0×7FFF 0xFFFF
// mm5 = 0×7FFF 0×7FFF 0xFFFF 0xFFFF

movq		mm6, mm7
pand		mm7, mm5	/* b ? 0 | 0×8000 */
pand		mm6, mm3	/* a ? 0 | 0×8000 */
pxor		mm5, mm7	/* b ? 0×7FFF max -> signed 15bit */
psraw		mm6, 15		/* a ? 0 | 0xFFFF */
psraw		mm7, 15		/* b ? 0 | 0xFFFF */
pand		mm6, mm5	/* a ? 0 | b */
pand		mm7, mm3	/* b ? 0 | a */

// mm3 = 0×7FFF 0xFFFF 0×7FFF 0xFFFF
// mm5 = 0×7FFF 0×7FFF 0×7FFF 0×7FFF

movq		mm2, mm5
pmullw		mm5, mm3	/* a * b lo 16×16 unsigned */
pmulhw		mm2, mm3	/* a * b hi 15×16 signed */
paddw		mm2, mm6	/* pos * neg ? +b fix-up */
movq		mm6, mm5
punpcklwd	mm5, mm2
punpckhwd	mm6, mm2

// mm5 = 0×7FFF * 0×7FFF = 0×3FFF0001, 0×7FFF * 0xFFFF = 0×7FFE8001
// mm6 = 0×7FFF * 0×7FFF = 0×3FFF0001, 0×7FFF * 0xFFFF = 0×7FFE8001

pxor		mm2, mm2
movq		mm3, mm7
punpcklwd	mm7, mm2
punpckhwd	mm3, mm2
pslld		mm7, 16 - 1
pslld		mm3, 16 - 1
paddd		mm5, mm7	/* neg * pos ? +b << 15 fix-up */
paddd		mm6, mm3	/* neg * pos ? +b << 15 fix-up */

// mm5 = 0×7FFF * 0xFFFF = 0×7FFE8001, 0xFFFF * 0xFFFF = 0xFFFE0001
// mm6 = 0×7FFF * 0×7FFF = 0×3FFF0001, 0xFFFF * 0×7FFF = 0×7FFE8001

The MMX Extensions and SSE variant is this:

// mm3 = 0x7FFF 0xFFFF 0x7FFF 0xFFFF
// mm5 = 0x7FFF 0x7FFF 0xFFFF 0xFFFF

movq		mm2, mm5
pmullw		mm5, mm3	/* a * b lo 16×16 unsigned */
pmulhuw		mm2, mm3	/* a * b hi 16×16 unsigned */
movq		mm6, mm5
punpcklwd	mm5, mm2
punpckhwd	mm6, mm2

// mm5 = 0×7FFF * 0xFFFF = 0×7FFE8001, 0xFFFF * 0xFFFF = 0xFFFE0001
// mm6 = 0×7FFF * 0×7FFF = 0×3FFF0001, 0xFFFF * 0×7FFF = 0×7FFE8001

You may enjoy, that I almost cleanly pre- and postfix the nessesary fix-up for the signed pmulhw, without cluttering the original 16×16=32 block too much.

Kein Kommentar

Tags

  • Algorithms Assembler bash C++ Homemade Javascript MMX Optimization Rants TYPO3

Categories

    • Applications
      • TYPO3
    • TidBits
      • Algorithms
      • Approximations
      • Equivalence
      • Fixes
      • Javascript
      • Optimizations
      • Scripts
    • Uncategorized

Meta

    • Log in
    • Entries RSS
    • Comments RSS
    • WordPress.org

Recent Posts

    • jQuery’s animate is short-thought
    • movntq alignment
    • OP-equivalent series (pavgd)
    • OP-equivalent series (psubq)
    • OP-equivalent series (paddq)

 

  • September 2010
    M T W T F S S
    « Aug    
     12345
    6789101112
    13141516171819
    20212223242526
    27282930  

Archives

    • August 2010
    • April 2010
    • February 2010
    • January 2010
    • July 2009
    • February 2009
    • December 2008
    • November 2008
    • August 2008
    • July 2008
    • May 2008
© 2010 Wired by Fröhling
Design von Dezzain Studio
Übersetzt von Htwo
Nature Pictures | Bamboo Blinds