Havn’t found this anywhere, so I had to do it myself. The title says it all, so here it is without drama:
// mm7 = 0x8000 0x8000 0x8000 0x8000
// mm3 = 0×7FFF 0xFFFF 0×7FFF 0xFFFF
// mm5 = 0×7FFF 0×7FFF 0xFFFF 0xFFFF
movq mm6, mm7
pand mm7, mm5 /* b ? 0 | 0×8000 */
pand mm6, mm3 /* a ? 0 | 0×8000 */
pxor mm5, mm7 /* b ? 0×7FFF max -> signed 15bit */
psraw mm6, 15 /* a ? 0 | 0xFFFF */
psraw mm7, 15 /* b ? 0 | 0xFFFF */
pand mm6, mm5 /* a ? 0 | b */
pand mm7, mm3 /* b ? 0 | a */
// mm3 = 0×7FFF 0xFFFF 0×7FFF 0xFFFF
// mm5 = 0×7FFF 0×7FFF 0×7FFF 0×7FFF
movq mm2, mm5
pmullw mm5, mm3 /* a * b lo 16×16 unsigned */
pmulhw mm2, mm3 /* a * b hi 15×16 signed */
paddw mm2, mm6 /* pos * neg ? +b fix-up */
movq mm6, mm5
punpcklwd mm5, mm2
punpckhwd mm6, mm2
// mm5 = 0×7FFF * 0×7FFF = 0×3FFF0001, 0×7FFF * 0xFFFF = 0×7FFE8001
// mm6 = 0×7FFF * 0×7FFF = 0×3FFF0001, 0×7FFF * 0xFFFF = 0×7FFE8001
pxor mm2, mm2
movq mm3, mm7
punpcklwd mm7, mm2
punpckhwd mm3, mm2
pslld mm7, 16 - 1
pslld mm3, 16 - 1
paddd mm5, mm7 /* neg * pos ? +b << 15 fix-up */
paddd mm6, mm3 /* neg * pos ? +b << 15 fix-up */
// mm5 = 0×7FFF * 0xFFFF = 0×7FFE8001, 0xFFFF * 0xFFFF = 0xFFFE0001
// mm6 = 0×7FFF * 0×7FFF = 0×3FFF0001, 0xFFFF * 0×7FFF = 0×7FFE8001
The MMX Extensions and SSE variant is this:
// mm3 = 0x7FFF 0xFFFF 0x7FFF 0xFFFF
// mm5 = 0x7FFF 0x7FFF 0xFFFF 0xFFFF
movq mm2, mm5
pmullw mm5, mm3 /* a * b lo 16×16 unsigned */
pmulhuw mm2, mm3 /* a * b hi 16×16 unsigned */
movq mm6, mm5
punpcklwd mm5, mm2
punpckhwd mm6, mm2
// mm5 = 0×7FFF * 0xFFFF = 0×7FFE8001, 0xFFFF * 0xFFFF = 0xFFFE0001
// mm6 = 0×7FFF * 0×7FFF = 0×3FFF0001, 0xFFFF * 0×7FFF = 0×7FFE8001
You may enjoy, that I almost cleanly pre- and postfix the nessesary fix-up for the signed pmulhw, without cluttering the original 16×16=32 block too much.