psubq (64bit - 64bit = 64bit):
movq mm7, mmc
psubd mmc, mms
movq mm6, mmc
psubd mm7, 0×80000000
psubd mm6, 0×80000000
pcmpgtd mm6, mm7
psllq mm6, 32
psubd mmc, mm6
psubq (64bit - 64bit = 64bit):
movq mm7, mmc
psubd mmc, mms
movq mm6, mmc
psubd mm7, 0×80000000
psubd mm6, 0×80000000
pcmpgtd mm6, mm7
psllq mm6, 32
psubd mmc, mm6
paddq (64bit + 64bit = 64bit):
movq mm7, mmc
paddd mmc, mma
psubd mm7, 0×80000000
psubd mmc, 0×80000000
pcmpgtd mm7, mmc
paddd mmc, 0×80000000
psllq mm7, 32
psubd mmc, mm7
packusqd (saturated clamp from unsigned long long to unsigned long),
packssqud (saturated clamp from signed long long to unsigned long):
There is a condition for unsigned long long inputs, range is “only” [0x0000000000000000, 0x7FFFFFFFFFFF].
There is a no condition for signed long long inputs.
// from-scratch, no helper available
/* -1 */
pcmpeqd mm6, [...]
packusdw (saturated clamp from unsigned long to unsigned short),
packssduw (saturated clamp from signed long to unsigned short):
There is a condition for unsigned long inputs, range is “only” [0x00000000, 0x7FFFFFFF].
There is a condition for signed long inputs, range is “only” [0x80008000, 0x7FFFFFFF]. You can go to full signed long if there would exist “psubsd”, which does [...]
The SSE-ops lack a huge amount of very usefull functions. Basically every single mathematical relevant function is absent, worst, trying to re-create them and reach the same precision as the FPU-routines results in magnitude slower code. Sometimes one can find a usefull short-cut, but not very often.
Here I tried to make a cbrt(x), the cubic [...]
padddq (128bit + 64bit = 128bit):
// carry emulation via pcmpgtq equivalent
movdqa xmmx, xmma
movdqa xmmy, xmmb
movdqa xmmz, xmma
paddq xmma, xmmb
pcmpgtd xmmx, xmmb
pcmpgtd xmmy, xmma
pcmpeqd xmmz, xmmb
pshufd xmmx, xmmx, 1|1|1|1
pshufd xmmz, xmmz, 1|1|1|1
pand xmmz, xmmy
por xmmz, xmmx
punpcklqdq xmmz, xmmz
pslldq xmmz, 8 or pshufd
psubq xmma, xmmz
When you don’t have SSE4 available you have to compensate for the lack of ome essencial ops. In this series I’m going to write down equivalents of either missing or (currently) non-AMD ops.
pcmpeqq (64bit integer compare):
// equality == (hi == hi) && (lo == lo)
movdqa xmm?3, xmma
pcmpeqd xmm?3, xmmb
pshufd xmm?1, xmm?3, 3|3|1|1 (b == a)
pshufd xmm?2, xmm?3, 2|2|0|0 (B == A)
pand xmm?1, [...]