Do you mean something like that?
int _tmain(int argc, _TCHAR* argv[])
{
int x[4][4]={1,2,3,4,2,2,2,2,3,3,3,3,4,4,4,4}; int y[4][4]; int b[4] ={5,5,5,5};
__asm
{
push esi
push edi
push ebx
lea esi, x
lea edi, y
lea edx, b
mov eax, 4
mov ebx, eax
_loop_00:
mov ecx, eax
_loop_01:
movd mm0, [edx]
movd mm1, [esi]
paddd mm0, mm1
movd [edi], mm0
add esi, eax
add edi, eax
loop _loop_01
add edx, eax
dec ebx
jnz _loop_00
pop ebx
pop edi
pop esi
}
_mm_empty();
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
cout << y[i][j] << " ";
cout << endl;
}
return 0;
}
I didn't understand stage two. I commented it out because it yields all 0 with such small values. :-) If you need to do arithmetic shift, you can un-comment line "; psraw mm0, 6".
You can also change array sizes by ebx and ecx. I've used eax register as all constant values equal 4 for optimization (in this particular case).
BTW, I've used MMX on behalf of your question's title. There would be some more optimized (using only x86) way of doing this without using MMX instruction set. Because array elements are "int" (that is, 32 bits), I think, there is no gain by using 64 bits MMX registers.