The Lounge is rated Safe For Work. If you're about to post something inappropriate for a shared office environment, then don't post it. No ads, no abuse, and no programming questions. Trolling, (political, climate, religious or whatever) will result in your account being removed.
It's kinda difficult to explain, but it's fascinating to watch someone who really knows what they are doing, doing what they do best. It's the "unconscious competence" that they demonstrate, I suspect. It's what taught me to change motorcycle tyres, fix their engines, rewire the electrics, etc. ...
You never lose anything by watching a master at work, and you might learn something.
Sent from my Amstrad PC 1640 Never throw anything away, Griff
Bad command or file name. Bad, bad command! Sit! Stay! Staaaay...
AntiTwitter: @DalekDave is now a follower!
That's a coincidence, I saw a video recently of a guy picking a lock ( can't remember where though ) and it too piqued my inquisitive nerdy head - I've ordered a cheap beginners set of tools from Amazon - new page on this site maybe ? Lockpickers Q & A
We can’t stop here, this is bat country - Hunter S Thompson RIP
But hiding behind that is a set of genuinely useful instructions, not exclusively for "deep learning" (whatever that is).
VPDPBUSD (base intrinsic: _mm_dpbusd_epi32), in the words of the official guide,
Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate signed 16-bit results. Sum these 4 results with the corresponding 32-bit integer in src, and store the packed 32-bit results in dst.
So it's like a slightly more elaborate version of the old PMADDUBSW (aka _mm_maddubs_epi16), summing groups of 4 instead of 2 and with an extra 32-bit addition at the end. There is also a version where the final sum is saturating. Since it sums together groups of 4, it has 32 bits to represent the sum of 4 products, and it is easy to avoid saturation. This is different from working with PMADDUBSW which sums together two adjacent 16-bit products with saturation, which in many cases limits the useful range of your scale factors.
It's also actually a fast instruction (CTRL+F for VNNI), at 2 operations per cycle. It's not inherently faster than VPMADDUBSW, but you get more range on your scale factors so the amount of useful work goes up a bit. The 512 bit version does not look super useful yet, possibly even harmful: it's twice as wide but can be done half as often so it does not increase the amount of work done per cycle, and since it is a 512bit instruction it may have funny side effects such as reducing clock speed and fusing ports 0 and 1, who knows, be careful.
Unfortunately there is still no VPMULLB or VPMULHUB or VPMULHB. Intel, AMD, please.
Lots of image processing stuff: convolutions, 4-way cross-fade, color space conversion, bilinear interpolation..
It does not seem as great for parsing decimal numbers, because the 4th scale factor won't fit in an sbyte. If you shuffle first you can make something happen but is it worth it? Maybe..
It could be used to turn blocks of 4 8-bit masks into a 4-bit combination of packed flags by setting the second operand to _mm_setr_epi8(-1, -2, -4, -8, -1, -2, -4, -8, -1, -2, -4, -8, -1, -2, -4, -8) (this puts junk in the upper 24 bits of the dwords but just discard that), sort of like a PMOVMSKB but with a vector result, I predict some synergy with PSHUFB (which can use those 4-bit groups to pick arbitrary 8-bit values out of a 16-entry lookup table).