The Lounge is rated Safe For Work. If you're about to post something inappropriate for a shared office environment, then don't post it. No ads, no abuse, and no programming questions. Trolling, (political, climate, religious or whatever) will result in your account being removed.
That's a coincidence, I saw a video recently of a guy picking a lock ( can't remember where though ) and it too piqued my inquisitive nerdy head - I've ordered a cheap beginners set of tools from Amazon - new page on this site maybe ? Lockpickers Q & A
We can’t stop here, this is bat country - Hunter S Thompson RIP
But hiding behind that is a set of genuinely useful instructions, not exclusively for "deep learning" (whatever that is).
VPDPBUSD (base intrinsic: _mm_dpbusd_epi32), in the words of the official guide,
Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate signed 16-bit results. Sum these 4 results with the corresponding 32-bit integer in src, and store the packed 32-bit results in dst.
So it's like a slightly more elaborate version of the old PMADDUBSW (aka _mm_maddubs_epi16), summing groups of 4 instead of 2 and with an extra 32-bit addition at the end. There is also a version where the final sum is saturating. Since it sums together groups of 4, it has 32 bits to represent the sum of 4 products, and it is easy to avoid saturation. This is different from working with PMADDUBSW which sums together two adjacent 16-bit products with saturation, which in many cases limits the useful range of your scale factors.
It's also actually a fast instruction (CTRL+F for VNNI), at 2 operations per cycle. It's not inherently faster than VPMADDUBSW, but you get more range on your scale factors so the amount of useful work goes up a bit. The 512 bit version does not look super useful yet, possibly even harmful: it's twice as wide but can be done half as often so it does not increase the amount of work done per cycle, and since it is a 512bit instruction it may have funny side effects such as reducing clock speed and fusing ports 0 and 1, who knows, be careful.
Unfortunately there is still no VPMULLB or VPMULHUB or VPMULHB. Intel, AMD, please.
Lots of image processing stuff: convolutions, 4-way cross-fade, color space conversion, bilinear interpolation..
It does not seem as great for parsing decimal numbers, because the 4th scale factor won't fit in an sbyte. If you shuffle first you can make something happen but is it worth it? Maybe..
It could be used to turn blocks of 4 8-bit masks into a 4-bit combination of packed flags by setting the second operand to _mm_setr_epi8(-1, -2, -4, -8, -1, -2, -4, -8, -1, -2, -4, -8, -1, -2, -4, -8) (this puts junk in the upper 24 bits of the dwords but just discard that), sort of like a PMOVMSKB but with a vector result, I predict some synergy with PSHUFB (which can use those 4-bit groups to pick arbitrary 8-bit values out of a 16-entry lookup table).
I hear that they're making Guantanamo prisoners do regular expression coding, and that's the REAL reason hillary and obama are panicking over the DoJ investigations.
".45 ACP - because shooting twice is just silly" - JSOP, 2010 ----- You can never have too much ammo - unless you're swimming, or on fire. - JSOP, 2010 ----- When you pry the gun from my cold dead hands, be careful - the barrel will be very hot. - JSOP, 2013
THe problem I've run into with hyperlinking between articles is having to go back and add the links after each one is approved. When there's a lot of articles over time it gets hectic to remember them all.
Oh well, i still find myself agreeing with you. I think one article is too long.
It's sure too long for me to write in one sitting.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
Last Visit: 15-Nov-19 8:41 Last Update: 15-Nov-19 8:41