It’s always interesting to open the assembly view in Visual Studio and see the generated assembly code. You might be surprised how short or long some things you wrote in C++ are. I’ve often wondered how to see exactly how intensive a CPU instruction is, and today I’ve found an answer to my question.
Intel has this reference manual for people such as compiler writers and low-level coders to detail the specifics of their CPUs. One of them is Intel® 64 and IA-32 Architectures Optimization Reference Manual, a 642 page PDF describing in detail how their CPUs work with tons of optimization tricks and in the end a table detailing every CPU instruction and its latency (number of clock cycles the CPU takes to run the instruction). You can find these in Appendix C – Instruction Latency and Troughput here.
Let’s check this out in some more detail. I’ve recently learned some byte shifting and interestingly, when shifting a value’s bytes to the left, you multiply that value with 2.
Let’s check out the assembly for this. Notice the “shl” instruction when the byte shifting occurs
Let’s do this with a normal multiplication and a different number (because the CPU will byteshift and call the shl instruction otherwise).
This time the imul instruction is called. Even though both examples have the same amount of assembly, the first one is (logically) still faster, but how fast exactly? Turns out that according to the above document, the “shl” instruction has a latency of 1, while the “imul” instruction has a latency ranging between 15-18(!).
What if we do the same example with multiplication instead of byte shifting?
The compiler is smart to know we can byte shift and uses the shl instruction instead of the imul, neat!