Dear PTfS students,
there was some confusion in the tutorial about how to calculate the data transfer in case of a strided loop, e.g.:
for(i=0; i<N; i+=2) {
z[i] = a[i] * 0.5f;
}
As discussed on slide 16 of slide set 4, the fact that cache lines are always read and written as a whole leads to the effect that a[] must be read completely and z[] must be read and written completely (including the elements that are not used in any calculation). Since we want to calculate the in-memory code balance, all that matters are the memory transfers. Hence, this loop has a memory code balance of 12 bytes / 0.5 flops = 24 byte/flop. This is also true for the L2 and L3 code balance. (Note that if the stride is larger than a cache line (16 elements in this case), the data traffic may be reduced, depending on the details of the architecture.)
Best,
Georg.