Hello,
I'm trying to build an intuitive understanding of how memory traffic works for each element a[j][i] in a tiled outer product when non-temporal stores are disallowed.
In the first instance, we create the matrix A, we then store it in L2. (L1 -> L2). (this is not part of the loop, so I suppose this can be ignored?)
Next, we want to write to the matrix, after the computation. So first we need to read before we write, (L2 -> L1).
We conclude that we can write, so we write (L1 -> L2).
We do this for every element of the matrix/tile, and I suppose we only do write back to memory after that (L2 -> L3 -> Mem), or do we do that in an intermediate stage as well?
With this methodology, I find the following traffic
3 times L1 traffic of 8 B
4 times L2 traffic of 8 B
1 time L3 traffic of 8 B
1 time L3 traffic of 8 B
Which I think captures the essence of what we are trying to do, letting the storage take place as close as possible to the CPU. But I am doubting whether how I should take the initialisation of the matrix A into account here. I could not find a slide from the lecture that clearly gets into this sort of analysis, so I was wondering if there is a resource available to better develop my intuition on these type of analyses.
Thank you in advance.
Thies
When analyzing such codes, think about the steady state. Data streams in and out of the core from/to the memory hierarchy. Startup effects (i.e., how do the first elements of the data structures come into the CPU) are ignored. A large data structure that is updated will be read from memory and written back to memory eventually. A smaller data structure that fits into some cache may be reused from there (if there is actually reuse).
Think of the CPU as a data pump, not as a machine with cogwheels ticking away.
Hope this helps,
Georg.