PTfS25: Question regarding 2c assignment 4.

Hello,

I'm trying to build an intuitive understanding of how memory traffic works for each element a[j][i] in a tiled outer product when non-temporal stores are disallowed.
In the first instance, we create the matrix A, we then store it in L2. (L1 -> L2). (this is not part of the loop, so I suppose this can be ignored?)
Next, we want to write to the matrix, after the computation. So first we need to read before we write, (L2 -> L1).
We conclude that we can write, so we write (L1 -> L2).
We do this for every element of the matrix/tile, and I suppose we only do write back to memory after that (L2 -> L3 -> Mem), or do we do that in an intermediate stage as well?

With this methodology, I find the following traffic
3 times L1 traffic of 8 B
4 times L2 traffic of 8 B
1 time L3 traffic of 8 B
1 time L3 traffic of 8 B

Which I think captures the essence of what we are trying to do, letting the storage take place as close as possible to the CPU. But I am doubting whether how I should take the initialisation of the matrix A into account here. I could not find a slide from the lecture that clearly gets into this sort of analysis, so I was wondering if there is a resource available to better develop my intuition on these type of analyses.

Thank you in advance.
Thies

Re: Question regarding 2c assignment 4.

by Georg Hager - Thursday, 5 June 2025, 10:10 AM

When analyzing such codes, think about the steady state. Data streams in and out of the core from/to the memory hierarchy. Startup effects (i.e., how do the first elements of the data structures come into the CPU) are ignored. A large data structure that is updated will be read from memory and written back to memory eventually. A smaller data structure that fits into some cache may be reused from there (if there is actually reuse).

Think of the CPU as a data pump, not as a machine with cogwheels ticking away.

Hope this helps,

Georg.