subject

The transpose of a matrix interchanges its rows and columns. Here is a simple C loop to show the transpose:

for (i = 0; i < 3; i++) {
for (j = 0; j < 3; j++) {
output[j][i] = input[i][j];
}
}

Assume both the input and output matrices are stored in the row major order (row major order means row index changes fastest). Assume you are executing a 256 × 256 double-precision transpose on a processor with a 16 KB fully associative (so you don’t have to worry about cache conflicts) LRU replacement level 1 data cache with 64-byte blocks. Assume level 1 cache misses or prefetches require 16 cycles, always hit in the level 2 cache, and the level 2 cache can process a request every 2 processor cycles. Assume each iteration of the inner loop above requires 4 cycles if the data is present in the level 1 cache. Assume the cache has a write-allocate fetch-on-write policy for write misses. Unrealistically assume writing back dirty cache blocks requires 0 cycles.

For the simple implementation given above, this execution order would be nonideal for the input matrix. However, applying a loop interchange optimization would create a nonideal order for output matrix. Because loop interchange is not sufficient to improve its performance, it must be blocked instead.

a. (5 pts) What block size should be used to completely fill the data cache with one input and output block if the level 1 cache is fully associative 64 KB?

b. (5 pts) What is the minimum associativity required of the level 1 cache for consistent performance independent of both arrays’ position in memory?

c. (10 pts) Assume you are designing a hardware prefetcher for the unblocked matrix transposition code above. The simplest type of hardware prefetcher only prefetches sequential cache blocks after a miss. More complicated "nonunit stride" hardware prefetchers can analyze a miss reference stream, and detect and prefetch nonunit strides. Assume prefetches write directly into the cache and no pollution (overwriting data that needs to be used before the data that is prefetched). For best performance given a nonunit stride prefetcher, in the steady state of the inner loop, how many prefetches need to be outstanding at a given time?

ansver
Answers: 3

Another question on Computers and Technology

question
Computers and Technology, 22.06.2019 03:30
Identify at least three types of characteristics that you were asked about as you the computer identify a fruit.
Answers: 3
question
Computers and Technology, 23.06.2019 23:30
The keyboard usually has six rows of keys. which of the following is not one of the key group categories? letter keys number keys control keys graphic keys
Answers: 1
question
Computers and Technology, 24.06.2019 06:30
Me and category do i put them in because this is science
Answers: 1
question
Computers and Technology, 24.06.2019 09:30
Retype the statements, correcting the syntax errors. system.out.println("num: " + songnum); system.out.println(int songnum); system.out.println(songnum " songs"); note: these activities may test code with different test values. this activity will perform two tests: the first with songnum = 5, the second with songnum = 9. see how to use zybooks.
Answers: 1
You know the right answer?
The transpose of a matrix interchanges its rows and columns. Here is a simple C loop to show the tra...
Questions
question
Mathematics, 03.08.2019 06:00
Questions on the website: 13722361