問題如下:
For a pipeline processor, there are 3 clock cycle of latency for multiplication
operation, there are 2 clock cycle of latency for any other ALU operation and
there is 1 clock cycle of latency for any Branch operation and Load/Store
operation.
Let AR0 be an auxiliary register and R0, R1, and R2 be data register. For the
following C code,
for(i=1;i<=256;i++){a+=f(i)*g(i);}
Let the associate assembly code be as follows.
Loop: LOAD R0, 0(AR0) ;R0=*(AR0)
LOAD R1, 1024(AR0) ;R1=*(AR0+1024)
MPY R0, R0, R1 ;R0=R0*R1
ADD R2, R2, R0 ;R2=R2+R0
SUB AR0, AR0, #1 ;AR0=AR0-1
JNZ AR0, Loop ;Jump to Loop if AR0=0
Initial condition of registers and data arrangement are set such that they are
suitable for the execution of the corresponding C codes.
(1)How stalls are inserted into the above program if no scheduling is
performed?
(2)Reschedule the above program such that the least number of clock cycles is
requires for the job.
(3)Find the number of clock cycles required based on your design in(2).
------------------------------------------------------------------------------------
(1) Loop: LOAD R0, 0(AR0)
LOAD R1, 1024(AR0)
stall
stall
stall
MPY R0, R0, R1
stall
stall
stall
ADD R2, R2, R0
SUB AR0, AR0, #1
stall
stall
stall
JNZ AR0, Loop
(2) Loop: LOAD R0, 0(AR0)
LOAD R1, 1024(AR0)
SUB AR0, AR0, #1
stall
stall
MPY R0, R0, R1
stall
stall
stall
ADD R2, R2, R0
JNZ AR0, Loop
(3) 15*256 clock cycles
以上是我自己寫的答案,因為跟書上給的答案有出入
所以想請各位高手們幫我看看我有沒有寫錯...
鋼溫!!!
--
※ 發信站: 批踢踢實業坊(ptt.cc)
◆ From: 223.143.226.251