Parallella Community

by **timpart** » Sun Mar 30, 2014 1:17 am

My theory is there is a fault in the multiply add instruction on one of the cores. One that doesn't show up if you multiply 1.0 by 1.0 and then add those numbers together. Alternatively, the rounding mode from the config register isn't working properly and truncate is being used instead of round.

Perhaps an alternative test that tries a selection of "randomly" chosen numbers then compares the result to a known good outcome? (Perhaps best in assembler to get exact bit patterns, or some tricky C that interprets float as an int bit pattern.) Doesn't need to be a matrix operation, just some arithmetic statements.

Tim

by **timpart** » Sun Mar 30, 2014 9:21 am

Woke up this morning with some extra thoughts on the problem.

Might be a fault in the core's round to even logic if exactly even.
Might be the host Zynq that has the problem. (I suspect the problem is in a floating point unit. Operating systems don't use float arithmetic for their core functions, so the system could happily boot and run with a faulty float unit.)

For the host problem...
Let's make sure matmul uses reproducible random numbers. In the host's main put srand(1); near the top before any matrices are initialised. The C standard says this should be the case anyway, but let's be sure...

get matmul to dump out both host and epiphany result matrices.

Run matmul on a system that works, save the output.

Run matmul on the problem system and again save the output.

I believe you have already modified matmul to identify discrepancies.

Look at any discrepancy on the faulty system. Note down the answers from host and Epiphany.

Locate that entry in the output from the known good system. Find out which one is really wrong.

Hope this helps,

Tim

by **Calle** » Mon Mar 31, 2014 4:09 pm

by **Calle** » Mon Mar 31, 2014 4:15 pm

by **Calle** » Mon Mar 31, 2014 4:37 pm

I can also add that I have know tried another SD card which produced succesful results on another Parallella board in this board, and it to produces faulty results sometimes.

by **Calle** » Mon Mar 31, 2014 6:25 pm

by **Calle** » Mon Mar 31, 2014 6:28 pm

The forum does not allow me to post pictures over 256 KiB, so no picture of my setup today.

by **9600** » Mon Mar 31, 2014 6:38 pm

I've just increased the limit to 1M.

Cheers,

Andrew

by **Calle** » Mon Mar 31, 2014 6:52 pm

Thanks Andrew, I resized it anyway. Here is my setup.

by **ysapir** » Mon Mar 31, 2014 11:36 pm

If the error log you posted contains all errors, then it looks like the errors are concentrated in one or two matrix rows. Try reducing the test to small size matrices, we may be able to get better insight on the faulty point.

Another thing to try to isolate the problem, is to multiply matrices where only specific members are positive (and preferably small integers). Thus making sure that the result matrix is all zeros but selected element(s). For example, only row P of mat A and col Q of mat B will result a mat C where only element (P,Q) is nonzero.

Parallella Community

Matmul-16 example gets stuck

Re: Matmul-16 example gets stuck

Re: Matmul-16 example gets stuck

Re: Matmul-16 example gets stuck

Re: Matmul-16 example gets stuck

Re: Matmul-16 example gets stuck

Re: Matmul-16 example gets stuck

Re: Matmul-16 example gets stuck

Re: Matmul-16 example gets stuck

Re: Matmul-16 example gets stuck

Re: Matmul-16 example gets stuck

Who is online