ctimer test - random failures

Hardware related problems and workarounds

ctimer test - random failures

Postby cmcconnell » Wed Sep 03, 2014 12:15 am

I should start by saying that I haven't updated the SDK or the Ubuntu distribution since I received my board in May. (So apologies if I'm reporting a bug which has already been seen and fixed.) -

I'm getting weird behaviour with the ctimer example, in epiphany-examples. If I run it while logged in locally, it will quite often report no errors, but much of the time one or two cores will report some failures.

Curiously, if I log in via ssh and run it, then the failures are much more common. (More cores failing more of the time.)

Here is the output from a couple of sample runs via ssh -
Code: Select all
linaro-nano:~/Parallella/epiphany-examples/ctimer> ./run.sh
Message from eCore 0x808 ( 0, 0):

test02 CTimer passed!

Message from eCore 0x809 ( 0, 1):

test02 CTimer passed!

Message from eCore 0x80a ( 0, 2):

test02 CTimer passed!

Message from eCore 0x80b ( 0, 3):
test02 CTimer failed! cycles spent on event CLK                is 534560 cycles!                        Expecting 565000 cycles!
test02 CTimer failed! cycles spent on event EXT_FETCH_STALLS   is 480161 cycles!                        Expecting 510000 cycles!

test02 CTimer failed! Number of faults is 2!

Message from eCore 0x848 ( 1, 0):

test02 CTimer passed!

Message from eCore 0x849 ( 1, 1):

test02 CTimer passed!

Message from eCore 0x84a ( 1, 2):

test02 CTimer passed!

Message from eCore 0x84b ( 1, 3):
test02 CTimer failed! cycles spent on event CLK                is 534816 cycles!                        Expecting 565000 cycles!
test02 CTimer failed! cycles spent on event EXT_FETCH_STALLS   is 480849 cycles!                        Expecting 510000 cycles!

test02 CTimer failed! Number of faults is 2!

Message from eCore 0x888 ( 2, 0):

test02 CTimer passed!

Message from eCore 0x889 ( 2, 1):

test02 CTimer passed!

Message from eCore 0x88a ( 2, 2):

test02 CTimer passed!

Message from eCore 0x88b ( 2, 3):
test02 CTimer failed! cycles spent on event EXT_FETCH_STALLS   is 484305 cycles!                        Expecting 510000 cycles!

test02 CTimer failed! Number of faults is 1!

Message from eCore 0x8c8 ( 3, 0):

test02 CTimer passed!

Message from eCore 0x8c9 ( 3, 1):

test02 CTimer passed!

Message from eCore 0x8ca ( 3, 2):
test02 CTimer failed! cycles spent on event CLK                is 535280 cycles!                        Expecting 565000 cycles!
test02 CTimer failed! cycles spent on event EXT_FETCH_STALLS   is 483025 cycles!                        Expecting 510000 cycles!

test02 CTimer failed! Number of faults is 2!

Message from eCore 0x8cb ( 3, 3):
test02 CTimer failed! cycles spent on event CLK                is 535376 cycles!                        Expecting 565000 cycles!
test02 CTimer failed! cycles spent on event EXT_FETCH_STALLS   is 480289 cycles!                        Expecting 510000 cycles!

test02 CTimer failed! Number of faults is 2!

linaro-nano:~/Parallella/epiphany-examples/ctimer>


Code: Select all
linaro-nano:~/Parallella/epiphany-examples/ctimer> ./run.sh
Message from eCore 0x808 ( 0, 0):

test02 CTimer passed!

Message from eCore 0x809 ( 0, 1):

test02 CTimer passed!

Message from eCore 0x80a ( 0, 2):
test02 CTimer failed! cycles spent on event EXT_FETCH_STALLS   is 484305 cycles!                        Expecting 510000 cycles!

test02 CTimer failed! Number of faults is 1!

Message from eCore 0x80b ( 0, 3):
test02 CTimer failed! cycles spent on event CLK                is 534624 cycles!                        Expecting 565000 cycles!
test02 CTimer failed! cycles spent on event EXT_FETCH_STALLS   is 480385 cycles!                        Expecting 510000 cycles!

test02 CTimer failed! Number of faults is 2!

Message from eCore 0x848 ( 1, 0):

test02 CTimer passed!

Message from eCore 0x849 ( 1, 1):

test02 CTimer passed!

Message from eCore 0x84a ( 1, 2):

test02 CTimer passed!

Message from eCore 0x84b ( 1, 3):
test02 CTimer failed! cycles spent on event CLK                is 534576 cycles!                        Expecting 565000 cycles!
test02 CTimer failed! cycles spent on event EXT_FETCH_STALLS   is 480929 cycles!                        Expecting 510000 cycles!
test02 CTimer failed! cycles spent on event EXT_LOAD_STALLS    is 114156 cycles!                        Expecting 120500 cycles!

test02 CTimer failed! Number of faults is 3!

Message from eCore 0x888 ( 2, 0):

test02 CTimer passed!

Message from eCore 0x889 ( 2, 1):

test02 CTimer passed!

Message from eCore 0x88a ( 2, 2):
test02 CTimer failed! cycles spent on event EXT_FETCH_STALLS   is 484481 cycles!                        Expecting 510000 cycles!

test02 CTimer failed! Number of faults is 1!

Message from eCore 0x88b ( 2, 3):

test02 CTimer passed!

Message from eCore 0x8c8 ( 3, 0):

test02 CTimer passed!

Message from eCore 0x8c9 ( 3, 1):

test02 CTimer passed!

Message from eCore 0x8ca ( 3, 2):
test02 CTimer failed! cycles spent on event CLK                is 535968 cycles!                        Expecting 565000 cycles!
test02 CTimer failed! cycles spent on event EXT_FETCH_STALLS   is 483745 cycles!                        Expecting 510000 cycles!

test02 CTimer failed! Number of faults is 2!

Message from eCore 0x8cb ( 3, 3):
test02 CTimer failed! cycles spent on event CLK                is 534656 cycles!                        Expecting 565000 cycles!
test02 CTimer failed! cycles spent on event EXT_FETCH_STALLS   is 480065 cycles!                        Expecting 510000 cycles!
test02 CTimer failed! cycles spent on event EXT_LOAD_STALLS    is 114348 cycles!                        Expecting 120500 cycles!

test02 CTimer failed! Number of faults is 3!

- different cores affected, and reporting different figures.


I'm hoping this is indicative of a problem with the test code, rather than with my board. Does anyone else see similar behaviour?
Colin.
cmcconnell
 
Posts: 99
Joined: Thu May 22, 2014 6:58 pm

Re: ctimer test - random failures

Postby ralphmcardell » Wed Sep 03, 2014 3:12 pm

Hello Colin,

This is an interesting observation to me because I have a mini-cluster of 4 Parallellas (3 A101040s from the 4 supplied as Kickstarter rewards and 1 P1601-DK02 - the 4th KS reward board has real problems using the Epiphany so has been placed on other duties!).

The main access method for the mini-cluster is remote access via SSH - I have only connected a monitor to the HDMI output to check what is going on while (not) booting. All boards exhibit problems occasionally when running the matmul-16 example in that some executions never finish (the host in constantly in the busy loop waiting for the Epiphany to signal it is done). Do not think I have tried the ctimer example - I should probably give it a go some time.

Although I have not been able to verify or quantify it I have had the suspicion when running repeated matmul-16 executions that the failure rate seemed to increase the more 'things' a board was doing over the network.

Other occurrences of this problem seem to have been fixed by changing PSU but so far I have tried (with various boards and collections of boards) 3 PSUs - including running the latest arrival P1601-DK02 board from one of the adapters supplied by Adapteva with no joy.

Your post has therefore made me wonder if there is in fact more to my suspicion of network activity / Epiphany problems than mere suspicion!

Regards Ralph
ps: more on my problems can be found in this thread: viewtopic.php?f=50&t=1438
ralphmcardell
 
Posts: 12
Joined: Mon Dec 17, 2012 3:25 am
Location: London UK

Re: ctimer test - random failures

Postby cmcconnell » Wed Sep 03, 2014 5:36 pm

Hi Ralph,

Well, at present I'm hoping there is not a problem with my board, but rather with the ctimer test program. (For what it's worth, I have no issues with matmul, including over ssh.)

Looking at the code, The ctimer test reports a failure if any result is more than 5% different from the expected value. I'm surprised by that, as I would have thought this was an entirely deterministic test (i.e. the results should always be identical) ??

That being the case, if there is a logical reason for the results to vary, the only issue may be that the 5% tolerance that was chosen for the test is not sufficient.

So I'd be grateful for a clarification of the intended behaviour of the ctimer test, plus whether or not other people see the same variability in the results that I do.

Thanks,
Colin.
cmcconnell
 
Posts: 99
Joined: Thu May 22, 2014 6:58 pm

Re: ctimer test - random failures

Postby cmcconnell » Mon Sep 08, 2014 4:01 am

cmcconnell wrote:That being the case, if there is a logical reason for the results to vary, the only issue may be that the 5% tolerance that was chosen for the test is not sufficient.

So I'd be grateful for a clarification of the intended behaviour of the ctimer test, plus whether or not other people see the same variability in the results that I do.


I think I've partially figured out the answer to my own question -

The code under test uses the standard library, linked into external SDRAM. Hence you can get variability of the counters EXT_FETCH_STALLS, EXT_LOAD_STALLS, and CLK, as seen in the results I posted.

You'd expect variability between cores, due to the differing numbers of hops on the eMesh involved, and variability between runs, due to activity on the ARM causing contention for the FPGA/memory resources.

But I still don't know if there is anything out of the ordinary about my particular results. It's disconcerting to run a test and be greeted with the words 'failed!' and 'faults'.

So, please, could someone (from Adapteva and/or a fellow Parallella owner) give me some feedback on this.
Colin.
cmcconnell
 
Posts: 99
Joined: Thu May 22, 2014 6:58 pm

Re: ctimer test - random failures

Postby notzed » Mon Sep 08, 2014 4:28 am

I haven't looked at it but I think it could only provide meaningful results for an idle system unless it had other metrics from the other memory accessing subsystems. The memory has a fixed bandwidth and it is shared amongst several systems (cpu, framebuffer, i/o dma, epiphany) many of which can saturate its capacity on their own. If its already busy then someone has to wait.

(in short, either it's only intended for an idle system, or it just isn't a very good test).

Even on-chip timing isn't completely deterministic because the round robin scheduling of the mesh - but it shouldn't vary much.

Those numbers don't look very far from expectations so I would only be worried if they failed to execute at all or were some multiple out.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: ctimer test - random failures

Postby cmcconnell » Mon Sep 08, 2014 6:34 am

Looking more closely, the code under test does not in fact use any libraries, but it's being built with legacy.ldf, so everything will be in SDRAM.

When I get the chance I may experiment to see how it behaves with internal.ldf.
Colin.
cmcconnell
 
Posts: 99
Joined: Thu May 22, 2014 6:58 pm


Return to Troubleshooting

Who is online

Users browsing this forum: No registered users and 2 guests