Unexpected Mesh Traffic?

Any technical questions about the Epiphany chip and Parallella HW Platform.

Moderator: aolofsson

Unexpected Mesh Traffic?

Postby GreggChandler » Mon Jun 05, 2017 4:39 am

I am attempting to measure "mesh distance" from each of the 16 Epiphany III cores to external memory. The application makes 300 measurements for each core using the hardware timers. A ten second sleep based upon e_wait() executes on the core before the test. While any individual core is being tested, all of the other cores are "Idling" (that is, they have executed the "idle" instruction). While the measurements are executing on the core, the single threaded ARM application is sleeping, although I have mapped core memory, register memory, and external memory into the ARM address space. No DMA is active during the tests, and no interrupts appear to be executing. The test itself executes out of core memory.

When I measure internal memory, the standard deviation is 0, which is expected. The mesh should not be used. When I time a read after write to external memory (counting cycles until the read value matches the written value), I see a standard deviation which is up to 12 percent of the measured time, but is at times less than 1 percent of the measured value. I imagine that the read responses are contending with the read requests for mesh bandwidth. Is there a source of mesh traffic that I have not considered, or is the mesh traffic just that variable?
GreggChandler
 
Posts: 66
Joined: Sun Feb 12, 2017 1:56 am

Re: Unexpected Mesh Traffic?

Postby sebraa » Mon Jun 05, 2017 9:48 am

Read requests and read replies use different meshes, so they shouldn't clash.

However, external accesses (especially reads) are much slower, so you might just see the difference between "the N'th read succeeds" vs. "the (N+1)'th read succeeds". Try to check this. Also, you can fire off multiple writes to external memory, and only then read back the results. This should be more consistent.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: Unexpected Mesh Traffic?

Postby GreggChandler » Mon Jun 05, 2017 5:22 pm

The documentation states that "all" read requests go out via the rMesh network (architectural reference p. 25). Since the request is to external memory, that would imply the data should be returned on the xMesh network (next paragraph, same page). The same manual, however, suggests that the xMesh network is for "passing through transactions destined for another chip" (p. 22). It is reasonable to expect that external memory is treated as another chip, in which case one would expect the issue and return would both be on the xMesh. At some point it will necessarily be destined for the eLink. The core would appear to determine the outbound network. Does everything that eventually makes it to the eLink go via the xMesh network?

The documentation further suggests that the request and response should propagate one hop per clock, but that each core "has a maximum throughput of 1 read transaction every 8 clock cycles in each routing direction" (p. 22). Does this refer to reads issued or read responses? or both? Do the 8 clocks apply at each hop? only the first hop?

If the external memory is at the end of every row, the maximum number of hops to external memory is 4. If the external memory is hung off of only one row, that is 7 hops to external memory. The return path should be a similar number of clocks. The actual routing of a read request to external memory is significant. I have been unable to find documentation that details the mesh network topology to this level of precision.

In theory I have tried to isolate all of the mesh traffic. Perhaps I have missed something. I doubt that the issue is a variation of the "n" or "n+1" read as you suggest. (See data provided.) It is a little tough to save much data, but Run-to-run variation for a particular core can range from less than 1 percent standard deviation to over 8 percent, with many values in between.

Here is a histogram of the results from Core 0 to External Memory, with number of clocks before data matches vs. the number of occurrences for that value. Arguably +/- 8 clocks should be considered equivalent, depending upon one's interpretation of the manual:

Code: Select all
uMin: 463
uMax: 1047
Hist[ 463]:    1
Hist[ 471]:    1
Hist[ 839]:  203
Hist[ 842]:    1
Hist[ 847]:   28
Hist[ 855]:   11
Hist[ 863]:    4
Hist[ 879]:    2
Hist[ 887]:    4
Hist[ 895]:    3
Hist[ 903]:    1
Hist[ 919]:    1
Hist[ 927]:    1
Hist[ 935]:    1
Hist[ 943]:    2
Hist[ 959]:    2
Hist[ 975]:    2
Hist[ 983]:    4
Hist[ 991]:    3
Hist[ 999]:    3
Hist[1007]:    5
Hist[1015]:    3
Hist[1023]:    6
Hist[1031]:    5
Hist[1039]:    2
Hist[1047]:    1
Total:  300
GreggChandler
 
Posts: 66
Joined: Sun Feb 12, 2017 1:56 am

Re: Unexpected Mesh Traffic?

Postby sebraa » Tue Jun 06, 2017 9:41 pm

GreggChandler wrote:The documentation states that "all" read requests go out via the rMesh network (architectural reference p. 25). Since the request is to external memory, that would imply the data should be returned on the xMesh network (next paragraph, same page). The same manual, however, suggests that the xMesh network is for "passing through transactions destined for another chip" (p. 22). It is reasonable to expect that external memory is treated as another chip, in which case one would expect the issue and return would both be on the xMesh.
Your analysis matches my understanding of the system as well: The read request to shared memory takes the rMesh, the read reply and the write take the xMesh; the cMesh is not involved.

GreggChandler wrote:The documentation further suggests that the request and response should propagate one hop per clock, but that each core "has a maximum throughput of 1 read transaction every 8 clock cycles in each routing direction" (p. 22). Does this refer to reads issued or read responses? or both? Do the 8 clocks apply at each hop? only the first hop?
In order to get a double-word transaction to do one hop per cycle, the cMesh must be 64-bit wide. However, looking at https://github.com/parallella/oh/blob/master/src/elink/hdl/elink.v, both rMesh and xMesh seem to be 8-bit wide. Assuming that all transactions are assumed to be maximum size, each hop on these meshes would then take 8 cycles, so both read requests and read responses to external memory. This also matches the often-quoted "16x slower" for external reads.

GreggChandler wrote:If the external memory is at the end of every row, the maximum number of hops to external memory is 4. If the external memory is hung off of only one row, that is 7 hops to external memory.
I do not know for sure, but this should be possible to measure. Since external memory was "insanely slow" in our context, I never cared about measuring its performance. I do remember that I always arrived at very consistent cycle times eventually (minus some bugs and timer issues).

GreggChandler wrote:In theory I have tried to isolate all of the mesh traffic. Perhaps I have missed something.
Running the exact same, deterministic code 300x should result in perfect (or almost-perfect) cycle times. If you get a high variation, you may have found a bug. I noticed that sometimes, the timers would stop for no obvious reason, producing garbage. To workaround this, I used to measure only small sections of code plus the overhead of starting/stopping the timers. (I also wrote a program to watch the timer configuration from the outside, and could see the timers stopping after a few hundred cycles, too. Something is fishy, but it never got explained.)

GreggChandler wrote:I doubt that the issue is a variation of the "n" or "n+1" read as you suggest. (See data provided.) It is a little tough to save much data, but Run-to-run variation for a particular core can range from less than 1 percent standard deviation to over 8 percent, with many values in between.
Doing the same thing multiple times should produce consistent results. If it doesn't, you are likely doing something wrong (Epiphany and eLink are completely deterministic), or you may see interference from the host (i.e. memory accesses being delayed due to ARM accesses or DRAM refresh or similar). Your data doesn't support that theory either, really.

I won't be able of much help with actual work though, since I consider the Epiphany a dead horse not worth riding any longer.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: Unexpected Mesh Traffic?

Postby GreggChandler » Wed Jun 07, 2017 8:14 pm

While I agree that my code could have a bug, I don't believe that does have a bug in this case. Furthermore, I also agree that the measurements should be deterministic, as measured, they appear to vary considerably. The code is quite simple: disable external access to the system, disable processor interrupts, initialize the timer, execute 2-4 instructions, and stop the timers, re-enable interrupts, re-enable external access to the system, and return the data to the host. (I also remove the timer enable/disable overhead from my calculation.)

I increased my sample size to 5,000. What I have observed is that the mesh is quiet enough that the correct value is returned on either the first or second read after the write--always. The outliers in the 400's correspond to the data being returned on the first read, which is consistent with the observed data. The others all correspond to the data being returned on the second read. This implies that some reads can be as fast as 463, while others are closer to 525 clocks in the above data.

Furthermore, one can see the data pile up every eight clocks--usually, but not always. This appears to correlate with the read every 8 clocks specification. The eastern cores have lower access times than the western ones, which is consistent with external memory hung off of the east side of the chip. The data is too variable to determine if it the request goes north/south or just east/west.

Based upon the data I have observed, I am compelled to conclude that the read requests are arbitrated at every hop towards external memory, independent of whether the core attached to the node is idled. This arbitration, although not random, does not appear deterministic. Obviously, the ARM could be contributing to the latency in returning the data; however, it was my understanding (perhaps incorrect,) that the external shared Epiphany memory did not partake in the cache, etc., of the ARM.

Heisenberg warns that if I start watching the timers from the ARM, I will destroy my test by generating some of the very traffic I am seeking to avoid.

Data based upon 5,000 reads:
Code: Select all
Sleep: 90
Sleep Done
CoreId: 0x00000808
pInt: 0x8f0083f8
    [ 0]842.8( 60.9)
uTare: 28
uMin:  467
uMax:  1086
Hist[ 467]:   16
Hist[ 471]:    1
Hist[ 474]:    1
Hist[ 475]:   31
Hist[ 483]:   29
Hist[ 491]:    8
Hist[ 619]:    1
Hist[ 621]:    1
Hist[ 659]:    1
Hist[ 826]:    7
Hist[ 827]:    1
Hist[ 828]:    4
Hist[ 829]:    2
Hist[ 830]: 1101
Hist[ 831]:    5
Hist[ 832]:    4
Hist[ 833]:   10
Hist[ 834]:    4
Hist[ 835]:    5
Hist[ 836]:    6
Hist[ 837]:    2
Hist[ 838]: 1072
Hist[ 839]:    3
Hist[ 840]:    4
Hist[ 841]:   13
Hist[ 842]:    7
Hist[ 843]:    6
Hist[ 844]:   12
Hist[ 845]:    3
Hist[ 846]: 1638
Hist[ 847]:    5
Hist[ 848]:    5
Hist[ 849]:   12
Hist[ 851]:    3
Hist[ 852]:    4
Hist[ 853]:    1
Hist[ 854]:  477
Hist[ 855]:    6
Hist[ 856]:    3
Hist[ 857]:   13
Hist[ 858]:    3
Hist[ 859]:    1
Hist[ 860]:    4
Hist[ 862]:  241
Hist[ 870]:   14
Hist[ 878]:    6
Hist[ 886]:    3
Hist[ 894]:    1
Hist[ 910]:    1
Hist[ 921]:    1
Hist[ 934]:    4
Hist[ 938]:    1
Hist[ 942]:    5
Hist[ 950]:    1
Hist[ 956]:    1
Hist[ 958]:    4
Hist[ 964]:    1
Hist[ 966]:    7
Hist[ 975]:    1
Hist[ 977]:    1
Hist[ 982]:    4
Hist[ 990]:    7
Hist[ 993]:    1
Hist[ 995]:    1
Hist[ 998]:    3
Hist[1000]:    1
Hist[1002]:    1
Hist[1004]:    1
Hist[1006]:   15
Hist[1011]:    1
Hist[1014]:    5
Hist[1022]:    7
Hist[1024]:    2
Hist[1030]:   33
Hist[1031]:    1
Hist[1037]:    1
Hist[1038]:   39
Hist[1044]:    1
Hist[1046]:   24
Hist[1049]:    1
Hist[1054]:   14
Hist[1057]:    1
Hist[1062]:    1
Hist[1070]:    4
Hist[1078]:    1
Hist[1086]:    2
Total:         5000
Avg:            842.8
StdDev:          60.9
Avg-Stalls:     834.0
StdDev-Stalls:   60.0
Mean:           842.8
Median:         846.0
Mode:           846
Usage:           17.2%
Elapsed: 00:01:39.000
GreggChandler
 
Posts: 66
Joined: Sun Feb 12, 2017 1:56 am

Re: Unexpected Mesh Traffic?

Postby sebraa » Thu Jun 08, 2017 10:36 am

GreggChandler wrote:Furthermore, I also agree that the measurements should be deterministic, as measured, they appear to vary considerably.
Does this change if your memory cell is in a different core's local memory? In that case, the cycle counts should be exactly the same (for each core separately, of course).

GreggChandler wrote:What I have observed is that the mesh is quiet enough that the correct value is returned on either the first or second read after the write--always.
That sounds correct, as in "n or n+1".

GreggChandler wrote:Based upon the data I have observed, I am compelled to conclude that the read requests are arbitrated at every hop towards external memory, independent of whether the core attached to the node is idled.
All routers must stay active at all times, because the routing algorithm doesn't support detours. Also, while a core might be idle, its DMA engines can still produce requests. Arbitration is always done locally at each hop, but should be consistent unless there is cross-traffic (i.e. without collisions, each packet should be forwarded immediately). I have not seen any evidence that this may not be the case, ...

GreggChandler wrote:This arbitration, although not random, does not appear deterministic. Obviously, the ARM could be contributing to the latency in returning the data; however, it was my understanding (perhaps incorrect,) that the external shared Epiphany memory did not partake in the cache, etc., of the ARM.
...so your non-determinism must come from the eLink glue logic or the ARM memory system itself. I consider the latter to be highly likely; while the shared memory part of the ARM's memory is uncached, the memory chip itself may be busy (the ARM cores are accessing other parts of memory, and DRAM needs to be refreshed periodically). Also, it may simply be a case of synchronization delays due to asynchronous clocking between the systems.


If you test on-chip remote memory, you should see perfect cycle counts; I don't know if you can force accesses through the off-chip mesh without them hitting the ARM memory (not without writing your own eLink glue logic). Connecting the NORTH/SOUTH interfaces would only make requests cycle forever... so I don't think you can take out the ARM from that equation without a lot of work.

When your tests are done, you can check whether the CTIMER source is still the same that you set. However, your values seem reasonable, so I assume that your timers work correctly.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: Unexpected Mesh Traffic?

Postby GreggChandler » Thu Jun 08, 2017 3:27 pm

sebraa wrote:so your non-determinism must come from the eLink glue logic or the ARM memory system itself


I don't think so, my point is that the node arbitration at each hop is being delayed in a non-deterministic way even though the core is "idled". (I understand the router must remain active, otherwise the system couldn't access external memory or other cores beyond an interposing idled core.) There is no ongoing DMA or left-over memory traffic from/to the idled core. I tried to make sure of that before "idling" it. The rMesh arbitration doesn't appear sophisticated enough to pass the request through immediately at the core-to-core hops because there is nothing else in the queue at that particular node (possibly in another direction). The clue is the 1 in 8 per direction comment quoted below. It appears to forward the request in the 1 of 8 pattern. At each hop, the time slots appear fixed rather than based upon available data--possibly with some intentional jitter. I believe that I have created a situation where a single request is working its way to external memory and back. The number of clock cycles for each request and response to traverse the mesh network is highly variable--that is unexpected.

sebraa wrote:Does this change if your memory cell is in a different core's local memory?


As per the initial post in this thread, I am measuring traffic to external memory, not to memory in another core. When I test against internal memory or adjacent core memory, the timings are much more consistent, however, they are not identical. I would not expect them to be identical. From the architectural manual,

rMesh: Used for all read requests. The rMesh network connects a mesh node to all four of its neighbors and has a maximum throughput of 1 read transaction every 8 clock cycles in each routing direction.


Based upon my results from external memory, if your results for memory in another core are identical, I would expect a problem with your methodology--unless you have a way to synchronize with the router's 1 in 8 clock. If you do have a way to synchronize with the 1 in 8 clock, I would be most interested in how to do so. The only way that I know synchronize with the routing clock would be to issue a read, which would sync the first read with the clock--assuming no jitter. By paying careful attention to the clocks the next request could be well synchronized. However, in this case, the first read would take a different number of clocks to complete than subsequent reads.

sebraa wrote:That sounds correct, as in "n or n+1".


The "n or n+1" thought does not explain the variation in the read response times. It explains 463/471 vs 839 et al. in the initial data. My curiosity is not why the 2 of 300 aren't in the 800's, but the variation in the 298 of 300--where all the data are returned in two read requests. (However, 463 vs 471 further documents my point.)

I think one of the most interesting observations in my data is the way that the data "clumps" in multiples of 8. My suspicion that this "clumping" is related to the documented behavior of the rMesh routing throughput. It is important to remember that I am not looking at "instruction cycles", but rather, "clock cycles".

Here is some data for a request to memory in another core, notice the CoreId and pInt variables at the top of the table (ignore the data related to stalls, it does not have overhead removed):

Code: Select all
Sleep Done
CoreId: 0x00000808
pInt: 0x80905af8
    [ 0] 23.0(  0.6)
uTare: 29
uMin:  23
uMax:  37
Hist[  23]: 4976
Hist[  24]:    1
Hist[  25]:    2
Hist[  26]:    3
Hist[  28]:    1
Hist[  29]:    5
Hist[  30]:    4
Hist[  31]:    1
Hist[  33]:    1
Hist[  34]:    1
Hist[  35]:    2
Hist[  36]:    1
Hist[  37]:    2
Total:         5000
Avg:             23.0
StdDev:           0.6
Avg-Stalls:      29.0
StdDev-Stalls:    0.0
Mean:            23.0
Median:          23.0
Mode:            23
Usage:            2.6%
Elapsed: 00:00:39.000
GreggChandler
 
Posts: 66
Joined: Sun Feb 12, 2017 1:56 am

Re: Unexpected Mesh Traffic?

Postby sebraa » Thu Jun 08, 2017 5:21 pm

GreggChandler wrote:
sebraa wrote:Does this change if your memory cell is in a different core's local memory?
As per the initial post in this thread, I am measuring traffic to external memory, not to memory in another core. When I test against internal memory or adjacent core memory, the timings are much more consistent, however, they are not identical. I would not expect them to be identical. [...] Based upon my results from external memory, if your results for memory in another core are identical, I would expect a problem with your methodology--unless you have a way to synchronize with the router's 1 in 8 clock.
I just checked some of my raw results from a few years back, and you are right. The computational parts of my algorithm were cycle-identical, but communication was not. For off-chip communication, the standard deviation was much larger than for on-chip communication as well. Sorry that I had forgotten about this. :-)

GreggChandler wrote:The number of clock cycles for each request and response to traverse the mesh network is highly variable--that is unexpected.
I would not have expected this. Since on-core reads are affected as well, the ARM's memory can not be the only culprit - pointing at the mesh.

GreggChandler wrote:If you do have a way to synchronize with the 1 in 8 clock, I would be most interested in how to do so. The only way that I know synchronize with the routing clock would be to issue a read, which would sync the first read with the clock--assuming no jitter. By paying careful attention to the clocks the next request could be well synchronized. However, in this case, the first read would take a different number of clocks to complete than subsequent reads.
In my first project, I only did next-neighbor reads, and all cores ran exactly the same kernel and - as accurate as the barriers would allow - in lockstep. So any effect would have been very small. All my later projects did not use reads at all, and on-core writes do not jitter.


I have to admit that I can't really help you any further, except by stating that your analysis seems sound. While I assume some people (e.g. Andreas, Ola) to know - or to have known - these details, I don't think you'll get a conclusive answer.
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm


Return to Epiphany and Parallella Q & A

Who is online

Users browsing this forum: No registered users and 5 guests