Parallella Memory

Any technical questions about the Epiphany chip and Parallella HW Platform.

Moderator: aolofsson

Re: Parallella Memory

Postby jrambo316 » Wed Nov 26, 2014 6:38 pm

In table 1 on page 20 of the epiphany reference manual, it describes the memory model. It seems to me that the read core x/ read core x and the read core x / read core y would be applicable here. Both were deterministic. What am I missing?
jrambo316
 
Posts: 4
Joined: Tue Nov 25, 2014 6:16 pm

Re: Parallella Memory

Postby cmcconnell » Wed Nov 26, 2014 7:35 pm

jrambo316 wrote:In table 1 on page 20 of the epiphany reference manual, it describes the memory model. It seems to me that the read core x/ read core x and the read core x / read core y would be applicable here. Both were deterministic. What am I missing?

Plus, in the same doc it also explicitly says that it is possible to run code from another core's memory -

6.2 Mesh-Node Crossbar Switch

The local memory in a processor node is split into 4 banks that are 8 bytes wide. The banks can be accessed in 1-byte transfers or in 8-byte transfers. All banks can be accessed once per clock cycle and operate at the same frequency as the CPU. The memory system in a single processor node thus supports 32GB/sec memory bandwidth at an operating frequency of 1 GHz.

Four masters can access the processor node local memory simultaneously:

Instruction Fetch: This master fetches one 8-byte instruction from local memory into the instruction decoder of the program sequencer. The CPU’s maximum instruction issue rate is two 32-bit instructions per clock cycle, so in heavily loaded program conditions, the program sequencer can access a memory bank on every clock cycle. The instruction-fetch logic can also fetch instructions directly from external memory or from other cores within the Epiphany fabric.
Colin.
cmcconnell
 
Posts: 99
Joined: Thu May 22, 2014 6:58 pm

Re: Parallella Memory

Postby greytery » Wed Nov 26, 2014 9:09 pm

Colin,

I am 'delighted' :oops: to be RTFM'd about the instruction fetch. Thanks.

But..
greytery wrote: even if it did, the mesh would be running flat out and you'd have no bandwidth left to shovel the data. Also, the mesh is a weak memory-order model, and there would be no guarantee of the order of instructions

On second thoughts, maybe the instruction fetch order would sort itself out, but what about any associated data fetches/writes?
Exciting times.
Although 32GB/sec bandwidth (at 1GHz, not Parallella) gives great on-core performance, as soon as a core goes on the mesh for code, that bandwidth will fall away due to contention. If that fetch goes out to the shared memory (via eLink) then z z z z.
It's more likely that a DMA page fetch approach would prove to be more efficient/performant than a single-shot code fetch.

Some advice from the guy wot designed this would be informative. :?:

Cheers
tery
User avatar
greytery
 
Posts: 205
Joined: Sat Dec 07, 2013 12:19 pm
Location: ^Wycombe, UK

Re: Parallella Memory

Postby cmcconnell » Wed Nov 26, 2014 10:05 pm

greytery wrote: but what about any associated data fetches/writes?

The sort of model I've been tentatively thinking about would involve mostly local data, as it would be the different data that distinguishes one core (or workgroup) from another, while they execute the same code.

greytery wrote:Exciting times.
Although 32GB/sec bandwidth (at 1GHz, not Parallella) gives great on-core performance, as soon as a core goes on the mesh for code, that bandwidth will fall away due to contention. If that fetch goes out to the shared memory (via eLink) then z z z z.
It's more likely that a DMA page fetch approach would prove to be more efficient/performant than a single-shot code fetch.

It might depend on the nature of the application. Ideally, what I would like is to share any large but rarely invoked code between cores, while keeping the more frequently invoked stuff in the speedy, one-per-core arrangement.

If a function were to be shared, then I suppose it ought to be loaded into one of the cores towards the centre of the matrix, so that the fetching of the instructions by other cores would involve as few hops as possible.

And, space permitting, there could be more than one copy of a function, but still less than one per core. E.g., a workgroup might consist of 2 or 4 cores, with some functions shared among the cores in the workgroup. You've then got 8 or 4 instances of this workgroup on an E16 chip, with only the near neighbours communicating with each other.
Colin.
cmcconnell
 
Posts: 99
Joined: Thu May 22, 2014 6:58 pm

Re: Parallella Memory

Postby Melkhior » Wed Nov 26, 2014 10:22 pm

greytery wrote:Also, the mesh is a weak memory-order model, and there would be no guarantee of the order of instructions


I think you're being overly pessimistic about weak memory ordering and memory consistency model.

There's a couple of cases you normally don't need to worry about memory ordering:

1) any area of memory that is only ever read (which will be the case for non-self-modifying code)
2) any area of memory that is only ever accessed by a single processing unit

Whenever memory is read, the completion of the load is stalled until the data arrives. Data-dependency on registers then subsequently stalls everything depending on the load. Some other load can complete in an out-of-order fashion, but that won't alter the semantic of the instruction flow.

Whenever a single process read & write, the guarantee is that a load will observe the value of the most recent store at the same address, which is what you'd expect.

If you only share one address, you might have synchronization issues and cache coherency issues (not on the parallella, for the obvious reason that there's no cache), but you're still not in the memory consistency area.

The memory consistency model and the memory ordering specifications are important when you share more than one data. The canonical example is:

0) X is 0, Y is 0
1A) process A does:
write X = 42;
write Y = 1;
1B) at the same "time", process B does:
repeat read Y until it is 1;
read X;

Most people expect X to be read as 42 in process B. However, there's no guarantee with a weak memory ordering model. The write "Y=1" can complete and propagate from A to B before "X=42" does. So process B can see Y is 1, yet X is still seen as 0. This problem is what break e.g. Dekker's algorithm when you have a weakly ordered memory consistency model.

Rule of thumb - if there's no sharing of _writable_ data, you're in the clear :-)

Update: for those interested, I suggest Parallel Computer Architecture: A Hardware/Software Approach by Culler et al. It's a great textbook.

Update 2: clarify *where* we expect X=42.
Melkhior
 
Posts: 39
Joined: Sat Nov 08, 2014 12:19 pm

Re: Parallella Memory

Postby aolofsson » Wed Nov 26, 2014 11:19 pm

Just so that we are clear regarding point #2.

The following scenario is a gotcha!
[edit: Bad example below! Could only happen for DMA, not with regular LDR instruction]

1.) CoreA writes aa to addr X
2.) CoreA reads from addr X
3.) CoreA write bb to addr X

The value read back in step 2 could be aa or bb, depending on network traffic. Very likely that aa is returned, but not guaranteed

This is only an issue when X is outside coreA.

[edit: New example]
A more likely scenario is the following:
0.) CoreA writes aa to addr X
....a lot of other stuff happens
1.) CoreA writes bb to addr X
2.) CoreA reads from addr X
Lesson, don't set some kind of program status/sync flag outside of local memory, without doing an explicit path flush as described here:
viewtopic.php?f=49&t=984
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Parallella Memory

Postby notzed » Thu Nov 27, 2014 2:59 am

aolofsson wrote:Just so that we are clear regarding point #2.

The following scenario is a gotcha!

1.) CoreA writes aa to addr X
2.) CoreA reads from addr X
3.) CoreA write bb to addr X

The value read back in step 2 could be aa or bb, depending on network traffic. Very likely that aa is returned, but not guaranteed

This is only an issue when X is outside coreA


Doesn't 2 stall until it returns? Or does that stall not prevent step 3 getting far enough through the pipeline to fire off the write transaction?

Is there a 3rd possibility that it gets neither aa or bb (whatever was there before)?
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: Parallella Memory

Postby Melkhior » Thu Nov 27, 2014 7:01 am

aolofsson wrote:The value read back in step 2 could be aa or bb, depending on network traffic. Very likely that aa is returned, but not guaranteed


Whoa. I had read 4.2 too fast, the bullet points page 19 seemed to guarantee "the usual stuff". But table 1 confirms that you don't get sequential consistency even for the single-core case if you get out-of-core. Interesting choice...

Edit: It might explain some "weird" hang I've observed. It's possible I was depending on single-core SC to the shared memory...
Melkhior
 
Posts: 39
Joined: Sat Nov 08, 2014 12:19 pm

Re: Parallella Memory

Postby greytery » Thu Nov 27, 2014 5:50 pm

!Z,
notzed wrote:Doesn't 2 stall until it returns?

" 5.3 Read Transactions Read transactions are non-blocking..."
A read to another node is asynchronous across the rMesh network. The result is written back via the cMesh network (or xMesh if it's off-chip/eLink).
The order of read requests between two nodes should(?) be maintained because the routing is fixed.
The order of the return write results between those two nodes should (?) be maintained because the return route (cMesh/xMesh) is also fixed.
.... (Assuming the writer does not re-order on that core, there's no message loss - and not sure what happens on eLink which is another mux point).

notzed wrote:Is there a 3rd possibility that it gets neither aa or bb (whatever was there before)?

Yes - I think so - if the cMesh was 'saturated' at that time.
The weak-order is mainly due to the separation of the read and write networks, and the sheer randomness of the traffic loads at each mesh node which affect the latency.
Suppose the instantaneous traffic profile across the on-chip network was mostly writes, or alternatively mostly reads - the latencies would be different, and then the race is on!
Same sort of re-ordering happens on the internet - but sequence numbers and higher-level protocols are supposed to sort that out.
This behaviour is part of the Epiphany and it needs to be accommodated.
Looks like we may need some e_safe_read() and e_safe_write() routines in the SDK (and ezesdk). :D

Cheers,
tery
User avatar
greytery
 
Posts: 205
Joined: Sat Dec 07, 2013 12:19 pm
Location: ^Wycombe, UK

Re: Parallella Memory

Postby notzed » Thu Nov 27, 2014 11:38 pm

greytery wrote:!Z,
notzed wrote:Doesn't 2 stall until it returns?

" 5.3 Read Transactions Read transactions are non-blocking..."
A read to another node is asynchronous across the rMesh network. The result is written back via the cMesh network (or xMesh if it's off-chip/eLink).
The order of read requests between two nodes should(?) be maintained because the routing is fixed.
The order of the return write results between those two nodes should (?) be maintained because the return route (cMesh/xMesh) is also fixed.
.... (Assuming the writer does not re-order on that core, there's no message loss - and not sure what happens on eLink which is another mux point).


Not talking about the mesh but the instruction pipeline. Even the 'non-blocking' writes stall the pipeline about 10 cycles afaict but reads also have to wait for the reply to return.

notzed wrote:Is there a 3rd possibility that it gets neither aa or bb (whatever was there before)?

Yes - I think so - if the cMesh was 'saturated' at that time.
The weak-order is mainly due to the separation of the read and write networks, and the sheer randomness of the traffic loads at each mesh node which affect the latency.
Suppose the instantaneous traffic profile across the on-chip network was mostly writes, or alternatively mostly reads - the latencies would be different, and then the race is on!
Same sort of re-ordering happens on the internet - but sequence numbers and higher-level protocols are supposed to sort that out.
This behaviour is part of the Epiphany and it needs to be accommodated.
Looks like we may need some e_safe_read() and e_safe_write() routines in the SDK (and ezesdk). :D

Cheers,


I've found ways to deal with most of the issues. Usually by local read/remote write dual-pairs and reusable primitives that hide details.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

PreviousNext

Return to Epiphany and Parallella Q & A

Who is online

Users browsing this forum: No registered users and 4 guests

cron