Parallella Community

[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 483: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 112: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 112: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/includes/bbcode.php on line 112: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead
[phpBB Debug] PHP Warning: in file [ROOT]/feed.php on line 173: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
[phpBB Debug] PHP Warning: in file [ROOT]/feed.php on line 174: Cannot modify header information - headers already sent by (output started at [ROOT]/includes/functions.php:3823)
Parallella Community Supercomputing for Everyone 2017-03-25T21:08:04+00:00 https://parallella.org/forums/feed.php?f=13&t=3987 2017-03-25T21:08:04+00:00 2017-03-25T21:08:04+00:00 https://parallella.org/forums/viewtopic.php?t=3987&p=18806#p18806 <![CDATA[Re: mutex performance]]> Statistics: Posted by GreggChandler — Sat Mar 25, 2017 9:08 pm

]]> 2017-03-24T12:43:54+00:00 2017-03-24T12:43:54+00:00 https://parallella.org/forums/viewtopic.php?t=3987&p=18802#p18802 <![CDATA[Re: mutex performance]]>
I chose to put my data in the shared external memory rather than core memory due to my understanding of the system architecture as documented in the eSDK and architecture manuals that I have access to. I have not read the eSDK source to determine how the loader works, but from the eSDK documentation, it does not appear that the host can directly map core memory as it can external memory. Use of the e-read() and e-write() functions to access the core memory, while possible, is less than elegant. I could copy the block of data to the host, access it, and copy it back--but without a mutex, this would be problematic.(grin) I presume that the loader and other eSDK functions that access the registers do so through the mesh network protocol via the FPGA and the link protocol--but I have not studied this yet. I could possibly write some assembly code to make this happen, but am not that far yet.

To test my code, I implemented a third shared memory data structure that supports formatted printing from the core through the host to the console. I guarded access to the data structures with my mutex from both the host and core processors. I also created light-weight c++ classes layered over the eSDK definition to isolate me from eSDK changes and relieve the tedium of much of the code I found myself writing. One class supports the eHAL functions, the other supports the eLib functions. Lastly, I implemented a core application that executed random wait's to simulate work and printed core data/numbers that were interspersed with barriers to see it all work. It was thorough enough to flush out a bug or two in both the barrier and the mutex code, however, I would not claim them to be exhaustively tested. The resulting host or core code does not look like any of the sample epiphany applications that I have read--although I have only read a few, however, it does allow me to reuse algorithms and code more easily.

As for my code, I pretty much followed Lamport's pseudo-code as detailed in the link to his ACM article previously posted. It is not finished yet, and I have not analyzed performance yet, but it does appear to work. The downside of the algorithm is that thread number, or in my case core number, determines priority in locking the mutex. One can re-configure the priorities, however, as discovered by Lamport, thread/core 1/0 will always have priority over 16/15. (His algorithm indexes from one rather than zero.) The good news is that it is predictable. The bad new is that it is not fair. To implement fairness, one probably would need to add at least a primitive scheduler as an Operating System would. The implications of the prioritization are that independent of order of mutex lock request, thread/core 1/0 will always have priority if it is waiting for the mutex.

P.S. My initial implementation dynamically allocated the buffers based upon the number of cores in the system, however, I found that executing a "new" or "malloc" brought in way too much library code and data. I am exploring alternatives.

e-test.cpp

bakery.h

bakery.cpp

Statistics: Posted by GreggChandler — Fri Mar 24, 2017 12:43 pm

]]> 2017-03-23T02:11:29+00:00 2017-03-23T02:11:29+00:00 https://parallella.org/forums/viewtopic.php?t=3987&p=18797#p18797 <![CDATA[Re: mutex performance]]>
No worries on the delay. I hope you're feeling better.

I had discussed, with a colleague, applications where locking fairness is problematic and he said he has experienced it on some applications, but it is rare. And that was with with SMP x86. The problem isn't unique to Epiphany and I'm optimistic clever solutions in software are possible. And I can appreciate the concerns for scaling to thousands and millions of cores. You may be right that some extra transistors may be needed at that scale.

If you're willing to share, I am interested in taking a look at your Lamport's algorithm with the external shared memory. Your decision to use external shared memory is curious to me as the ARM host can access the Epiphany core memory.

If you've run code on Army clusters, we probably know some of the same people.

Take care.

Statistics: Posted by jar — Thu Mar 23, 2017 2:11 am

]]> 2017-03-22T17:38:23+00:00 2017-03-22T17:38:23+00:00 https://parallella.org/forums/viewtopic.php?t=3987&p=18795#p18795 <![CDATA[Re: mutex performance]]>
I looked up Lamport's bakery algorithm. My reading indicated it was far from a "fair" algorithm. In fact, it appears to be a highly prioritized algorithm: https://en.wikipedia.org/wiki/Lamport%2 ... _algorithm

Nonetheless, I coded it up with the data in external shared memory. Although it appears to work well, and possibly will let me include the host processor in the exclusivity, I have not exhaustively tested it yet. It is very prioritized (as suggested in the above link)--unless I have a bug. As with the eSDK algorithm, it won't scale well either--at least when there is significant contention for the mutex (think thousands or millions of cores). I suspect that the real answer I am looking for will require additional transistors between the cores rather than relying on inter-core communication already present--and that probably won't happen.

I also need to think through the parallella's weak memory ordering to see if it it preserves the integrity of Lamport's original algorithm and his proof. http://lamport.azurewebsites.net/pubs/bakery.pdf A number of others appear to have tweaked the original in various ways addressing the unbounded arithmetic, perhaps one of them fixed the "fairness"? Perhaps one of them generalizes beyond the memory ordering? Perhaps memory ordering is not really significant? All interesting questions.

Additionally, I modified my original program to support moving the location of the eSDK mutex between the various cores. As before, I was surprised by the results. I would have expected that the "closest" cores would have most favored access to the mutex. That does not appear to always be the case. There appears to be something in the interprocessor routing/access algorithm that I don't understand yet!

As for applications where the "fairness" is problematic, I am approaching the machine more as a multi-core general processor instead of a dedicated co-processor, or GPU. In the past, I have licensed software to the government, (including the U.S. Army,) and other organizations, that was used to test large systems. Historically, that code was run on, at times, hundreds of processors, (full boxed computers,) that were used to load a server or test an application. The hundreds of boxes typically emulated thousands of simultaneous users. The co-ordination among those boxes to drive the tests implemented exclusion, barriers and shared memory--although we did not use most of those terms, and most of the code was custom written by one of my customers from scratch. In fact, one installation used a separate LAN to isolate the generated network traffic from the application traffic to ensure fairness. If the exclusion algorithm's were unfair, it could impact the test results as bias would be introduced. (Not all processors were executing identical code/scripts or were testing identical parts of the system/application.) I am exploring replacing hundreds of boxes, with hundreds/16 parallella's. A significant problem, however, is tying each of the parallella's cores to the network. Taking all of the data through the Arm would be a problem. Possibly putting additional NIC's in the FPGA might help.

Statistics: Posted by GreggChandler — Wed Mar 22, 2017 5:38 pm

]]> 2017-02-23T05:22:13+00:00 2017-02-23T05:22:13+00:00 https://parallella.org/forums/viewtopic.php?t=3987&p=18616#p18616 <![CDATA[Re: mutex performance]]>
Epiphany, in this context, is not an SMP processor so we shouldn't expect that fairness. You can implement a bakery algorithm to introduce your fairness into a locking scheme. External memory does indeed have worse performance with respect to latency and is not symmetric.

I have not run into a fairness issue with the testset locking scheme in practice. I would be interested in any applications where it is problematic.

Regarding barriers, feel free to peruse some of the OpenSHMEM code written for them:
https://github.com/USArmyResearchLab/op ... rier_all.c
https://github.com/USArmyResearchLab/op ... _barrier.c
https://github.com/USArmyResearchLab/op ... rier_all.c

The routines have been optimized for performance. The WAND barrier requires inline assembly. The software dissemination (butterfly) barrier is "loud" in the sense that it creates more network traffic than a tree-based locking algorithm (not shown). There are a few algorithms for barriers. If you find a better one for Epiphany than the code above, I am interested in hearing about it.

Statistics: Posted by jar — Thu Feb 23, 2017 5:22 am

]]> 2017-02-22T15:48:15+00:00 2017-02-22T15:48:15+00:00 https://parallella.org/forums/viewtopic.php?t=3987&p=18613#p18613 <![CDATA[Re: mutex performance]]>
I have to think about this more, but I would expect that most, if not all, mutex implementations for SMP processors such as a multicore x86 are built around some type of atomic conditional read/modify/write instruction. In the case of the Epiphany, as the number of cores increases, one can only expect that the "fairness" of the mutex implementation will only get worse, i.e. it won't scale well as the range of distances to the mutex core increases. Furthermore, I would imagine that selection of the mutex core location is another important factor, i.e. center of the grid versus the localized origin of the grid as in the example program provided by Adapteva. (In my results, wrapping around the end for core [0,3] appears most favored position.)

As these results appear to be the consequences of fundamental architectural features/constraints of the Epiphany, it is not clear how any amount of software scaffolding around the test/set instruction could compensate for the design. From my admittedly inexperienced familiarity with the architecture, I would expect that moving the mutex to External Memory would not reduce the bias that I measured, and would only server to be much slower. Do each of the cores access external memory with the same access penalties, or do those closer to the edge of the chip experience faster response times? (I seem to remember reading somewhere that external memory was implemented off of one of the edges of the chip.)

As I wrote previously, if this is the expected behavior of the eSDK, then it should be documented. I don't believe that any of the other mutex's that I have used (Unix, Linux, Microsoft Windows, Mac) exhibit such behavior. (Admittedly these are operating systems, not bare iron, and mostly running x86, although the eSDK is not bare iron either.) Perhaps I need to retest them however?

As you appear to have correctly anticipated, my next experiment was going to be integration with barriers, although perhaps you have saved me the time!

Statistics: Posted by GreggChandler — Wed Feb 22, 2017 3:48 pm

]]> 2017-02-22T04:24:13+00:00 2017-02-22T04:24:13+00:00 https://parallella.org/forums/viewtopic.php?t=3987&p=18611#p18611 <![CDATA[Re: mutex performance]]>
Your results aren't surprising. The mutex routines are essentially thin wrappers of the testset instruction. The testset instruction must propagate through the network and then return to the original core. It's expected that this would lead to a load imbalance. If you need to have a "fair" or ordered locking method, that can be implemented in software. That is the solution, but you pay the price with performance.

Regarding your comment in the other thread, the eSDK barrier isn't well optimized for performance. A dissemination barrier has been shown to be significantly faster. Additionally, there is a hardware WAND barrier if you're using all of the cores.

Statistics: Posted by jar — Wed Feb 22, 2017 4:24 am

]]> 2017-02-21T05:12:35+00:00 2017-02-21T05:12:35+00:00 https://parallella.org/forums/viewtopic.php?t=3987&p=18608#p18608 <![CDATA[Re: mutex performance]]> Statistics: Posted by GreggChandler — Tue Feb 21, 2017 5:12 am

]]> 2017-02-20T17:21:31+00:00 2017-02-20T17:21:31+00:00 https://parallella.org/forums/viewtopic.php?t=3987&p=18607#p18607 <![CDATA[mutex performance]]> Statistics: Posted by GreggChandler — Mon Feb 20, 2017 5:21 pm

]]>