I apologize for the tardiness of this reply. I have been recovering from surgery. The parallella was a distraction during my surgery preparation.
I looked up Lamport's bakery algorithm. My reading indicated it was far from a "fair" algorithm. In fact, it appears to be a highly prioritized algorithm:
https://en.wikipedia.org/wiki/Lamport%2 ... _algorithmNonetheless, I coded it up with the data in external shared memory. Although it appears to work well, and possibly will let me include the host processor in the exclusivity, I have not exhaustively tested it yet. It is very prioritized (as suggested in the above link)--unless I have a bug. As with the eSDK algorithm, it won't scale well either--at least when there is significant contention for the mutex (think thousands or millions of cores). I suspect that the real answer I am looking for will require additional transistors between the cores rather than relying on inter-core communication already present--and that probably won't happen.
I also need to think through the parallella's weak memory ordering to see if it it preserves the integrity of Lamport's original algorithm and his proof.
http://lamport.azurewebsites.net/pubs/bakery.pdf A number of others appear to have tweaked the original in various ways addressing the unbounded arithmetic, perhaps one of them fixed the "fairness"? Perhaps one of them generalizes beyond the memory ordering? Perhaps memory ordering is not really significant? All interesting questions.
Additionally, I modified my original program to support moving the location of the eSDK mutex between the various cores. As before, I was surprised by the results. I would have expected that the "closest" cores would have most favored access to the mutex. That does not appear to always be the case. There appears to be something in the interprocessor routing/access algorithm that I don't understand yet!
As for applications where the "fairness" is problematic, I am approaching the machine more as a multi-core general processor instead of a dedicated co-processor, or GPU. In the past, I have licensed software to the government, (including the U.S. Army,) and other organizations, that was used to test large systems. Historically, that code was run on, at times, hundreds of processors, (full boxed computers,) that were used to load a server or test an application. The hundreds of boxes typically emulated thousands of simultaneous users. The co-ordination among those boxes to drive the tests implemented exclusion, barriers and shared memory--although we did not use most of those terms, and most of the code was custom written by one of my customers from scratch. In fact, one installation used a separate LAN to isolate the generated network traffic from the application traffic to ensure fairness. If the exclusion algorithm's were unfair, it could impact the test results as bias would be introduced. (Not all processors were executing identical code/scripts or were testing identical parts of the system/application.) I am exploring replacing hundreds of boxes, with hundreds/16 parallella's. A significant problem, however, is tying each of the parallella's cores to the network. Taking all of the data through the Arm would be a problem. Possibly putting additional NIC's in the FPGA might help.