I would like to know if there is any software example that show a linear chain of cores doing some work.
The problem I want to solve that way: I have 16 buffers (one at each core) and want to sum them. A chain of cores would, after a 16 "sum cycles" latency, do 16 sums in parallell, which is even better than summing by "bi-partition" (as in typical map-reduce algorithms, if I'm not wrong). By now, I couldn't get the synchronization of cores to work properly (yes, shame on me ). The code got full of barriers and mutexes, but it doesn't work...
Is there any example that uses that approach? I see there is an "e_neighbor_id" function, so I believe I'm not the first to have the idea... (I'm not using the function right now because I don't know how to configure the e_coreid_wrap_t).
Has anyone done a sum by bi-partition? That would be: the sum is made first by 2 core groups, then by 4 core groups, then by 8 core groups, and finally the total sum. (I've been thinking that would be best implemented if the results are divided exponentially between the cores, so that everyone does work at each stage).
Thank you very much.