Parallella Community

by **MiguelTasende** » Thu Aug 13, 2015 1:17 pm

I would like to know if there is any software example that show a linear chain of cores doing some work.

The problem I want to solve that way: I have 16 buffers (one at each core) and want to sum them. A chain of cores would, after a 16 "sum cycles" latency, do 16 sums in parallell, which is even better than summing by "bi-partition" (as in typical map-reduce algorithms, if I'm not wrong). By now, I couldn't get the synchronization of cores to work properly (yes, shame on me

). The code got full of barriers and mutexes, but it doesn't work...

Is there any example that uses that approach? I see there is an "e_neighbor_id" function, so I believe I'm not the first to have the idea... (I'm not using the function right now because I don't know how to configure the e_coreid_wrap_t).

Has anyone done a sum by bi-partition? That would be: the sum is made first by 2 core groups, then by 4 core groups, then by 8 core groups, and finally the total sum. (I've been thinking that would be best implemented if the results are divided exponentially between the cores, so that everyone does work at each stage).

Thank you very much.

by **sebraa** » Thu Aug 13, 2015 2:15 pm

I have written a small library which provides unidirectional, point-to-point communication channels. Currently, I only have simple channels implemented, like this:
- The receiving core contains the buffer (a char*).
- Each channel consists of two ports, one for each core invoved.
- A port is a structure containing a pointer to the buffer, a read-index and a write-index (byte indices into the buffer array).
- "Reading from the channel" means: Do a memcpy() from the buffer into a local variable, update the read-index, and write the new read-index value to the sender's port (so that it knows how much additional data it can send).
- "Writing to the channel" means: Do a memcpy() from the local variable into the buffer, update the write-index, and write the new write-index value to the receiver's port (so that it knows hoch much it is able to read).

Since I can guarantee that the "read-index" and "write-index" accesses are atomic (they are 32-bit word writes), I don't need any additional synchronization. The reader only reads from the buffer, and only writes to the read-index; the writer only writes to the buffer, ond only writes to the write-index; reader and writer will never access the same part of the buffer simultaneously. The library overhead isn't really small: I need about 100 bytes (split between the cores) plus the buffer itself (on the receiving core) for each core, plus about 2 KB for the whole code.

Initialization and channel management is a pain though.

by **MiguelTasende** » Fri Aug 14, 2015 1:51 pm

Thanks for your answer.

Finally I removed the mutexes, and made the program work with "not so theoretically correct" tools...
- One barrier before begining
- Different amount of __asm__ __volatile__("nop"); for each core (first core in the pipe doesn't wait, second waits 100 cycles, third waits 200 cycles, etc...)
- Begin as if no sync was necessary...
- It works! (Apache registered)

(after all they already share a clock... just joking: I know it is not safe as different cores may go through different flows of execution and take different times, but as far as there are enough cycles to compensate...)

Anyway, that way of summing seems not to be too good for machine precision (you potentially end up summing big numbers to small numbers) so I will change to bi-partition method in "production". It is good to have this implementation for testing, though.

by **piotr5** » Fri Aug 14, 2015 8:47 pm

as I said elsewhere, to sum up numbers you need first sort them by length. then if you sum up 2 neighbours, they need to be about same length. their sum should be again smaller or same as length of the sum of next 2, so you can create sum of these 4. and so on. as for actual algorithm, have you looked at ?

by **MiguelTasende** » Tue Aug 18, 2015 4:54 pm

Thanks for your response.

I think the "sorting by length" is not an option. The amount of calculations needed would be too much (this algorithm goes really fast by now; probably the fastest I've seen on Parallella... just that it doesn't always work... [lack of sync] and it has poor precision (10e-7 relative)... [bad summing algorithm]). But I can't sacrifice much speed.

The "Prefix sum" algorithm looks much better (it seems similar to what I called "bi-partition", I think, though a bit more elaborate...). I will look at that more thoroughly.

by **piotr5** » Thu Aug 20, 2015 11:16 am

with "sorting by length" I didn't mean you must sort exactly according to the value, I mean you just take the exponent of your floating number and assign it to the core specified in some table. i.e. if exponent ranges from -128 to 128, first core takes numbers with exponent between -128 and -112, next core -112 to -96, and so on. you might play around with actually setting up the table for your needs, maybe some numbers are appearing more rarely than others, maybe the range is tighter, or maybe use no table and just divide the exponent by 16 modulo 16 in my example...

have you checked the error happens really in summation? the reason nobody bothers to sort before summation is that usually all numbers tend to be in the same magnitude already. why isn't that the case for you? or do you use negative values in your summation too? cancellation is another big problem, actually bigger than through summation of positive values. i.e. suppose you have 2 equally long values, you subtract them, what remains is the short remnants of their precision. therefore if you did split into small and big values, it's better to subtract pairs of positive and negative before doing the actual summation. for example instead of a+b+c-(e+f+g) you should do (a-e)+(b-f)+(c-g). but basically it might be even better to perform subtraction even before whatever multiplication you did perform before the summation, if the values they're multiplied with had common factors...

Parallella Community

Chained cores

Chained cores

Re: Chained cores

Re: Chained cores

Re: Chained cores

Re: Chained cores

Re: Chained cores

Who is online