Parallella Community

by **jar** » Sat Mar 26, 2016 2:37 am

Nick,

Under the hood, the COPRTHR MPI is performing double-word direct writes. It is the most efficient way to move data between cores on chip in a synchronous manner. Use the DMA engine if you want to overlap the transfer with compute. And use the DMA engine for off-chip transfers.

by **nickoppen** » Sat Mar 26, 2016 10:36 am

Thanks jar,

That sounds like it will fit in to my strategy quite well. I plan to send an MPI message containing the location of the new data set from the ARM to all cores. This is the indication that there is new data ready and they will work out which bit they need to copy using DMA.

My understanding is that coprthr implements pthread like processes on the epiphany cores (?). My idea is to have one thread doing the processing and one sending and receiving MPI messages to and from the ARM and doing the data transfer back and forth. Similarly on the ARM, there will be one input/writer thread and one reader/output thread.

There is one thing I need to check out for the inter-core communication. I want to investigate the relative performance between direct memory writes using barriers to coordinate and MPI messages. My first data passing experiment showed that barriers add a huge overhead. Maybe MPI messages will be faster.

nick

by **leonfg** » Sat Mar 26, 2016 4:03 pm

Hi nick! Thanks for answering!
I just added "for" sequence around the forwardpass func. There is no need to share the code. I am still working on understanding your code, I am a new guy in cl

by **jar** » Sun Mar 27, 2016 1:56 am

Nick, I think you may be misunderstanding what threaded MPI does. Or maybe I am misunderstanding you. The COPRTHR interface can launch threads in a Pthread style manner. Arguments are passed to the core with a pointer to a struct containing all arguments like Pthreads. That struct may have pointers to host side data.

The MPI interface is just used for inter-core data transfer. Not ARM to Epiphany.

Things are going to change soon, however. I have been using an early release of COPRTHR 2.0 and there are new features coming that will improve the programmability in many aspects. The Pthread style will still be supported. You will just have other options.

by **nickoppen** » Mon Mar 28, 2016 1:16 am

by **leonfg** » Mon Mar 28, 2016 6:52 am

Assuming that the reasons of huge time cost are COPRTHR operation and data movement latency, based on current architecture, could we send a set of samples one time instead of one sample one time? For example in the same network topology which I mentioned, if we cache 100 predict samples then send to Epiphany and activate the forwardpass kernel, and retrieve 100 results one time, the whole execution time may still less than 0.2s. But in ARM, same task will cost 2s+. It is possible to doing this?

by **nickoppen** » Mon Mar 28, 2016 9:03 am

Yes, we could collect batches of 100 and process them with one call. That would be fine for a lot of applications.

However, some applications don't have the luxury of waiting. If you want to process video frames in real time, at 25 frames per second, you would be waiting 4 seconds before you do anything. That would not be acceptable for a real time application. Actually, if your time per frame was less that 0.04sec you would be finished the first frame before the second was even ready.

I want to build a general purpose framework where processing can start as soon as the first data packet is ready. In that way, reading in the second packet can occur in parallel with the processing of the first. This would allow arbitrarily long data streams to be processed in the limited memory available on each core. If you launch the kernel and processed packets for the next hour, you would get pretty close to eliminating the launch time as a limiting factor.

I've just got to fine the combination of techniques to get it all working.

by **leonfg** » Mon Mar 28, 2016 1:58 pm

by **leonfg** » Tue Apr 05, 2016 2:18 pm

Hi Nick, I changed your code to make JIT compile once instead of everytime when there are multiple samples in .dat file, and imports a result csv log. Getting code from https://github.com/leonfg/epiphanyANN.git.
In addition, I wrote a matlab project to train a MLP and import weights and biases in .nn format and generate test samples in .dat format. I used UCI IRIS data set to train a 4/10/3 BP SIGMOID network, and generated several sets of nn and dat files. The difference between the file sets is the selection ratio of train samples, leading to different accuracy rate of prediction in matlab.The strange thing is, in parallella, some file sets will output exactly same prediction result as matlab, but some other sets will get totally different and wrong result. I can not explain this, these file sets are all generated by same matlab code and the only difference is training sample selection ratio, technically in parallella they should all get same result as they were in matlab.
I just want to find a faster training way and import the trained network model in Epiphany will be enough for most applications. But recently I am stuck in here.

Parallella Community

Performance issues

Performance issues

Re: Performance issues

Re: Performance issues

Re: Performance issues

Re: Performance issues

Re: Performance issues

Re: Performance issues

Re: Performance issues

Re: Performance issues

Re: Performance issues

Who is online