Performance issues

Moderator: dar

Performance issues

Postby nickoppen » Fri Mar 25, 2016 10:05 am

leonfg wrote:Hi, I read your ANN blogs and executed the nnP code. There is a question that I want to discuss.
I defined a 48,48,34 network and it worked well, the predict process will cost ~0.15s(forka executing time), but in ARM a same topology network based on opencv only need ~0.02s. I changed the cl code to make the "forwardPass" run 2 to 100 times in "k_forward", the forka time only increased ~0.005s*n. Is that means the forward predict process only cost ~0.005s in Epiphany, and the data movement cost most of the other time? If that is true, do you think there is any way to reduce the time cost?


Hi,

Thanks for doing the performance testing. So far I've concentrated on getting the algorithm right and have not thought about performance. I'm not surprised that additional forward passes add very little to the execution time. The cost of launching the kernel is huge and doing all that work for a single pass makes no sense at all.

I've started working on the on the "performance" version starting with upgrading my passing experiment to see how MPI compares with direct memory writes. Using COPRTHR and MPI I hope to get to a situation where I can read in the data using the ARM, pass it to the epiphany using a dma transfer while the previous data set is being processed. Then, at the end of the process, copy the whole trained network back using dma or the output of the forward pass using whatever method is fastest.

The overall idea is described at the end of my "Getting it done" post. I've expanded my ideas a little since I wrote that when I realised that I can use dma to at least start the copy of the next data set into local memory while the epiphany core is working on the last data set. The only way that using the epiphany can be made efficient is to launch the kernel once and then throw lots of data at it for as long a time as possible.

Do you have your code on github? I'd like to have a look at your changes.

nick
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 266
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia

Re: Performance issues

Postby jar » Sat Mar 26, 2016 2:37 am

Nick,

Under the hood, the COPRTHR MPI is performing double-word direct writes. It is the most efficient way to move data between cores on chip in a synchronous manner. Use the DMA engine if you want to overlap the transfer with compute. And use the DMA engine for off-chip transfers.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Performance issues

Postby nickoppen » Sat Mar 26, 2016 10:36 am

Thanks jar,

That sounds like it will fit in to my strategy quite well. I plan to send an MPI message containing the location of the new data set from the ARM to all cores. This is the indication that there is new data ready and they will work out which bit they need to copy using DMA.

My understanding is that coprthr implements pthread like processes on the epiphany cores (?). My idea is to have one thread doing the processing and one sending and receiving MPI messages to and from the ARM and doing the data transfer back and forth. Similarly on the ARM, there will be one input/writer thread and one reader/output thread.

There is one thing I need to check out for the inter-core communication. I want to investigate the relative performance between direct memory writes using barriers to coordinate and MPI messages. My first data passing experiment showed that barriers add a huge overhead. Maybe MPI messages will be faster.

nick
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 266
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia

Re: Performance issues

Postby leonfg » Sat Mar 26, 2016 4:03 pm

Hi nick! Thanks for answering!
I just added "for" sequence around the forwardpass func. There is no need to share the code. I am still working on understanding your code, I am a new guy in cl
leonfg
 
Posts: 18
Joined: Mon Nov 24, 2014 8:31 am

Re: Performance issues

Postby jar » Sun Mar 27, 2016 1:56 am

Nick, I think you may be misunderstanding what threaded MPI does. Or maybe I am misunderstanding you. The COPRTHR interface can launch threads in a Pthread style manner. Arguments are passed to the core with a pointer to a struct containing all arguments like Pthreads. That struct may have pointers to host side data.

The MPI interface is just used for inter-core data transfer. Not ARM to Epiphany.

Things are going to change soon, however. I have been using an early release of COPRTHR 2.0 and there are new features coming that will improve the programmability in many aspects. The Pthread style will still be supported. You will just have other options.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Performance issues

Postby nickoppen » Mon Mar 28, 2016 1:16 am

jar wrote:The COPRTHR interface can launch threads in a Pthread style manner. Arguments are passed to the core with a pointer to a struct containing all arguments like Pthreads.


I think I did misunderstand the coprthr pthread interface. From what I can glean, it allows the host application to treat processes running on the coprocessor like a pthread running on the host processor. I was hoping that I could launch a second reader/writer thread on the core. Come to think of it that would require a run time package with pre-emption to be running on the core... In that case I'm going to have to serialise the copy in process and then kick off the copy out using dma just before the processing of the new input starts. I can at least get a little I/O happening in parallel.

jar wrote:The MPI interface is just used for inter-core data transfer. Not ARM to Epiphany.


This is a bummer. Reading about MPI I assumed that ARM to Epiphany would be included, given that the literature talks about cross-platform inter-operation. Never mind, the current library is still a bit of a toe-in-the-water implementation so I'm keeping my fingers crossed for ver. 2.0. Until then I'll see that I can figure out.

Thanks again jar. Your input is always great.
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 266
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia

Re: Performance issues

Postby leonfg » Mon Mar 28, 2016 6:52 am

Assuming that the reasons of huge time cost are COPRTHR operation and data movement latency, based on current architecture, could we send a set of samples one time instead of one sample one time? For example in the same network topology which I mentioned, if we cache 100 predict samples then send to Epiphany and activate the forwardpass kernel, and retrieve 100 results one time, the whole execution time may still less than 0.2s. But in ARM, same task will cost 2s+. It is possible to doing this?
leonfg
 
Posts: 18
Joined: Mon Nov 24, 2014 8:31 am

Re: Performance issues

Postby nickoppen » Mon Mar 28, 2016 9:03 am

Yes, we could collect batches of 100 and process them with one call. That would be fine for a lot of applications.

However, some applications don't have the luxury of waiting. If you want to process video frames in real time, at 25 frames per second, you would be waiting 4 seconds before you do anything. That would not be acceptable for a real time application. Actually, if your time per frame was less that 0.04sec you would be finished the first frame before the second was even ready.

I want to build a general purpose framework where processing can start as soon as the first data packet is ready. In that way, reading in the second packet can occur in parallel with the processing of the first. This would allow arbitrarily long data streams to be processed in the limited memory available on each core. If you launch the kernel and processed packets for the next hour, you would get pretty close to eliminating the launch time as a limiting factor.

I've just got to fine the combination of techniques to get it all working.
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 266
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia

Re: Performance issues

Postby leonfg » Mon Mar 28, 2016 1:58 pm

nickoppen wrote:Yes, we could collect batches of 100 and process them with one call. That would be fine for a lot of applications.

However, some applications don't have the luxury of waiting. If you want to process video frames in real time, at 25 frames per second, you would be waiting 4 seconds before you do anything. That would not be acceptable for a real time application. Actually, if your time per frame was less that 0.04sec you would be finished the first frame before the second was even ready.

I want to build a general purpose framework where processing can start as soon as the first data packet is ready. In that way, reading in the second packet can occur in parallel with the processing of the first. This would allow arbitrarily long data streams to be processed in the limited memory available on each core. If you launch the kernel and processed packets for the next hour, you would get pretty close to eliminating the launch time as a limiting factor.

I've just got to fine the combination of techniques to get it all working.

Looking forward your achievement!
leonfg
 
Posts: 18
Joined: Mon Nov 24, 2014 8:31 am

Re: Performance issues

Postby leonfg » Tue Apr 05, 2016 2:18 pm

Hi Nick, I changed your code to make JIT compile once instead of everytime when there are multiple samples in .dat file, and imports a result csv log. Getting code from https://github.com/leonfg/epiphanyANN.git.
In addition, I wrote a matlab project to train a MLP and import weights and biases in .nn format and generate test samples in .dat format. I used UCI IRIS data set to train a 4/10/3 BP SIGMOID network, and generated several sets of nn and dat files. The difference between the file sets is the selection ratio of train samples, leading to different accuracy rate of prediction in matlab.The strange thing is, in parallella, some file sets will output exactly same prediction result as matlab, but some other sets will get totally different and wrong result. I can not explain this, these file sets are all generated by same matlab code and the only difference is training sample selection ratio, technically in parallella they should all get same result as they were in matlab.
I just want to find a faster training way and import the trained network model in Epiphany will be enough for most applications. But recently I am stuck in here.
leonfg
 
Posts: 18
Joined: Mon Nov 24, 2014 8:31 am

Next

Return to OpenCL

Who is online

Users browsing this forum: No registered users and 2 guests

cron