Parallella Community

by **ESI** » Wed Jun 24, 2015 10:32 pm

Hi,
just some additional info.
Banana pi has also a camera OV5640. It it has 40 Pin connection, so I like the 15 pin from raspi better, but the data sheet can be easily found, the spec seems similar to me, so maybe this helps with "guessing" the registers.
BR
Joerg

by **tnt** » Thu Jun 25, 2015 9:40 am

Here are the files for people to try it out :

http://people.osmocom.org/~tnt/parallella/proto.tar.bz2

The FPGA project isn't in there, it's a pre-built version for the Z7020 board. I didn't get a change this morning to package the project properly to be distributable (need to figure how to do that in vivado). I'll get to that when I'm back from work tonigh.

@ESI: Yeah, some things might be interesting, but even that datasheet has a lot of "Contact your FAE for more informations" for a lot of the details. The registers also look mostly different. (We have a ov5647 datasheet, it's just an old preliminary one).

by **tnt** » Thu Jun 25, 2015 4:43 pm

Here's a project archive that contains all that's needed to reproduce :

http://people.osmocom.org/~tnt/parallel ... am.xpr.zip

by **ESI** » Thu Jun 25, 2015 7:46 pm

@tnt
nice work! I am waiting for my porcupine to test it.
I did not have the time to compare, I just threw in in there, because I ran over the data sheet yesterday. I hoped it would help to get more info. I am working on a OV7670 (interface to a different HW), which has an fully open data sheet, at least the parts I already looked at, but the specs are so different (ov7670 is only VGA at 30fps) that I do not think it helps.
BR
Joerg

by **tnt** » Tue Jun 30, 2015 8:36 pm

Ok, I've updated both files with the latest version :

* http://people.osmocom.org/~tnt/parallella/proto.tar.bz2

This contains the pre-build FPGA binaries and the source of the userspace application to control and retrieve frames

* http://people.osmocom.org/~tnt/parallel ... am.xpr.zip

This is an archive of the project with all the HDL / vivado sources

So what's new :
* HW mod is no longer required ! You can just plug your RPi camera into the porcupine and it will work !
* The MIPI CSI core now has proper control and status registers. Among other things you can see protocol errors / reset stuff if needed. You can also configure the small video processing pipeline which allows to grab pixels in different format like rgba32 or gray8 or raw8 or raw10 ...
* Improved user space application performance. It can now dump frames 30 fps. However this takes so much resource on the ARM (especially memory bandwidth) that pretty much the only think you can do with them is to store them as RAW data on a ram disk.

Here's a small video I recorded at 30 fps in B/W (because 30sec of RAW ARGB would not have fitted into RAM) : http://people.osmocom.org/~tnt/parallel ... 150630.mp4

I think this should qualify for items 1->3 ? Andreas comments ?

At this point the performance is about as good as it's going to get. The ARM is just not that fast. Data rate is about 150 MB/s and the ARM benchmark shows memcpy at 350 MB/s to give you an order of magnitude.

I actually looked at performance for a kernel driver and this would not really help speed either. The main issue is that the ARM and the DMA are not cache-coherent so you have two choices:

1) Map the memory you use for DMA as uncached in the userspace process : This is what's done currently.
2) Map the memory you use for DMA as cache in the userspace process and after everyframe, invalidate any cache line that may point to it.

The issue with (1) is that uncached access from the ARM are _SLOW_. A simple memcpy will be roughy 10x slower on uncached vs cached (for a large 8MB memcpy).
The issue with (2) is that invalidating cache lines is _SLOW_ because you need to issue invalidate commands for _every_ possible cache line. And a cache line being 32b, there is 262144 of them in a 8MB frame buffer. Then you actually need to do that both for L1 and for L2. This takes a while ... nearly negating all the effects of caching.

What I've done so far is to use (1) and then to mitigate the uncached access is to use NEON to fetch data. This allows me to issue a single 64 bytes fetch in 1 instruction and this results in pretty efficient bursts when fecthing from DDR and you can get performance that's near what cached access it.

(Note that all this stuff also applies to access to the shared memory with the epiphany and NEON access can speed it up pretty good too !)

by **olajep** » Wed Jul 01, 2015 11:01 pm

by **tnt** » Thu Jul 02, 2015 8:24 am

Yes, the ACP is an option. I realized I hadn't discussed it just after posting.

On the plus side, you can then use normal memory and don't have to care about cache. If you driver is well written using dma API, then just specifying 'dma-coherent' attribute on the device tree will turn the cache ops or the dma_alloc_coherent into no-ops and return normal memory. So you can plug your HW on ACP or on the HP port and only have to change 1 attr on the device tree. It's also nice that then your application doesn't have to care all that much.

The down side is that the ACP port has less bandwidth available than the HP port. People have measured it at about half the bandwidth of the HP port. That bandwidth also impact the performance of the ARM since it's shared (which a benchmark might not show). You also have to be careful on which protection bits you use during your ACP transfer to make sure not to trash the cache. You want the DMA write to not allocate cache lines, just invalidate them (at least for such large transfers like with the camera). Exchanging small data with the epiphany could be full cacheable and allocate on read/write to speed things up.

Note that the cache trashing might happen if your app does large access to this cacheable zone as well ... which is why a good written app is always better :p

I started a thread a couple days ago on the linux-arm mailling list where all the solutions have been discussed with pro/cons if you want more details.

ATM I've used vldm and vstm for my optimization. See the fast_write method in rpi_cam.c. Since I'm writing to a file descriptor, I just wanted the 'write' to be fast and so I'm doing small 8k write() calls and before that call, I memcpy using vldm/vstm those 8k from the non-cached shared memory into a small local buffer. Now because that local buffer is so small and I overwrite it between each call, it's most likely that the actual writes to it never hit DDR and stay within L1.

A very quick read speed bench I made (basically summing all the int32_t in a large buffer) yields:
- Pure C - cacheable mem : 340 MB/s
- Pure C - non-cacheable : 35 MB/s
- NEON - cacheable mem : 577 MB/s
- NEON - non-cacheable : 482 MB/s

For the non-cacheable mem, preload wouldn't help at all (and probably even be counter productive) because pre-loading is just telling the CPU to load this memory into cache so it's in cache when you really read it. But for non-cacheable mem that won't happen. For cacheable mem, it would probably help yes.

Replacing the memcpy could definitely help. There is a project called 'fastarm' ( https://github.com/hglm/fastarm ) that has optimized memcpy / memset replacements. It also has a benchmark. On my rev1 (with 533M DDR) :

standard memcpy: ( ./benchmark --test 47 --memcpy a )
8M bytes page aligned: 307.47 MB/s
8M bytes page aligned: 307.44 MB/s

new memcpy for cortex using NEON with line size 64, preload offset 192: ( ./benchmark --test 47 --memcpy g )
8M bytes page aligned: 495.45 MB/s
8M bytes page aligned: 495.50 MB/s

But since those are optimized with preload and stuff like that I'm not sure how they'd perform on non-cached memory. A dedicated memcpy for that zone might do better.

Cheers,

Sylvain

by **tnt** » Sun Aug 02, 2015 7:34 pm

Some news:
- All the files and software and everything is now in the parallella-examples repo of adapteva.
- I've also created a pull request to update everything in there to the very latest stuff, hopefully it will be merged soon, but in the mean time, it's available on my fork on github ( https://github.com/smunaut/parallella-examples )

The improvements are:

- Added proper video cropper in the HW. So if the sensor output 1296x968, you can get 1280x960, or anything smaller. Note that not all resolutions are supported, the number of bytes of a line of video data must be a multiple of 64 bits ... (all "normal resolutions" will work fine).

- Added support for other sensor resolutions: You can now pull data from the sensor at eithr 2592x1944 @15fps, or 1936x1088 @ 30 fps, or 1296x968 @ 30 fps.
Programming the sensor is unfortunately not trivial and theses are the only timings I managed to get. I tried coming up with a generic method to get arbitrary resolution from the sensor and I failed miserably ... after like 10h or trying random stuff I settled for those 3. I couldn't make any of the higher frame rate mode work (like 640x480 @ 60fps).

- Upgraded the constraint on the MIPI clock to allow link speed up to 500 Mbps.

- Upgraded the rpi_cam control software to select sensor resolution and output resolution (which is cropped version out of the sensor resolution) and pixel format. Also added option to stream data to a TCP address ...

That's pretty much where I'll stop working on it ... gotta move to other things and I already spent way too much time on this even though I have no use for a camera :p

If anybody wants to continue, I'll be happy to answer any questions.

by **fbt** » Thu Dec 17, 2015 12:00 pm

Hi, is it possible to try out on a parallella desktop version based on a 7010 device? thanks

by **jpwhitt** » Fri Dec 18, 2015 6:34 pm

Hi all, anyone want to do some camera driver testing? I've got a candidate V4L2 raspberry pi camera driver based on the excellent work Sylvain did.

It still could do with a few features such as V4L controls for exposure, colour balance etc., but it's usable and includes a VGA mode at 60 fps. Admittedly, the latter mode isn't brilliant and needs some tweaking to correct some aliasing/pixel convergence problems, but it's a start.

In addition to the driver, I've made a few updates to the FPGA code, namely;

- bug-fix allowing de-bayering on video over 2048 pixels horizontal resolution
- added support for output of RGB565 16-bit per pixel video format
- packaged up the CSI module for Vivado IP integrator

Files can be found at

The boot directory contains a linux uImage with the driver built-in and HDMI output enabled.

Test with your favourite camera viewer; qv4l2, camorama, guvcview etc. Should also work with opencv.

Fingers crossed...

Jason

Parallella Community

RPi Camera bounty

Re: RPi Camera bounty

Re: RPi Camera bounty

Re: RPi Camera bounty

Re: RPi Camera bounty

Re: RPi Camera bounty

Re: RPi Camera bounty

Re: RPi Camera bounty

Re: RPi Camera bounty

Re: RPi Camera bounty

Re: RPi Camera bounty

Who is online