distributed x264 video encoding

distributed x264 video encoding

Postby JackTakahashi » Mon Dec 02, 2013 7:31 pm

Hey guys,

I'm really interested in understanding the potential of the parallella board for X264 video encoding.
What kind of limitations would there be to port an X264 encoder (something like HandBrake) to the co-processor? What about clustering limitations?
If the parallella hardware is capable, I was thinking of having one master server split up the original video in X parts and then send them over the LAN to various parallella cards which then encode the video and return the encoded parts to the server which then merges all of the parts in one complete re-encoded video.
Obviously this is not a fully distributed solution (which would have to be built from scratch for the parallella board, I assume) but it could be a start.
I am a developer but I have never worked with x264 before and I don't know where to start, but I'd be really interested in making this work on the parallella board.
Any thoughts on this are appreciated, thank you!
JackTakahashi
 
Posts: 2
Joined: Mon Dec 02, 2013 7:20 pm

Re: distributed x264 video encoding

Postby theover » Mon Dec 02, 2013 9:57 pm

I haven't looked into the x264 (or also open source ffmpeg) sources, so I don't know how hard it is to find stuff that can parallelize with the proper granularity.

Maybe when the getting the boards to the customers job is done, there are experts who can say which code will run good on the Adaptive, surely there's be quite a few people interested in a low power, fast x264 encoder, even though wavelet encoding is a newer thing to try ( the BBCs Dirac/Schroedinger, and jpeg2000+ for HD film).

Just running a distributed encoder on the dual ARM cores, and connecting to more boards, is another. Maybe there are options already for that in the Open Source world. Certain computations could be singled out for optimization (even on the FPGA), like DCT and pattern search, etc.

T.V.
theover
 
Posts: 181
Joined: Mon Dec 17, 2012 4:50 pm

Re: distributed x264 video encoding

Postby over9000 » Tue Dec 03, 2013 4:20 am

theover wrote:I haven't looked into the x264 (or also open source ffmpeg) sources, so I don't know how hard it is to find stuff that can parallelize with the proper granularity

I know that ffmpeg is quite hard to modify to get to work with a platform like this. From what I've read, they use a big struct that holds lots of unrelated data together and disentangling them isn't trivial. It's hard to parallelise. I've read a few different papers on the subject. I'll see if I can find links and post them later.
theover wrote:Maybe when the getting the boards to the customers job is done, there are experts who can say which code will run good on the Adaptive, surely there's be quite a few people interested in a low power, fast x264 encoder, even though wavelet encoding is a newer thing to try ( the BBCs Dirac/Schroedinger, and jpeg2000+ for HD film).

Myself included. I think a big problem is getting access to the specs. The consortium that created H.264 won't give you access to the details of how everything works without paying a lot of money. I've not really come across a readable description of how it all works, plus there's a pretty big learning curve even trying to follow how something like ffmpeg does everything (just understanding container formats is hard enough). The wavelet stuff looks interesting, but I'm not sure about how practical Dirac is. It might be OK for archival or internal broadcast, but the problem is that it just isn't well supported by players so you'll probably need more transcoding. It has been a few years since I looked at it though, and at the time it was very slow (not capable of real-time rendering).
theover wrote:Just running a distributed encoder on the dual ARM cores, and connecting to more boards, is another. Maybe there are options already for that in the Open Source world. Certain computations could be singled out for optimization (even on the FPGA), like DCT and pattern search, etc.
T.V.

Another option is to use Raspberry Pis for encoding/transcoding. You can buy MPEG-2 licences for en/decoding very cheaply (a couple of Euro per pi) and the H.264 encoding is already enabled on the SoC. By using the omxtx software, you can transcode (SD) MPEG-2 to H.264 at about 100fps. On the Parallela, a combination of the "transcode" package and dvd::rip could give you a good starting point for building a distributed transcode cluster. You'd still have to modify ffmpeg (which transcode uses) to be able to take advantage of the Epiphany to get accelerated transcode, but it's handy given that all the high-level stuff wouldn't need to change (so the clustering part would come for free).
over9000
 
Posts: 98
Joined: Tue Aug 06, 2013 1:49 am

Re: distributed x264 video encoding

Postby JackTakahashi » Tue Dec 03, 2013 12:48 pm

over9000 wrote:Another option is to use Raspberry Pis for encoding/transcoding. You can buy MPEG-2 licences for en/decoding very cheaply (a couple of Euro per pi) and the H.264 encoding is already enabled on the SoC. By using the omxtx software, you can transcode (SD) MPEG-2 to H.264 at about 100fps. On the Parallela, a combination of the "transcode" package and dvd::rip could give you a good starting point for building a distributed transcode cluster. You'd still have to modify ffmpeg (which transcode uses) to be able to take advantage of the Epiphany to get accelerated transcode, but it's handy given that all the high-level stuff wouldn't need to change (so the clustering part would come for free).


I didn't know the Raspberry Pi could transcode that fast! How about mkv to mkv (presumably x264 to x264) does it maintain those speeds?
You said that if you modify ffmpeg to use the Epiphany cores you would get clustering for free but I think you'd have to have a custom workload manager that divides the original video into n pieces (n=number of nodes in Parallella cluster) and sends them off to the nodes for encoding and then merges the output of those parts into one.
This would be the approach using current software but ideal you'd port as much as possible of the encoding part of the process (ffmpeg) to the Epiphany cores to improve cluster efficiency.
JackTakahashi
 
Posts: 2
Joined: Mon Dec 02, 2013 7:20 pm

Re: distributed x264 video encoding

Postby over9000 » Tue Dec 03, 2013 3:10 pm

JackTakahashi wrote:I didn't know the Raspberry Pi could transcode that fast! How about mkv to mkv (presumably x264 to x264) does it maintain those speeds?
You said that if you modify ffmpeg to use the Epiphany cores you would get clustering for free but I think you'd have to have a custom workload manager that divides the original video into n pieces (n=number of nodes in Parallella cluster) and sends them off to the nodes for encoding and then merges the output of those parts into one.
This would be the approach using current software but ideal you'd port as much as possible of the encoding part of the process (ffmpeg) to the Epiphany cores to improve cluster efficiency.

I'm not sure about Matroska, since I didn't test it. The omxtx program links against libavformat, though, so in theory it should be able to handle whatever container formats that can handle (ie, pretty much everything). The Broadcom SoC has hardware codec for H.264 and from what I read, you can have multiple encode/decode blocks running and there's no problem with H.264 to H.264 recoding. Besides omxtx (which is very much geared towards transcoding files locally), people have been working on OpenMAX-based (ie, hardware accelerated) gstreamer for Pi. It's been a while since I looked at it, but I read about people getting it working. It's not in the main Raspbian repo, though. A forum post by one of the mods say they weren't inclined to allow an untested (potentially insecure) package into it. If it does work, though, you could use the gstreamer primitives to build up almost any kind of graphics pipeline (not limited to just local transcode).

What you're describing with splitting files up into separate pieces is pretty much what dvd::rip does. It's really only a front-end to transcode and the command-line tools that do dvd ripping, with a bit of extra interactive features like for modifying clipping, aspect ratio and quality settings. Once you have an image of the MPEG-2 (ie, VOB) files on the disk, you can tell it to do the transcoding on the cluster of machines you set up. You just need to have the same software on all nodes with ssh connection and a shared (eg, NFS) partition where all nodes can read/write the directory where the input/output files live. Then it issues commands to the slave nodes to work on a different range of frames (or GOPs?). Since it's just a front-end, you can actually look in the log file to see which calls it's making to the underlying transcode programs. You could use that as a basis for your own video processing script.

To give a flavour of the difficulties of porting ffmpeg to Parallella, have a look at the paper Optimizing the ffmpeg Library for the Cell BE (link to pdf). They focused only on accelerating a small part of the ffmpeg code (dct_quantize), but they didn't see any improvement. They mention the difficulty in offloading any higher-level functions to the SPU (more or less equivalent to an e-core on the Epiphany) due to the large structures involved (and PS3's SPUs have 256Kbytes each compared to Epiphany's 32Kb) and liberal use of pointers within those structures (which are very inefficient to dereference on either PS3/Epiphany). Their single-function (kernel) offload idea was probably introducing a good bit of overhead for the actual offload, so despite their SPU kernel implementation using native SIMD instructions (with 128-bit registers for up to 16-way SIMD), it still wasn't fast enough to compensate. Thankfully, Epiphany isn't quite as hard to program for because it doesn't have or rely on SIMD (which, along with a few other "features" makes the PS3 difficult to code for), but you've still got the function offload overheads (which will be slightly worse given that there are 16 cores as opposed to the PS3's 6 SPUs) and also a slower clock speed (800MHz? vs 3.4GHz). Epiphany is also worse than the PS3's SPUs due to not having hardware division (which might be needed for DCT? I'd have to look that up; though perhaps SPUs don't have hardware division either---again, I'd have to check). Add to that the potential for congestion on the Epiphany's interface to main RAM and you might end up with a similar outcome for the single-function offload approach. It would definitely be something worth trying, though, and I'm sure that it'll be an attractive project for certain types of people to try out once they get their hands on the boards...
Last edited by over9000 on Tue Dec 03, 2013 11:37 pm, edited 2 times in total.
over9000
 
Posts: 98
Joined: Tue Aug 06, 2013 1:49 am

Re: distributed x264 video encoding

Postby over9000 » Tue Dec 03, 2013 3:29 pm

JackTakahashi wrote:I didn't know the Raspberry Pi could transcode that fast! How about mkv to mkv (presumably x264 to x264) does it maintain those speeds?

You might want to check out this link. It looks cool, but I think the project stalled. Anyway, they mention being able to transcode "2 SD channels on input to 3 different bitrate (so 6 streams at all)," (with audio pass-through) in real time. The GPU inside the Pi is really quite powerful!
over9000
 
Posts: 98
Joined: Tue Aug 06, 2013 1:49 am

Re: distributed x264 video encoding

Postby theover » Tue Dec 03, 2013 5:40 pm

Well, I'm used to running ffmpeg on I7s (notebook, and even an extreme), and I wouldn't say either h264 (with x264 or ffmpeg) gives me what I want: realtime 1080 HD encoding at at least 30 fps, but it compiles easily (on Linux), and it does produce usable results of High Definition quality.

The x264 tool can be used for bluray type of video streams, so 1080i pr 1080p at 24/50/60 frames/sec, and can do so properly. You cannot necessarily patch pieces of h264 together into one stream, because there may be frame-inter-dependencies. Of course both search algorithm and the rounding to a number of bits (with proper progressiveness) and other kinds of things (like motion analysis) can be done in different ways, and corners could be cut.

Of course this technology isn't new, so it's hardly a matter of "can it be done". Because "of course" it can be done, I've got a $300 Sony cam that pulls about 3 Watts and it creates stunning 1080p50 h264 of 28 megabit/sec.. Also a standard Android tablet with decent ARM processor in it can do reasonable realtime h264 encoding for 720HD. Sure a good idea to use Raspberry work to run on the ARM of the Adaptive board, maybe there's some nice NEON accelerated stuff that can be used, without wandering into the territory of making illegal DVD/Bluray transcodings. More like seeing if the Adaptiva can maybe also run a free Mathematica license!

From both you comments I recon you're not the kind of savvy guys I was talking about, but maybe some people you know have some interesting code you can port, and play around with. It would be cool to have some things on the Parallella that allow fun things like video encoding, even if it isn't industrial-level optimized. It would make a great (most likely university) undergrad teaching tool to be able to play with the available ARM-to-Adaptiva bandwidth, and to search out which computations could be offloaded to where. Including the FPGA would even make for some novel ideas to be implementable in the architecture, could be interesting, and a RISC processor like the Adaptiva cores may well give quite different results than say the PlayStation stream processor or offloading to something like Cuda and NVidia's "VDPAU".

T.V.
theover
 
Posts: 181
Joined: Mon Dec 17, 2012 4:50 pm


Return to Image and Video Processing

Who is online

Users browsing this forum: No registered users and 1 guest