Parallella Community

by **Melkhior** » Wed Nov 19, 2014 8:16 pm

Hello,

Just wondering if anyone already has numbers for some basic cryptography on the Epiphany, to compare with.

After a couple of days with my parallella, I have AES-256 in CTR mode and Chacha20 running (they're both counter-based and fully parallel), but speed is not tremendous. AES is at around 45 MB/s (including DMA'ing the data back into the ARM memory, which is a bottleneck even after trying to overlap the DMAs with the computations) vs. 16-18 MB/s for openssl on the A9. Chacha20 barely matches the A9 (around 35 MB/s, including DMA). As anyone done similar work and what would be the state of the art on the Epiphany?

Incidentally, it seems the fast SRAM works wonder for AES, just use the full 4 32-bits forward tables. OTOH, the compute-heavy Chacha20 doesn't like the single-issue pipeline I think. Also, not having a rotation instruction hurts.

by **aolofsson** » Wed Nov 19, 2014 8:45 pm

Sounds interesting!

Have you checked out the John The Ripper project on Parallella? Similar problem domain.
https://github.com/parallella/parallell ... aster/john

-What is the compute/data ratio? What kind of data transfer rates are you seeing.
-The Epiphany core is dual issue and you can make use of the "32 bit integer mode" to increase performance. Doesn't help for shifts, but does help with mult/add/sub etc.
-Pointer to code?

Andreas

by **Melkhior** » Wed Nov 19, 2014 9:07 pm

Hello,

Compute ratio is quite good - basically for CTR, once you have the starting point and the number of blocks data only goes out of the Epiphany into the ARM (unless you want to XOR with the message). My current implementation generates 4 KiB per core and signal the ARM (via shared memory...), ARM recovers the 4 KiB and the core move on to the next 4 KiB. ARM loop on all cores to get data ASAP. There's probably a better way. Hence me wondering what is the state of the art, if any.

Dual-issue - those algos only use XOR, ROT (i.e. a pair of shifts & a (X)OR) and for AES table lookup and for Chacha20 some ADD. Seems gcc doesn't generate the IADD required for dual-issue (or did I misread something ? anyway 'ezetime' says the function is sequential, but with no bubbles). Also, it's not taking advantage of the hardware loop (both have a fixed-count loop), but that's only a small part (3 cycles per iteration ?).

Code - it's still ugly as hell and about an hour old for chacha20 :-)

And it's not much use without the hash part to build AES-GCM or HS1-SIV (based on chacha20).

by **cmcconnell** » Thu Nov 20, 2014 5:51 am

by **Melkhior** » Thu Nov 20, 2014 8:01 am

I just quickly tried using inline ASM to get IADD for ADDs & SLR/IMADD for ROTs, but it doesn't change the measured speed (which is about 45MB/s not 35MB/s, this last one is for gcc but I've switched the host to clang for this algo since it's better).

I think I'm bottlenecked by getting the data back from the Epiphany to the ARM. The e-bandwidth-test example returns:

ARM Host --> eCore(0,0) write spead = 13.68 MB/s
ARM Host --> eCore(0,0) read spead = 6.55 MB/s
ARM Host --> ERAM write speed = 75.77 MB/s
ARM Host <-- ERAM read speed = 105.69 MB/s
ARM Host <-> DRAM: Copy speed = 195.37 MB/s
eCore (0,0) --> eCore(1,0) write speed (DMA) = 1280.04 MB/s
eCore (0,0) <-- eCore(1,0) read speed (DMA) = 390.01 MB/s
eCore (0,0) --> ERAM write speed (DMA) = 240.24 MB/s
eCore (0,0) <-- ERAM read speed (DMA) = 87.61 MB/s

The core are doing DMA writes to ERAM, then the ARM is reading, so I think I should max out at 73 MB/s at best, I get 60% of that. But I also have the synchronisation in ERAM in between those calls :-(

Update: The ARM copy-back with e_read() is 95% of the visible time...

by **xilman** » Thu Nov 20, 2014 11:44 am

Turning to asymmetric crypto, the Epiphany is crippled.

Among the deficiencies are lack of division and modulus operations, an inability to produce the high-order word of a 32-bit multiply, extremely poor carry propagation (add with carry, subtract with borrow) support, and the lack of unsigned arithmetic operations.

Please can these be addressed in the next release of the architecture? For the moment, only GPUs and the Xeon Phi are of much use for parallel integer arithmetic.

FWIW, my Parallella cluster is doing useful work with elliptic curve arithmetic but it is running purely on the ARM cores. I plan to try porting it to the Epiphany but I've already decided that a radically unusual approach will be essential. The residue number system (RNS) representation of multi-precise integers has carry-free arithmetic. Multiprecision modular remainder doesn't require division or modulo operators because Montgomery reduction is possible (that's true of regular representation too). The sticking point is 32*32=>64 multiplication and 64/32 => 32q/32r division/modulus. For a start, lack of unsigned arithmetic reduces those 64s to 62 and the 32s to 31. Lack of division for the RNS primitives can be worked around, by careful shifts and subtractions, will still be more expensive than would be desirable.

by **cmcconnell** » Thu Nov 20, 2014 5:49 pm

by **Melkhior** » Fri Nov 21, 2014 8:01 pm

by **Melkhior** » Sun Nov 23, 2014 8:48 am

by **Melkhior** » Sun Nov 30, 2014 5:33 pm

Parallella Community

Practical crypto on epiphany ?

Practical crypto on epiphany ?

Re: Practical crypto on epiphany ?

Re: Practical crypto on epiphany ?

Re: Practical crypto on epiphany ?

Re: Practical crypto on epiphany ?

Re: Practical crypto on epiphany ?

Re: Practical crypto on epiphany ?

Re: Practical crypto on epiphany ?

Re: Practical crypto on epiphany ?

Re: Practical crypto on epiphany ?

Who is online