Practical crypto on epiphany ?

Generic algorithms that could be implemented in almost any language, e.g. matrix operations and FFT.

Practical crypto on epiphany ?

Postby Melkhior » Wed Nov 19, 2014 8:16 pm

Hello,

Just wondering if anyone already has numbers for some basic cryptography on the Epiphany, to compare with.

After a couple of days with my parallella, I have AES-256 in CTR mode and Chacha20 running (they're both counter-based and fully parallel), but speed is not tremendous. AES is at around 45 MB/s (including DMA'ing the data back into the ARM memory, which is a bottleneck even after trying to overlap the DMAs with the computations) vs. 16-18 MB/s for openssl on the A9. Chacha20 barely matches the A9 (around 35 MB/s, including DMA). As anyone done similar work and what would be the state of the art on the Epiphany?

Incidentally, it seems the fast SRAM works wonder for AES, just use the full 4 32-bits forward tables. OTOH, the compute-heavy Chacha20 doesn't like the single-issue pipeline I think. Also, not having a rotation instruction hurts.
Melkhior
 
Posts: 39
Joined: Sat Nov 08, 2014 12:19 pm

Re: Practical crypto on epiphany ?

Postby aolofsson » Wed Nov 19, 2014 8:45 pm

Sounds interesting!

Have you checked out the John The Ripper project on Parallella? Similar problem domain.
https://github.com/parallella/parallell ... aster/john

-What is the compute/data ratio? What kind of data transfer rates are you seeing.
-The Epiphany core is dual issue and you can make use of the "32 bit integer mode" to increase performance. Doesn't help for shifts, but does help with mult/add/sub etc.
-Pointer to code?

Andreas
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Practical crypto on epiphany ?

Postby Melkhior » Wed Nov 19, 2014 9:07 pm

Hello,

Compute ratio is quite good - basically for CTR, once you have the starting point and the number of blocks data only goes out of the Epiphany into the ARM (unless you want to XOR with the message). My current implementation generates 4 KiB per core and signal the ARM (via shared memory...), ARM recovers the 4 KiB and the core move on to the next 4 KiB. ARM loop on all cores to get data ASAP. There's probably a better way. Hence me wondering what is the state of the art, if any.

Dual-issue - those algos only use XOR, ROT (i.e. a pair of shifts & a (X)OR) and for AES table lookup and for Chacha20 some ADD. Seems gcc doesn't generate the IADD required for dual-issue (or did I misread something ? anyway 'ezetime' says the function is sequential, but with no bubbles). Also, it's not taking advantage of the hardware loop (both have a fixed-count loop), but that's only a small part (3 cycles per iteration ?).

Code - it's still ugly as hell and about an hour old for chacha20 :-) And it's not much use without the hash part to build AES-GCM or HS1-SIV (based on chacha20).
Melkhior
 
Posts: 39
Joined: Sat Nov 08, 2014 12:19 pm

Re: Practical crypto on epiphany ?

Postby cmcconnell » Thu Nov 20, 2014 5:51 am

I don't have any comparable numbers to share, but I have been doing some related work.

I've been working on improvements to the cgminer code that was ported to Parallella as a GSoC project. I was able to speed up the asm version of the salsa20_8 function a great deal, by getting as much dual issue as possible, and avoiding pipeline stalls. (The original asm version was actually a lot slower than the C version.)

My work on this has been on hold for a few weeks, but I hope to return to it some time soon. The interface between ARM and Epiphany is another thing that is slowing the original code down. I've sketched out a design for how to re-do it, but have yet to start work on this (and the new SDK might offer some useful new features to help in this area.)

For comparison, the original C version delivers about 1.3 Kh/s on an E16, my updated Salsa version (with a few other tweaks) gives about 2.0 Kh/s, but if I comment out the Scrypt function call altogether, so that each hash is notionally being computed instantaneously, cgminer still only reports about 8 Kh/s (if I remember correctly), because of all the ARM/Epiphany communications overhead.

I think around 3Kh/s ought to be achievable, once both the comms and the hashing are fully optimised.

I'm not sure if ChaCha can be optimised to quite the same extent as Salsa on the Epiphany, but I've been intending to look into it. I hope to add support for 'neoscrypt' to cgminer, which uses both Salsa20/20 and ChaCha20/20.


What I found with Salsa is this -

The C version generates 5 IALU instructions (so requires 5 cycles) for the basic computation that gets done four times in a quarterround -
Code: Select all
ADD, LSR, LSL, ORR, EOR


In asm, this can be reduced to 4 instructions - 2 IALU and 2 FPU (IALU2) -
Code: Select all
IADD, LSR, IMADD, EOR


So with full use of both pipelines, that would be 2 cycles, instead of 5. (Interleaving sequences of instructions, so that there are no stalls due to register dependencies).

But in order to achieve this you would need to work on two Salsa instances in parallel. For Scrypt, that would mean two Scrypt instances, and a halving of the scratchpad size. The scratchpad issue means it's faster to have a single Scrypt instance, with a slightly slower Salsa implementation. I came up with a scheme that gives an average of about 2.8 cycles, mixing elements of both the above instruction sequences.


I hope all the above makes sense. I don't have a background in crypto, so I've very much been learning as I go along (both about crypto and about the Parallella/Epiphany).
Last edited by cmcconnell on Thu Nov 20, 2014 9:53 am, edited 1 time in total.
Colin.
cmcconnell
 
Posts: 99
Joined: Thu May 22, 2014 6:58 pm

Re: Practical crypto on epiphany ?

Postby Melkhior » Thu Nov 20, 2014 8:01 am

I just quickly tried using inline ASM to get IADD for ADDs & SLR/IMADD for ROTs, but it doesn't change the measured speed (which is about 45MB/s not 35MB/s, this last one is for gcc but I've switched the host to clang for this algo since it's better).

I think I'm bottlenecked by getting the data back from the Epiphany to the ARM. The e-bandwidth-test example returns:

ARM Host --> eCore(0,0) write spead = 13.68 MB/s
ARM Host --> eCore(0,0) read spead = 6.55 MB/s
ARM Host --> ERAM write speed = 75.77 MB/s
ARM Host <-- ERAM read speed = 105.69 MB/s
ARM Host <-> DRAM: Copy speed = 195.37 MB/s
eCore (0,0) --> eCore(1,0) write speed (DMA) = 1280.04 MB/s
eCore (0,0) <-- eCore(1,0) read speed (DMA) = 390.01 MB/s
eCore (0,0) --> ERAM write speed (DMA) = 240.24 MB/s
eCore (0,0) <-- ERAM read speed (DMA) = 87.61 MB/s


The core are doing DMA writes to ERAM, then the ARM is reading, so I think I should max out at 73 MB/s at best, I get 60% of that. But I also have the synchronisation in ERAM in between those calls :-(

Update: The ARM copy-back with e_read() is 95% of the visible time...
Melkhior
 
Posts: 39
Joined: Sat Nov 08, 2014 12:19 pm

Re: Practical crypto on epiphany ?

Postby xilman » Thu Nov 20, 2014 11:44 am

Turning to asymmetric crypto, the Epiphany is crippled.

Among the deficiencies are lack of division and modulus operations, an inability to produce the high-order word of a 32-bit multiply, extremely poor carry propagation (add with carry, subtract with borrow) support, and the lack of unsigned arithmetic operations.

Please can these be addressed in the next release of the architecture? For the moment, only GPUs and the Xeon Phi are of much use for parallel integer arithmetic.

FWIW, my Parallella cluster is doing useful work with elliptic curve arithmetic but it is running purely on the ARM cores. I plan to try porting it to the Epiphany but I've already decided that a radically unusual approach will be essential. The residue number system (RNS) representation of multi-precise integers has carry-free arithmetic. Multiprecision modular remainder doesn't require division or modulo operators because Montgomery reduction is possible (that's true of regular representation too). The sticking point is 32*32=>64 multiplication and 64/32 => 32q/32r division/modulus. For a start, lack of unsigned arithmetic reduces those 64s to 62 and the 32s to 31. Lack of division for the RNS primitives can be worked around, by careful shifts and subtractions, will still be more expensive than would be desirable.
xilman
 
Posts: 80
Joined: Sat May 10, 2014 8:10 pm
Location: UK

Re: Practical crypto on epiphany ?

Postby cmcconnell » Thu Nov 20, 2014 5:49 pm

Melkhior wrote:I just quickly tried using inline ASM to get IADD for ADDs & SLR/IMADD for ROTs, but it doesn't change the measured speed (which is about 45MB/s not 35MB/s, this last one is for gcc but I've switched the host to clang for this algo since it's better).

You need to use interleaving to avoid register dependencies, otherwise the IADD and IMADD instructions will actually slow things down, rather than speed them up. (I couldn't find a way to get the C compiler to generate the desired interleaving of inline instructions, though it may be possible; it had to be a full, hand-crafted assembly routine.)

Melkhior wrote:I think I'm bottlenecked by getting the data back from the Epiphany to the ARM. The e-bandwidth-test example returns: [...]

But it sounds like this bottleneck is the bigger problem.
Colin.
cmcconnell
 
Posts: 99
Joined: Thu May 22, 2014 6:58 pm

Re: Practical crypto on epiphany ?

Postby Melkhior » Fri Nov 21, 2014 8:01 pm

Hello,

Getting the data back is still an issue. Even making the transfer size bigger (host-initiated from ERAM) did not help. I upped the shared memory size so the cores can dump more than one page (8 KiB) to it before stopping, but I'm still stuck at around 45 MB/s for both algos. And the code occasionally hang, and I can't figure out why.

On the other hand, thanks to yanidubin great tutorials, I've managed to get some GCM out of the FPGA. I'm quite pleased, since I'd never touched an FPGA or written VHDL before. It's using the AXI4-lite interface which may slow things down. It isn't very fast, as shown below. Also, my control register doesn't really work as intended. But I'm learning :-)

Code: Select all
ref: 1048576 bytes in 2.147383 seconds -> 0.488304 MB/s
ref: 1048576 bytes in 2.148490 seconds -> 0.488053 MB/s
ref: 1048576 bytes in 2.147365 seconds -> 0.488308 MB/s
neon: 1048576 bytes in 0.045786 seconds -> 22.901673 MB/s
neon: 1048576 bytes in 0.045602 seconds -> 22.994079 MB/s
neon: 1048576 bytes in 0.045620 seconds -> 22.985007 MB/s
fpga: 1048576 bytes in 0.202072 seconds -> 5.189121 MB/s
fpga: 1048576 bytes in 0.202090 seconds -> 5.188659 MB/s
fpga: 1048576 bytes in 0.202072 seconds -> 5.189121 MB/s


'ref' is the pure-C reference implementation (lifted from supercop), to validate everything else.
'neon' is my own intrinsics-based implementation (using the algorithm by Camara et al.), written for an A15 not an A9.
'fpga' is just a textbook (literally) implementation of GCM in the memory-mapped registers of the AXI4-lite core.

Now, I'm impatiently awaiting the tutorial on AXI4-stream :-)
Melkhior
 
Posts: 39
Joined: Sat Nov 08, 2014 12:19 pm

Re: Practical crypto on epiphany ?

Postby Melkhior » Sun Nov 23, 2014 8:48 am

Melkhior wrote:It's using the AXI4-lite interface which may slow things down.


It definitely does, since this is what I get after minimizing the number of loads and stores from the A9 to the memory-mapped registers:

Code: Select all
fpga: 1048576 bytes in 0.060199 seconds -> 17.418495 MB/s
fpga: 1048576 bytes in 0.060218 seconds -> 17.412999 MB/s
fpga: 1048576 bytes in 0.060199 seconds -> 17.418495 MB/s


Still not as fast as the A9 itself, but much better.
Melkhior
 
Posts: 39
Joined: Sat Nov 08, 2014 12:19 pm

Re: Practical crypto on epiphany ?

Postby Melkhior » Sun Nov 30, 2014 5:33 pm

Melkhior wrote:Update: The ARM copy-back with e_read() is 95% of the visible time...

I had tried mmap()ing the shared memory (using the appropriate offset in /dev/mem) and doing memcpy(), but it didn't help, my code was still maxing out at less than 50 MB/s for the SharedRAM-> DRAM copies.
However, replacing the stock memcpy() by a quick'n'dirty homegrown one (with unrolled NEON multi-registers load/store) on the mmap(), I get over 100 MB/s of data recovery for both AES-256 and Chacha20.
Code: Select all
Checking 'aes256' for 8388608 Bytes, without xor after Epiphany
Host impl. : 0.62791 s /  13.4 MB/s
Epip impl. : 0.07592 s / 110.5 MB/s
Ref  impl. : 0.47350 s /  17.7 MB/s
Checking 'aes256' for 8388608 Bytes,
Host impl. : 0.64833 s /  12.9 MB/s
Epip impl. : 0.13146 s /  63.8 MB/s
Ref  impl. : 0.47354 s /  17.7 MB/s
Checking 'chacha20' for 8388608 Bytes, without xor after Epiphany
Host impl. : 0.18299 s /  45.8 MB/s
Epip impl. : 0.07489 s / 112.0 MB/s
Checking 'chacha20' for 8388608 Bytes,
Host impl. : 0.18298 s /  45.8 MB/s
Epip impl. : 0.13098 s /  64.0 MB/s

'Host impl' is my own ARM implementation (AES is not optimized), 'Ref impl' is OpenSSL, 'Epip impl.' is on the Epiphany (including getting the data back).
The 'without xor' results are the raw generation/recovery speed for 'Epip impl'. Comparable implementations are the ones which include XOR.
Melkhior
 
Posts: 39
Joined: Sat Nov 08, 2014 12:19 pm


Return to Algorithms

Who is online

Users browsing this forum: No registered users and 3 guests

cron