MEng Project - cores only demonstrate 200MHz performance

Hi everyone,
I'm finishing up on my MEng project which involved importing an aritificial developmental system (ADS) whose orginal software implementation represents an organism as a 2D array of cells which perform development in a sequential manner through nested loops running through the x and y axis. The model is based on an aritficial gene regulatory network (GRN) and is basically a more complex version of cellular automata. The idea was that by porting this ADS onto hardware, where a single core represents a single cell, the x and y loops could be discarded and a performance boost should have been witnessed.
After successfully porting the model I discovered that the base time, so time taken for development to occur internally within the cell, is just ever so slighltly slower than the original software implementation. If I then permit the core to core data transfers to occur (so as to exchange chemicals and perform gene regulation) then more time is added on.
My question here is, the eCores should perform at 600 to 700MHz around correct? If the parallella version is approximately takes the same time, ever so slightly longer, as the software implementation which ran on a 3.2 GHz CPU, then that means the eCores demonstrated a performance of only 200MHz (3.2GHz/16 cores). Would anyone know why this is? The developmental model is comlex, as in a lot goes on and it revolves around modular operations, divisions, additions, subtractions and moving about of data, but nothing out of the ordinary where the 200MHz performance is demonstrated instead. Have I missed something or done something wrong like inadequate use of command line options while compiling? I used the matmul-16 template and simply copied and pasted the compiler command lines from there. Or is my comparison inacurrate?
Any hints would be useful to finalise my results section in my report as to why this is.
I'm finishing up on my MEng project which involved importing an aritificial developmental system (ADS) whose orginal software implementation represents an organism as a 2D array of cells which perform development in a sequential manner through nested loops running through the x and y axis. The model is based on an aritficial gene regulatory network (GRN) and is basically a more complex version of cellular automata. The idea was that by porting this ADS onto hardware, where a single core represents a single cell, the x and y loops could be discarded and a performance boost should have been witnessed.
After successfully porting the model I discovered that the base time, so time taken for development to occur internally within the cell, is just ever so slighltly slower than the original software implementation. If I then permit the core to core data transfers to occur (so as to exchange chemicals and perform gene regulation) then more time is added on.
My question here is, the eCores should perform at 600 to 700MHz around correct? If the parallella version is approximately takes the same time, ever so slightly longer, as the software implementation which ran on a 3.2 GHz CPU, then that means the eCores demonstrated a performance of only 200MHz (3.2GHz/16 cores). Would anyone know why this is? The developmental model is comlex, as in a lot goes on and it revolves around modular operations, divisions, additions, subtractions and moving about of data, but nothing out of the ordinary where the 200MHz performance is demonstrated instead. Have I missed something or done something wrong like inadequate use of command line options while compiling? I used the matmul-16 template and simply copied and pasted the compiler command lines from there. Or is my comparison inacurrate?
Any hints would be useful to finalise my results section in my report as to why this is.