Parallella Community

by **Morgaine** » Thu Oct 10, 2013 1:13 am

Over in the "Single Board Computer" section of the Element14 community forum, we've been measuring the Ethernet throughput of as many ARM Linux boards as we can lay our hands on, and we've compiled a pretty interesting table of the results. We're using the single (and very well respected) measuring utility in all cases, in order to guarantee consistency. The table is maintained in the leading article of the thread , which also explains how to do the measurements.

When we get our own Parallella boards then of course we'll put them under test ourselves, but in the meantime, since many of you already have the board, perhaps a few of you would like to follow the instructions and give us an early set of measurements to add to the table? I'll gladly record in the table any results posted either here in this thread or on the Element14 forum.

A lot of applications use the common architectural pattern of feeding data to a machine through Ethernet, passing it to an on-board computation engine or accelerator for processing (CPUs, GPUs, FPGA, or in our case here, Epiphany), and then passing it back out over Ethernet again. To make the most of the available computational resources, it's important to know the I/O bottlenecks to avoid hitting them and to maximize the time spent computing, and to do that those bottlenecks have to be measured first. It's important generally, but especially so when a board like Parallella is designed as a strong computational engine

The maximum read and write throughputs to the host over Ethernet are only two of several limiting I/O parameters to be measured, and we'll have to quantify the others too in due course --- direct memory copy between host and Epiphany, DMA throughput and Epiphany inter-core throughput will very often be of interest too.

For now, if anyone wishes to run the measurements described in the link above, it'll be very interesting to read your results. Note that the ARM boards with gigabit Ethernet measured so far have had trouble achieving high utilization of the link, rarely reaching half of the maximum theoretical 941.482 Mbps TCP payload throughput with TCP TimeStamps enabled and no jumbo frames. In contrast, x86 machines frequently achieve very close to the maximum theoretical throughput. The Zynq is an unknown quantity to us at present, but hopefully it'll be better than the cheaper SoCs tested. Whether better or poorer, its limiting throughput needs to be known anyway.

Happy measuring! :-)

Morgaine.

by **notzed** » Thu Oct 10, 2013 4:49 am

by **aolofsson** » Thu Oct 10, 2013 5:50 am

morgaine,
Great suggestion. Thanks for joining the forum! Xilinx has published an app note showing ethernet performance for the zynq chip: http://www.xilinx.com/support/documenta ... nq-eth.pdf

notzed,
Thanks! These agree with the numbers we have seen on the parallella with 'iperf'. I don't think jumbo frames are turned on by default so hopefully we will be able to push the performance closer to the 700Mbps+ in the xilinx app note.(haven't gotten to the fun part of tuning the platform yet..

)

Andreas

by **Morgaine** » Thu Oct 10, 2013 5:52 am

by **Morgaine** » Thu Oct 10, 2013 6:26 am

Thanks Andreas, I've posted on the blog before, but here I've been just reading and learning. :-)

That's a good application note from Xilinx, thank you. It seems that we'll be getting less than half gigabit link utilization out of the Zynq without jumbo frames then, which I find a bit surprising for this high-end device, but in the end it just means that the Epiphany workload will need to have a compute-to-communicate time ratio twice as high as it would have been if we had full gigabit link utilization. It's important to know this, and it will have to be factored in to usage plans otherwise Epiphany would be starved by the external link. (This applies only to that specific architecture of data flow though, albeit a common one.)

That's a pretty evil blip at 1494 bytes as the packet gets split. Some applications may benefit by avoiding it explicitly.

by **Morgaine** » Thu Oct 10, 2013 6:42 am

by **notzed** » Thu Oct 10, 2013 10:13 am

by **Morgaine** » Thu Oct 10, 2013 7:31 pm

Thanks for the tests against localhost on the Parallella. Although I'm not recording them because they don't measure Ethernet throughput, they do give some idea of the overheads in the protocol stack when the communications hardware is eliminated from the path, and so they're a worthwhile sanity check to perform. If these numbers observed for localhost were close to those measured over Ethernet then the latter would be suspect, since throughput would be limited by the protocol stack instead of by the external path.

I think these last results mean that your measurements are safe, but still affected by whatever is causing your variability.

Obvious interfering candidates such as known running processes should be eliminated first when performing measurements. The big elephant in the room is the browser, since in today's Javascript-infested web both the CPU and the network will be in continual use unless Javascript has been turned off or blockers like NoScript are used well to control it. The security-conscious won't be allowing Javascript to run anyway, but when performing measurements involving the local machine it's best to go the whole way and terminate the browser completely. Even better is to use bare machines without desktops for the test.

If using headless machines over ssh then be aware that this dual use of the link means that keepalive packets may add a small amount of variability to the results, and any output sent over ssh during measurement will cause process rescheduling. In your one-liner loops, add a short sleep after nuttcp termination to add confidence that the result line isn't being sent over the link at the same time as the next iteration is running another nuttcp.

No firewall should be running anywhere in the path since that naturally adds overhead to communications. Using an in-service firewall machine directly as one endpoint for the tests would be especially inappropriate unless one is trying to evaluate the impact of service traffic on purpose. (Never leave the tool around on a public facing machine anyway, for obvious security reasons.)

CPU governors can have an impact on measurements as well. so be sure that both ends used for a test are using the performance governor. (Check with "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor".)

With a tool like htop showing a load average of 0.0x % and no hidden use of bandwidth such as by forwarded packets, only the third digit of precision would normally show significant variation. (Terminate htop before running the test.) Variation in the 2nd digit of precision can still give us useful ballpark figures, but variation in the 1st digit as you have seen are a sign of an uncontrolled test environment, and aren't actually measuring the Ethernet performance.

by **Morgaine** » Thu Oct 10, 2013 9:43 pm

Andreas, since my last reply to you, we've been analysing Xilinx's application note in more depth over in that Element14 thread and I now agree with you that we should be expecting nuttcp to report 700Mbps+ or thereabouts using the standard MTU. The way that Xilinx was using Netperf generates curves that are asymptotic to the single figure produced by nuttcp (nuttcp keeps the pipe full and doesn't suffer from mixed-size frames on the link).

Consequently it's the top end of those Xilinx curves that are most meaningful when quantifying Ethernet throughput, and hence the figure of 700Mbps+ without jumbo frames. This matches Xilinx's bar graphs for 1500-byte MTU on page 13, so I'm happier with my understanding of the application note now. 700Mbps+ is indeed our target for standard frames on Parallella.

We really need more nuttcp measurements to eliminate variables from notzed's measurements. By the way, Iperf is in good agreement with nuttcp on the boards I've tested, although I prefer to record nuttcp results for all boards to eliminate one possible variable --- just good practice.

Parallella Community

Ethernet throughput measurements

Ethernet throughput measurements

Re: Ethernet throughput measurements

Re: Ethernet throughput measurements

Re: Ethernet throughput measurements

Re: Ethernet throughput measurements

Re: Ethernet throughput measurements

Re: Ethernet throughput measurements

Re: Ethernet throughput measurements

Re: Ethernet throughput measurements

Who is online