Ethernet throughput measurements

Any technical questions about the Epiphany chip and Parallella HW Platform.

Moderator: aolofsson

Ethernet throughput measurements

Postby Morgaine » Thu Oct 10, 2013 1:13 am

Over in the "Single Board Computer" section of the Element14 community forum, we've been measuring the Ethernet throughput of as many ARM Linux boards as we can lay our hands on, and we've compiled a pretty interesting table of the results. We're using the single (and very well respected) nuttcp measuring utility in all cases, in order to guarantee consistency. The table is maintained in the leading article of the thread "SBC Network Throughput", which also explains how to do the measurements.

When we get our own Parallella boards then of course we'll put them under test ourselves, but in the meantime, since many of you already have the board, perhaps a few of you would like to follow the instructions and give us an early set of measurements to add to the table? I'll gladly record in the table any results posted either here in this thread or on the Element14 forum.

A lot of applications use the common architectural pattern of feeding data to a machine through Ethernet, passing it to an on-board computation engine or accelerator for processing (CPUs, GPUs, FPGA, or in our case here, Epiphany), and then passing it back out over Ethernet again. To make the most of the available computational resources, it's important to know the I/O bottlenecks to avoid hitting them and to maximize the time spent computing, and to do that those bottlenecks have to be measured first. It's important generally, but especially so when a board like Parallella is designed as a strong computational engine

The maximum read and write throughputs to the host over Ethernet are only two of several limiting I/O parameters to be measured, and we'll have to quantify the others too in due course --- direct memory copy between host and Epiphany, DMA throughput and Epiphany inter-core throughput will very often be of interest too.

For now, if anyone wishes to run the measurements described in the link above, it'll be very interesting to read your results. Note that the ARM boards with gigabit Ethernet measured so far have had trouble achieving high utilization of the link, rarely reaching half of the maximum theoretical 941.482 Mbps TCP payload throughput with TCP TimeStamps enabled and no jumbo frames. In contrast, x86 machines frequently achieve very close to the maximum theoretical throughput. The Zynq is an unknown quantity to us at present, but hopefully it'll be better than the cheaper SoCs tested. Whether better or poorer, its limiting throughput needs to be known anyway.

Happy measuring! :-)

Morgaine.
Morgaine
 
Posts: 42
Joined: Tue Jul 02, 2013 8:29 pm

Re: Ethernet throughput measurements

Postby notzed » Thu Oct 10, 2013 4:49 am

The software didn't run very well - it wouldn't talk between certain machines in some directions. I used the 6.1.2 version I compiled from source on each machine. The parallella compiler spewed a whole pile of pointless warnings which are on by default there for some reason.

But FWIW here's what I got.

Everything is plugged into the same 5 port switch.

Server: workstation class desktop w/ 64-bit os, client: parallella 16:

Code: Select all
xx@linaro-ubuntu-desktop:~/tar/nuttcp-6.1.2$ ./nuttcp-6.1.2 -t 192.xx
  429.8125 MB /  11.11 sec =  324.4873 Mbps 100 %TX 4 %RX 0 retrans 0.53 msRTT
xx@linaro-ubuntu-desktop:~/tar/nuttcp-6.1.2$ ./nuttcp-6.1.2 -r 192.xx
  398.9078 MB /   9.00 sec =  371.9240 Mbps 2 %TX 92 %RX 0 retrans 0.35 msRTT


But the -r result is wildly variable, on a few runs I had results ranging from about 130 to 400. On cpu related benchmarking I notice a fairly wide range of results too but it's normally between a 'faster' and 'slower' result.

Server: parallella 16, client: 64-bit workstation:

Code: Select all
approx 380 / 325Mbps respectively.


Server: parallella 16, client: thinkpad laptop running 32-bit os:

Code: Select all
400-500Mbps or thereabouts but only the -t test worked.


Server: 64-bit workstation, client thinkpad laptop (just to check switch):

Code: Select all
about 935Mbps but again only the -t test worked.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: Ethernet throughput measurements

Postby aolofsson » Thu Oct 10, 2013 5:50 am

morgaine,
Great suggestion. Thanks for joining the forum! Xilinx has published an app note showing ethernet performance for the zynq chip: http://www.xilinx.com/support/documenta ... nq-eth.pdf

notzed,
Thanks! These agree with the numbers we have seen on the parallella with 'iperf'. I don't think jumbo frames are turned on by default so hopefully we will be able to push the performance closer to the 700Mbps+ in the xilinx app note.(haven't gotten to the fun part of tuning the platform yet.. :( )

Andreas
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Ethernet throughput measurements

Postby Morgaine » Thu Oct 10, 2013 5:52 am

That's extremely odd, notzed. I'm currently using both the stable release nuttcp-6.1.2 and the latest development version nuttcp-7.2.1, and there are none of your issues here across a wide range of equipment (I've reported on only a fraction of my tests in the thread). What's more, I've been using the predecessors of nuttcp all the way back to early ttcp for decades, and nothing like you report has ever appeared. I also have both native and cross-compiled versions, and they're all well behaved and show no significant variation in reported figures.

So, I'm puzzled. :-(

notzed wrote:The software didn't run very well - it wouldn't talk between certain machines in some directions. I used the 6.1.2 version I compiled from source on each machine. The parallella compiler spewed a whole pile of pointless warnings which are on by default there for some reason.


The latter is probably something to ask Embecosm. I'm not sure if they did the compiler port for the Zynq's ARM or if it's just generic gcc for Cortex-A9, but I'm sure they would know.

The issue of not working between pairs of machines is almost certainly a local networking problem though. That utility has been used on such a huge range of equipment all the way up to real HPC supercomputers using 10gig ether that software problems would not be expected, at least for the stable 6.1.2.


Code: Select all
xx@linaro-ubuntu-desktop:~/tar/nuttcp-6.1.2$ ./nuttcp-6.1.2 -t 192.xx
  429.8125 MB /  11.11 sec =  324.4873 Mbps 100 %TX 4 %RX 0 retrans 0.53 msRTT
xx@linaro-ubuntu-desktop:~/tar/nuttcp-6.1.2$ ./nuttcp-6.1.2 -r 192.xx
  398.9078 MB /   9.00 sec =  371.9240 Mbps 2 %TX 92 %RX 0 retrans 0.35 msRTT


Thanks, that gives us an initial 371.92 / 324.49 Mbps pair for Rx/Tx, but it's in the ballpark of the i.MX6 and AM3359 which I find surprising since they're both low-end mass market SOCs. Still, we can take that as an initial data point and work on it. But read on ...

But the -r result is wildly variable, on a few runs I had results ranging from about 130 to 400. On cpu related benchmarking I notice a fairly wide range of results too but it's normally between a 'faster' and 'slower' result.


If I were a betting person, I would bet that you have something unknown running on one of the two machines or on both, and/or something unknown is using your LAN bandwidth, or possibly that your switch is starting to die or one of your NICs is going postal and affecting the switch. That sort of variation just does not happen with nuttcp under controlled conditions. Variations are normally only in the 3rd significant digit, somewhere around 0.4% here.

Server: parallella 16, client: 64-bit workstation:

Code: Select all
approx 380 / 325Mbps respectively.


Server: parallella 16, client: thinkpad laptop running 32-bit os:

Code: Select all
400-500Mbps or thereabouts but only the -t test worked.


Server: 64-bit workstation, client thinkpad laptop (just to check switch):

Code: Select all
about 935Mbps but again only the -t test worked.


Well you have some complete networking block in one direction that needs resolving, but leaving that aside for now, at least the last test proves that both your x86 machines can receive at full gigabit rate in one direction so they're not artificially limiting the earlier tests in the same direction. It also suggests that the Parallella tests can be trusted once whatever is causing the variability is eliminated.

I'll add your figures to the table, but frankly I don't think they'll stand once you figure out what's causing variation and retest. The fact that ThinkPad to Parallella yields 400-500Mbps immediately casts doubt on the initial pair of Rx/Tx figures.
Last edited by Morgaine on Thu Oct 10, 2013 7:23 am, edited 1 time in total.
Morgaine
 
Posts: 42
Joined: Tue Jul 02, 2013 8:29 pm

Re: Ethernet throughput measurements

Postby Morgaine » Thu Oct 10, 2013 6:26 am

Thanks Andreas, I've posted on the blog before, but here I've been just reading and learning. :-)

That's a good application note from Xilinx, thank you. It seems that we'll be getting less than half gigabit link utilization out of the Zynq without jumbo frames then, which I find a bit surprising for this high-end device, but in the end it just means that the Epiphany workload will need to have a compute-to-communicate time ratio twice as high as it would have been if we had full gigabit link utilization. It's important to know this, and it will have to be factored in to usage plans otherwise Epiphany would be starved by the external link. (This applies only to that specific architecture of data flow though, albeit a common one.)

That's a pretty evil blip at 1494 bytes as the packet gets split. Some applications may benefit by avoiding it explicitly.
Morgaine
 
Posts: 42
Joined: Tue Jul 02, 2013 8:29 pm

Re: Ethernet throughput measurements

Postby Morgaine » Thu Oct 10, 2013 6:42 am

notzed, if you manage to get the other direction to work as well between your x86 and amd64 machines, and if it yields a solid 935 or so Mbps in that direction as well using the same switch, then the situation changes dramatically. It then becomes much more likely that something is periodically taking time away from Ethernet handling on the Parallella itself. On such a new board it could be anything, from rogue interrupts, bus arbitration, DMA, FPGA polling if any, noise/RF on floating inputs, and many other things.

I think it will be very informative to see other measurements as they come in.

===

PS. Although it matters little when link utilization is this low, I ought to record whether you have TCP Timestamps enabled on the hosts at the two ends of the link:

Code: Select all
cat  /proc/sys/net/ipv4/tcp_timestamps


If they're enabled (1) at both ends or disabled (0) at both ends then it's clear what will happen (default in most distros is enabled), but if the two ends don't agree then it's not entirely certain whether the sender prevails or not. It's best to enable them at both ends since they provide important functionality, except perhaps when squeezing out the maximum performance figures purely for fun.
Morgaine
 
Posts: 42
Joined: Tue Jul 02, 2013 8:29 pm

Re: Ethernet throughput measurements

Postby notzed » Thu Oct 10, 2013 10:13 am

Looks like my laptop is running with "firewall enabled" so should be ignored. I installed it 3 years ago and obviously never touched it since, since it works.

As I said, even cpu-bound benchmarks on the ARM on the parallella show major differences in runtime that should not be there. I do not know why this is so, but as it is not practically interfering with my experiments I haven't followed it up. Could be a dud for all i know.

e.g. two runs against localhost.
$ for v in -r -t ; do echo $v ; for n in 1 2 3 4 5 ; do ./nuttcp-6.1.2 $v localhost ; done ; done
-r
3085.6250 MB / 10.04 sec = 2579.3697 Mbps 99 %TX 78 %RX 0 retrans 0.53 msRTT
3094.8750 MB / 10.03 sec = 2587.4759 Mbps 98 %TX 75 %RX 0 retrans 0.17 msRTT
3379.4375 MB / 10.03 sec = 2825.3472 Mbps 99 %TX 69 %RX 0 retrans 0.18 msRTT
3364.9375 MB / 10.03 sec = 2812.9616 Mbps 100 %TX 71 %RX 0 retrans 0.18 msRTT
3380.8750 MB / 10.03 sec = 2826.5820 Mbps 100 %TX 72 %RX 0 retrans 0.17 msRTT
-t
1925.1875 MB / 10.02 sec = 1612.2870 Mbps 69 %TX 100 %RX 0 retrans 0.17 msRTT
1911.5625 MB / 10.02 sec = 1600.8923 Mbps 70 %TX 100 %RX 0 retrans 0.17 msRTT
1937.5469 MB / 10.02 sec = 1622.6185 Mbps 70 %TX 100 %RX 0 retrans 0.17 msRTT
3573.5000 MB / 10.04 sec = 2987.1275 Mbps 100 %TX 73 %RX 0 retrans 0.18 msRTT
1925.7969 MB / 10.02 sec = 1612.8107 Mbps 70 %TX 100 %RX 0 retrans 0.17 msRTT

$ for v in -r -t ; do echo $v ; for n in 1 2 3 4 5 ; do ./nuttcp-6.1.2 $v localhost ; done ; done
-r
1891.8750 MB / 10.02 sec = 1584.2628 Mbps 69 %TX 100 %RX 0 retrans 0.41 msRTT
3432.9375 MB / 10.03 sec = 2870.1040 Mbps 99 %TX 73 %RX 0 retrans 0.17 msRTT
3361.6875 MB / 10.03 sec = 2810.5122 Mbps 100 %TX 69 %RX 0 retrans 0.17 msRTT
3542.5625 MB / 10.03 sec = 2961.4592 Mbps 100 %TX 73 %RX 0 retrans 0.17 msRTT
3667.4375 MB / 10.03 sec = 3066.1727 Mbps 99 %TX 72 %RX 0 retrans 0.18 msRTT
-t
1931.8750 MB / 10.02 sec = 1617.8333 Mbps 69 %TX 100 %RX 0 retrans 0.17 msRTT
1946.2500 MB / 10.02 sec = 1629.8443 Mbps 68 %TX 100 %RX 0 retrans 0.15 msRTT
1928.4688 MB / 10.02 sec = 1615.0483 Mbps 68 %TX 99 %RX 0 retrans 0.15 msRTT
1928.4844 MB / 10.02 sec = 1615.0561 Mbps 68 %TX 100 %RX 0 retrans 0.17 msRTT
1910.0625 MB / 10.02 sec = 1599.5827 Mbps 67 %TX 100 %RX 0 retrans 0.15 msRTT
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: Ethernet throughput measurements

Postby Morgaine » Thu Oct 10, 2013 7:31 pm

Thanks for the tests against localhost on the Parallella. Although I'm not recording them because they don't measure Ethernet throughput, they do give some idea of the overheads in the protocol stack when the communications hardware is eliminated from the path, and so they're a worthwhile sanity check to perform. If these numbers observed for localhost were close to those measured over Ethernet then the latter would be suspect, since throughput would be limited by the protocol stack instead of by the external path.

I think these last results mean that your measurements are safe, but still affected by whatever is causing your variability.

Obvious interfering candidates such as known running processes should be eliminated first when performing measurements. The big elephant in the room is the browser, since in today's Javascript-infested web both the CPU and the network will be in continual use unless Javascript has been turned off or blockers like NoScript are used well to control it. The security-conscious won't be allowing Javascript to run anyway, but when performing measurements involving the local machine it's best to go the whole way and terminate the browser completely. Even better is to use bare machines without desktops for the test.

If using headless machines over ssh then be aware that this dual use of the link means that keepalive packets may add a small amount of variability to the results, and any output sent over ssh during measurement will cause process rescheduling. In your one-liner loops, add a short sleep after nuttcp termination to add confidence that the result line isn't being sent over the link at the same time as the next iteration is running another nuttcp.

No firewall should be running anywhere in the path since that naturally adds overhead to communications. Using an in-service firewall machine directly as one endpoint for the tests would be especially inappropriate unless one is trying to evaluate the impact of service traffic on purpose. (Never leave the tool around on a public facing machine anyway, for obvious security reasons.)

CPU governors can have an impact on measurements as well. so be sure that both ends used for a test are using the performance governor. (Check with "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor".)

With a tool like htop showing a load average of 0.0x % and no hidden use of bandwidth such as by forwarded packets, only the third digit of precision would normally show significant variation. (Terminate htop before running the test.) Variation in the 2nd digit of precision can still give us useful ballpark figures, but variation in the 1st digit as you have seen are a sign of an uncontrolled test environment, and aren't actually measuring the Ethernet performance.
Morgaine
 
Posts: 42
Joined: Tue Jul 02, 2013 8:29 pm

Re: Ethernet throughput measurements

Postby Morgaine » Thu Oct 10, 2013 9:43 pm

Andreas, since my last reply to you, we've been analysing Xilinx's application note in more depth over in that Element14 thread and I now agree with you that we should be expecting nuttcp to report 700Mbps+ or thereabouts using the standard MTU. The way that Xilinx was using Netperf generates curves that are asymptotic to the single figure produced by nuttcp (nuttcp keeps the pipe full and doesn't suffer from mixed-size frames on the link).

Consequently it's the top end of those Xilinx curves that are most meaningful when quantifying Ethernet throughput, and hence the figure of 700Mbps+ without jumbo frames. This matches Xilinx's bar graphs for 1500-byte MTU on page 13, so I'm happier with my understanding of the application note now. 700Mbps+ is indeed our target for standard frames on Parallella.

We really need more nuttcp measurements to eliminate variables from notzed's measurements. By the way, Iperf is in good agreement with nuttcp on the boards I've tested, although I prefer to record nuttcp results for all boards to eliminate one possible variable --- just good practice.
Morgaine
 
Posts: 42
Joined: Tue Jul 02, 2013 8:29 pm


Return to Epiphany and Parallella Q & A

Who is online

Users browsing this forum: No registered users and 1 guest