Project nxt robot designs pdf
37 commentsDogecoin pool explained in detail
However, unlike desktop and server CPUs, mobile CPU and GPU vendors tend to do very little architectural disclosure - a fact that we've been working hard to change over the past few years. Often times all that's available are marketing slides with fuzzy performance claims. This situation frustrates me to no end personally. We've done quite a bit of low-level mobile CPU analysis at AnandTech in pursuit of understanding architectures where there is no publicly available documentation. In this spirit, I wrote a few synthetic tests to better understand the performance of current-gen ARM CPU cores without having to rely upon vendor supplied information.
For this article I'm focusing exclusively on floating point performance. We will look at 5 CPU cores today: The test devices are listed below. I wanted to test the instruction throughput of various floating point instructions. I wrote a simple benchmark consisting of a loop with a large number of iterations. The loop body consisted of many say 20 floating point instructions with no data dependence between them. There were no memory instructions inside the loop and thus memory performance was not an issue.
There were minimal dependencies in the loop body. I tested the performance of scalar addition, multiplication and multiply-accumulate for bit and bit floating point datatypes. Apart from testing throughput of individual instructions, I also wrote a test for testing throughput of a program consisting of two types of instructions: The instructions were interleaved, i. There were no dependencies between the additions and following multiplies.
You may be wondering the reasoning behind this mixed test. For example, one floating point unit may only support addition while another may only support multiplication. Thus, if we only test the additions and multiplications separately, we will not see the peak throughput on such a machine.
We perform the mixed test to identify such cases. All the tests mentioned above measure the amount of time taken for a particular number of instructions and thus we get the instructions executed per-second. Knowing the peak frequency of the device is not enough because CPUs have multiple frequency states and the tests may not be running at the advertised peak speeds. Thus, I also wrote code to monitor the percentage of time spent in each frequency state as reported by the kernel.
However, as it turns out, the method I used for measuring the time spent in each frequency state does not work on aSMP designs like the Krait based Snapdragon S4 and Krait based Snapdragon For Krait , the results reported here are for MSM which shouldn't really have thermal throttling issues.
My results on the MSM also line up quite neatly with the assumption that the CPU spent most or all of its time in the test in the peak frequency state. Brian also ran the test on a Nexus 4 and the results were essentially identical as both have the same peak, which is additional confirmation that our results are likely correct. Thus I will assume a frequency of 1. Results on Krait Snapdragon however are more mixed. I am not sure if it is reaching peak frequency on all the tests and thus I am less sure of the per-cycle estimates on this chip.
Brian also ran the tests on another handset LG Optimus G Pro with the same Snapdragon , and the results were qualitatively very similar. First up, the raw data collected from the tests in gigaflops:.
Before we discuss the results, it is important to keep in mind that the results and per-cycle timing estimates reported are what I observed from the tests.
I did my best to ensure that the design of the tests was very conducive to achieving high throughput. However, it is possible there may be some cases where an architecture can achieve higher performance than what what I was able to get out of my tests. With that out of the way, lets look at the results. In the data, we need to distinguish between number of instructions and number of flops. I count scalar addition and multiply as one flop and scalar MACs as two flops. Thus, we get the following per-cycle instruction throughput estimates:.
We start with the Cortex A9. The mixed test reveals that though fp64 muls can only be issued every two cycles, Cortex A9 can issue a fp64 add in the otherwise empty pipeline slot. Qualcomm's Scorpion implementation of scalar implementations is similar to Cortex A9 except that it seems unable to issue fp64 adds immediately after fp64 muls in the mixed test. Interestingly, Krait has half the per-cycle throughput for MAC instructions, which is a regression compared to Scorpion.
An alternate explanation is that Snapdragon reduced the frequency in the MAC tests for some unknown reason. Without accurate frequency information, currently it is difficult to make that judgment.
In the big picture, readers may want to know how the the floating point capabilities of these cores compares to x86 cores. Thus, current ARM cores are noticeably behind in this case. Apart from the usual reasons power and area constraints, very client focused designs , current ARM cores also particularly lag behind in this case because currently NEON does not have vector instructions for fp I end at a cautionary note that the tests here are synthetic tests that only stress the FP units.
Floating point ALU peaks are only a part of a microarchitecture. Performance of real-world applications will depend upon rest of the microarchitecture such as cache hierarchy, out of order execution capabilities and so on. We will continue to make further investigations into these CPUs to understand them better. First up, the raw data collected from the tests in gigaflops: Thus, we get the following per-cycle instruction throughput estimates: Post Your Comment Please log in or sign up to comment.
Log in Don't have an account? Sign up now Username Password Remember Me.