Why is there a large performance impact when looping over an array with 240 or more elements?Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?Constant pointer array or pointer to an array ? What is faster in C?Why does the order of the loops affect performance when iterating over a 2D array?Why does order of array declaration affect performance so much?Why is my program slow when looping over exactly 8192 elements?Is there a performance impact when calling ToList()?Why is numpy.any so slow over large arrays?Swift for loop: for index, element in array?Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviationswhy is std::equal much slower than a hand rolled loop for two small std::array?Why is the performance of `while` said to be slower than `for` when iterating over an array?

Did Pope Urban II issue the papal bull "terra nullius" in 1095?

Escape Velocity - Won't the orbital path just become larger with higher initial velocity?

Weird resistor with dots around it

Does an ig- prefix mean there's an underlying g in the root?

Help, I cannot decide when to start the story

Units of measurement, especially length, when body parts vary in size among races

How much can I judge a company based on a phone screening?

Should I leave building the database for the end?

Running code generated in realtime in JavaScript with eval()

Are employers legally allowed to pay employees in goods and services equal to or greater than the minimum wage?

Co-workers with a lot of money and openly talk about it

Do beef farmed pastures net remove carbon emissions?

Bringing Power Supplies on Plane?

How would armour (and combat) change if the fighter didn't need to actually wear it?

How can God warn people of the upcoming rapture without disrupting society?

Did DOS zero out the BSS area when it loaded a program?

How would you translate this? バタコチーズライス

What are the odds of rolling specific ability score totals in D&D?

Global BGP Routing only by only importing supernet prefixes

How did Arecibo detect methane lakes on Titan, and image Saturn's rings?

Boss wants me to ignore a software API license

How can I find an old paper when the usual methods fail?

What is the prop for Thor's hammer made of?

How do some PhD students get 10+ papers? Is that what I need for landing good faculty position?



Why is there a large performance impact when looping over an array with 240 or more elements?


Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?Constant pointer array or pointer to an array ? What is faster in C?Why does the order of the loops affect performance when iterating over a 2D array?Why does order of array declaration affect performance so much?Why is my program slow when looping over exactly 8192 elements?Is there a performance impact when calling ToList()?Why is numpy.any so slow over large arrays?Swift for loop: for index, element in array?Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviationswhy is std::equal much slower than a hand rolled loop for two small std::array?Why is the performance of `while` said to be slower than `for` when iterating over an array?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








136















When running a sum loop over an array in Rust, I noticed a huge performance drop when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.



Is there special compilation optimization Rust is doing for "short" arrays?



Compiled with rustc -C opt-level=3.



use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main()
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;

let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;

println!("sum: time::?", sum, now.elapsed());










share|improve this question


























  • github.com/gkorland/benchmark-rust

    – Guy Korland
    2 days ago






  • 4





    Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific.

    – rodrigo
    2 days ago







  • 9





    Reproduced here. Now I'm guessing that it has something to do with loop unrolling.

    – rodrigo
    2 days ago

















136















When running a sum loop over an array in Rust, I noticed a huge performance drop when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.



Is there special compilation optimization Rust is doing for "short" arrays?



Compiled with rustc -C opt-level=3.



use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main()
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;

let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;

println!("sum: time::?", sum, now.elapsed());










share|improve this question


























  • github.com/gkorland/benchmark-rust

    – Guy Korland
    2 days ago






  • 4





    Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific.

    – rodrigo
    2 days ago







  • 9





    Reproduced here. Now I'm guessing that it has something to do with loop unrolling.

    – rodrigo
    2 days ago













136












136








136


22






When running a sum loop over an array in Rust, I noticed a huge performance drop when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.



Is there special compilation optimization Rust is doing for "short" arrays?



Compiled with rustc -C opt-level=3.



use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main()
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;

let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;

println!("sum: time::?", sum, now.elapsed());










share|improve this question
















When running a sum loop over an array in Rust, I noticed a huge performance drop when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.



Is there special compilation optimization Rust is doing for "short" arrays?



Compiled with rustc -C opt-level=3.



use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main()
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;

let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;

println!("sum: time::?", sum, now.elapsed());







arrays performance rust llvm-codegen






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 1 hour ago









RonJohn

1412 silver badges16 bronze badges




1412 silver badges16 bronze badges










asked 2 days ago









Guy KorlandGuy Korland

2,92810 gold badges37 silver badges81 bronze badges




2,92810 gold badges37 silver badges81 bronze badges















  • github.com/gkorland/benchmark-rust

    – Guy Korland
    2 days ago






  • 4





    Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific.

    – rodrigo
    2 days ago







  • 9





    Reproduced here. Now I'm guessing that it has something to do with loop unrolling.

    – rodrigo
    2 days ago

















  • github.com/gkorland/benchmark-rust

    – Guy Korland
    2 days ago






  • 4





    Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific.

    – rodrigo
    2 days ago







  • 9





    Reproduced here. Now I'm guessing that it has something to do with loop unrolling.

    – rodrigo
    2 days ago
















github.com/gkorland/benchmark-rust

– Guy Korland
2 days ago





github.com/gkorland/benchmark-rust

– Guy Korland
2 days ago




4




4





Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific.

– rodrigo
2 days ago






Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific.

– rodrigo
2 days ago





9




9





Reproduced here. Now I'm guessing that it has something to do with loop unrolling.

– rodrigo
2 days ago





Reproduced here. Now I'm guessing that it has something to do with loop unrolling.

– rodrigo
2 days ago












2 Answers
2






active

oldest

votes


















233














TL:DR: Below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.




You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization -- only performed for length 239 -- is responsible for the huge speed difference. But let's start slowly:



(All code in this answer is compiled with -C opt-level=3)



pub fn foo() -> usize 
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

s



This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. I only paste a small part of the assembly here:



movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0


This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. the increment of loop variable, check if loop ended and the jump.



In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.



Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.



LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.




But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:



const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;


let mut sum = 0;
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;


sum



(On Godbolt Compiler Explorer)



The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.



The important difference is that for 239 LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!



First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly, the comments with * are especially important):



 ; at the start of the function, `rbx` was set to 0

movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`


As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.



This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.




An additional note: can you do anything about this? Not really LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.



As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to a any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.






share|improve this answer



























  • @lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

    – Guy Korland
    2 days ago






  • 2





    @LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

    – Mgetz
    2 days ago






  • 3





    @Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

    – Lukas Kalbertodt
    2 days ago






  • 4





    Why doesn't the compiler or LLVM realize that the entire calculation can be made at compile time? I would have expected to have the loop result hardcoded. Or is the use of Instant preventing that?

    – Uncreative Name
    yesterday






  • 2





    @JosephGarvin: I assume it's because fully unrolling happens to allow the later optimization pass to see that. Remember that optimizing compilers still care about compiling quickly, as well as making efficient asm, so they have to limit the worst-case complexity of any analysis they do so it doesn't take hours / days to compile some nasty source code with complicated loops. But yes, this is obviously a missed optimization for size >= 240. I wonder if not optimizing away loops inside of loops is intentional to avoid breaking simple benchmarks? Probably not, but maybe.

    – Peter Cordes
    yesterday



















15














In addition to Lukas' answer, if you want to use an iterator, try this:



const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

pub fn bar() -> usize
(0..CAPACITY).sum::<usize>() * IN_LOOPS



Thanks @Chris Morgan for the suggestion about range pattern.



The optimized assembly is quite good:



example::bar:
movabs rax, 14340000000
ret





share|improve this answer






















  • 3





    Or better still, (0..CAPACITY).sum::<usize>() * IN_LOOPS, which yields the same result.

    – Chris Morgan
    yesterday






  • 6





    I would actually explain that the assembly is not actually doing the calculation, but LLVM has precomputed the answer in this case.

    – Josep
    yesterday











  • I’m kind of surprised that rustc is missing the opportunity to do this strength-reduction. In this specific context, though, this appears to be a timing loop, and you deliberately want it not to be optimized out. The whole point is to repeat the computation that number of times from scratch and divide by the number of repetitions. In C, the (unofficial) idiom for that is to declare the loop counter as volatile, e.g. the BogoMIPS counter in the Linux kernel. Is there a way to achieve this in Rust? There might be, but I don’t know it. Calling an external fn might help.

    – Davislor
    11 hours ago














Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57458460%2fwhy-is-there-a-large-performance-impact-when-looping-over-an-array-with-240-or-m%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









233














TL:DR: Below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.




You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization -- only performed for length 239 -- is responsible for the huge speed difference. But let's start slowly:



(All code in this answer is compiled with -C opt-level=3)



pub fn foo() -> usize 
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

s



This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. I only paste a small part of the assembly here:



movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0


This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. the increment of loop variable, check if loop ended and the jump.



In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.



Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.



LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.




But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:



const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;


let mut sum = 0;
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;


sum



(On Godbolt Compiler Explorer)



The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.



The important difference is that for 239 LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!



First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly, the comments with * are especially important):



 ; at the start of the function, `rbx` was set to 0

movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`


As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.



This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.




An additional note: can you do anything about this? Not really LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.



As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to a any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.






share|improve this answer



























  • @lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

    – Guy Korland
    2 days ago






  • 2





    @LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

    – Mgetz
    2 days ago






  • 3





    @Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

    – Lukas Kalbertodt
    2 days ago






  • 4





    Why doesn't the compiler or LLVM realize that the entire calculation can be made at compile time? I would have expected to have the loop result hardcoded. Or is the use of Instant preventing that?

    – Uncreative Name
    yesterday






  • 2





    @JosephGarvin: I assume it's because fully unrolling happens to allow the later optimization pass to see that. Remember that optimizing compilers still care about compiling quickly, as well as making efficient asm, so they have to limit the worst-case complexity of any analysis they do so it doesn't take hours / days to compile some nasty source code with complicated loops. But yes, this is obviously a missed optimization for size >= 240. I wonder if not optimizing away loops inside of loops is intentional to avoid breaking simple benchmarks? Probably not, but maybe.

    – Peter Cordes
    yesterday
















233














TL:DR: Below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.




You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization -- only performed for length 239 -- is responsible for the huge speed difference. But let's start slowly:



(All code in this answer is compiled with -C opt-level=3)



pub fn foo() -> usize 
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

s



This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. I only paste a small part of the assembly here:



movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0


This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. the increment of loop variable, check if loop ended and the jump.



In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.



Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.



LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.




But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:



const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;


let mut sum = 0;
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;


sum



(On Godbolt Compiler Explorer)



The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.



The important difference is that for 239 LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!



First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly, the comments with * are especially important):



 ; at the start of the function, `rbx` was set to 0

movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`


As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.



This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.




An additional note: can you do anything about this? Not really LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.



As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to a any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.






share|improve this answer



























  • @lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

    – Guy Korland
    2 days ago






  • 2





    @LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

    – Mgetz
    2 days ago






  • 3





    @Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

    – Lukas Kalbertodt
    2 days ago






  • 4





    Why doesn't the compiler or LLVM realize that the entire calculation can be made at compile time? I would have expected to have the loop result hardcoded. Or is the use of Instant preventing that?

    – Uncreative Name
    yesterday






  • 2





    @JosephGarvin: I assume it's because fully unrolling happens to allow the later optimization pass to see that. Remember that optimizing compilers still care about compiling quickly, as well as making efficient asm, so they have to limit the worst-case complexity of any analysis they do so it doesn't take hours / days to compile some nasty source code with complicated loops. But yes, this is obviously a missed optimization for size >= 240. I wonder if not optimizing away loops inside of loops is intentional to avoid breaking simple benchmarks? Probably not, but maybe.

    – Peter Cordes
    yesterday














233












233








233







TL:DR: Below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.




You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization -- only performed for length 239 -- is responsible for the huge speed difference. But let's start slowly:



(All code in this answer is compiled with -C opt-level=3)



pub fn foo() -> usize 
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

s



This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. I only paste a small part of the assembly here:



movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0


This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. the increment of loop variable, check if loop ended and the jump.



In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.



Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.



LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.




But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:



const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;


let mut sum = 0;
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;


sum



(On Godbolt Compiler Explorer)



The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.



The important difference is that for 239 LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!



First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly, the comments with * are especially important):



 ; at the start of the function, `rbx` was set to 0

movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`


As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.



This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.




An additional note: can you do anything about this? Not really LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.



As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to a any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.






share|improve this answer















TL:DR: Below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.




You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization -- only performed for length 239 -- is responsible for the huge speed difference. But let's start slowly:



(All code in this answer is compiled with -C opt-level=3)



pub fn foo() -> usize 
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

s



This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. I only paste a small part of the assembly here:



movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0


This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. the increment of loop variable, check if loop ended and the jump.



In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.



Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.



LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.




But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:



const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;


let mut sum = 0;
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;


sum



(On Godbolt Compiler Explorer)



The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.



The important difference is that for 239 LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!



First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly, the comments with * are especially important):



 ; at the start of the function, `rbx` was set to 0

movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`


As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.



This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.




An additional note: can you do anything about this? Not really LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.



As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to a any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.







share|improve this answer














share|improve this answer



share|improve this answer








edited yesterday









Peter Cordes

154k21 gold badges244 silver badges393 bronze badges




154k21 gold badges244 silver badges393 bronze badges










answered 2 days ago









Lukas KalbertodtLukas Kalbertodt

30.2k5 gold badges78 silver badges141 bronze badges




30.2k5 gold badges78 silver badges141 bronze badges















  • @lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

    – Guy Korland
    2 days ago






  • 2





    @LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

    – Mgetz
    2 days ago






  • 3





    @Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

    – Lukas Kalbertodt
    2 days ago






  • 4





    Why doesn't the compiler or LLVM realize that the entire calculation can be made at compile time? I would have expected to have the loop result hardcoded. Or is the use of Instant preventing that?

    – Uncreative Name
    yesterday






  • 2





    @JosephGarvin: I assume it's because fully unrolling happens to allow the later optimization pass to see that. Remember that optimizing compilers still care about compiling quickly, as well as making efficient asm, so they have to limit the worst-case complexity of any analysis they do so it doesn't take hours / days to compile some nasty source code with complicated loops. But yes, this is obviously a missed optimization for size >= 240. I wonder if not optimizing away loops inside of loops is intentional to avoid breaking simple benchmarks? Probably not, but maybe.

    – Peter Cordes
    yesterday


















  • @lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

    – Guy Korland
    2 days ago






  • 2





    @LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

    – Mgetz
    2 days ago






  • 3





    @Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

    – Lukas Kalbertodt
    2 days ago






  • 4





    Why doesn't the compiler or LLVM realize that the entire calculation can be made at compile time? I would have expected to have the loop result hardcoded. Or is the use of Instant preventing that?

    – Uncreative Name
    yesterday






  • 2





    @JosephGarvin: I assume it's because fully unrolling happens to allow the later optimization pass to see that. Remember that optimizing compilers still care about compiling quickly, as well as making efficient asm, so they have to limit the worst-case complexity of any analysis they do so it doesn't take hours / days to compile some nasty source code with complicated loops. But yes, this is obviously a missed optimization for size >= 240. I wonder if not optimizing away loops inside of loops is intentional to avoid breaking simple benchmarks? Probably not, but maybe.

    – Peter Cordes
    yesterday

















@lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

– Guy Korland
2 days ago





@lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

– Guy Korland
2 days ago




2




2





@LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

– Mgetz
2 days ago





@LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

– Mgetz
2 days ago




3




3





@Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

– Lukas Kalbertodt
2 days ago





@Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

– Lukas Kalbertodt
2 days ago




4




4





Why doesn't the compiler or LLVM realize that the entire calculation can be made at compile time? I would have expected to have the loop result hardcoded. Or is the use of Instant preventing that?

– Uncreative Name
yesterday





Why doesn't the compiler or LLVM realize that the entire calculation can be made at compile time? I would have expected to have the loop result hardcoded. Or is the use of Instant preventing that?

– Uncreative Name
yesterday




2




2





@JosephGarvin: I assume it's because fully unrolling happens to allow the later optimization pass to see that. Remember that optimizing compilers still care about compiling quickly, as well as making efficient asm, so they have to limit the worst-case complexity of any analysis they do so it doesn't take hours / days to compile some nasty source code with complicated loops. But yes, this is obviously a missed optimization for size >= 240. I wonder if not optimizing away loops inside of loops is intentional to avoid breaking simple benchmarks? Probably not, but maybe.

– Peter Cordes
yesterday






@JosephGarvin: I assume it's because fully unrolling happens to allow the later optimization pass to see that. Remember that optimizing compilers still care about compiling quickly, as well as making efficient asm, so they have to limit the worst-case complexity of any analysis they do so it doesn't take hours / days to compile some nasty source code with complicated loops. But yes, this is obviously a missed optimization for size >= 240. I wonder if not optimizing away loops inside of loops is intentional to avoid breaking simple benchmarks? Probably not, but maybe.

– Peter Cordes
yesterday














15














In addition to Lukas' answer, if you want to use an iterator, try this:



const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

pub fn bar() -> usize
(0..CAPACITY).sum::<usize>() * IN_LOOPS



Thanks @Chris Morgan for the suggestion about range pattern.



The optimized assembly is quite good:



example::bar:
movabs rax, 14340000000
ret





share|improve this answer






















  • 3





    Or better still, (0..CAPACITY).sum::<usize>() * IN_LOOPS, which yields the same result.

    – Chris Morgan
    yesterday






  • 6





    I would actually explain that the assembly is not actually doing the calculation, but LLVM has precomputed the answer in this case.

    – Josep
    yesterday











  • I’m kind of surprised that rustc is missing the opportunity to do this strength-reduction. In this specific context, though, this appears to be a timing loop, and you deliberately want it not to be optimized out. The whole point is to repeat the computation that number of times from scratch and divide by the number of repetitions. In C, the (unofficial) idiom for that is to declare the loop counter as volatile, e.g. the BogoMIPS counter in the Linux kernel. Is there a way to achieve this in Rust? There might be, but I don’t know it. Calling an external fn might help.

    – Davislor
    11 hours ago
















15














In addition to Lukas' answer, if you want to use an iterator, try this:



const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

pub fn bar() -> usize
(0..CAPACITY).sum::<usize>() * IN_LOOPS



Thanks @Chris Morgan for the suggestion about range pattern.



The optimized assembly is quite good:



example::bar:
movabs rax, 14340000000
ret





share|improve this answer






















  • 3





    Or better still, (0..CAPACITY).sum::<usize>() * IN_LOOPS, which yields the same result.

    – Chris Morgan
    yesterday






  • 6





    I would actually explain that the assembly is not actually doing the calculation, but LLVM has precomputed the answer in this case.

    – Josep
    yesterday











  • I’m kind of surprised that rustc is missing the opportunity to do this strength-reduction. In this specific context, though, this appears to be a timing loop, and you deliberately want it not to be optimized out. The whole point is to repeat the computation that number of times from scratch and divide by the number of repetitions. In C, the (unofficial) idiom for that is to declare the loop counter as volatile, e.g. the BogoMIPS counter in the Linux kernel. Is there a way to achieve this in Rust? There might be, but I don’t know it. Calling an external fn might help.

    – Davislor
    11 hours ago














15












15








15







In addition to Lukas' answer, if you want to use an iterator, try this:



const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

pub fn bar() -> usize
(0..CAPACITY).sum::<usize>() * IN_LOOPS



Thanks @Chris Morgan for the suggestion about range pattern.



The optimized assembly is quite good:



example::bar:
movabs rax, 14340000000
ret





share|improve this answer















In addition to Lukas' answer, if you want to use an iterator, try this:



const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

pub fn bar() -> usize
(0..CAPACITY).sum::<usize>() * IN_LOOPS



Thanks @Chris Morgan for the suggestion about range pattern.



The optimized assembly is quite good:



example::bar:
movabs rax, 14340000000
ret






share|improve this answer














share|improve this answer



share|improve this answer








edited yesterday

























answered yesterday









mjamja

48410 silver badges19 bronze badges




48410 silver badges19 bronze badges










  • 3





    Or better still, (0..CAPACITY).sum::<usize>() * IN_LOOPS, which yields the same result.

    – Chris Morgan
    yesterday






  • 6





    I would actually explain that the assembly is not actually doing the calculation, but LLVM has precomputed the answer in this case.

    – Josep
    yesterday











  • I’m kind of surprised that rustc is missing the opportunity to do this strength-reduction. In this specific context, though, this appears to be a timing loop, and you deliberately want it not to be optimized out. The whole point is to repeat the computation that number of times from scratch and divide by the number of repetitions. In C, the (unofficial) idiom for that is to declare the loop counter as volatile, e.g. the BogoMIPS counter in the Linux kernel. Is there a way to achieve this in Rust? There might be, but I don’t know it. Calling an external fn might help.

    – Davislor
    11 hours ago













  • 3





    Or better still, (0..CAPACITY).sum::<usize>() * IN_LOOPS, which yields the same result.

    – Chris Morgan
    yesterday






  • 6





    I would actually explain that the assembly is not actually doing the calculation, but LLVM has precomputed the answer in this case.

    – Josep
    yesterday











  • I’m kind of surprised that rustc is missing the opportunity to do this strength-reduction. In this specific context, though, this appears to be a timing loop, and you deliberately want it not to be optimized out. The whole point is to repeat the computation that number of times from scratch and divide by the number of repetitions. In C, the (unofficial) idiom for that is to declare the loop counter as volatile, e.g. the BogoMIPS counter in the Linux kernel. Is there a way to achieve this in Rust? There might be, but I don’t know it. Calling an external fn might help.

    – Davislor
    11 hours ago








3




3





Or better still, (0..CAPACITY).sum::<usize>() * IN_LOOPS, which yields the same result.

– Chris Morgan
yesterday





Or better still, (0..CAPACITY).sum::<usize>() * IN_LOOPS, which yields the same result.

– Chris Morgan
yesterday




6




6





I would actually explain that the assembly is not actually doing the calculation, but LLVM has precomputed the answer in this case.

– Josep
yesterday





I would actually explain that the assembly is not actually doing the calculation, but LLVM has precomputed the answer in this case.

– Josep
yesterday













I’m kind of surprised that rustc is missing the opportunity to do this strength-reduction. In this specific context, though, this appears to be a timing loop, and you deliberately want it not to be optimized out. The whole point is to repeat the computation that number of times from scratch and divide by the number of repetitions. In C, the (unofficial) idiom for that is to declare the loop counter as volatile, e.g. the BogoMIPS counter in the Linux kernel. Is there a way to achieve this in Rust? There might be, but I don’t know it. Calling an external fn might help.

– Davislor
11 hours ago






I’m kind of surprised that rustc is missing the opportunity to do this strength-reduction. In this specific context, though, this appears to be a timing loop, and you deliberately want it not to be optimized out. The whole point is to repeat the computation that number of times from scratch and divide by the number of repetitions. In C, the (unofficial) idiom for that is to declare the loop counter as volatile, e.g. the BogoMIPS counter in the Linux kernel. Is there a way to achieve this in Rust? There might be, but I don’t know it. Calling an external fn might help.

– Davislor
11 hours ago


















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57458460%2fwhy-is-there-a-large-performance-impact-when-looping-over-an-array-with-240-or-m%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

19. јануар Садржај Догађаји Рођења Смрти Празници и дани сећања Види још Референце Мени за навигацијуу

Israel Cuprins Etimologie | Istorie | Geografie | Politică | Demografie | Educație | Economie | Cultură | Note explicative | Note bibliografice | Bibliografie | Legături externe | Meniu de navigaresite web oficialfacebooktweeterGoogle+Instagramcanal YouTubeInstagramtextmodificaremodificarewww.technion.ac.ilnew.huji.ac.ilwww.weizmann.ac.ilwww1.biu.ac.ilenglish.tau.ac.ilwww.haifa.ac.ilin.bgu.ac.ilwww.openu.ac.ilwww.ariel.ac.ilCIA FactbookHarta Israelului"Negotiating Jerusalem," Palestine–Israel JournalThe Schizoid Nature of Modern Hebrew: A Slavic Language in Search of a Semitic Past„Arabic in Israel: an official language and a cultural bridge”„Latest Population Statistics for Israel”„Israel Population”„Tables”„Report for Selected Countries and Subjects”Human Development Report 2016: Human Development for Everyone„Distribution of family income - Gini index”The World FactbookJerusalem Law„Israel”„Israel”„Zionist Leaders: David Ben-Gurion 1886–1973”„The status of Jerusalem”„Analysis: Kadima's big plans”„Israel's Hard-Learned Lessons”„The Legacy of Undefined Borders, Tel Aviv Notes No. 40, 5 iunie 2002”„Israel Journal: A Land Without Borders”„Population”„Israel closes decade with population of 7.5 million”Time Series-DataBank„Selected Statistics on Jerusalem Day 2007 (Hebrew)”Golan belongs to Syria, Druze protestGlobal Survey 2006: Middle East Progress Amid Global Gains in FreedomWHO: Life expectancy in Israel among highest in the worldInternational Monetary Fund, World Economic Outlook Database, April 2011: Nominal GDP list of countries. Data for the year 2010.„Israel's accession to the OECD”Popular Opinion„On the Move”Hosea 12:5„Walking the Bible Timeline”„Palestine: History”„Return to Zion”An invention called 'the Jewish people' – Haaretz – Israel NewsoriginalJewish and Non-Jewish Population of Palestine-Israel (1517–2004)ImmigrationJewishvirtuallibrary.orgChapter One: The Heralders of Zionism„The birth of modern Israel: A scrap of paper that changed history”„League of Nations: The Mandate for Palestine, 24 iulie 1922”The Population of Palestine Prior to 1948originalBackground Paper No. 47 (ST/DPI/SER.A/47)History: Foreign DominationTwo Hundred and Seventh Plenary Meeting„Israel (Labor Zionism)”Population, by Religion and Population GroupThe Suez CrisisAdolf EichmannJustice Ministry Reply to Amnesty International Report„The Interregnum”Israel Ministry of Foreign Affairs – The Palestinian National Covenant- July 1968Research on terrorism: trends, achievements & failuresThe Routledge Atlas of the Arab–Israeli conflict: The Complete History of the Struggle and the Efforts to Resolve It"George Habash, Palestinian Terrorism Tactician, Dies at 82."„1973: Arab states attack Israeli forces”Agranat Commission„Has Israel Annexed East Jerusalem?”original„After 4 Years, Intifada Still Smolders”From the End of the Cold War to 2001originalThe Oslo Accords, 1993Israel-PLO Recognition – Exchange of Letters between PM Rabin and Chairman Arafat – Sept 9- 1993Foundation for Middle East PeaceSources of Population Growth: Total Israeli Population and Settler Population, 1991–2003original„Israel marks Rabin assassination”The Wye River Memorandumoriginal„West Bank barrier route disputed, Israeli missile kills 2”"Permanent Ceasefire to Be Based on Creation Of Buffer Zone Free of Armed Personnel Other than UN, Lebanese Forces"„Hezbollah kills 8 soldiers, kidnaps two in offensive on northern border”„Olmert confirms peace talks with Syria”„Battleground Gaza: Israeli ground forces invade the strip”„IDF begins Gaza troop withdrawal, hours after ending 3-week offensive”„THE LAND: Geography and Climate”„Area of districts, sub-districts, natural regions and lakes”„Israel - Geography”„Makhteshim Country”Israel and the Palestinian Territories„Makhtesh Ramon”„The Living Dead Sea”„Temperatures reach record high in Pakistan”„Climate Extremes In Israel”Israel in figures„Deuteronom”„JNF: 240 million trees planted since 1901”„Vegetation of Israel and Neighboring Countries”Environmental Law in Israel„Executive branch”„Israel's election process explained”„The Electoral System in Israel”„Constitution for Israel”„All 120 incoming Knesset members”„Statul ISRAEL”„The Judiciary: The Court System”„Israel's high court unique in region”„Israel and the International Criminal Court: A Legal Battlefield”„Localities and population, by population group, district, sub-district and natural region”„Israel: Districts, Major Cities, Urban Localities & Metropolitan Areas”„Israel-Egypt Relations: Background & Overview of Peace Treaty”„Solana to Haaretz: New Rules of War Needed for Age of Terror”„Israel's Announcement Regarding Settlements”„United Nations Security Council Resolution 497”„Security Council resolution 478 (1980) on the status of Jerusalem”„Arabs will ask U.N. to seek razing of Israeli wall”„Olmert: Willing to trade land for peace”„Mapping Peace between Syria and Israel”„Egypt: Israel must accept the land-for-peace formula”„Israel: Age structure from 2005 to 2015”„Global, regional, and national disability-adjusted life years (DALYs) for 306 diseases and injuries and healthy life expectancy (HALE) for 188 countries, 1990–2013: quantifying the epidemiological transition”10.1016/S0140-6736(15)61340-X„World Health Statistics 2014”„Life expectancy for Israeli men world's 4th highest”„Family Structure and Well-Being Across Israel's Diverse Population”„Fertility among Jewish and Muslim Women in Israel, by Level of Religiosity, 1979-2009”„Israel leaders in birth rate, but poverty major challenge”„Ethnic Groups”„Israel's population: Over 8.5 million”„Israel - Ethnic groups”„Jews, by country of origin and age”„Minority Communities in Israel: Background & Overview”„Israel”„Language in Israel”„Selected Data from the 2011 Social Survey on Mastery of the Hebrew Language and Usage of Languages”„Religions”„5 facts about Israeli Druze, a unique religious and ethnic group”„Israël”Israel Country Study Guide„Haredi city in Negev – blessing or curse?”„New town Harish harbors hopes of being more than another Pleasantville”„List of localities, in alphabetical order”„Muncitorii români, doriți în Israel”„Prietenia româno-israeliană la nevoie se cunoaște”„The Higher Education System in Israel”„Middle East”„Academic Ranking of World Universities 2016”„Israel”„Israel”„Jewish Nobel Prize Winners”„All Nobel Prizes in Literature”„All Nobel Peace Prizes”„All Prizes in Economic Sciences”„All Nobel Prizes in Chemistry”„List of Fields Medallists”„Sakharov Prize”„Țara care și-a sfidat "destinul" și se bate umăr la umăr cu Silicon Valley”„Apple's R&D center in Israel grew to about 800 employees”„Tim Cook: Apple's Herzliya R&D center second-largest in world”„Lecții de economie de la Israel”„Land use”Israel Investment and Business GuideA Country Study: IsraelCentral Bureau of StatisticsFlorin Diaconu, „Kadima: Flexibilitate și pragmatism, dar nici un compromis în chestiuni vitale", în Revista Institutului Diplomatic Român, anul I, numărul I, semestrul I, 2006, pp. 71-72Florin Diaconu, „Likud: Dreapta israeliană constant opusă retrocedării teritoriilor cureite prin luptă în 1967", în Revista Institutului Diplomatic Român, anul I, numărul I, semestrul I, 2006, pp. 73-74MassadaIsraelul a crescut in 50 de ani cât alte state intr-un mileniuIsrael Government PortalIsraelIsraelIsraelmmmmmXX451232cb118646298(data)4027808-634110000 0004 0372 0767n7900328503691455-bb46-37e3-91d2-cb064a35ffcc1003570400564274ge1294033523775214929302638955X146498911146498911

Кастелфранко ди Сопра Становништво Референце Спољашње везе Мени за навигацију43°37′18″ СГШ; 11°33′32″ ИГД / 43.62156° СГШ; 11.55885° ИГД / 43.62156; 11.5588543°37′18″ СГШ; 11°33′32″ ИГД / 43.62156° СГШ; 11.55885° ИГД / 43.62156; 11.558853179688„The GeoNames geographical database”„Istituto Nazionale di Statistica”проширитиууWorldCat156923403n850174324558639-1cb14643287r(подаци)