Why is there a large performance impact when looping over an array over 240 elements?Improve INSERT-per-second performance of SQLite?Constant pointer array or pointer to an array ? What is faster in C?Why does the order of the loops affect performance when iterating over a 2D array?Why does order of array declaration affect performance so much?Why is my program slow when looping over exactly 8192 elements?Is there a performance impact when calling ToList()?Swift for loop: for index, element in array?Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviationswhy is std::equal much slower than a hand rolled loop for two small std::array?Why is if (variable1 % variable2 == 0) inefficient?

Simplification of numbers

Submitting a new paper just after another was accepted by the same journal

Can the ground attached to neutral fool a receptacle tester?

TEMPO: play a sound in animated GIF/PDF/SVG

Breadcrumb history decision

Super Duper Vdd stiffening required on 555 timer, what is the best way?

If clocks themselves are based on light signals, wouldn't we expect the measured speed of light to always be the same constant?

How to divide item stack in MC PE?

Is God unknowable?

What is the status of the F-1B engine development?

How does "Te vas a cansar" mean "You're going to get tired"?

Bitcoin successfully deducted on sender wallet but did not reach receiver wallet

A continuous water "planet" ring around a star

Does the Fireball spell damage objects?

A torrent of foreign terms

Why is there a large performance impact when looping over an array over 240 elements?

Why command hierarchy, if the chain of command is standing next to each other?

If "more guns less crime", how do gun advocates explain that the EU has less crime than the US?

Can "être sur" mean "to be about" ?

Solution to German Tank Problem

How to create events observer that only call when REST api dispatch events?

Normalization constant of a planar wave

How much maintenance time did it take to make an F4U Corsair ready for another flight?

Do beef farmed pastures net remove carbon emissions?



Why is there a large performance impact when looping over an array over 240 elements?


Improve INSERT-per-second performance of SQLite?Constant pointer array or pointer to an array ? What is faster in C?Why does the order of the loops affect performance when iterating over a 2D array?Why does order of array declaration affect performance so much?Why is my program slow when looping over exactly 8192 elements?Is there a performance impact when calling ToList()?Swift for loop: for index, element in array?Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviationswhy is std::equal much slower than a hand rolled loop for two small std::array?Why is if (variable1 % variable2 == 0) inefficient?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








29















When running a sum loop over an array in Rust, I noticed a huge performance difference when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.



Is there special compilation optimization Rust is doing for "short" arrays?



Compiled with rustc -C opt-level=3.



use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main()
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;

let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;

println!("sum: time::?", sum, now.elapsed());










share|improve this question


























  • github.com/gkorland/benchmark-rust

    – Guy Korland
    13 hours ago






  • 1





    Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific.

    – rodrigo
    12 hours ago







  • 4





    Reproduced here. Now I'm guessing that it has something to do with loop unrolling.

    – rodrigo
    12 hours ago











  • thanks @rodrigo

    – Guy Korland
    11 hours ago

















29















When running a sum loop over an array in Rust, I noticed a huge performance difference when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.



Is there special compilation optimization Rust is doing for "short" arrays?



Compiled with rustc -C opt-level=3.



use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main()
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;

let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;

println!("sum: time::?", sum, now.elapsed());










share|improve this question


























  • github.com/gkorland/benchmark-rust

    – Guy Korland
    13 hours ago






  • 1





    Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific.

    – rodrigo
    12 hours ago







  • 4





    Reproduced here. Now I'm guessing that it has something to do with loop unrolling.

    – rodrigo
    12 hours ago











  • thanks @rodrigo

    – Guy Korland
    11 hours ago













29












29








29


3






When running a sum loop over an array in Rust, I noticed a huge performance difference when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.



Is there special compilation optimization Rust is doing for "short" arrays?



Compiled with rustc -C opt-level=3.



use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main()
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;

let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;

println!("sum: time::?", sum, now.elapsed());










share|improve this question
















When running a sum loop over an array in Rust, I noticed a huge performance difference when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.



Is there special compilation optimization Rust is doing for "short" arrays?



Compiled with rustc -C opt-level=3.



use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main()
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;

let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;

println!("sum: time::?", sum, now.elapsed());







arrays performance rust llvm-codegen






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 10 hours ago









Shepmaster

176k21 gold badges389 silver badges545 bronze badges




176k21 gold badges389 silver badges545 bronze badges










asked 15 hours ago









Guy KorlandGuy Korland

2,5368 gold badges36 silver badges79 bronze badges




2,5368 gold badges36 silver badges79 bronze badges















  • github.com/gkorland/benchmark-rust

    – Guy Korland
    13 hours ago






  • 1





    Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific.

    – rodrigo
    12 hours ago







  • 4





    Reproduced here. Now I'm guessing that it has something to do with loop unrolling.

    – rodrigo
    12 hours ago











  • thanks @rodrigo

    – Guy Korland
    11 hours ago

















  • github.com/gkorland/benchmark-rust

    – Guy Korland
    13 hours ago






  • 1





    Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific.

    – rodrigo
    12 hours ago







  • 4





    Reproduced here. Now I'm guessing that it has something to do with loop unrolling.

    – rodrigo
    12 hours ago











  • thanks @rodrigo

    – Guy Korland
    11 hours ago
















github.com/gkorland/benchmark-rust

– Guy Korland
13 hours ago





github.com/gkorland/benchmark-rust

– Guy Korland
13 hours ago




1




1





Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific.

– rodrigo
12 hours ago






Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific.

– rodrigo
12 hours ago





4




4





Reproduced here. Now I'm guessing that it has something to do with loop unrolling.

– rodrigo
12 hours ago





Reproduced here. Now I'm guessing that it has something to do with loop unrolling.

– rodrigo
12 hours ago













thanks @rodrigo

– Guy Korland
11 hours ago





thanks @rodrigo

– Guy Korland
11 hours ago












1 Answer
1






active

oldest

votes


















37














You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization -- only performed for length 239 -- is responsible for the huge speed difference. But let's start slowly:



(All code in this answer is compiled with -C opt-level=3)



pub fn foo() -> usize 
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

s



This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. I only paste a small part of the assembly here:



movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0


This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. the increment of loop variable, check if loop ended and the jump.



(In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. More even, two SIMD register (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically executed to of these instructions at the same time. After all, they are independent of one another. In the end, both registers are combined.)




But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:



const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;


let mut sum = 0;
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;


sum



(On Godbolt Compiler Explorer)



The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.



The important difference is that for 239 LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!



First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly, the comments with * are especially important):



 ; at the start of the function, `rbx` was set to 0

movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`


As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.



This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.




An additional note: can you do anything about this? Not really LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.



As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to a any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.






share|improve this answer



























  • @lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

    – Guy Korland
    10 hours ago











  • @LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

    – Mgetz
    10 hours ago






  • 2





    @Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

    – Lukas Kalbertodt
    9 hours ago











  • @LukasKalbertodt I'm curious if this whole thing related to some issues the LLVM team have discussed in regards to unsigned iterators and UB cases in C/C++ that they've done talks on. But that's purely speculation.

    – Mgetz
    9 hours ago










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57458460%2fwhy-is-there-a-large-performance-impact-when-looping-over-an-array-over-240-elem%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









37














You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization -- only performed for length 239 -- is responsible for the huge speed difference. But let's start slowly:



(All code in this answer is compiled with -C opt-level=3)



pub fn foo() -> usize 
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

s



This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. I only paste a small part of the assembly here:



movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0


This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. the increment of loop variable, check if loop ended and the jump.



(In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. More even, two SIMD register (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically executed to of these instructions at the same time. After all, they are independent of one another. In the end, both registers are combined.)




But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:



const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;


let mut sum = 0;
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;


sum



(On Godbolt Compiler Explorer)



The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.



The important difference is that for 239 LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!



First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly, the comments with * are especially important):



 ; at the start of the function, `rbx` was set to 0

movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`


As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.



This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.




An additional note: can you do anything about this? Not really LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.



As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to a any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.






share|improve this answer



























  • @lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

    – Guy Korland
    10 hours ago











  • @LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

    – Mgetz
    10 hours ago






  • 2





    @Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

    – Lukas Kalbertodt
    9 hours ago











  • @LukasKalbertodt I'm curious if this whole thing related to some issues the LLVM team have discussed in regards to unsigned iterators and UB cases in C/C++ that they've done talks on. But that's purely speculation.

    – Mgetz
    9 hours ago















37














You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization -- only performed for length 239 -- is responsible for the huge speed difference. But let's start slowly:



(All code in this answer is compiled with -C opt-level=3)



pub fn foo() -> usize 
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

s



This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. I only paste a small part of the assembly here:



movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0


This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. the increment of loop variable, check if loop ended and the jump.



(In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. More even, two SIMD register (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically executed to of these instructions at the same time. After all, they are independent of one another. In the end, both registers are combined.)




But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:



const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;


let mut sum = 0;
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;


sum



(On Godbolt Compiler Explorer)



The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.



The important difference is that for 239 LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!



First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly, the comments with * are especially important):



 ; at the start of the function, `rbx` was set to 0

movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`


As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.



This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.




An additional note: can you do anything about this? Not really LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.



As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to a any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.






share|improve this answer



























  • @lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

    – Guy Korland
    10 hours ago











  • @LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

    – Mgetz
    10 hours ago






  • 2





    @Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

    – Lukas Kalbertodt
    9 hours ago











  • @LukasKalbertodt I'm curious if this whole thing related to some issues the LLVM team have discussed in regards to unsigned iterators and UB cases in C/C++ that they've done talks on. But that's purely speculation.

    – Mgetz
    9 hours ago













37












37








37







You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization -- only performed for length 239 -- is responsible for the huge speed difference. But let's start slowly:



(All code in this answer is compiled with -C opt-level=3)



pub fn foo() -> usize 
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

s



This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. I only paste a small part of the assembly here:



movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0


This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. the increment of loop variable, check if loop ended and the jump.



(In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. More even, two SIMD register (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically executed to of these instructions at the same time. After all, they are independent of one another. In the end, both registers are combined.)




But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:



const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;


let mut sum = 0;
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;


sum



(On Godbolt Compiler Explorer)



The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.



The important difference is that for 239 LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!



First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly, the comments with * are especially important):



 ; at the start of the function, `rbx` was set to 0

movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`


As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.



This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.




An additional note: can you do anything about this? Not really LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.



As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to a any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.






share|improve this answer















You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization -- only performed for length 239 -- is responsible for the huge speed difference. But let's start slowly:



(All code in this answer is compiled with -C opt-level=3)



pub fn foo() -> usize 
let arr = [0; 240];
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

s



This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. I only paste a small part of the assembly here:



movdqa xmm1, xmmword ptr [rsp + 32]
movdqa xmm0, xmmword ptr [rsp + 48]
paddq xmm1, xmmword ptr [rsp]
paddq xmm0, xmmword ptr [rsp + 16]
paddq xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq xmm0, xmmword ptr [rsp + 1840]
paddq xmm1, xmmword ptr [rsp + 1856]
paddq xmm0, xmmword ptr [rsp + 1872]
paddq xmm0, xmm1
pshufd xmm1, xmm0, 78
paddq xmm1, xmm0


This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. the increment of loop variable, check if loop ended and the jump.



(In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. More even, two SIMD register (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically executed to of these instructions at the same time. After all, they are independent of one another. In the end, both registers are combined.)




But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:



const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY
arr[i] = i;


let mut sum = 0;
for _ in 0..IN_LOOPS
let mut s = 0;
for i in 0..arr.len()
s += arr[i];

sum += s;


sum



(On Godbolt Compiler Explorer)



The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.



The important difference is that for 239 LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!



First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly, the comments with * are especially important):



 ; at the start of the function, `rbx` was set to 0

movq rax, xmm1 ; result of SIMD summing up stored in `rax`
add rax, 711 ; add up missing terms from loop unrolling
mov ecx, 500000 ; * init loop variable outer loop
.LBB0_1:
add rbx, rax ; * rbx += rax
add rcx, -1 ; * decrement loop variable
jne .LBB0_1 ; * if loop variable != 0 jump to LBB0_1
mov rax, rbx ; move rbx (the sum) back to rax
; two unimportant instructions omitted
ret ; the return value is stored in `rax`


As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.



This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.




An additional note: can you do anything about this? Not really LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.



As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to a any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.







share|improve this answer














share|improve this answer



share|improve this answer








edited 10 hours ago

























answered 10 hours ago









Lukas KalbertodtLukas Kalbertodt

29.8k4 gold badges76 silver badges141 bronze badges




29.8k4 gold badges76 silver badges141 bronze badges















  • @lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

    – Guy Korland
    10 hours ago











  • @LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

    – Mgetz
    10 hours ago






  • 2





    @Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

    – Lukas Kalbertodt
    9 hours ago











  • @LukasKalbertodt I'm curious if this whole thing related to some issues the LLVM team have discussed in regards to unsigned iterators and UB cases in C/C++ that they've done talks on. But that's purely speculation.

    – Mgetz
    9 hours ago

















  • @lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

    – Guy Korland
    10 hours ago











  • @LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

    – Mgetz
    10 hours ago






  • 2





    @Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

    – Lukas Kalbertodt
    9 hours ago











  • @LukasKalbertodt I'm curious if this whole thing related to some issues the LLVM team have discussed in regards to unsigned iterators and UB cases in C/C++ that they've done talks on. But that's purely speculation.

    – Mgetz
    9 hours ago
















@lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

– Guy Korland
10 hours ago





@lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated sum directly on not a local s was running much slower. for i in 0..arr.len() sum += arr[i];

– Guy Korland
10 hours ago













@LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

– Mgetz
10 hours ago





@LukasKalbertodt Something else is going on in LLVM turning on AVX2 shouldn't make that big of a difference. Repro'd in rust too

– Mgetz
10 hours ago




2




2





@Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

– Lukas Kalbertodt
9 hours ago





@Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this.

– Lukas Kalbertodt
9 hours ago













@LukasKalbertodt I'm curious if this whole thing related to some issues the LLVM team have discussed in regards to unsigned iterators and UB cases in C/C++ that they've done talks on. But that's purely speculation.

– Mgetz
9 hours ago





@LukasKalbertodt I'm curious if this whole thing related to some issues the LLVM team have discussed in regards to unsigned iterators and UB cases in C/C++ that they've done talks on. But that's purely speculation.

– Mgetz
9 hours ago








Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57458460%2fwhy-is-there-a-large-performance-impact-when-looping-over-an-array-over-240-elem%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

ParseJSON using SSJSUsing AMPscript with SSJS ActivitiesHow to resubscribe a user in Marketing cloud using SSJS?Pulling Subscriber Status from Lists using SSJSRetrieving Emails using SSJSProblem in updating DE using SSJSUsing SSJS to send single email in Marketing CloudError adding EmailSendDefinition using SSJS

Кампала Садржај Географија Географија Историја Становништво Привреда Партнерски градови Референце Спољашње везе Мени за навигацију0°11′ СГШ; 32°20′ ИГД / 0.18° СГШ; 32.34° ИГД / 0.18; 32.340°11′ СГШ; 32°20′ ИГД / 0.18° СГШ; 32.34° ИГД / 0.18; 32.34МедијиПодациЗванични веб-сајту

19. јануар Садржај Догађаји Рођења Смрти Празници и дани сећања Види још Референце Мени за навигацијуу