Skip to content

bumpalo-based arenas#2

Open
fitzgen wants to merge 45 commits intomainfrom
bumpalo-based-arenas
Open

bumpalo-based arenas#2
fitzgen wants to merge 45 commits intomainfrom
bumpalo-based-arenas

Conversation

@fitzgen
Copy link
Copy Markdown
Member

@fitzgen fitzgen commented Sep 27, 2022

Alternative to #1

If we ever need them we can add them, but this is just going to make things go a
little faster here.
@fitzgen fitzgen requested a review from cfallin September 27, 2022 22:37
@Amanieu
Copy link
Copy Markdown

Amanieu commented Sep 29, 2022

It turns out my benchmarks from #1 were wrong, I had accidentally left debug assertions on in release mode (I was debugging another issue and forgot to turn them back off afterwards). The std BTreeMap wasn't affected. Here are a new set of benchmark results with debug assertions disabled.

std BTreeMap:

          2,139.54 msec task-clock                #    1.000 CPUs utilized          
                 4      context-switches          #    1.870 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
            35,008      page-faults               #   16.362 K/sec                  
     5,642,728,968      cycles                    #    2.637 GHz                    
    11,399,984,081      instructions              #    2.02  insn per cycle         
     1,716,455,724      branches                  #  802.254 M/sec                  
        38,997,435      branch-misses             #    2.27% of all branches        

arena BTreeMap #1

          2,289.05 msec task-clock                #    1.000 CPUs utilized          
                 5      context-switches          #    2.184 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
            38,999      page-faults               #   17.037 K/sec                  
     5,986,327,420      cycles                    #    2.615 GHz                    
    11,958,129,918      instructions              #    2.00  insn per cycle         
     1,752,578,062      branches                  #  765.636 M/sec                  
        39,874,075      branch-misses             #    2.28% of all branches        

arena BTreeMap #2

          2,166.25 msec task-clock                #    1.000 CPUs utilized          
                 4      context-switches          #    1.847 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
            34,786      page-faults               #   16.058 K/sec                  
     5,709,502,207      cycles                    #    2.636 GHz                    
    11,388,274,391      instructions              #    1.99  insn per cycle         
     1,717,719,537      branches                  #  792.946 M/sec                  
        39,164,113      branch-misses             #    2.28% of all branches        

So it seems that (at least in my case) #2 has no performance difference compared to the std BTreeMap while #1 is about 5% slower (rather than 20% as previously reported).

@cfallin
Copy link
Copy Markdown
Member

cfallin commented Sep 29, 2022

Interesting to note from your numbers:

Option 1, with u32 indices and a Vec of nodes:

        38,999      page-faults               #   17.037 K/sec 

Option 2, with borrows (i.e., real 64-bit pointers) and a bumpalo arena:

        34,786      page-faults               #   16.058 K/sec       

I've found the page-faults metric from perf to be a reasonable proxy for total memory allocations in a heap-grows-monotonically-ish batch program (like a compiler); and the difference here suggest to me that the approach using a Vec is consuming more memory, leading to more faulted-in zeroed pages and more cache-misses as well.

I'm curious if the Vec growth is to blame, and what would happen if we did a Vec::with_capacity at the beginning with a guess for the eventual node count?

@Amanieu
Copy link
Copy Markdown

Amanieu commented Sep 29, 2022

What is strange is that in regalloc2 (bytecodealliance/regalloc2#88), #1 is much faster than both std and #2 (which seem to have similar performance bytecodealliance/regalloc2#92). Perhaps regalloc2 is much more cache-bound than my benchmark and the u32 indices are helping a lot?

@Amanieu
Copy link
Copy Markdown

Amanieu commented Sep 29, 2022

Using Vec::with_capacity does seem to reduce the page faults. No impact on performance though.

          2,142.89 msec task-clock                #    1.000 CPUs utilized          
                 3      context-switches          #    1.400 /sec                   
                 1      cpu-migrations            #    0.467 /sec                   
            33,161      page-faults               #   15.475 K/sec                  
     6,003,358,962      cycles                    #    2.802 GHz                    
    11,936,071,446      instructions              #    1.99  insn per cycle         
     1,748,444,800      branches                  #  815.928 M/sec                  
        39,567,381      branch-misses             #    2.26% of all branches        

@fitzgen fitzgen force-pushed the bumpalo-based-arenas branch from 25350d9 to 1839826 Compare October 14, 2022 16:11
@fitzgen fitzgen force-pushed the bumpalo-based-arenas branch from 1839826 to 0ae687f Compare October 14, 2022 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants