Skip to content

add dummy fields to trees to test cache locality #3133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

liufengyun
Copy link
Contributor

add dummy fields to trees to test cache locality

@liufengyun
Copy link
Contributor Author

test performance please

@dottybot
Copy link
Member

performance test scheduled: 1 job(s) in queue, 0 running.

@dottybot
Copy link
Member

Performance test finished successfully:

Visit http://dotty-bench.epfl.ch/3133 to see the changes.

Benchmarks is based on merge(s) with master

@odersky
Copy link
Contributor

odersky commented Sep 17, 2017

Very very strange. Can you try to have half of the fields before myType and the rest after? Also, maybe change RefinedPrinter to print these fields under -verbose so that we make sure the JVM does allocate them.

@smarter
Copy link
Member

smarter commented Sep 17, 2017

FWIW, you can use http://openjdk.java.net/projects/code-tools/jol/ to check the layout of objects on the JVM.

@liufengyun
Copy link
Contributor Author

test performance please

@dottybot
Copy link
Member

performance test scheduled: 1 job(s) in queue, 0 running.

@smarter
Copy link
Member

smarter commented Sep 17, 2017

Here's how to use jol:

$ wget http://central.maven.org/maven2/org/openjdk/jol/jol-cli/0.8/jol-cli-0.8-full.jar
$ java -jar jol-cli-0.8-full.jar internals -cp $HOME/.ivy2/cache/org.scala-sbt/interface/jars/interface-0.13.15.jar:$HOME/.ivy2/cache/org.scala-lang/scala-library/jars/scala-library-2.12.3.jar:interfaces/target/dotty-interfaces-0.4.0-bin-SNAPSHOT.jar:out/bootstrap/dotty-library-bootstrapped/scala-0.4/classes:out/bootstrap/dotty-compiler-bootstrapped/scala-0.4/classes 'dotty.tools.dotc.ast.Trees$Ident'
# Running 64-bit HotSpot VM.
# Using compressed oop with 3-bit shift.
# Using compressed klass with 3-bit shift.
# Objects are 8 bytes aligned.
# Field sizes by type: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
# Array element sizes: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]

Instantiated the sample instance via public dotty.tools.dotc.ast.Trees$Ident(dotty.tools.dotc.core.Names$Name)

dotty.tools.dotc.ast.Trees$Ident object internals:
 OFFSET  SIZE                                    TYPE DESCRIPTION                               VALUE
      0     4                                         (object header)                           01 00 00 00 (00000001 00000000 00000000 00000000) (1)
      4     4                                         (object header)                           00 00 00 00 (00000000 00000000 00000000 00000000) (0)
      8     4                                         (object header)                           74 a0 01 f8 (01110100 10100000 00000001 11111000) (-134111116)
     12     4                                         (alignment/padding gap)                  
     16     8                                    long Positioned.curPos                         -4503599627370495
     24     4                                     int Tree.myUniqueId                           3
     28     4   dotty.tools.dotc.util.Attachment.Link Tree.next                                 null
     32     4                        java.lang.Object Tree.myTpe                                null
     36     4        dotty.tools.dotc.core.Names.Name Ident.name                                null
Instance size: 40 bytes
Space losses: 4 bytes internal + 0 bytes external = 4 bytes total

@liufengyun
Copy link
Contributor Author

The output:

# Running 64-bit HotSpot VM.
# Using compressed oop with 3-bit shift.
# Using compressed klass with 3-bit shift.
# WARNING | Compressed references base/shifts are guessed by the experiment!
# WARNING | Therefore, computed addresses are just guesses, and ARE NOT RELIABLE.
# WARNING | Make sure to attach Serviceability Agent to get the reliable addresses.
# Objects are 8 bytes aligned.
# Field sizes by type: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
# Array element sizes: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]

Instantiated the sample instance via public dotty.tools.dotc.ast.Trees$Ident(dotty.tools.dotc.core.Names$Name)

dotty.tools.dotc.ast.Trees$Ident object internals:
 OFFSET  SIZE                                    TYPE DESCRIPTION                               VALUE
      0     4                                         (object header)                           01 00 00 00 (00000001 00000000 00000000 00000000) (1)
      4     4                                         (object header)                           00 00 00 00 (00000000 00000000 00000000 00000000) (0)
      8     4                                         (object header)                           da 0c 02 f8 (11011010 00001100 00000010 11111000) (-134083366)
     12     4                                         (alignment/padding gap)
     16     8                                    long Positioned.curPos                         -4503599627370495
     24     4                                     int Tree.myUniqueId                           3
     28     4                                     int Tree.x1                                   0
     32     4                                     int Tree.x2                                   0
     36     4                                     int Tree.x3                                   0
     40     4                                     int Tree.x4                                   0
     44     4                                     int Tree.x5                                   0
     48     4                                     int Tree.x6                                   0
     52     4                                     int Tree.x7                                   0
     56     4                                     int Tree.x8                                   0
     60     4                                     int Tree.x9                                   0
     64     4                                     int Tree.x10                                  0
     68     4                                     int Tree.x11                                  0
     72     4                                     int Tree.x12                                  0
     76     4                                     int Tree.x13                                  0
     80     4                                     int Tree.x14                                  0
     84     4                                     int Tree.x15                                  0
     88     4                                     int Tree.x16                                  0
     92     4   dotty.tools.dotc.util.Attachment.Link Tree.next                                 null
     96     4                        java.lang.Object Tree.myTpe                                null
    100     4        dotty.tools.dotc.core.Names.Name Ident.name                                null
Instance size: 104 bytes
Space losses: 4 bytes internal + 0 bytes external = 4 bytes total

@dottybot
Copy link
Member

Performance test finished successfully:

Visit http://dotty-bench.epfl.ch/3133 to see the changes.

Benchmarks is based on merge(s) with master

@liufengyun
Copy link
Contributor Author

test performance please

@dottybot
Copy link
Member

performance test scheduled: 1 job(s) in queue, 0 running.

@dottybot
Copy link
Member

Performance test finished successfully:

Visit http://dotty-bench.epfl.ch/3133/ to see the changes.

Benchmarks is based on merge(s) with master

@liufengyun
Copy link
Contributor Author

The perf statistics show that the change indeed don't affect the cache miss much.

With padding

      75264.597199      task-clock (msec)         #    1.208 CPUs utilized            (100.00%)
              4598      context-switches          #    0.061 K/sec                    (100.00%)
               154      cpu-migrations            #    0.002 K/sec                    (100.00%)
              1146      page-faults               #    0.015 K/sec
      164457779248      cycles                    #    2.185 GHz                      (37.67%)
       99589660335      stalled-cycles-frontend   #   60.56% frontend cycles idle     (49.84%)
   <not supported>      stalled-cycles-backend
      156095467211      instructions              #    0.95  insns per cycle
                                                  #    0.64  stalled cycles per insn  (62.36%)
       28666269927      branches                  #  380.873 M/sec                    (74.56%)
         663047179      branch-misses             #    2.31% of all branches          (74.13%)
       48448451430      L1-dcache-loads           #  643.708 M/sec                    (62.33%)
        2812906608      L1-dcache-load-misses     #    5.81% of all L1-dcache hits    (40.22%)
        1017558427      LLC-loads                 #   13.520 M/sec                    (25.14%)
   <not supported>      LLC-load-misses

      62.310249391 seconds time elapsed

Without padding

      34869.544850      task-clock (msec)         #    1.234 CPUs utilized            (100.00%)
              2210      context-switches          #    0.063 K/sec                    (100.00%)
                91      cpu-migrations            #    0.003 K/sec                    (100.00%)
               593      page-faults               #    0.017 K/sec
       76064701479      cycles                    #    2.181 GHz                      (37.87%)
       47049824316      stalled-cycles-frontend   #   61.86% frontend cycles idle     (49.55%)
   <not supported>      stalled-cycles-backend
       70420313759      instructions              #    0.93  insns per cycle
                                                  #    0.67  stalled cycles per insn  (61.96%)
       12947704456      branches                  #  371.318 M/sec                    (74.00%)
         313598975      branch-misses             #    2.42% of all branches          (73.88%)
       21664368951      L1-dcache-loads           #  621.298 M/sec                    (61.12%)
        1308236458      L1-dcache-load-misses     #    6.04% of all L1-dcache hits    (38.44%)
         489415127      LLC-loads                 #   14.036 M/sec                    (25.35%)
   <not supported>      LLC-load-misses

      28.258925592 seconds time elapsed

@liufengyun liufengyun closed this Sep 21, 2017
@liufengyun liufengyun deleted the cache-line branch September 21, 2017 11:54
@liufengyun
Copy link
Contributor Author

The ~6% L1d cache miss rate correspond to bench values in @DarkDimius 's MiniPhase paper.

The branch miss is also low, 2.3-2.5% (no such data in MiniPhase paper).

The current bench machine doesn't support the last-level-cache miss: LLC-load-misses. I'll use another machine to check LLC-load-misses, which may explain why stalled-cycles-frontend is high.

@liufengyun
Copy link
Contributor Author

liufengyun commented Sep 26, 2017

The LLC-load-misses from another machine is about 20%(unfortunately the testing hardware has no counter for L1-icache-misses).

Given that L1-dcache-load-misses is only 6.04%, branch miss is only 2%, and L3Cache is much larger than L1d/L1i cache, I think it partially confirms @DarkDimius 's hypothesis that the cache need for data in L2/L3 invalidates the cache for instructions in L2/L3 (thus cause evictions in L1i, due to inclusive caching), which explains the high stalled-cycles-frontend 60%.

Cite Miniphase Paper (section 5.3):

We believe that this is explained by the fact that CPU caches are inclusive and eviction from
last level cache would also trigger eviction from lower level caches. By improving the hit rate in data caches, Miniphases also indirectly reduce evictions from the L1-instruction cache.

     147824.838093      task-clock (msec)         #    0.524 CPUs utilized            (100.00%)
            15,312      context-switches          #    0.104 K/sec                    (100.00%)
               339      cpu-migrations            #    0.002 K/sec                    (100.00%)
             8,508      page-faults               #    0.058 K/sec
   558,864,101,786      cycles                    #    3.781 GHz                      (49.45%)
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
   460,052,067,149      instructions              #    0.82  insns per cycle          (61.99%)
    84,910,140,048      branches                  #  574.397 M/sec                    (62.14%)
     1,782,151,177      branch-misses             #    2.10% of all branches          (62.01%)
   145,949,205,409      L1-dcache-loads           #  987.312 M/sec                    (56.10%)
     8,820,323,783      L1-dcache-load-misses     #    6.04% of all L1-dcache hits    (27.33%)
     2,880,226,509      LLC-loads                 #   19.484 M/sec                    (26.50%)
       584,228,389      LLC-load-misses           #   20.28% of all LL-cache hits     (37.07%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants