Skip to content

Use less stack for HttpResponseHeaders.CopyToFast #7724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 26, 2019

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Feb 19, 2019

The Jit seems unhappy to optimize this method due to the number of locals and tmps this method creates (585 locals/tmps; including all the new ReadOnlySpan<byte>s).

This change slims the method by 9638 bytes of asm from 13112 bytes to 3474 bytes; and reduces the stack clearing from 416 bytes to 8 bytes,

  Method |                       Type |     Mean |         Op/s |
 ------- |--------------------------- |---------:|-------------:|
- Output |                 LiveAspNet | 587.5 ns |  1,702,161.8 |
+ Output |                 LiveAspNet | 462.2 ns |  2,163,172.9 |
- Output |           PlaintextChunked | 139.8 ns |  7,151,853.6 |
+ Output |           PlaintextChunked |  84.3 ns | 11,850,576.2 |
- Output | PlaintextChunkedWithCookie | 276.3 ns |  3,618,669.2 |
+ Output | PlaintextChunkedWithCookie | 179.5 ns |  5,570,065.5 |
- Output |        PlaintextWithCookie | 289.5 ns |  3,454,397.7 |
+ Output |        PlaintextWithCookie | 169.2 ns |  5,908,899.3 |
- Output |       TechEmpowerPlaintext | 175.4 ns |  5,700,681.7 |
+ Output |       TechEmpowerPlaintext |  86.1 ns | 11,607,037.7 |

It does this by using a single output point for the headers (other than the unusual ones Raw and ContentLength) where a single ReadOnlySpan is created. To achieve this it is a little convoluted using goto and switch...

Before

; Lcl frame size = 1704

G_M45189_IG01:
       4157                 push     r15
       4156                 push     r14
       4155                 push     r13
       4154                 push     r12
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4881ECA8060000       sub      rsp, 0x6A8
       C5F877               vzeroupper 
       488BF1               mov      rsi, rcx
       488D7C2428           lea      rdi, [rsp+28H]
       B9A0010000           mov      ecx, 416           ; zero 416 bytes of stack
       33C0                 xor      rax, rax
       F3AB                 rep stosd 
       
       ; ...

; Total bytes of code 13112, prolog size 42 for method HttpResponseHeaders:CopyToFast(byref):this
; Lcl frame size = 88

G_M44993_IG01:
       4157                 push     r15
       4156                 push     r14
       4155                 push     r13
       4154                 push     r12
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC58             sub      rsp, 88
       488BF1               mov      rsi, rcx
       488D7C2430           lea      rdi, [rsp+30H]
       B908000000           mov      ecx, 8          ; Zero 8 bytes of stack
       33C0                 xor      rax, rax
       F3AB                 rep stosd 
       
       ; ...
       
; Total bytes of code 3474, prolog size 42 for method HttpResponseHeaders:CopyToFast(byref):this

/cc @AndyAyersMS

Copy link
Contributor

@jkotalik jkotalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change the ordering of headers for HTTP responses. Seems like a fairly decent breaking change, especially in tests that verify response content in customer applications.

@Tratcher
Copy link
Member

Tratcher commented Feb 19, 2019

@jkotalik header ordering is not meaningful, especially if you avoid re-ordering multiple instances of the same header. It is quite a bit of test churn.

@AndyAyersMS
Copy link
Member

Keeping the number of jit temps under 512 is advisable in perf-sensitive code.

512 is the maximum number of things the jit will model for liveness (which is a "dense" data flow problem that requires using bitvectors, so lots of jit time & memory).

See dotnet/coreclr#13280 for some context.

When you go beyond this number, the jit marks 512 temps as "tracked" temps and the remainder as "untracked" temps. It chooses the set of tracked temps greedily, using block-weighed aggregate appearances as a priority function.

Untracked GC ref temps prolog zeroing. In a method like this where average path length is fairly small (even though method itself is quite large) the prolog costs can predominate. There are a bunch of potential mitigation strategies in the JIT for this, but nothing that is being actively worked on right now.

I wonder though -- if the JIT sees we're in the untracked regime, perhaps it should somewhat prioritize tracking GC refs over non-GC temps... hmm (alternatively: consider an untracked GC ref incurs an implicit extra "prolog" weight from the zeroing that we don't account for -- perhaps we should make that explicit). The goal should really be to minimize the overall post-sort cost, and we don't quite do that right now. Let me play around with this idea for a bit.

@benaadams
Copy link
Member Author

benaadams commented Feb 19, 2019

@jkotalik changed to maintain ContentLength's header position ; so no test churn

@AndyAyersMS
Copy link
Member

At least in the before example, better temp sorting by the JIT would not help reduce prolog cost. Despite there being 585 temps, all the ref type temps, and all but one of the byref type temps end up getting tracked. EG the T197 below is the tracking ID for the temp.

;  V581 tmp472      [V581,T197] (  2,  8   )   byref  ->  rcx         "argument with side effect"
...
;  V353 tmp244      [V353    ] (  4,  1.75)   byref  ->  [rsp+0x1D0]   must-init pinned "Inline stloc first use temp"

V353 is the only untracked scalar GC ref.

The bulk of of the prolog zeroing thus comes from GC refs within structs. There are 118 struct temps, all marked as must-init (the result of .init locals on the method), and this accounts for the large amount of zeroing. This can be removed by the current jit only if the struct is promoted, but it doesn't look like any of these are promoted., they are all marked "do not enregister." Probably worth understanding why that happens.

@benaadams
Copy link
Member Author

A fair boost on the ResponseHeadersWritingBenchmark

  Method |                       Type |     Mean |         Op/s |
 ------- |--------------------------- |---------:|-------------:|
- Output |                 LiveAspNet | 587.5 ns |  1,702,161.8 |
+ Output |                 LiveAspNet | 511.5 ns |  1,955,163.5 |
- Output |           PlaintextChunked | 139.8 ns |  7,151,853.6 |
+ Output |           PlaintextChunked |  86.4 ns | 11,574,316.8 |
- Output | PlaintextChunkedWithCookie | 276.3 ns |  3,618,669.2 |
+ Output | PlaintextChunkedWithCookie | 194.3 ns |  5,148,107.5 |
- Output |        PlaintextWithCookie | 289.5 ns |  3,454,397.7 |
+ Output |        PlaintextWithCookie | 181.2 ns |  5,520,292.8 |
- Output |       TechEmpowerPlaintext | 175.4 ns |  5,700,681.7 |
+ Output |       TechEmpowerPlaintext | 101.1 ns |  9,891,448.7 |

@AndyAyersMS
Copy link
Member

And as for why there are so many unpromoted structs in the before version -- perhaps no great mystery: we run into the too many temps throttling in the inliner, after 328 successful inlines.

Similar to the kinds of things I'm tracking in dotnet/coreclr#22240.

  [328 IL=1167 TR=003047 06000071] [profitable inline] StringValues:get_Count():int:this
  [0 IL=1213 TR=003105 06001527] [FAILED: too many locals] ReadOnlySpan`1:.ctor(ref,int,int):this
  [0 IL=1218 TR=003111 06000017] [FAILED: too many locals] BufferWriter`1:Write(struct):this
  [0 IL=1226 TR=003120 060006E2] [FAILED: too many locals] PipelineExtensions:WriteAsciiNoValidation(byref,ref)
  [0 IL=1278 TR=002935 06000071] [FAILED: too many locals] StringValues:get_Count():int:this
  ... and many more ....
  [0 IL=4338 TR=000426 06001527] [FAILED: too many locals] ReadOnlySpan`1:.ctor(ref,int,int):this
  [0 IL=4343 TR=000432 06000017] [FAILED: too many locals] BufferWriter`1:Write(struct):this
  [0 IL=4351 TR=000441 060006E2] [FAILED: too many locals] PipelineExtensions:WriteAsciiNoValidation(byref,ref)

Probably worth running TechEmpower scenarios with JIT inline ETL enabled to see if this is a common failure reason, and if so, where we might want to reduce method complexity.

Am going to temporarily remove this check and let the inliner run free, just to see what happens:

https://github.com/dotnet/coreclr/blob/e2081d0e67a1d7fedaf6303f576acef316c7bd66/src/jit/compiler.hpp#L1531-L1535

@benaadams
Copy link
Member Author

Plaintext inlines https://gist.github.com/benaadams/1e08f4fe89b3a07a64a915d388a63336

"FAILED: too many locals" seems limited to this method

@AndyAyersMS
Copy link
Member

Inliner can handle it, but not the rest of the jit. Way more temps, lots of untracked, etc.

Top method regressions by size (bytes):
       17684 (134.87% of base) : Microsoft.AspNetCore.Server.Kestrel.Core.dasm - HttpResponseHeaders:CopyToFast(byref):this

;  V1249 tmp1140    [V1249,T506] (  2,  8   )   byref  ->  rcx         "argument with side effect"
;  V1250 tmp1141    [V1250,T507] (  2,  8   )   byref  ->  rcx         "argument with side effect"
;  V1251 tmp1142    [V1251,T508] (  2,  8   )   byref  ->  rcx         "argument with side effect"
;  V1252 tmp1143    [V1252,T509] (  2,  8   )   byref  ->  rcx         "argument with side effect"
;  V1253 cse0       [V1253,T01] (207,364.50)   byref  ->  [rsp+0x28]   "ValNumCSE"
;
; Lcl frame size = 7024

G_M45944_IG01:
       push     r15
       push     r14
       push     r12
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       test     qword ptr [rsp-1000H], rax
       sub      rsp, 0x1B70
       vzeroupper 
       mov      rsi, rcx
       lea      rdi, [rsp+30H]
       mov      ecx, 0x6D0

That looks like the only method in the core assembly that hits this limit.

Simplifying it is the right thing to do.

We may get the jit to the point where it can handle things like the before case better someday, but that day is a ways off.

@benaadams
Copy link
Member Author

Thanks for checking

@benaadams
Copy link
Member Author

Cleaned up the other switch jump tables to use named labels rather than just numbers so its easier to compare the label name to what its outputting

@davidfowl davidfowl dismissed jkotalik’s stale review February 20, 2019 05:09

Header order doesn't matter

public string TestNotBit() => $"(_bits & {1L << Index}L) == 0";
public string SetBit() => $"_bits |= {1L << Index}L";
public string ClearBit() => $"_bits &= ~{1L << Index}L";
public string TestBit() => $"(_bits & {"0x" + (1L << Index).ToString("x")}L) != 0";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@benaadams
Copy link
Member Author

Changed to loop; perf is very nice (also has ROS opt etc, so isn't like for like on before loop)

 Method |                       Type |      Mean |         Op/s |
------- |--------------------------- |----------:|-------------:|
 Output |                 LiveAspNet | 462.28 ns |  2,163,172.9 |
 Output |           PlaintextChunked |  84.38 ns | 11,850,576.2 |
 Output | PlaintextChunkedWithCookie | 179.53 ns |  5,570,065.5 |
 Output |        PlaintextWithCookie | 169.24 ns |  5,908,899.3 |
 Output |       TechEmpowerPlaintext |  86.15 ns | 11,607,037.7 |

@benaadams
Copy link
Member Author

AspNetCore-ci failed due to weird (Code_check) issues; raised issue for it #7839

@pakrym
Copy link
Contributor

pakrym commented Feb 22, 2019

Any changes in public APIs? Rebase and run .\eng\scripts\GenerateReferenceAssemblies.ps1

@benaadams
Copy link
Member Author

Any changes in public APIs?

Don't think so

Rebase and run...

Will do

@benaadams
Copy link
Member Author

CI issues #7839 and #7867

@halter73 halter73 merged commit 423de42 into dotnet:master Feb 26, 2019
@halter73
Copy link
Member

Thanks!

@benaadams benaadams deleted the Header-stack-space branch February 26, 2019 01:42
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants