Use less stack for HttpResponseHeaders.CopyToFast #7724

benaadams · 2019-02-19T21:36:17Z

The Jit seems unhappy to optimize this method due to the number of locals and tmps this method creates (585 locals/tmps; including all the new ReadOnlySpan<byte>s).

This change slims the method by 9638 bytes of asm from 13112 bytes to 3474 bytes; and reduces the stack clearing from 416 bytes to 8 bytes,

  Method |                       Type |     Mean |         Op/s |
 ------- |--------------------------- |---------:|-------------:|
- Output |                 LiveAspNet | 587.5 ns |  1,702,161.8 |
+ Output |                 LiveAspNet | 462.2 ns |  2,163,172.9 |
- Output |           PlaintextChunked | 139.8 ns |  7,151,853.6 |
+ Output |           PlaintextChunked |  84.3 ns | 11,850,576.2 |
- Output | PlaintextChunkedWithCookie | 276.3 ns |  3,618,669.2 |
+ Output | PlaintextChunkedWithCookie | 179.5 ns |  5,570,065.5 |
- Output |        PlaintextWithCookie | 289.5 ns |  3,454,397.7 |
+ Output |        PlaintextWithCookie | 169.2 ns |  5,908,899.3 |
- Output |       TechEmpowerPlaintext | 175.4 ns |  5,700,681.7 |
+ Output |       TechEmpowerPlaintext |  86.1 ns | 11,607,037.7 |

It does this by using a single output point for the headers (other than the unusual ones Raw and ContentLength) where a single ReadOnlySpan is created. To achieve this it is a little convoluted using goto and switch...

Before

; Lcl frame size = 1704

G_M45189_IG01:
       4157                 push     r15
       4156                 push     r14
       4155                 push     r13
       4154                 push     r12
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4881ECA8060000       sub      rsp, 0x6A8
       C5F877               vzeroupper 
       488BF1               mov      rsi, rcx
       488D7C2428           lea      rdi, [rsp+28H]
       B9A0010000           mov      ecx, 416           ; zero 416 bytes of stack
       33C0                 xor      rax, rax
       F3AB                 rep stosd 
       
       ; ...

; Total bytes of code 13112, prolog size 42 for method HttpResponseHeaders:CopyToFast(byref):this

; Lcl frame size = 88

G_M44993_IG01:
       4157                 push     r15
       4156                 push     r14
       4155                 push     r13
       4154                 push     r12
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC58             sub      rsp, 88
       488BF1               mov      rsi, rcx
       488D7C2430           lea      rdi, [rsp+30H]
       B908000000           mov      ecx, 8          ; Zero 8 bytes of stack
       33C0                 xor      rax, rax
       F3AB                 rep stosd 
       
       ; ...
       
; Total bytes of code 3474, prolog size 42 for method HttpResponseHeaders:CopyToFast(byref):this

/cc @AndyAyersMS

jkotalik

I don't think we should change the ordering of headers for HTTP responses. Seems like a fairly decent breaking change, especially in tests that verify response content in customer applications.

Tratcher · 2019-02-19T22:10:10Z

@jkotalik header ordering is not meaningful, especially if you avoid re-ordering multiple instances of the same header. It is quite a bit of test churn.

AndyAyersMS · 2019-02-19T22:12:09Z

Keeping the number of jit temps under 512 is advisable in perf-sensitive code.

512 is the maximum number of things the jit will model for liveness (which is a "dense" data flow problem that requires using bitvectors, so lots of jit time & memory).

See dotnet/coreclr#13280 for some context.

When you go beyond this number, the jit marks 512 temps as "tracked" temps and the remainder as "untracked" temps. It chooses the set of tracked temps greedily, using block-weighed aggregate appearances as a priority function.

Untracked GC ref temps prolog zeroing. In a method like this where average path length is fairly small (even though method itself is quite large) the prolog costs can predominate. There are a bunch of potential mitigation strategies in the JIT for this, but nothing that is being actively worked on right now.

I wonder though -- if the JIT sees we're in the untracked regime, perhaps it should somewhat prioritize tracking GC refs over non-GC temps... hmm (alternatively: consider an untracked GC ref incurs an implicit extra "prolog" weight from the zeroing that we don't account for -- perhaps we should make that explicit). The goal should really be to minimize the overall post-sort cost, and we don't quite do that right now. Let me play around with this idea for a bit.

benaadams · 2019-02-19T22:59:42Z

@jkotalik changed to maintain ContentLength's header position ; so no test churn

AndyAyersMS · 2019-02-19T23:27:02Z

At least in the before example, better temp sorting by the JIT would not help reduce prolog cost. Despite there being 585 temps, all the ref type temps, and all but one of the byref type temps end up getting tracked. EG the T197 below is the tracking ID for the temp.

;  V581 tmp472      [V581,T197] (  2,  8   )   byref  ->  rcx         "argument with side effect"
...
;  V353 tmp244      [V353    ] (  4,  1.75)   byref  ->  [rsp+0x1D0]   must-init pinned "Inline stloc first use temp"

V353 is the only untracked scalar GC ref.

The bulk of of the prolog zeroing thus comes from GC refs within structs. There are 118 struct temps, all marked as must-init (the result of .init locals on the method), and this accounts for the large amount of zeroing. This can be removed by the current jit only if the struct is promoted, but it doesn't look like any of these are promoted., they are all marked "do not enregister." Probably worth understanding why that happens.

benaadams · 2019-02-20T00:44:11Z

A fair boost on the ResponseHeadersWritingBenchmark

  Method |                       Type |     Mean |         Op/s |
 ------- |--------------------------- |---------:|-------------:|
- Output |                 LiveAspNet | 587.5 ns |  1,702,161.8 |
+ Output |                 LiveAspNet | 511.5 ns |  1,955,163.5 |
- Output |           PlaintextChunked | 139.8 ns |  7,151,853.6 |
+ Output |           PlaintextChunked |  86.4 ns | 11,574,316.8 |
- Output | PlaintextChunkedWithCookie | 276.3 ns |  3,618,669.2 |
+ Output | PlaintextChunkedWithCookie | 194.3 ns |  5,148,107.5 |
- Output |        PlaintextWithCookie | 289.5 ns |  3,454,397.7 |
+ Output |        PlaintextWithCookie | 181.2 ns |  5,520,292.8 |
- Output |       TechEmpowerPlaintext | 175.4 ns |  5,700,681.7 |
+ Output |       TechEmpowerPlaintext | 101.1 ns |  9,891,448.7 |

AndyAyersMS · 2019-02-20T01:02:29Z

And as for why there are so many unpromoted structs in the before version -- perhaps no great mystery: we run into the too many temps throttling in the inliner, after 328 successful inlines.

Similar to the kinds of things I'm tracking in dotnet/coreclr#22240.

  [328 IL=1167 TR=003047 06000071] [profitable inline] StringValues:get_Count():int:this
  [0 IL=1213 TR=003105 06001527] [FAILED: too many locals] ReadOnlySpan`1:.ctor(ref,int,int):this
  [0 IL=1218 TR=003111 06000017] [FAILED: too many locals] BufferWriter`1:Write(struct):this
  [0 IL=1226 TR=003120 060006E2] [FAILED: too many locals] PipelineExtensions:WriteAsciiNoValidation(byref,ref)
  [0 IL=1278 TR=002935 06000071] [FAILED: too many locals] StringValues:get_Count():int:this
  ... and many more ....
  [0 IL=4338 TR=000426 06001527] [FAILED: too many locals] ReadOnlySpan`1:.ctor(ref,int,int):this
  [0 IL=4343 TR=000432 06000017] [FAILED: too many locals] BufferWriter`1:Write(struct):this
  [0 IL=4351 TR=000441 060006E2] [FAILED: too many locals] PipelineExtensions:WriteAsciiNoValidation(byref,ref)

Probably worth running TechEmpower scenarios with JIT inline ETL enabled to see if this is a common failure reason, and if so, where we might want to reduce method complexity.

Am going to temporarily remove this check and let the inliner run free, just to see what happens:

https://github.com/dotnet/coreclr/blob/e2081d0e67a1d7fedaf6303f576acef316c7bd66/src/jit/compiler.hpp#L1531-L1535

benaadams · 2019-02-20T01:21:06Z

Plaintext inlines https://gist.github.com/benaadams/1e08f4fe89b3a07a64a915d388a63336

"FAILED: too many locals" seems limited to this method

AndyAyersMS · 2019-02-20T01:23:53Z

Inliner can handle it, but not the rest of the jit. Way more temps, lots of untracked, etc.

Top method regressions by size (bytes):
       17684 (134.87% of base) : Microsoft.AspNetCore.Server.Kestrel.Core.dasm - HttpResponseHeaders:CopyToFast(byref):this

;  V1249 tmp1140    [V1249,T506] (  2,  8   )   byref  ->  rcx         "argument with side effect"
;  V1250 tmp1141    [V1250,T507] (  2,  8   )   byref  ->  rcx         "argument with side effect"
;  V1251 tmp1142    [V1251,T508] (  2,  8   )   byref  ->  rcx         "argument with side effect"
;  V1252 tmp1143    [V1252,T509] (  2,  8   )   byref  ->  rcx         "argument with side effect"
;  V1253 cse0       [V1253,T01] (207,364.50)   byref  ->  [rsp+0x28]   "ValNumCSE"
;
; Lcl frame size = 7024

G_M45944_IG01:
       push     r15
       push     r14
       push     r12
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       test     qword ptr [rsp-1000H], rax
       sub      rsp, 0x1B70
       vzeroupper 
       mov      rsi, rcx
       lea      rdi, [rsp+30H]
       mov      ecx, 0x6D0

That looks like the only method in the core assembly that hits this limit.

Simplifying it is the right thing to do.

We may get the jit to the point where it can handle things like the before case better someday, but that day is a ways off.

benaadams · 2019-02-20T01:58:08Z

Thanks for checking

benaadams · 2019-02-20T03:20:18Z

Cleaned up the other switch jump tables to use named labels rather than just numbers so its easier to compare the label name to what its outputting

Header order doesn't matter

src/Servers/Kestrel/tools/CodeGenerator/KnownHeaders.cs

halter73 · 2019-02-21T23:51:45Z

src/Servers/Kestrel/tools/CodeGenerator/KnownHeaders.cs

-            public string TestNotBit() => $"(_bits & {1L << Index}L) == 0";
-            public string SetBit() => $"_bits |= {1L << Index}L";
-            public string ClearBit() => $"_bits &= ~{1L << Index}L";
+            public string TestBit() => $"(_bits & {"0x" + (1L << Index).ToString("x")}L) != 0";


benaadams · 2019-02-22T00:59:28Z

Changed to loop; perf is very nice (also has ROS opt etc, so isn't like for like on before loop)

 Method |                       Type |      Mean |         Op/s |
------- |--------------------------- |----------:|-------------:|
 Output |                 LiveAspNet | 462.28 ns |  2,163,172.9 |
 Output |           PlaintextChunked |  84.38 ns | 11,850,576.2 |
 Output | PlaintextChunkedWithCookie | 179.53 ns |  5,570,065.5 |
 Output |        PlaintextWithCookie | 169.24 ns |  5,908,899.3 |
 Output |       TechEmpowerPlaintext |  86.15 ns | 11,607,037.7 |

benaadams · 2019-02-22T03:26:42Z

AspNetCore-ci failed due to weird (Code_check) issues; raised issue for it #7839

pakrym · 2019-02-22T03:48:07Z

Any changes in public APIs? Rebase and run .\eng\scripts\GenerateReferenceAssemblies.ps1

benaadams · 2019-02-22T04:44:04Z

Any changes in public APIs?

Don't think so

Rebase and run...

Will do

benaadams · 2019-02-22T16:50:24Z

CI issues #7839 and #7867

src/Servers/Kestrel/Core/ref/Microsoft.AspNetCore.Server.Kestrel.Core.netcoreapp3.0.cs

src/Servers/Kestrel/Core/src/Internal/Http/HttpResponseTrailers.cs

halter73 · 2019-02-26T01:41:25Z

Thanks!

benaadams requested review from jkotalik and Tratcher as code owners February 19, 2019 21:36

jkotalik previously requested changes Feb 19, 2019

View reviewed changes

benaadams force-pushed the Header-stack-space branch from 9104076 to 3934001 Compare February 19, 2019 22:59

Eilon added the area-servers label Feb 19, 2019

benaadams force-pushed the Header-stack-space branch from 3ff109b to 220a502 Compare February 20, 2019 03:38

davidfowl requested review from halter73 and davidfowl February 20, 2019 05:23

halter73 reviewed Feb 21, 2019

View reviewed changes

src/Servers/Kestrel/tools/CodeGenerator/KnownHeaders.cs Outdated Show resolved Hide resolved

halter73 reviewed Feb 21, 2019

View reviewed changes

halter73 approved these changes Feb 22, 2019

View reviewed changes

benaadams force-pushed the Header-stack-space branch from e2ca205 to 80d514f Compare February 22, 2019 04:51

halter73 reviewed Feb 22, 2019

View reviewed changes

src/Servers/Kestrel/Core/ref/Microsoft.AspNetCore.Server.Kestrel.Core.netcoreapp3.0.cs Outdated Show resolved Hide resolved

Tratcher reviewed Feb 22, 2019

View reviewed changes

src/Servers/Kestrel/Core/src/Internal/Http/HttpResponseTrailers.cs Outdated Show resolved Hide resolved

benaadams added 14 commits February 23, 2019 03:10

Use less stack for HttpResponseHeaders.CopyToFast

efd18bf

Preserve ContentLength header position

654e110

Move OutputHeader first

78ba1ba

Improve ResponseHeadersWritingBenchmark

de3f537

Use named labels rather than numbered

041d258

Clarity and fix

9a1f14f

Fix enumerators and Trailers for Content Length

789e5c2

Use hex for flags

907b6f0

Oops

295254c

Use C#7 ROS<byte> optimization

23178d9

Moar hex

5661f05

Change to loop

6033e54

Ref assemblies

c7dab37

Remove ContentLength from trailers

87182af

benaadams force-pushed the Header-stack-space branch from 5aace8e to 87182af Compare February 23, 2019 03:11

halter73 merged commit 423de42 into dotnet:master Feb 26, 2019

benaadams deleted the Header-stack-space branch February 26, 2019 01:42

benaadams mentioned this pull request Apr 2, 2019

Reuse previous materialized strings #8374

Merged

This was referenced Jan 31, 2020

Examples where heavy intrinsics usage runs into internal jit limits on optimization dotnet/runtime#11905

Open

JIT: consider boosting priority of GC type temps when sorting to determine tracked set dotnet/runtime#12073

Open

amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 6, 2023

Use less stack for HttpResponseHeaders.CopyToFast #7724

Use less stack for HttpResponseHeaders.CopyToFast #7724

Uh oh!

Conversation

benaadams commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotalik left a comment

Choose a reason for hiding this comment

Uh oh!

Tratcher commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented Feb 19, 2019

Uh oh!

benaadams commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented Feb 19, 2019

Uh oh!

benaadams commented Feb 20, 2019

Uh oh!

AndyAyersMS commented Feb 20, 2019

Uh oh!

benaadams commented Feb 20, 2019

Uh oh!

AndyAyersMS commented Feb 20, 2019

Uh oh!

benaadams commented Feb 20, 2019

Uh oh!

benaadams commented Feb 20, 2019

Uh oh!

Uh oh!

halter73 Feb 21, 2019

Choose a reason for hiding this comment

Uh oh!

benaadams commented Feb 22, 2019

Uh oh!

benaadams commented Feb 22, 2019

Uh oh!

pakrym commented Feb 22, 2019

Uh oh!

benaadams commented Feb 22, 2019

Uh oh!

benaadams commented Feb 22, 2019

Uh oh!

Uh oh!

Uh oh!

halter73 commented Feb 26, 2019

Uh oh!

Uh oh!

benaadams commented Feb 19, 2019 •

edited

Loading

Tratcher commented Feb 19, 2019 •

edited

Loading

benaadams commented Feb 19, 2019 •

edited

Loading