Skip to content

Enhancing indexOfDiff efficiency in large input slices #24097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

MINGtoMING
Copy link

@MINGtoMING MINGtoMING commented Jun 6, 2025

Background

The previous std.mem.indexOfDiff was implemented with a naive while loop, whose performance relies on the compiler's auto-vectorization. When processing large input data, it fails to fully utilize the CPU's capabilities, resulting in longer execution times. Therefore, I attempted to optimize std.mem.indexOfDiff by referencing std.mem.eql to better leverage CPU performance.

fn indexOfDiffBytes(a: []const u8, b: []const u8) ?usize {
    const shortest = @min(a.len, b.len);
    const vec_len = std.simd.suggestVectorLength(u8) orelse 0;
    // ......
}

The optimization strategy is as follows:

  • shortest < @sizeOf(usize): Use a while loop.
  • shortest <= @sizeOf(usize) * 2 or vec_len == 0: Use SWAR.
  • shortest < vec_len: Choose a smaller but appropriate vector length.
  • shortest >= vec_len: Use that vector length and perform loop unrolling.

Benchmark

  • bench code : click
  • indexOfDiff(V1): std.mem.indexOfDiff
  • indexOfDiff(V2): cur impl
  • eql: std.mem.eql

cpu: AMD Ryzen 7 3750H with Radeon Vega Mobile Gfx

           fn/T/len            elapsed
---------------------------------------
     indexOfDiff_V1/u8/1         1ns
     indexOfDiff_V2/u8/1         3ns
       std.mem.eql/u8/1          1ns
---------------------------------------
     indexOfDiff_V1/u8/5         3ns
     indexOfDiff_V2/u8/5         3ns
       std.mem.eql/u8/5          2ns
---------------------------------------
     indexOfDiff_V1/u8/10        4ns
     indexOfDiff_V2/u8/10        2ns
      std.mem.eql/u8/10          2ns
---------------------------------------
     indexOfDiff_V1/u8/20        9ns
     indexOfDiff_V2/u8/20        3ns
      std.mem.eql/u8/20          1ns
---------------------------------------
     indexOfDiff_V1/u8/50        31ns
     indexOfDiff_V2/u8/50        6ns
      std.mem.eql/u8/50          3ns
---------------------------------------
    indexOfDiff_V1/u8/100        56ns
    indexOfDiff_V2/u8/100        9ns
      std.mem.eql/u8/100         5ns
---------------------------------------
    indexOfDiff_V1/u8/1000      411ns
    indexOfDiff_V2/u8/1000      117ns
     std.mem.eql/u8/1000        126ns
---------------------------------------
   indexOfDiff_V1/u8/10000     3.226us
   indexOfDiff_V2/u8/10000      820ns
     std.mem.eql/u8/10000       814ns
---------------------------------------
     indexOfDiff_V1/u32/1        0ns
     indexOfDiff_V2/u32/1        1ns
      std.mem.eql/u32/1          3ns
---------------------------------------
     indexOfDiff_V1/u32/5        3ns
     indexOfDiff_V2/u32/5        3ns
      std.mem.eql/u32/5          2ns
---------------------------------------
    indexOfDiff_V1/u32/10        4ns
    indexOfDiff_V2/u32/10        7ns
      std.mem.eql/u32/10         2ns
---------------------------------------
    indexOfDiff_V1/u32/20        15ns
    indexOfDiff_V2/u32/20        9ns
      std.mem.eql/u32/20         3ns
---------------------------------------
    indexOfDiff_V1/u32/50        32ns
    indexOfDiff_V2/u32/50        14ns
      std.mem.eql/u32/50         15ns
---------------------------------------
    indexOfDiff_V1/u32/100       56ns
    indexOfDiff_V2/u32/100       38ns
     std.mem.eql/u32/100         37ns
---------------------------------------
   indexOfDiff_V1/u32/1000      546ns
   indexOfDiff_V2/u32/1000      349ns
     std.mem.eql/u32/1000       385ns
---------------------------------------
   indexOfDiff_V1/u32/10000    4.555us
   indexOfDiff_V2/u32/10000    3.363us
    std.mem.eql/u32/10000       3.42us
---------------------------------------
    indexOfDiff_V1/u128/1        0ns
    indexOfDiff_V2/u128/1        0ns
      std.mem.eql/u128/1         3ns
---------------------------------------
    indexOfDiff_V1/u128/5        6ns
    indexOfDiff_V2/u128/5        9ns
      std.mem.eql/u128/5         4ns
---------------------------------------
    indexOfDiff_V1/u128/10       15ns
    indexOfDiff_V2/u128/10       17ns
     std.mem.eql/u128/10         8ns
---------------------------------------
    indexOfDiff_V1/u128/20       29ns
    indexOfDiff_V2/u128/20       27ns
     std.mem.eql/u128/20         26ns
---------------------------------------
    indexOfDiff_V1/u128/50       84ns
    indexOfDiff_V2/u128/50       60ns
     std.mem.eql/u128/50         62ns
---------------------------------------
   indexOfDiff_V1/u128/100      154ns
   indexOfDiff_V2/u128/100      149ns
     std.mem.eql/u128/100       169ns
---------------------------------------
   indexOfDiff_V1/u128/1000    1.567us
   indexOfDiff_V2/u128/1000    1.595us
    std.mem.eql/u128/1000      1.772us
---------------------------------------
  indexOfDiff_V1/u128/10000    15.069us
  indexOfDiff_V2/u128/10000    14.597us
    std.mem.eql/u128/10000     13.783us

@MINGtoMING MINGtoMING force-pushed the opt-index-of-diff branch from ad384f4 to dcf6ff4 Compare June 8, 2025 13:39
@MINGtoMING
Copy link
Author

AArch64 Processor rev 0 (aarch64) vendor Kirin820

           fn/T/len            elapsed
---------------------------------------
     indexOfDiff_V1/u8/5         19ns
     indexOfDiff_V2/u8/5         21ns
       std.mem.eql/u8/5          9ns
---------------------------------------
     indexOfDiff_V1/u8/10        31ns
     indexOfDiff_V2/u8/10        16ns
      std.mem.eql/u8/10          9ns
---------------------------------------
     indexOfDiff_V1/u8/20        55ns
     indexOfDiff_V2/u8/20        27ns
      std.mem.eql/u8/20          13ns
---------------------------------------
     indexOfDiff_V1/u8/50       129ns
     indexOfDiff_V2/u8/50        39ns
      std.mem.eql/u8/50          26ns
---------------------------------------
    indexOfDiff_V1/u8/100       254ns
    indexOfDiff_V2/u8/100        52ns
      std.mem.eql/u8/100         41ns
---------------------------------------
    indexOfDiff_V1/u8/1000     2.667us
    indexOfDiff_V2/u8/1000      359ns
     std.mem.eql/u8/1000        356ns
---------------------------------------
   indexOfDiff_V1/u8/10000     15.768us
   indexOfDiff_V2/u8/10000     3.449us
     std.mem.eql/u8/10000      3.591us
---------------------------------------
     indexOfDiff_V1/u32/5        7ns
     indexOfDiff_V2/u32/5        14ns
      std.mem.eql/u32/5          7ns
---------------------------------------
    indexOfDiff_V1/u32/10        14ns
    indexOfDiff_V2/u32/10        36ns
      std.mem.eql/u32/10         10ns
---------------------------------------
    indexOfDiff_V1/u32/20        27ns
    indexOfDiff_V2/u32/20        20ns
      std.mem.eql/u32/20         15ns
---------------------------------------
    indexOfDiff_V1/u32/50        66ns
    indexOfDiff_V2/u32/50        44ns
      std.mem.eql/u32/50         38ns
---------------------------------------
    indexOfDiff_V1/u32/100      132ns
    indexOfDiff_V2/u32/100       72ns
     std.mem.eql/u32/100         68ns
---------------------------------------
   indexOfDiff_V1/u32/1000     1.363us
   indexOfDiff_V2/u32/1000     1.038us
     std.mem.eql/u32/1000       980ns
---------------------------------------
   indexOfDiff_V1/u32/10000    11.881us
   indexOfDiff_V2/u32/10000    5.994us
    std.mem.eql/u32/10000      6.089us
---------------------------------------
    indexOfDiff_V1/u128/5        10ns
    indexOfDiff_V2/u128/5        17ns
      std.mem.eql/u128/5         12ns
---------------------------------------
    indexOfDiff_V1/u128/10       18ns
    indexOfDiff_V2/u128/10       28ns
     std.mem.eql/u128/10         31ns
---------------------------------------
    indexOfDiff_V1/u128/20       35ns
    indexOfDiff_V2/u128/20       50ns
     std.mem.eql/u128/20         45ns
---------------------------------------
    indexOfDiff_V1/u128/50       90ns
    indexOfDiff_V2/u128/50      115ns
     std.mem.eql/u128/50        113ns
---------------------------------------
   indexOfDiff_V1/u128/100      204ns
   indexOfDiff_V2/u128/100      232ns
     std.mem.eql/u128/100       228ns
---------------------------------------
   indexOfDiff_V1/u128/1000    2.279us
   indexOfDiff_V2/u128/1000     2.45us
    std.mem.eql/u128/1000      2.323us
---------------------------------------
  indexOfDiff_V1/u128/10000    26.107us
  indexOfDiff_V2/u128/10000    25.562us
    std.mem.eql/u128/10000     25.135us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant