Skip to content

Speed up JSON Parsing #181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Dec 29, 2015
Merged

Speed up JSON Parsing #181

merged 22 commits into from
Dec 29, 2015

Conversation

argon
Copy link
Contributor

@argon argon commented Dec 29, 2015

I ran 2 benchmarks against the current JSON parser implementation on master, comparing it to Darwin Foundation for large inputs and found it was 5-10x slower. (72s vs 8s for a single decode). This new code reduces this to a factor for 2-4x (17s)

While JSON is a text-based serialization, all control characters are ASCII characters. The data can be most efficiently parsed as a stream of bytes instead of the initial overhead of converting the bytes to a String.

The byte -> string conversion in the current code only accounts for 3s of time but parsing numbers is incredibly expensive (Double.init?(_ text: String) converts the string back to bytes to use strtod()). This technique uses strtod directly on the byte array. There is a penalty for non-UTF-8 encoded data which is inline with the Darwin implementation

The most efficient encoding to use for parsing is UTF-8, so if you have a choice in encoding the data passed to this method, use UTF-8.

Another modification has been made to remove the intermediate parser values as the heap allocations for each intermediate parser was adding significant overhead.

This also includes a commit to implement .AllowFragments.

argon and others added 22 commits December 22, 2015 13:09
JavaScript numbers are double so we should use that as a basis for
validation.
Only supports UTF-8. This is an experiment to see if it is the right
solution
Other encodings still have undefined behaviour and error locations
are broken for input containing any multi-byte characters.
A character with non-zero higher-order bytes would previously be
interpreted as ASCII and decoding would break.
One niggling use remains in `takeString`
Needs to reference code unit locations instead of specific character
locations
@phausler
Copy link
Contributor

from a cursory glance this looks like a reasonable change; builds pass on ubuntu.

Do you happen to have the performance benchmarks handy? that might be something interesting to check in so that we could potentially build up a performance test suite.

Additionally what other hotspots are there in Foundation that you found by this?

phausler added a commit that referenced this pull request Dec 29, 2015
@phausler phausler merged commit db4b395 into swiftlang:master Dec 29, 2015
@argon
Copy link
Contributor Author

argon commented Dec 29, 2015

The performance "benchmark" wasn't in any way automated unfortunately. I simply created a main.swift to read the JSON file from disk then decode it, then I profiled it using instruments.

The main hotspots in this code now relate to

  • Array and Dictionary size increases, leading to a large number of copies
  • Bridging from NSString -> String in parseString

In general String parsing for the sf-city-lots-json bears a huge time cost ~(5000ms vs 250ms) when compared with Darwin Foundation.

@argon argon deleted the experiment/JSONParseBuffer branch December 29, 2015 13:01
atrick pushed a commit to atrick/swift-corelibs-foundation that referenced this pull request Jan 12, 2021
Improve BuildServerBuildSystemTests error handling
kateinoigakukun pushed a commit to kateinoigakukun/swift-corelibs-foundation that referenced this pull request Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants