Skip to content

Replace curl with reqwest #1499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 12, 2018
Merged

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented Sep 24, 2018

The diff stats may look scary, but the vast majority of the added lines is because reqwest depends on tokio, which caused 448 lines to be added to Cargo.lock. The actual code changes are not particularly large

Why should we do this?

curl::Easy is anything but. Its API is verbose, and consistently misunderstood in our codebase. These commits replace the curl crate with reqwest, which is designed to be a higher level HTTP client. It doesn't have every random nicety we might want (like setting Date on POST/PUT requests), but does have a ton of things we want built in like helpers for JSON request/response bodies that assumes the use of serde, and "turn 4xx/5xx response codes into an error" helper functions.

At best the curl crate required some very verbose code, or confusing APIs to make it possible to work with stack data, but at worst it led to some code that was confusing or even dangerous. A few things that were weird or gave me pause while writing this were:

  • Setting the Host header manually, which is at best pointless and at worst a major latent bug waiting to strike. I suspect that the first instance of this was in the S3 code, which was done because Host is mentioned in the S3 docs for common headers for some reason, and the other places this was done were due to that code being copied from the S3 code
  • Specifically calling a CURL function that opted into Expect: 100-Continue, and then manually overriding the Expect header with a comment that it's not clear what it's purpose is
  • Setting the Date header on requests for no reason (which on requests with no body, is actually a violation of the HTTP/1.1 spec)
  • Choosing "hello!" as a User-Agent string for endpoints which require it

What changes occur in our HTTP requests?

The http-data files are basically impossible to review, so here are the changes that were made:

All requests have the following headers added:

  • User-Agent: reqwest/0.9.1
  • Accept-Encoding: gzip

All requests had the following header removed:

  • Proxy-Connection: Keep-Alive
    • Uncyclopedia describes this header as "Implemented as a misunderstanding of the HTTP specifications. Common because of mistakes in implementations of early HTTP versions." It has been removed from Firefox and more recent versions of curl.

Some requests had the following header removed:

  • User-Agent: hello!
    • The HTTP specifications states that all user agents SHOULD send this header. All requests that we make are a direct result of user interaction, making us a user agent (and it's pretty standard to expect this header even from non-user-agents like crawlers). Arguably we should be setting this to crates.io, but this is at least marginally useful unlike "hello!" which is not appropriate here.

What notable changes happened in the code?

For the most part, the code is just replacing manual stuff with built-in stuff. I've removed anything that set a Date or User-Agent header manually (unless Date was required, which is the case for S3, since the date is used as part of the auth signature). I've removed all cases that were manually setting a Host header, since if this ever differed from the URL we used to lookup the domain, it would be very bad.

While making this change, a few refactorings occurred.

The S3 code was taking a file_length parameter as a u64. This was expected to be the length of the bytes passed in, but... we have a slice so we can just ask it the length. Due to a combination of the API reqwest wants to expose, and the fact that its backed by tokio, it can only work with static data. This was only relevant for S3 uploads, but changing the relevant functions from taking &[u8] to Vec<u8> was a trivial change.

The functions for interacting with the GitHub API were split into two pieces, one which made the request, and one which processed it. The only reason for this split was so that one consumer could intercept 404 responses and return Ok(false). I've opted to merge all of this into a single function, and return NotFound for 404, which we can check for in that one place. I was expecting to use Any::is there, but due to the way its written, it required duplicating its implementation (see the commit message for that commit for more details)

The only other point of note is that the changes to GhUser::init are not tested, since they only run if some files in this repo are not present, and for that code to run we'd need the passwords for two GitHub users that we don't have access to. I suspect we should probably just delete that code and panic if we don't have the file, but I haven't done so here.

There's some minor differences in behavior here -- We're no longer
setting the `Date` header, as clients aren't supposed to send it for
requests without bodies (and even for post/put it's optional). We're
also not manually setting the host header because there's literally no
reason for us to be doing so.
This behavior was split into two functions so that one of the 5 places
we're calling it from could intercept a 404 response code. Since the
happy path otherwise is just calling `.json`, this felt really awkward
to me. Instead I've opted to return `NotFound`, and downcast to it in
the error path for that one place.

I expected to just be able to call `Any::is`, but it turns out this
method is an inherent method not a trait method (which makes sense,
otherwise `Any` wouldn't be object safe). However, a side effect of that
is that even though `Any` is a supertrait of `CargoError`, we can't call
`Any::is` since `dyn CargoError` can't be cast to `dyn Any`. This may
change at some point in the future (see
rust-lang/rfcs#2035), but we have to duplicate
the body of `Any::is` for now.

Except we can't even just duplicate the body of `Any::is`, because the
only trait method for `Any` is unstable so we have to duplicate that
method, too..........

Other notable changes:

- We're no longer sending `Proxy-Connection: Keep-Alive`. According to
  Uncyclopedia, this is "Implemented as a misunderstanding of the HTTP
  specifications. Common because of mistakes in implementations of early
  HTTP versions". Firefox recently removed it, as have recent versions
  of curl.
- We will accept gzipped responses now. This is good.
- We send an actual user agent, instead of "hello!"
Notable differences in behavior:

- We now send "User-Agent: reqwest/0.9.1" to S3
- We no longer send "Proxy-Connection: Keep-Alive" (which was useless)
- We now send "Accept-Encoding: gzip", which has no effect on these
  requests, as we do not receive any response body

We are no longer manually specifying the `Date` header, since this is at
best completely pointless, and at worse a latent bug waiting to crop up.
We are still setting the `Date` header, as S3 requires either that
header or `x-amz-date` when `Authorization` is set, since the date is
part of the signature used.

This commit has the largest effect from removing curl, as this was the
oldest code using it, had accumulated the most cruft, and had an API
most affected by its use.

Going through this code was a fun jump through history. There were a few
notable interesting things about the old code.

The return type of `Transfer` appeared to be to work around some
lifetime issues from taking a slice. Notably, `reqwest` actually
requires a static lifetime on the request body (due to the API it wants
to expose, and the fact that it's backed by tokio, it cannot deal with
non-static lifetimes). Luckily, there were no callers that actually
needed to pass a slice.

My favorite part of this code though, was using a curl method that has
no purpose other than to opt into `Expect: 100-continue`, and then
explicitly overriding the `Expect:` header due to not understanding it's
purpose. (For the record, `Expect: 100-continue` is used when sending
the request body is "inappropriate or inefficient", and specifies that
the server should reply with 100 CONTINUE before the client will send
the request body). I'm actually surprised this interfered with the
recorder. The HTTP/1.1 specification points out that older servers may
not recognize the `Expect` header, and that clients should not wait an
indefinite period of time before sending the request body.
And for our final replacement, we fix some code that never actually
runs, and would fail if it did because running it requires credentials
for two specific github users that none of us have! Hooray!
@sgrif sgrif mentioned this pull request Sep 24, 2018
@sgrif sgrif requested a review from jtgeibel October 11, 2018 22:05
@sgrif
Copy link
Contributor Author

sgrif commented Oct 12, 2018

bors: r+

bors-voyager bot added a commit that referenced this pull request Oct 12, 2018
1499: Replace `curl` with `reqwest` r=sgrif a=sgrif

**The diff stats may look scary, but the vast majority of the added lines is because `reqwest` depends on `tokio`, which caused 448 lines to be added to `Cargo.lock`. The actual code changes are not particularly large**

Why should we do this?
---

`curl::Easy` is anything but. Its API is verbose, and consistently misunderstood in our codebase. These commits replace the `curl` crate with `reqwest`, which is designed to be a higher level HTTP client. It doesn't have every random nicety we might want (like setting `Date` on `POST`/`PUT` requests), but does have a ton of things we want built in like helpers for JSON request/response bodies that assumes the use of serde, and "turn 4xx/5xx response codes into an error" helper functions.

At best the curl crate required some very verbose code, or confusing APIs to make it possible to work with stack data, but at worst it led to some code that was confusing or even dangerous. A few things that were weird or gave me pause while writing this were:

- Setting the `Host` header manually, which is at best pointless and at worst a major latent bug waiting to strike. I suspect that the first instance of this was in the S3 code, which was done because `Host` is mentioned in the S3 docs for common headers for some reason, and the other places this was done were due to that code being copied from the S3 code
- Specifically calling a CURL function that opted into `Expect: 100-Continue`, and then manually overriding the `Expect` header with a comment that it's not clear what it's purpose is
- Setting the `Date` header on requests for no reason (which on requests with no body, is actually a violation of the HTTP/1.1 spec)
- Choosing "hello!" as a User-Agent string for endpoints which require it

What changes occur in our HTTP requests?
----

The `http-data` files are basically impossible to review, so here are the changes that were made:

All requests have the following headers added:

- `User-Agent: reqwest/0.9.1`
- `Accept-Encoding: gzip`

All requests had the following header removed:

- `Proxy-Connection: Keep-Alive`
  - Uncyclopedia describes this header as "Implemented as a misunderstanding of the HTTP specifications. Common because of mistakes in implementations of early HTTP versions." It has been removed from Firefox and more recent versions of curl.

Some requests had the following header removed:
 
- `User-Agent: hello!`
  - The HTTP specifications states that all user agents SHOULD send this header. All requests that we make are a direct result of user interaction, making us a user agent (and it's pretty standard to expect this header even from non-user-agents like crawlers). Arguably we should be setting this to `crates.io`, but this is at least marginally useful unlike "hello!" which is not appropriate here.

What notable changes happened in the code?
---

For the most part, the code is just replacing manual stuff with built-in stuff. I've removed anything that set a `Date` or `User-Agent` header manually (unless `Date` was required, which is the case for S3, since the date is used as part of the auth signature). I've removed all cases that were manually setting a `Host` header, since if this ever differed from the URL we used to lookup the domain, it would be very bad.

While making this change, a few refactorings occurred.

The S3 code was taking a `file_length` parameter as a `u64`. This was expected to be the length of the bytes passed in, but... we have a slice so we can just ask it the length. Due to a combination of the API reqwest wants to expose, and the fact that its backed by tokio, it can only work with static data. This was only relevant for S3 uploads, but changing the relevant functions from taking `&[u8]` to `Vec<u8>` was a trivial change.

The functions for interacting with the GitHub API were split into two pieces, one which made the request, and one which processed it. The only reason for this split was so that one consumer could intercept 404 responses and return `Ok(false)`. I've opted to merge all of this into a single function, and return `NotFound` for 404, which we can check for in that one place. I was expecting to use `Any::is` there, but due to the way its written, it required duplicating its implementation (see the commit message for that commit for more details)

The only other point of note is that the changes to `GhUser::init` are not tested, since they only run if some files in this repo are not present, and for that code to run we'd need the passwords for two GitHub users that we don't have access to. I suspect we should probably just delete that code and panic if we don't have the file, but I haven't done so here.

Co-authored-by: Sean Griffin <[email protected]>
@bors-voyager
Copy link
Contributor

bors-voyager bot commented Oct 12, 2018

Build succeeded

@bors-voyager bors-voyager bot merged commit 0d5ed7a into rust-lang:master Oct 12, 2018
@sgrif sgrif deleted the sg-remove-curl branch October 23, 2018 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants