Skip to content

Add uuid validator #772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 24, 2023
Merged

Conversation

JeanArhancet
Copy link
Contributor

@JeanArhancet JeanArhancet commented Jul 14, 2023

Change Summary

Adds a Rust implementation of UUID validation and serialization.

Related issue number

Resolves #554

Checklist

  • Unit tests for the changes exist
  • Documentation reflects the changes where applicable
  • Pydantic tests pass with this pydantic-core (except for expected changes)
  • My PR is ready to review, please add a comment including the phrase "please review" to assign reviewers

Selected Reviewer: @davidhewitt

@JeanArhancet JeanArhancet changed the title feat: add uuid validator Add uuid validator Jul 14, 2023
@JeanArhancet JeanArhancet marked this pull request as draft July 14, 2023 07:17
@codspeed-hq
Copy link

codspeed-hq bot commented Jul 14, 2023

CodSpeed Performance Report

Merging #772 will not alter performance

Comparing JeanArhancet:feat/add-uuid-validator (de5c8bf) with main (f38bea6)

Summary

✅ 126 untouched benchmarks

🆕 9 new benchmarks

Benchmarks breakdown

Benchmark main JeanArhancet:feat/add-uuid-validator Change
🆕 test_uuid_from_string_core N/A 60.6 µs N/A
🆕 test_uuid_from_string_pyd N/A 87.5 µs N/A
🆕 test_uuid_from_uuid_core N/A 14.9 µs N/A
🆕 test_uuid_from_uuid_pyd N/A 15.8 µs N/A
🆕 test_core_python N/A 41.3 µs N/A
🆕 test_model_core_json N/A 85.6 µs N/A
🆕 test_core_raw N/A 14.9 µs N/A
🆕 test_core_str N/A 60.7 µs N/A
🆕 test_uuid N/A 66.3 µs N/A

@JeanArhancet JeanArhancet force-pushed the feat/add-uuid-validator branch from b80f4d6 to 69fa15e Compare July 16, 2023 19:03
@codecov
Copy link

codecov bot commented Jul 16, 2023

Codecov Report

Merging #772 (de5c8bf) into main (9b540d0) will decrease coverage by 0.03%.
The diff coverage is 88.62%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #772      +/-   ##
==========================================
- Coverage   93.53%   93.50%   -0.03%     
==========================================
  Files         100      102       +2     
  Lines       14376    14587     +211     
  Branches       25       25              
==========================================
+ Hits        13446    13639     +193     
- Misses        924      942      +18     
  Partials        6        6              
Impacted Files Coverage Δ
src/serializers/shared.rs 80.54% <0.00%> (-0.44%) ⬇️
src/serializers/infer.rs 95.39% <33.33%> (-1.56%) ⬇️
src/errors/types.rs 84.23% <77.77%> (-0.16%) ⬇️
src/input/return_enums.rs 85.08% <80.00%> (-0.05%) ⬇️
src/validators/uuid.rs 90.35% <90.35%> (ø)
src/serializers/type_serializers/uuid.rs 94.00% <94.00%> (ø)
python/pydantic_core/core_schema.py 96.70% <100.00%> (+0.03%) ⬆️
src/input/input_python.rs 98.11% <100.00%> (+0.01%) ⬆️
src/serializers/ob_type.rs 86.03% <100.00%> (+0.65%) ⬆️
src/validators/function.rs 90.65% <100.00%> (ø)
... and 1 more

... and 3 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f38bea6...de5c8bf. Read the comment docs.

@JeanArhancet JeanArhancet marked this pull request as ready for review July 17, 2023 07:27
@JeanArhancet
Copy link
Contributor Author

please review

Copy link
Contributor

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This has been on my list of things to tackle so I'm grateful to have this picked up.

I've given a few initial thoughts. I guess we need to be careful to keep the pydantic behaviour the same. Maybe create a corresponding PR in pydantic which uses pydantic-core from this branch so we can see how the whole thing fits together end-to-end?

Also while reading around I saw uuid-rs/uuid#406, so it sounds like we need to be careful with endianness with the integer conversions going on here.

@JeanArhancet
Copy link
Contributor Author

Thanks! This has been on my list of things to tackle so I'm grateful to have this picked up.

I've given a few initial thoughts. I guess we need to be careful to keep the pydantic behaviour the same. Maybe create a corresponding PR in pydantic which uses pydantic-core from this branch so we can see how the whole thing fits together end-to-end?

Also while reading around I saw uuid-rs/uuid#406, so it sounds like we need to be careful with endianness with the integer conversions going on here.

I've opened this MR pydantic/pydantic#6711 to test integration @davidhewitt

@davidhewitt davidhewitt mentioned this pull request Jul 17, 2023
Copy link
Contributor

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Overall it looks like this implementation is good to me, I've added some final comments and then otherwise I think this is mergeable.

Before we merge it would be great to get the downstream PR in pydantic passing against this git branch to prove that this behaves correctly end-to-end. That will also have the effect of getting us a patch ready to merge when we update pydantic to pull in a future pydantic-core with this validator.

Comment on lines +720 to +724
@pytest.mark.benchmark(group='uuid str')
def test_core_str(self, benchmark, uuid_str):
v = SchemaValidator({'type': 'uuid'})

benchmark(v.validate_python, uuid_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to compare these benchmarks against the current Pydantic implementation (in Python), if you're willing to do that? That would confirm our assumption that migrating this to Rust is worth it :)

It should be possible to do that by building the same core schema as Pydantic currently does (maybe with some copy-paste) and feeding that to a validator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, do you have an example to create this benchmark?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me update #763 later with a similar benchmark and get back to you :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done d9e45c1, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! So I see we're about 33% faster to have a Rust implementation, makes sense to me as the heavy lifting in the Python UUID implementation will be implemented in C anyway; what we're shaving off is the dynamic overheads of checking & constructing a UUID.

@davidhewitt
Copy link
Contributor

Looks great, please give me a bit of time to produce an example benchmark for the decimal case and loop back here!

@JeanArhancet JeanArhancet force-pushed the feat/add-uuid-validator branch from f16aed7 to 4cc1c93 Compare July 18, 2023 14:48
@davidhewitt
Copy link
Contributor

(responding to comment about hex, which I can't find above)

Yes, we would need to keep hex string support if the pydantic test suite already suggests we accept hex. I've used the hex crate before, should be suitable.

@JeanArhancet
Copy link
Contributor Author

(responding to comment about hex, which I can't find above)

Yes, we would need to keep hex string support if the pydantic test suite already suggests we accept hex. I've used the hex crate before, should be suitable.

I deleted the comment, I made a mistake the line of code works fine, sorry

@JeanArhancet JeanArhancet force-pushed the feat/add-uuid-validator branch from 9cfb7eb to da1f5f6 Compare July 19, 2023 18:34
@JeanArhancet
Copy link
Contributor Author

@davidhewitt I fixed a regression in this commit 8dffaeb (compare version when it's an UUID object)

@JeanArhancet JeanArhancet force-pushed the feat/add-uuid-validator branch from da1f5f6 to 79eb871 Compare July 19, 2023 18:55
@JeanArhancet JeanArhancet force-pushed the feat/add-uuid-validator branch 3 times, most recently from e85c501 to 42e40c0 Compare July 20, 2023 11:41
@JeanArhancet JeanArhancet force-pushed the feat/add-uuid-validator branch from 42e40c0 to 98f4a4a Compare July 20, 2023 12:33
@JeanArhancet
Copy link
Contributor Author

I've seen another regression with this python input b'ebcdab58-6eb8-46fb-a190-d07a33e9eac8'. Do you have any idea how to transform this? I can add a size condition but maybe there's a better solution:

 None => {
                let either_bytes = input
                    .validate_bytes(true)
                    .map_err(|_| ValError::new(ErrorType::UuidType, input))?;
                let py_object = either_bytes.into_py(py);
                let py_bytes = py_object.downcast::<PyBytes>(py)?;
                if py_bytes.len()? > 16 {
                    let uuid_str = from_utf8(py_bytes.as_bytes()).map_err(|e| ValError::new(ErrorType::UuidParsing { error: e.to_string() }, input))?;
                    Uuid::parse_str(&uuid_str)
                        .map_err(|e| ValError::new(ErrorType::UuidParsing { error: e.to_string() }, input))?
                } else {
                    Uuid::from_slice(py_bytes.as_bytes())
                        .map_err(|e| ValError::new(ErrorType::UuidParsing { error: e.to_string() }, input))?
                }
            }

@davidhewitt
Copy link
Contributor

Looks like this is the relevant equivalent in our Python implementation:

try:
    return UUID(input_value.decode())
except ValueError:
    # 16 bytes in big-endian order as the bytes argument fail
    # the above check
    return UUID(bytes=input_value)

So we currently try to convert bytes to a string, and if that fails, parse as bytes.

I think what you're doing with size is probably better, maybe the following is most efficient to get the data into Rust fastest:

// Add this method to get the data
impl<'a> EitherBytes<'a> {
    pub fn as_slice(&'a self) -> &[u8] {
        match self {
            EitherBytes::Cow(bytes) => Ok(bytes),
            EitherBytes::Py(py_bytes) => py_bytes.as_bytes(),
        }
    }
}

// and later on
let either_bytes = input
    .validate_bytes(true)
    .map_err(|_| ValError::new(ErrorType::UuidType, input))?;
let bytes_slice = either_bytes.as_bytes();
let uuid = 'parse: {
    // Try parsing as utf8, but don't care if it fails
    if let Ok(utf8_str) = from_utf8(bytes_slice) {
        if let Ok(uuid) = Uuid::parse_str(utf8_str) {
            break 'parse uuid;
        }
    }
    Uuid::from_slice(bytes_slice)
        .map_err(|e| ValError::new(ErrorType::UuidParsing { error: e.to_string() }, input))?
}

Copy link
Contributor

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, thanks for all the effort on this. I think we're pretty much feature complete here now?

I have just one observation about a serializer branch that I think needs adjusting, otherwise let's merge! We can make a pydantic-core release today and then integrate this upstream into pydantic shortly after 🚀

@@ -187,6 +188,7 @@ pub(crate) fn infer_to_python_known(
let py_url: PyMultiHostUrl = value.extract()?;
py_url.__str__().into_py(py)
}
ObType::Uuid => serialize_dict(uuid_to_dict(value)?)?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be uuid_to_string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, WDYT?

@@ -483,6 +485,7 @@ pub(crate) fn infer_serialize_known<S: Serializer>(
pydantic_serializer.serialize(serializer)
}
ObType::Dataclass => serialize_dict!(dataclass_to_dict(value).map_err(py_err_se_err)?),
ObType::Uuid => serialize_dict!(uuid_to_dict(value).map_err(py_err_se_err)?),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here I think we just want the string serialized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 401 to 411
pub(super) fn uuid_to_dict(dc: &PyAny) -> PyResult<&PyDict> {
let py = dc.py();
let dc_slots: &PyDict = dc.getattr(intern!(py, "__slots__"))?.downcast()?;
let dict = PyDict::new(py);

for (field_name, _) in dc_slots.iter() {
let field_name: &PyString = field_name.downcast()?;
dict.set_item(field_name, dc.getattr(field_name)?)?;
}
Ok(dict)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And in general I guess I don't understand why we'd want this conversion instead of just going to string?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do the same as dataclass, would you rather transform into a string?

Copy link
Contributor

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, thanks so much for all your work on it!

@davidhewitt davidhewitt merged commit c18c875 into pydantic:main Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a built-in validator for UUID
2 participants