-
Notifications
You must be signed in to change notification settings - Fork 292
Rework def ref system to fix various bugs #571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/build_context.rs
Outdated
// Mapping of reference strings to indexes in the definitions Vec | ||
// As we recurse through the input when we find: | ||
// - A definitions-reference (`{'type': 'definitions-reference', 'schema_ref': '...'}`) | ||
// then we store that `schema_ref` -> `definitions.len()` | ||
// so that the value is the index into definitions. | ||
// We also set a None value in definitions. | ||
// Then we create a DefinitionsReferenceValidator that stores the index (value) from this mapping. | ||
// At runtime if it needs to call into the validator it simply looks it up by index. | ||
// - Any schema with a `'ref'`: before recursing we store the ref here and set a None | ||
// value in definitions. | ||
// As we recurse if a definitions-reference to this | ||
// Then after recursing and completing the CombinedValidator we replace that None value with the | ||
// newly built validator. Now anything that | ||
refs: AHashMap<String, usize>, | ||
|
||
// Here we hold an *optional* reference to a definition validator | ||
// Any time we encounter a schema with a `'ref'` we build a DefinitionsReferenceValidator that points | ||
// to an index in this Vector | ||
// At the end of building we check if any values are None, which indicates an unsatisfied reference | ||
definitions: Vec<Option<T>>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't too hard to add a basic ref-count system here so we know which references have more than one referrer and hence which can be inlined or not. But I got bogged down in implementation multiple times.
fn extract_used_refs(schema: &PyAny, refs: &mut AHashSet<String>) -> PyResult<()> { | ||
if let Ok(dict) = schema.downcast::<PyDict>() { | ||
if is_definition_ref(dict)? { | ||
refs.insert(dict.get_as_req(intern!(schema.py(), "schema_ref"))?); | ||
} else { | ||
for (key, value) in dict.iter() { | ||
if !key.eq(intern!(schema.py(), "metadata"))? { | ||
extract_used_refs(value, refs)?; | ||
} | ||
} | ||
} | ||
} else if let Ok(list) = schema.downcast::<PyList>() { | ||
for item in list.iter() { | ||
extract_used_refs(item, refs)?; | ||
} | ||
} | ||
Ok(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a lot of the dragons were in here, and it was slow (hence the building schema speedup).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess this is how we did the reference counting.
return serializer.to_python( | ||
py, | ||
value, | ||
extra.mode.to_object(py).extract(py)?, | ||
include, | ||
exclude, | ||
extra.by_alias, | ||
extra.exclude_unset, | ||
extra.exclude_defaults, | ||
extra.exclude_none, | ||
extra.round_trip, | ||
extra.warnings.is_active(), | ||
extra.fallback, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solves a bug where we were calling into the next serializer without any slots
, so if it had any def refs it would crash.
src/validators/definitions.rs
Outdated
fn validate_assignment<'s, 'data: 's>( | ||
&'s self, | ||
py: Python<'data>, | ||
obj: &'data PyAny, | ||
field_name: &'data str, | ||
field_value: &'data PyAny, | ||
extra: &Extra, | ||
slots: &'data [CombinedValidator], | ||
recursion_guard: &'s mut RecursionGuard, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solves a bug where validate_assignment
straight up didn't work with any def refs because they didn't implement it.
@@ -13,7 +11,7 @@ def test_custom_ser(): | |||
) | |||
) | |||
assert s.to_python([1, 2, 3]) == ['1', '2', '3'] | |||
assert plain_repr(s).endswith('slots=[])') | |||
# assert plain_repr(s).endswith('slots=[])') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What to do with these tests? Not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is testing that slots weren't used, I guess because we previously identified that the validator was only used once, so inlined it.
src/validators/mod.rs
Outdated
if build_context.ref_used(&schema_ref) { | ||
// the ref is used somewhere | ||
// check the ref is unique | ||
if build_context.ref_already_used(&schema_ref) { | ||
return py_err!("Duplicate ref: `{}`", schema_ref); | ||
} | ||
|
||
return if build_context.ref_used_within(schema_dict, &schema_ref)? { | ||
// the ref is used within itself, so we have to store the validator in slots | ||
// and return a DefinitionRefValidator | ||
let slot_id = build_context.prepare_slot(schema_ref)?; | ||
let inner_val = T::build(schema_dict, config, build_context)?; | ||
let name = inner_val.get_name().to_string(); | ||
build_context.complete_slot(slot_id, inner_val)?; | ||
Ok(definitions::DefinitionRefValidator::from_id(slot_id, name)) | ||
} else { | ||
// ref is used, but only out side itself, we want to clone it everywhere it's used | ||
let validator = T::build(schema_dict, config, build_context)?; | ||
build_context.store_reusable(schema_ref, validator.clone()); | ||
Ok(validator) | ||
}; | ||
} | ||
let validator_id = build_context.get_ref_id(schema_ref); | ||
let inner_val = T::build(schema_dict, config, build_context)?; | ||
build_context.add_definition(validator_id, inner_val)?; | ||
return Ok(DefinitionRefValidator::from_id(validator_id).into()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The big question here is: what is the cost of this change? In theory, it's large. As you can see anything with a ref
permanently becomes a DefinitionRefValidator
, which also always use slots so they'll be "slow".
However this is not what needs to happen in practice: it would be quite easy to, in Python, remove any unused refs
. Certainly much much easier than maintaining the complex system of refs and slots and whatnot that we had here. And in the end, it will be just as efficient because we can give pydantic-core
the most efficient arrangement of references to begin with instead of trying to jump through hoops in Rust to get the borrow checker to accept circular references.
If we do that in Pydantic we can even enforce it here if we want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems fine to me, but are we sure it's okay for core-schema's validator? If that's not effected, I'm fine to move it to pydantic.
The other option is to have another pass at reference counting in pydantic-core after this is merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, I think we can copy this stuff into pydantic-core
: https://github.com/pydantic/pydantic/blob/de0cf4bca502aad9f7e747cae60298e654bc5a75/pydantic/_internal/_core_utils.py#L222-L423. At which point we're just moving some complex logic from Rust into Python. I don't think it was much faster in Rust to begin with (in fact, I think it may have been slower because it was recursing into Python objects more) and it's certainly easier to debug in Python. And we wouldn't need to do any fancy stuff in Rust.
That said, yes I see the possibility of coming back after this PR and adding back complexity to do the schema inlining / reference counting.
By the way, this is not based on the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall I think this change is good.
I'm marginally in favour of implementing reference counting in pydantic-core rather than changing pydantic, but:
- I can see both arguments
- I think it should be done after this PR.
src/validators/mod.rs
Outdated
if build_context.ref_used(&schema_ref) { | ||
// the ref is used somewhere | ||
// check the ref is unique | ||
if build_context.ref_already_used(&schema_ref) { | ||
return py_err!("Duplicate ref: `{}`", schema_ref); | ||
} | ||
|
||
return if build_context.ref_used_within(schema_dict, &schema_ref)? { | ||
// the ref is used within itself, so we have to store the validator in slots | ||
// and return a DefinitionRefValidator | ||
let slot_id = build_context.prepare_slot(schema_ref)?; | ||
let inner_val = T::build(schema_dict, config, build_context)?; | ||
let name = inner_val.get_name().to_string(); | ||
build_context.complete_slot(slot_id, inner_val)?; | ||
Ok(definitions::DefinitionRefValidator::from_id(slot_id, name)) | ||
} else { | ||
// ref is used, but only out side itself, we want to clone it everywhere it's used | ||
let validator = T::build(schema_dict, config, build_context)?; | ||
build_context.store_reusable(schema_ref, validator.clone()); | ||
Ok(validator) | ||
}; | ||
} | ||
let validator_id = build_context.get_ref_id(schema_ref); | ||
let inner_val = T::build(schema_dict, config, build_context)?; | ||
build_context.add_definition(validator_id, inner_val)?; | ||
return Ok(DefinitionRefValidator::from_id(validator_id).into()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems fine to me, but are we sure it's okay for core-schema's validator? If that's not effected, I'm fine to move it to pydantic.
The other option is to have another pass at reference counting in pydantic-core after this is merged?
@@ -13,7 +11,7 @@ def test_custom_ser(): | |||
) | |||
) | |||
assert s.to_python([1, 2, 3]) == ['1', '2', '3'] | |||
assert plain_repr(s).endswith('slots=[])') | |||
# assert plain_repr(s).endswith('slots=[])') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is testing that slots weren't used, I guess because we previously identified that the validator was only used once, so inlined it.
7ce340e
to
061b721
Compare
CodSpeed Performance ReportMerging #571 Summary
Benchmarks breakdown
|
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## main #571 +/- ##
==========================================
- Coverage 94.24% 94.14% -0.11%
==========================================
Files 95 95
Lines 12904 12877 -27
Branches 24 24
==========================================
- Hits 12162 12123 -39
- Misses 736 748 +12
Partials 6 6
Continue to review full report in Codecov by Sentry.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few small things, after that I think this is ready to merge.
src/definitions.rs
Outdated
// Get a ReferenceId for the given reference string. | ||
// This ReferenceId can later be used to retrieve a definition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Get a ReferenceId for the given reference string. | |
// This ReferenceId can later be used to retrieve a definition | |
/// Get a ReferenceId for the given reference string. | |
/// This ReferenceId can later be used to retrieve a definition |
and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://doc.rust-lang.org/reference/comments.html#doc-comments TIL, thank you!
// Get a ReferenceId for the given reference string. | ||
// This ReferenceId can later be used to retrieve a definition | ||
pub fn get_reference_id(&mut self, reference: &str) -> ReferenceId { | ||
let next_id = self.definitions.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be moved down to line 45?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not because we borrow definitions as mutable. I can do it here because usize will just get copied freeing up definitions to be borrowed
|
||
// Add a definition, returning the ReferenceId that maps to it | ||
pub fn add_definition(&mut self, reference: String, value: T) -> PyResult<ReferenceId> { | ||
let next_id = self.definitions.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem above, I don’t think we can, or at least I wasn’t able to; I tried
4cd975a
to
9d5ad36
Compare
This resolves the bugs in the tests and some others.