Skip to content

Rework def ref system to fix various bugs #571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 3, 2023
Merged

Rework def ref system to fix various bugs #571

merged 11 commits into from
May 3, 2023

Conversation

adriangb
Copy link
Member

@adriangb adriangb commented May 2, 2023

This resolves the bugs in the tests and some others.

@adriangb adriangb self-assigned this May 2, 2023
Comment on lines 14 to 34
// Mapping of reference strings to indexes in the definitions Vec
// As we recurse through the input when we find:
// - A definitions-reference (`{'type': 'definitions-reference', 'schema_ref': '...'}`)
// then we store that `schema_ref` -> `definitions.len()`
// so that the value is the index into definitions.
// We also set a None value in definitions.
// Then we create a DefinitionsReferenceValidator that stores the index (value) from this mapping.
// At runtime if it needs to call into the validator it simply looks it up by index.
// - Any schema with a `'ref'`: before recursing we store the ref here and set a None
// value in definitions.
// As we recurse if a definitions-reference to this
// Then after recursing and completing the CombinedValidator we replace that None value with the
// newly built validator. Now anything that
refs: AHashMap<String, usize>,

// Here we hold an *optional* reference to a definition validator
// Any time we encounter a schema with a `'ref'` we build a DefinitionsReferenceValidator that points
// to an index in this Vector
// At the end of building we check if any values are None, which indicates an unsatisfied reference
definitions: Vec<Option<T>>,
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't too hard to add a basic ref-count system here so we know which references have more than one referrer and hence which can be inlined or not. But I got bogged down in implementation multiple times.

Comment on lines -166 to -183
fn extract_used_refs(schema: &PyAny, refs: &mut AHashSet<String>) -> PyResult<()> {
if let Ok(dict) = schema.downcast::<PyDict>() {
if is_definition_ref(dict)? {
refs.insert(dict.get_as_req(intern!(schema.py(), "schema_ref"))?);
} else {
for (key, value) in dict.iter() {
if !key.eq(intern!(schema.py(), "metadata"))? {
extract_used_refs(value, refs)?;
}
}
}
} else if let Ok(list) = schema.downcast::<PyList>() {
for item in list.iter() {
extract_used_refs(item, refs)?;
}
}
Ok(())
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a lot of the dragons were in here, and it was slow (hence the building schema speedup).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess this is how we did the reference counting.

Comment on lines +102 to +116
return serializer.to_python(
py,
value,
extra.mode.to_object(py).extract(py)?,
include,
exclude,
extra.by_alias,
extra.exclude_unset,
extra.exclude_defaults,
extra.exclude_none,
extra.round_trip,
extra.warnings.is_active(),
extra.fallback,
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solves a bug where we were calling into the next serializer without any slots, so if it had any def refs it would crash.

Comment on lines 101 to 108
fn validate_assignment<'s, 'data: 's>(
&'s self,
py: Python<'data>,
obj: &'data PyAny,
field_name: &'data str,
field_value: &'data PyAny,
extra: &Extra,
slots: &'data [CombinedValidator],
recursion_guard: &'s mut RecursionGuard,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solves a bug where validate_assignment straight up didn't work with any def refs because they didn't implement it.

@@ -13,7 +11,7 @@ def test_custom_ser():
)
)
assert s.to_python([1, 2, 3]) == ['1', '2', '3']
assert plain_repr(s).endswith('slots=[])')
# assert plain_repr(s).endswith('slots=[])')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What to do with these tests? Not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is testing that slots weren't used, I guess because we previously identified that the validator was only used once, so inlined it.

Comment on lines 320 to 325
if build_context.ref_used(&schema_ref) {
// the ref is used somewhere
// check the ref is unique
if build_context.ref_already_used(&schema_ref) {
return py_err!("Duplicate ref: `{}`", schema_ref);
}

return if build_context.ref_used_within(schema_dict, &schema_ref)? {
// the ref is used within itself, so we have to store the validator in slots
// and return a DefinitionRefValidator
let slot_id = build_context.prepare_slot(schema_ref)?;
let inner_val = T::build(schema_dict, config, build_context)?;
let name = inner_val.get_name().to_string();
build_context.complete_slot(slot_id, inner_val)?;
Ok(definitions::DefinitionRefValidator::from_id(slot_id, name))
} else {
// ref is used, but only out side itself, we want to clone it everywhere it's used
let validator = T::build(schema_dict, config, build_context)?;
build_context.store_reusable(schema_ref, validator.clone());
Ok(validator)
};
}
let validator_id = build_context.get_ref_id(schema_ref);
let inner_val = T::build(schema_dict, config, build_context)?;
build_context.add_definition(validator_id, inner_val)?;
return Ok(DefinitionRefValidator::from_id(validator_id).into());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big question here is: what is the cost of this change? In theory, it's large. As you can see anything with a ref permanently becomes a DefinitionRefValidator, which also always use slots so they'll be "slow".

However this is not what needs to happen in practice: it would be quite easy to, in Python, remove any unused refs. Certainly much much easier than maintaining the complex system of refs and slots and whatnot that we had here. And in the end, it will be just as efficient because we can give pydantic-core the most efficient arrangement of references to begin with instead of trying to jump through hoops in Rust to get the borrow checker to accept circular references.
If we do that in Pydantic we can even enforce it here if we want to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fine to me, but are we sure it's okay for core-schema's validator? If that's not effected, I'm fine to move it to pydantic.

The other option is to have another pass at reference counting in pydantic-core after this is merged?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, I think we can copy this stuff into pydantic-core: https://github.com/pydantic/pydantic/blob/de0cf4bca502aad9f7e747cae60298e654bc5a75/pydantic/_internal/_core_utils.py#L222-L423. At which point we're just moving some complex logic from Rust into Python. I don't think it was much faster in Rust to begin with (in fact, I think it may have been slower because it was recursing into Python objects more) and it's certainly easier to debug in Python. And we wouldn't need to do any fancy stuff in Rust.

That said, yes I see the possibility of coming back after this PR and adding back complexity to do the schema inlining / reference counting.

@adriangb adriangb requested review from samuelcolvin and dmontagu and removed request for samuelcolvin May 2, 2023 06:14
@adriangb adriangb marked this pull request as ready for review May 2, 2023 06:14
@adriangb
Copy link
Member Author

adriangb commented May 2, 2023

By the way, this is not based on the current main because I needed to do a lot of testing against pydantic proper and the current main is quite broken when used with pydantic 😞

@adriangb adriangb requested a review from samuelcolvin May 2, 2023 06:15
Copy link
Member

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall I think this change is good.

I'm marginally in favour of implementing reference counting in pydantic-core rather than changing pydantic, but:

  1. I can see both arguments
  2. I think it should be done after this PR.

Comment on lines 320 to 325
if build_context.ref_used(&schema_ref) {
// the ref is used somewhere
// check the ref is unique
if build_context.ref_already_used(&schema_ref) {
return py_err!("Duplicate ref: `{}`", schema_ref);
}

return if build_context.ref_used_within(schema_dict, &schema_ref)? {
// the ref is used within itself, so we have to store the validator in slots
// and return a DefinitionRefValidator
let slot_id = build_context.prepare_slot(schema_ref)?;
let inner_val = T::build(schema_dict, config, build_context)?;
let name = inner_val.get_name().to_string();
build_context.complete_slot(slot_id, inner_val)?;
Ok(definitions::DefinitionRefValidator::from_id(slot_id, name))
} else {
// ref is used, but only out side itself, we want to clone it everywhere it's used
let validator = T::build(schema_dict, config, build_context)?;
build_context.store_reusable(schema_ref, validator.clone());
Ok(validator)
};
}
let validator_id = build_context.get_ref_id(schema_ref);
let inner_val = T::build(schema_dict, config, build_context)?;
build_context.add_definition(validator_id, inner_val)?;
return Ok(DefinitionRefValidator::from_id(validator_id).into());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fine to me, but are we sure it's okay for core-schema's validator? If that's not effected, I'm fine to move it to pydantic.

The other option is to have another pass at reference counting in pydantic-core after this is merged?

@@ -13,7 +11,7 @@ def test_custom_ser():
)
)
assert s.to_python([1, 2, 3]) == ['1', '2', '3']
assert plain_repr(s).endswith('slots=[])')
# assert plain_repr(s).endswith('slots=[])')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is testing that slots weren't used, I guess because we previously identified that the validator was only used once, so inlined it.

@adriangb adriangb force-pushed the rework-references branch 2 times, most recently from 7ce340e to 061b721 Compare May 2, 2023 18:55
@codspeed-hq
Copy link

codspeed-hq bot commented May 2, 2023

CodSpeed Performance Report

Merging #571 rework-references (04c2936) will not alter performances.

Summary

🔥 0 improvements
❌ 1 regressions
✅ 111 untouched benchmarks

🆕 0 new benchmarks
⁉️ 0 dropped benchmarks

Benchmarks breakdown

Benchmark main rework-references Change
test_definition_out_of_tree 28.3 ms 32.1 ms -13.38%

@codecov-commenter
Copy link

Codecov Report

Merging #571 (7b33040) into main (817af72) will decrease coverage by 0.11%.
The diff coverage is 91.82%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #571      +/-   ##
==========================================
- Coverage   94.24%   94.14%   -0.11%     
==========================================
  Files          95       95              
  Lines       12904    12877      -27     
  Branches       24       24              
==========================================
- Hits        12162    12123      -39     
- Misses        736      748      +12     
  Partials        6        6              
Impacted Files Coverage Δ
src/lib.rs 100.00% <ø> (ø)
src/validators/dataclass.rs 94.11% <66.66%> (+0.22%) ⬆️
src/validators/is_subclass.rs 80.48% <66.66%> (ø)
src/validators/url.rs 93.41% <66.66%> (ø)
src/validators/lax_or_strict.rs 80.76% <71.42%> (ø)
src/validators/chain.rs 85.07% <75.00%> (ø)
src/validators/custom_error.rs 87.32% <75.00%> (ø)
src/validators/function.rs 90.59% <76.92%> (ø)
src/validators/bytes.rs 90.24% <80.00%> (+1.21%) ⬆️
src/validators/json.rs 97.87% <80.00%> (ø)
... and 59 more

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 817af72...7b33040. Read the comment docs.

Copy link
Member

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few small things, after that I think this is ready to merge.

Comment on lines 35 to 48
// Get a ReferenceId for the given reference string.
// This ReferenceId can later be used to retrieve a definition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Get a ReferenceId for the given reference string.
// This ReferenceId can later be used to retrieve a definition
/// Get a ReferenceId for the given reference string.
/// This ReferenceId can later be used to retrieve a definition

and below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Get a ReferenceId for the given reference string.
// This ReferenceId can later be used to retrieve a definition
pub fn get_reference_id(&mut self, reference: &str) -> ReferenceId {
let next_id = self.definitions.len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be moved down to line 45?

Copy link
Member Author

@adriangb adriangb May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not because we borrow definitions as mutable. I can do it here because usize will just get copied freeing up definitions to be borrowed


// Add a definition, returning the ReferenceId that maps to it
pub fn add_definition(&mut self, reference: String, value: T) -> PyResult<ReferenceId> {
let next_id = self.definitions.len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem above, I don’t think we can, or at least I wasn’t able to; I tried

@adriangb adriangb force-pushed the rework-references branch from 4cd975a to 9d5ad36 Compare May 3, 2023 02:21
@adriangb adriangb enabled auto-merge (squash) May 3, 2023 02:23
@adriangb adriangb merged commit c655030 into main May 3, 2023
@adriangb adriangb deleted the rework-references branch May 3, 2023 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants