Skip to content

Add RegexSet functionality to C API #288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 8, 2016
Merged

Add RegexSet functionality to C API #288

merged 7 commits into from
Nov 8, 2016

Conversation

tiehuis
Copy link
Contributor

@tiehuis tiehuis commented Oct 19, 2016

These functions implement a C interface to the RegexSet api.

Some notes:

  • These do not include start offsets as the standard regex functions
    do. The reason being is down to how these are implemented in the core
    regex crate. The RegexSet api does not expose a public is_match_at
    whilst the Regex api does. I would actually like to change this if possible
    as the inconsistency this makes between rure and rure_set types is
    not great.
  • This only tests a complete compile/match mainly for sanity. One or
    two more tests targeting the specific areas would be preferred.
  • Set matches take a mutable array to fill with results. This is more
    C-like and allows the caller to manage the memory on the stack if
    they want.

Wanted to make a request before targeting these changes to clarify.

These functions implement a C interface to the RegexSet api.

Some notes:
 * These do not include start offsets as the standard regex functions
   do. The reason being is down to how these are implemented in the core
   regex crate. The RegexSet api does not expose a public is_match_at
   whilst the Regex api does.

 * This only tests a complete compile/match mainly for sanity. One or
   two more tests targetting the specific areas would be preferred.

 * Set matches take a mutuable array to fill with results. This is more
   C-like and allows the caller to manage the memory on the stack if
   they want.
@@ -165,7 +172,7 @@ rure *rure_compile(const uint8_t *pattern, size_t length,
/*
* rure_free frees the given compiled regular expression.
*
* This must be called at most once.
* This must be called at most once for any rure.
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly more clear I think. rure_set_free should match this.


/*
* rure_set_matches compares each regex in the set against the haystack and
* returns an array of bools which correspond to if a match was found for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "to if a" -> "to a"

* caring which, use rure_set_is_match.
*/
bool rure_set_matches(rure_set *re, const uint8_t *haystack, size_t length,
const bool *matches);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be bool *matches instead of const bool *matches? (Since the data pointed to by matches will be modified.)

unsafe { Box::from_raw(re as *mut RegexSet); }
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to expose RegexSet::len here so that callers can use it to allocate matches.

@BurntSushi
Copy link
Member

These functions implement a C interface to the RegexSet api.

Thanks for doing this! At a first glance, this looks pretty good with a few nits.

So one thing that we should consider before merging this is how to incorporate rure_options into RegexSet compilation. We probably also need to think about flags as well.

The core problem you'll run into is that neither of these things is actually exposed in the regex crate proper.

Maybe we can come up with a straw man proposal for the C API and convince ourselves that it can be added later, so that we can merge this sooner.

These do not include start offsets as the standard regex functions
do. The reason being is down to how these are implemented in the core
regex crate. The RegexSet api does not expose a public is_match_at
whilst the Regex api does. I would actually like to change this if possible
as the inconsistency this makes between rure and rure_set types is
not great.

The inconsistency is in the regex crate as well.

The inconsistency exists for a reason. Getting the match offsets is hard.

This only tests a complete compile/match mainly for sanity. One or
two more tests targeting the specific areas would be preferred.

This is probably OK. Mostly, I think the C tests should just make sure that FFI is working. Testing the semantics of RegexSet is better suited for the regex crate proper. But I won't stop you from adding more tests. :-)

Set matches take a mutable array to fill with results. This is more
C-like and allows the caller to manage the memory on the stack if
they want.

Perfect.

@tiehuis
Copy link
Contributor Author

tiehuis commented Oct 20, 2016

So one thing that we should consider before merging this is how to incorporate rure_options into RegexSet compilation. We probably also need to think about flags as well.

The core problem you'll run into is that neither of these things is actually exposed in the regex crate proper.

Maybe we can come up with a straw man proposal for the C API and convince ourselves that it can be added later, so that we can merge this sooner.

I had a bit more of a look into RegexSet and would the following work for the actual crate implementation. Correct me if I've overlooked something:

  • Extend RegexBuilder with two new functions from_set and compile_set. from_set just takes a slice of patterns instead of a single. compile_set is similar but maps using From::RegexSet instead of From::Regex. This seems compatible since the types Regex and RegexSet just wrap an Exec type, with the bulk of the work already implemented within Exec.

New function names are just placeholders.

And actually, just that seems like it would be sufficient for the lack of options to me. This can easily be exposed to the C API. This would be a backwards compatible API change, so would adjust the versioning however you are doing it. A corresponding compile_must for the rure_set type may be of use then at this point, too.

The inconsistency is in the regex crate as well.

The inconsistency exists for a reason. Getting the match offsets is hard.

If that is the case and I can just omit the start argument from the match functions.

This is probably OK. Mostly, I think the C tests should just make sure that FFI is working. Testing the semantics of RegexSet is better suited for the regex crate proper. But I won't stop you from adding more tests. :-)

I'll just add a regex failure test for sanity.

Thanks for the quick response, let me know if the RegexBuilder modifications outline sounds fine and if so, I'll implement those changes hopefully tomorrow.

@tiehuis
Copy link
Contributor Author

tiehuis commented Oct 20, 2016

I noted previously that we would need some extra functions in the regex crate. This seems like an incompatible change however after reading the regex 1.0 RFC (at least in the short-term as you previously said). It looks like enough of the interface is public that we can just construct the RegexOptions ourselves and call use the ExecBuilder directly, though, so should still be feasible.

@BurntSushi
Copy link
Member

Sorry I'm dragging, but the past few days have been action packed (including a trip to the animal ER). Using RegexOptions and ExecBuilder is regrettable (they are technically exposed, but not really part of the public API), but I'm OK with it as long as there is a straight-forward way to expand the C API.

These options do need to get exposed in the regex crate proper though too. Your from_set and compile_set idea is one I've tossed around too, but my problem with it is that if you build with from_set, then it'd be nice to not call compile. I guess we could just return an error instead of ruling it out statically.

To be clear, I'm happy merging this PR without these options in the C API, as long as there's a reasonable path to adding them in the future.

@tiehuis
Copy link
Contributor Author

tiehuis commented Oct 21, 2016

Not a problem.

I'd like to get something thought out in that would minimize any breaking future api changes. I'd also like to avoid modifying the regex crate if possible right now.

With these goals in mind, here is what I suggest.

  • Add options to the c api regex set by using RegexOptions directly. This would be an implementation detail and can be adjusted to use a proper interface when one lands in the core crate.
  • Add a dummy start variable to the rure_set match functions. This is supported internally but no public way of acessing this is present. This unused field can be updated when support lands in the core crate. For the moment, it would require callers of this to use a 0 value to ensure future compatibility.

This should be flexible enough for future improvements without any breaking api changes.

Also, as an aside regarding the RegexBuilder and its compatibility with RegexSet. The cleanest way I see is to just create a seperate builder for a RegexSet itself. That is a topic for another time though.

I'll be offline till Monday but that day (which is thankfully a public holiday for me) I'll implement these and see how it all fits.

@BurntSushi
Copy link
Member

Sounds good. Two things:

  1. You'll need to add RegexOptions to the internal module, which is in src/lib.rs.
  2. You can implement the start functionality by using Exec and calling searcher.

In fact, I think both of these things means that you end up working around the use of RegexSet proper entirely.

@tiehuis
Copy link
Contributor Author

tiehuis commented Oct 25, 2016

Just a note that I've finished adding these features.

@BurntSushi BurntSushi merged commit 7968aad into rust-lang:master Nov 8, 2016
@BurntSushi
Copy link
Member

@tiehuis Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants