Skip to content

New query condition to match all strings that starts with some other given strings #673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

eduardbosch
Copy link
Contributor

@eduardbosch eduardbosch commented May 25, 2017

Hi to all! 🎉

This is my first contribution to Parse, and I hope it could be useful for the community.

The problem

I need a way to perform efficient searches in my database. I have researched, but I haven't found any way to perform efficient partial searches. The only way to perform efficient searches with Parse I think is by doing full text match or start with string matching.

I've read an article about performing keyword partial case insensitive. This solution is based on searching strings starting with another given strings which is really efficient with MongoDB. http://jam.sg/blog/efficient-partial-keyword-searches/

The solution

My solution was to split the string field that I want to search, in different keywords. Then, I take each of this keyword and I create different keyword versions by striping the firsts characters of each keyword with a minimum of 3 characters.

Suppose that we have this collection:

{ name: "LöRem ípsÜM-dÓlôr" }

Then I create the keywords without diacritics (https://jsperf.com/diacritics/43) and in lowercase. To create the keywords, I split the original name field by white space, comas, dots, dashes, etc:

{
    name: "LöRem ípsÜM-dÓlôr",
    nameSearch: ["lorem", "ipsum", "dolor"]
}

Finally I create all the keywords representations for partial search by matchings strings that starts with given strings:

{
    name: "LöRem ípsÜM-dÓlôr",
    nameSearch: [
        "lorem", "orem", "rem",
        "ipsum", "psum", "sum",
        "dolor", "olor", "lor
    ]
}

Now, I create an index for the "nameSearch" field, and I can do partial matches really efficient with this query:

{ nameSearch: { $all: [ /^\Qrem\E/, /^\Qpsum\E/ ] }

This increases the database size a little bit, but I think it's really worth it because it allows really good partial text searches. This is a good approach for searching short names, but not for searching in large amount of words. That will be better with full text matching and containsAll condition I think.

But the problem comes that parse does not allow calling ParseQuery.containsAll with regular expressions. So here it comes my feature.

The feature

I've created a new query condition to match all strings that starts with the given strings. I think this is really useful to make the partial keyword matching explained above.

This new condition is called containsAllStartsWith. Basically is a combination of containsAll and startsWith conditions. Given an array of strings, they are converted to regular expressions to match at the beginning of strings like in startsWith and delivered to containsAll condition.

The problem comes that the SDK sends the query as a JSON like this:
{ nameSearch: { $all: [ { $regex: "^\Qrem\E"}, {$regex: "^\Qpsum\E" ] }

This command is not supported by MongoDB, because $ is not allowed inside the $all condition. So I've updated the Parse Server to convert from regular expression JSON representation to a regular expression itself. { $regex: "^\Qrem\E"} to /^\Qrem\E/.

So Parse Server converts from this:
{ nameSearch: { $all: [ { $regex: "^\Qrem\E"}, {$regex: "^\Qpsum\E" ] }
to this:
{ nameSearch: { $all: [ /^\Qrem\E/, /^\Qpsum\E/ ] }
which runs perfectly on MongoDB.

Application use

To use this kind of search in an application it's really easy.

Suppose the user enters a text in a search field
"lÒr Psû"

The application should take this string and follow the same process to remove the diacritics and split as when creating the keywords. That will result in this:
["lor", "psu"]

Finally, that array should be delivered to containsAllStartWith condition like this:

List<String> stringQueries = Arrays.asList("lor", "psu");
ParseQuery<> query = ...
query.containsAllStartsWith("nameSearch", stringQueries);

This call will end with a query in MongoDB like this:
{ nameSearch: { $all: [ /^\Qlor\E/, /^\Qpsu\E/ ] }

Thanks!

I'm happy to try to contribute to this library. I hope that this can be useful for some other people.

Please, let me know if you want to know anything else 😃

Other SDK contributions and necessary changes

Parse-SDK-JS: parse-community/Parse-SDK-JS#437
parse server: parse-community/parse-server#3864

@codecov
Copy link

codecov bot commented May 25, 2017

Codecov Report

Merging #673 into master will increase coverage by 0.28%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##             master    #673      +/-   ##
===========================================
+ Coverage     52.62%   52.9%   +0.28%     
+ Complexity     1680    1679       -1     
===========================================
  Files           131     131              
  Lines         10120   10159      +39     
  Branches       1409    1409              
===========================================
+ Hits           5326    5375      +49     
+ Misses         4353    4341      -12     
- Partials        441     443       +2
Impacted Files Coverage Δ Complexity Δ
Parse/src/main/java/com/parse/ParseQuery.java 78.13% <100%> (+4.2%) 86 <4> (+6) ⬆️
.../src/main/java/com/parse/ParseRESTPushCommand.java 88% <0%> (-0.89%) 8% <0%> (-1%)
Parse/src/main/java/com/parse/ParsePush.java 92.85% <0%> (-0.57%) 29% <0%> (-1%)
...src/main/java/com/parse/ParseRESTQueryCommand.java 100% <0%> (ø) 13% <0%> (ø) ⬇️
.../src/main/java/com/parse/ParseFieldOperations.java
...e/src/main/java/com/parse/ParseFieldOperation.java 33.33% <0%> (ø) 0% <0%> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0516d24...229f5d1. Read the comment docs.

@@ -1699,6 +1701,31 @@ public void getInBackground(final String objectId, final GetCallback<T> callback
* {@code ParseQuery}.
* <p/>
* This only works on keys whose values are {@link ParseObject}s or lists of {@link ParseObject}s.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this javadoc is wrong? Add a constraint ... to match another ParseQuery. Same for whereContainsAll

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so also when I copied the comment from containsAll, but was not sure to change it.

Thanks to comment this 👍 Someone could write a good documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you can remove/reword both to their real meaning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments updated.

* The values that will match.
* @return this, so you can chain this call.
*/
public ParseQuery<T> whereContainsAllStartsWith(String key, Collection<String> values) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions for a different name? This seems obscure at first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the beginning I was thinking about containsAllStartingWith. Maybe this could be a better approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to understand what this does. Personally I think the cleanest and simplest wording is to keep whereStartsWith:

  • whereStartsWith(key, String prefix): check that value in key (must be String) starts with prefix
  • whereStartsWith(key, Collection<String> prefixes): check that values in key (must be Array) start with prefixes in the given collection (implicit: all of the constraints must be satisfied)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be a good name, but I think it's good to have the All word to know that all strings in collection must match. But I like your option too.

Let's see what other prefer.

@@ -120,6 +122,12 @@ public Object encode(Object object) {
return ((ParseQuery.RelationConstraint) object).encode(this);
}

if (object instanceof Pattern) {
JSONObject json = new JSONObject();
json.put("$regex", object.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to avoid Patterns and move this logic to ParseQuery just like all other $ constraints?

I mean using ParseQuery.addCondition(key, “$regex”, string). That way we keep consistencies and leave encoder untouched.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that could be better to avoid update the ParseEncoder. I didn't researched a lot.

Now, I think could be better to update the constraintsAllStartsWith to:

  public ParseQuery<T> whereContainsAllStartsWith(String key, Collection<String> values) {
    checkIfRunning();

    ArrayList<KeyConstraints> startsWithConstraints = new ArrayList<>();
    for (String value : values) {
      KeyConstraints keyConstraints = new KeyConstraints();
      keyConstraints.put("$regex", buildStartsWithRegex(value));
      startsWithConstraints.add(keyConstraints);
    }

    return whereContainsAll(key, startsWithConstraints);
  }

This might auto compile as expected. If people prefer to revert the ParseEncoder changes, I'll update the PR 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. Also since whereContainsAll is documented to work only with ParseObjects, you could call directly builder.addCondition(key, "$all", startsWithConstraints) and return this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the function to use KeyConstraints.

I keep using whereContsinaAll because was not restricted to ParseObjects. The docs were wrong.

@natario1
Copy link
Contributor

Left some comments but other than this, nice work IMO. Let’s see how the other PRs are received as well, I guess parse-server should be the first.

@eduardbosch
Copy link
Contributor Author

I've updated this section with correct links.

Other SDK contributions and necessary changes

Parse-SDK-JS: parse-community/Parse-SDK-JS#437
parse server: parse-community/parse-server#3864

@eduardbosch
Copy link
Contributor Author

@natario1 I've added containsAllStartingWith compatibility for offline search in LDS, but obviously, it's terribly slow. I think should be here only for compatibility.

The code could be optimized a little bit to avoid some regexp calls, but I think it will continue being so slow.

I think that for offline search, the best approach is to use Full Text matching (whereEqualsTo) or at most, partial text matching with (whereContains). Even using this approach, LDS is really slow for searching in thousands of records as parse loads all the records into memory (#279) and performs the matches from Java code.

I think that Parse it not really designed for offline search of efficient partial text matching 😞

@flovilmart
Copy link
Contributor

It’s more an issue with mongodb than Parse. Regular expressions are very costly

@eduardbosch
Copy link
Contributor Author

Well, maybe this could be improved by saving ParseObjects in LDS in a way that data could be queried without having to load all data into memory and perform the matches in plain Java.

Is SQLite LIKE expression more efficient than performing the matches with plain Java? I don't know. Or maybe use another offline data management approach like Realm that it's really efficient in that.

You know better Parse SDK than me, and how it's built, and I'm sure that will be really difficult to implement a more efficient LDS data management for a huge amount of data. But that will make working with thousands of ParseObjects in LDS great. Now is a little bit a pain, but it works 👍

Thanks for all your help as always!! 😃

@flovilmart
Copy link
Contributor

Sorry, I thought we were talking about regular expressions on the server, (too many threads). So to go back to the client, the LDS is a blob store (on iOS) and all objects are just stored by ID, so when running a query, everything from the LDS is loaded in memory which is ineffient AF.
The main issue is that SQLIte required proper schemas, which are not known on the client. What would be perhaps an avenue of optimization, would be extracting some keys in particular stores on which we could index.

@eduardbosch
Copy link
Contributor Author

Yeah, indexing will be great.

Yes, I've been looking deeply at Parse LDS SQLite db file and Parse offline store to know how data is saved and queried in LDS. I've seen that is saved as plain JSON, so the only way to perform queries right now is by loading into memory and perform the matching as it's done.

It's a pity that this makes a huge volume of data nearly unusable with some kind of queries in LDS. Maybe a good approach is to maintain another more efficient offline db that only saves the texts that could be searched with the associated ParseObject ids and its Class names so the search it's done outside of LDS and then we use the matching ParseObject ids to query into LDS.

@flovilmart
Copy link
Contributor

Yeah, or any list of keys that you wanna index for future searches. They could be pre-emptively set like: LDS.prepareStoreForQuery(storeName, query)

In this method we’d create a new table (if needed) with json objects as well as the columns extracted from the JSON. This way, you could use your query on that store and benefit from the indexing

@eduardbosch
Copy link
Contributor Author

Well, I think this approach will still be slow if there are thousands of records in LDS.prepareStoreForQuery(storeName, query), right?

One nice approach will be to use an offline db like realm as LDS, but as you said, you need to know the schema, but maybe the user can create the schema and then integrate it automatically. That will be great.

Now I've tested the approach to save objects with [String textForSearching, String parseObjectId, String parseClassName] into realm and then query for objects that contain a given text inside textForSearching in the realm db. Then, with the matching results, I create ParseObject.createWithoutData(parseClassName, parseObjectId) and I fetch from local fetchFromLocalDatastore() and the results are instant, compared to 19 seconds if I use ParseQuery.fromPin() for aproximatelly 5k objects.

@flovilmart
Copy link
Contributor

Rhat’s Pretty much what I suggested, using a separate properly indexed storage for enduring fast queries and then resolving the objects either because they are stored in the optimized storage, or on the main pin. Wether you use SQLite or realm is an implementation detail at that point, I’m sure you could do the same with both

@flovilmart
Copy link
Contributor

And the prepareQuery would basically lookup the schema required for running the query the most efficiently because it knows the schema right?

@eduardbosch
Copy link
Contributor Author

I don't understand your last question @flovilmart sorry 😅

Do you mean that prepareQuery looks the query constraints and builds a schema that matches these constraints and saves objects with that conditions, so later on, when user performs the same query, it's done in this new table and not like is done right now loading all objects and performing the matches in Java?

@flovilmart
Copy link
Contributor

Thst’s The idea! Then the storage need probably an abstraction later do we can plug realm or SQLite :)

@codecov-io
Copy link

codecov-io commented Jul 28, 2017

Codecov Report

Merging #673 into master will increase coverage by 0.14%.
The diff coverage is 86%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #673      +/-   ##
============================================
+ Coverage     54.09%   54.24%   +0.14%     
- Complexity     1682     1697      +15     
============================================
  Files           123      123              
  Lines          9736     9783      +47     
  Branches       1361     1371      +10     
============================================
+ Hits           5267     5307      +40     
- Misses         4040     4042       +2     
- Partials        429      434       +5
Impacted Files Coverage Δ Complexity Δ
Parse/src/main/java/com/parse/ParseQuery.java 75.18% <100%> (+0.37%) 87 <4> (+3) ⬆️
...rse/src/main/java/com/parse/OfflineQueryLogic.java 65.13% <82.92%> (+1.58%) 127 <9> (+12) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff3e27b...6ab5c59. Read the comment docs.

@eduardbosch eduardbosch force-pushed the feat/contains-all-starting-with branch from 229f5d1 to 7c8a423 Compare July 28, 2017 15:14
Eduard Bosch Bertran added 2 commits September 13, 2017 11:37
…tains-all-starting-with

# Conflicts:
#	Parse/src/main/java/com/parse/OfflineQueryLogic.java
@dplewis
Copy link
Member

dplewis commented May 19, 2018

@rogerhu This was added in the latest version of parse-server. How does it look?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 58.694% when pulling 6ab5c59 on eduardbosch:feat/contains-all-starting-with into ff3e27b on parse-community:master.

@rogerhu
Copy link
Contributor

rogerhu commented May 31, 2018

Thanks for PR'ng. Merging..

@rogerhu rogerhu mentioned this pull request May 31, 2018
@rogerhu
Copy link
Contributor

rogerhu commented Jun 1, 2018

rebased #839

@rogerhu rogerhu closed this Jun 1, 2018
@eduardbosch
Copy link
Contributor Author

Great! Thanks for merging @rogerhu and all the others that helped address this PR 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants