Skip to content

Configure Elasticsearch _id dynamically from document #272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 16, 2020

Conversation

jfschaff
Copy link
Contributor

@jfschaff jfschaff commented May 7, 2020

Implements #271

CL:

  • added optional method document_id(self, instance) to Document class to customize Elasticsearch document ids
  • documented that in sections fields

I wasn't sure where to put the doc, so don't hesitate to move it around it you think it shouldn't be in this section.

@@ -156,11 +156,19 @@ def parallel_bulk(self, actions, **kwargs):
# the result is currently not used upstream anyway.
return (1, [])

def document_id(self, object_instance):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the method name to get_document_id. document_id seems like a property from developer point of view.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point concerning the method name, but I have a feeling that get_document_id may be misleading, as this is not really a getter. It's more like a setter. It is basically the same as the prepare_<field name> methods, but for the document id instead of one of its fields.

I did not use the name prepare_document_id or prepare_id, as this would conflict with the methods that would set the document_id or id field inside the document.

Would you be ok with one of the following names?

  • set_id
  • set_document_id
  • determine_document_id
  • determine_id

Tell me which one you prefer and I'll do the changes.

And I'll add tests as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not setter as it is not setting any atribute to the object itself. the method is returning the id so from independent prespective, it is a getter. You can change it to something like generate_id, which maybe more explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, it's not a setter either. Ok for generate_id then.

@safwanrahman
Copy link
Collaborator

@jfschaff Thanks a lot for the PR. I have added 2 improvements which I think are helpful. Let me know if you have concerns.

@@ -156,11 +156,19 @@ def parallel_bulk(self, actions, **kwargs):
# the result is currently not used upstream anyway.
return (1, [])

def generate_id(self, object_instance):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to have static method so it can be called directly without initiating the object. Like Document.generate_id(model_object)

@safwanrahman
Copy link
Collaborator

I will check the tests and add some more review.

@jfschaff
Copy link
Contributor Author

OK, then I wait for your review before I make the final fixes.

@safwanrahman
Copy link
Collaborator

@jfschaff The tests looks pretty good. Can you add a unittest in the tests_document.py? Which will test whether the generate_id is called while indexing as well as the generate_id is returning correct value by default.

@safwanrahman
Copy link
Collaborator

@jfschaff I was thinking about adding a configuration in the class Django named something like id_field_name which can be set to easily configure the Elastic ID without overwriting the generate_id method. So in the generate_id method, we can call getattr(obj, self.django.id_field_name) and return the value. What do you think?

It is not necessary to have this in this patch, we can implement that in separate PR. But are you interested to add that as well in this PR?

@jfschaff
Copy link
Contributor Author

jfschaff commented May 26, 2020

Hi, and thank you for your feedback.

I could not find a way to retrieve the ES _id of a document (what you wrote es_obj._id in your suggestion).

Therefore, the integration tests use try / except blocks to get a doc by its ES _id. The idea of using try / except is to "assert that the code does not error". I could not find another way to assert that (and it looks like assertDoesNotError is kinda missing, see e.g. this SO discussion). Doing the same without a try / except would error instead of failing, which would be ambiguous: we want the test to fail when the feature is broken, not to error.

As for the other points:

  • I turned the method into a class method, which makes sense. A static method would not work ;
  • I've tried to add unit tests, but I could not make it work. I wanted to check that generate_id is called but did not manage to. I'm not so used to the mock module... I've left the commented (broken) tests at the very end of test_documents.py if you want to have a look at it.

And yes, your suggestion about adding id_field_name is relevant. I did not think of that because it would not help for my use case. But can we first merge this PR before adding that? You can attribute that feature to me so I don't forget :-)

@nicholasamorim
Copy link

Are we able to merge this in?

@safwanrahman
Copy link
Collaborator

@nicholasamorim @jfschaff Sorry I could not manage time to fix the tests. I will look on this upcoming weekend for sure.

@safwanrahman
Copy link
Collaborator

@jfschaff Thanks for your patience. I have fixed the tests in my repo safwanrahman@4c88715d can you fetch and cherry-pick the commit and push this to your branch? Or you may mark the "allow edit from maintainers" so I can push the changes directly to your repo.

The test fix is quite interesting and complex. The method is not called untill the low level elasticsearch API execute it. So the tests was not picking up.

@jfschaff
Copy link
Contributor Author

@safwanrahman, I've just pushed your commit to my branch, but I don't see the tests running...

@safwanrahman
Copy link
Collaborator

Seems like the tests got passed! Going to merge this! Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants