Skip to content

Implement delete operation #1324

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 13, 2020
Merged

Conversation

vincentarnaud90
Copy link
Contributor

Implement delete operation and tests. This is a duplicate from #1242 with testing.

Thank you for submitting a pull request!

Please make sure you have signed our Contributor License Agreement (CLA).
We are not asking you to assign copyright to us, but to give us the right to distribute your code without restriction. We ask this of all contributors in order to assure our users of the origin and continuing existence of the code.
You only need to sign the CLA once.

@vincentarnaud90 vincentarnaud90 changed the title Implement delete operation (#1) Implement delete operation Jul 22, 2019
@jbaiera jbaiera self-requested a review July 22, 2019 18:06
@vincentarnaud90
Copy link
Contributor Author

Hello, is the pull request still missing something ?

@jbaiera
Copy link
Member

jbaiera commented Aug 8, 2019

@vincentarnaud90 Thanks for re-commenting on this PR, as it slipped through my todo list. I'll get a review up for it soon.

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review here. This is a larger lift than it looks, especially with our serialization code being as complicated as it is.

I left a few code style nitpicks, but most importantly I think that using the ValueWriter to throw out the document source on the delete operation might be the wrong way forward here. I think we should make a new implementation of BulkCommand and have AbstractBulkFactory allow subclasses to return a different command (in this case, all the factories other than Delete would use the existing logic in the super class)

public static final class NoDataWriter implements ValueWriter<Object> {

@Override
public Result write(Object writable, Generator generator) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it only supports Hadoop Writable objects here. For instance, Spark uses regular Java and Scala objects for data types.

Since integrations can be added that have different types than described here, it makes me wonder if this is really the right place to have the connector ignore the document content for the bulk request. I don't think a ValueWriter will have enough wisdom at runtime to make the decision on a call-by-call basis. Maybe there needs to be work done in order to have a new BulkCommand that is returned from the DeleteBulkFactory ignore the document instead.

@@ -221,6 +221,11 @@ public String getSerializerValueWriterClassName() {
return getProperty(ES_SERIALIZATION_WRITER_VALUE_CLASS);
}

public Settings setSerializerValueWriterClassName(String className) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the ValueWriter doesn't seem like the right place to ignore the document source for delete operations, this won't be needed.

@@ -253,10 +253,18 @@ public static void validateSettings(Settings settings) {
Assert.isTrue(settings.getMappingExcludes().isEmpty(), "When writing data as JSON, the field exclusion feature is ignored. This is most likely not what the user intended. Bailing out...");
}

//check the configuration is coherent in order to use the delete operation
if (ConfigurationOptions.ES_OPERATION_DELETE.equals(settings.getOperation())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks for adding this


public DeleteBulkFactory(Settings settings, MetadataExtractor metaExtractor, EsMajorVersion version) {
// we only want a specific serializer for this particular bulk factory
super(settings.copy().setSerializerValueWriterClassName(NoDataWriter.class.getName()), metaExtractor, version);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not required if we use a new implementation of BulkCommand

@vincentarnaud90
Copy link
Contributor Author

Hello,
Thanks for the reply. From what I undertstand from your comment, you ask me to do this.

	DeleteBulkFactory  extends AbstractBulkFactory {

		@Override
		public BulkCommand createBulk(){
			
			//Inside this BulkCommand, I can ignore the document source. 
			return DeleteTemplateBulk();
		}
	}

Instead of overriding the createBulk method in the DeleteBulkFactory class, can I just add my logic inside the createBulk method of AbstractBulkFactory class in order to return the DeleteTemplateBulk
when the approriate settings are set ?
The High level code would look like this :

	public AbstractBulkFactory implements BulkFactory{
		@Override
		public BulkCommand createBulk(){
			//the code which is actually written

			if (isDeleteOp){
				//Inside this BulkCommand, I can ignore the document source. 
				return DeleteTemplateBulk();	
			}
			
		}	
	}

@jbaiera
Copy link
Member

jbaiera commented Sep 9, 2019

Instead of overriding the createBulk method in the DeleteBulkFactory class, can I just add my logic inside the createBulk method of AbstractBulkFactory class in order to return the DeleteTemplateBulk
when the approriate settings are set ?

I think that should be fine considering it already detects if the operation is an update vs an index operation.

vincentarnaud90 and others added 4 commits January 8, 2020 10:15
Implement delete operation and tests
Revert changes from codestyle and generate data in CommandTest with loops
@vincentarnaud90
Copy link
Contributor Author

Hello @jbaiera , sorry for the really late reply.
I have made the suggested changes. Can you have a look at them ?
Furthermore, it seems that your CI is not working anymore.

@jbaiera
Copy link
Member

jbaiera commented Mar 10, 2020

Thanks for putting in those changes! I'll run the full test suite on my own since TravisCI is busted.

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks very much for your dedication on getting this in! I'll go ahead and merge it in to master and backport it to the 7.x branch. It should be available in the 7.8.0 release when that lands.

@jbaiera jbaiera merged commit 41cfc5e into elastic:master Mar 13, 2020
jbaiera pushed a commit that referenced this pull request May 14, 2020
This PR adds support for delete operations. Document ID's must be present for each document for this to work. A new bulk command factory has been added which will render a delete operation with the id and no document body in the bulk output.
@krish2code
Copy link

krish2code commented Jun 23, 2020

Hi @jbaiera, not sure if this is right place to post my question but its relevant to your closing comment. We are using ElasticSearch and are in need of Delete feature. As you mentioned above, this feature will be back ported in 7.8.0 but I don't see that the case. Delete is still an Unknown operation on 7.8 codeline. https://github.com/elastic/elasticsearch-hadoop/blob/7.8/mr/src/main/java/org/elasticsearch/hadoop/serialization/bulk/BulkCommands.java
Can you please take a look.

krish2code pushed a commit to krish2code/elasticsearch-hadoop that referenced this pull request Jun 24, 2020
This PR adds support for delete operations. Document ID's must be present for each document for this to work. A new bulk command factory has been added which will render a delete operation with the id and no document body in the bulk output.
@krish2code
Copy link

krish2code commented Jun 24, 2020

@jbaiera As I mentioned above in my comment, we are using ElasticSearch 7.4.2 and it does not have delete feature support. I have created a PR against branch 7.4.2 to backport delete feature. Please review and let me know if I need to make any changes to code or process since this is my first commit here. #1492

krish2code pushed a commit to krish2code/elasticsearch-hadoop that referenced this pull request Jun 24, 2020
This PR adds support for delete operations. Document ID's must be present for each document for this to work. A new bulk command factory has been added which will render a delete operation with the id and no document body in the bulk output.
@jbaiera jbaiera added v7.9.0 and removed v7.8.0 labels Jun 26, 2020
@jbaiera
Copy link
Member

jbaiera commented Jun 26, 2020

@krish2code This backport fell between the cracks and ended up not making it in for the 7.8 release train. It should be available in 7.9. Sorry for the confusion!

@satyaki-mallick
Copy link

satyaki-mallick commented Jul 18, 2022

Hi, is there a documentation on how to implement this?
And can one implement this using the Python ElasticSearch client?

@sinban04
Copy link

@jbaiera

AFAIK, when i search the code in this repository (on the callpath of DataFrame sink),
This PR is merged but actually it's not referenced by DataFrame Sink.
I followed the code,
and I found out that when EsDataFrameWriter (by EsRDDWriter) tries to send bulk data to ES Index,
It only uses PUT RESTful API

 Response response = execute(PUT, resource.bulk(), data);

https://github.com/elastic/elasticsearch-hadoop/blob/main/mr/src/main/java/org/elasticsearch/hadoop/rest/RestClient.java#L236

So far, i can tell delete operation is not yet provided on RDD nor DataFrame. Is it right ?
Is there any plan to support delete operation here ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants