Skip to content

Redirect partition queues *after* reassignment. #107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

shlevy
Copy link
Contributor

@shlevy shlevy commented Sep 4, 2019

This matches the example given in https://gitter.im/edenhill/librdkafka/archives/2017/12/12?at=5a2fad9dcc1d527f6b20114d, and fixes duplicate message issues we've seen.

@AlexeyRaga
Copy link
Member

I think I tried it this way long time ago and it was problematic because librdkafka was able to to pull some messages from Kafka and put them into a default queue after the assignment happened but before redirection happened. Those messages were lost since we only expect messages from the specific queue.

Do you have any evidence that it won't happen anymore?

@AlexeyRaga
Copy link
Member

Ping Magnus @edenhill, can you advice if redirecting partitions queue should happen before or after acknowledging assignment? Can doing it after assignment cause missing/losing messages?

@edenhill
Copy link

edenhill commented Sep 4, 2019

If you redirect after assign() it means some messages may be forwarded to the single consumer queue, so either do it before assign() or do: assign(); pause(); redirect; resume()

@AlexeyRaga
Copy link
Member

@edenhill thanks!

@shlevy I think we can't just do redirect after assign. We can try looking at assign, pause, redirect, resume, but since our consumer is asynchronous w.r.t message poll, we need to be sure that nothing "bad" can happen in between assign and pause.

But just accepting this PR as is would introduce change loosing messages, which I don't think we want.

@shlevy
Copy link
Contributor Author

shlevy commented Sep 16, 2019

Hi @AlexeyRaga! Unfortunately, we've had an extremely hard time reproducing our issue in a satisfactory way. We have both a haskell reproduction which shows very high rates of commit errors (largely but not exclusively "no offset to commit" right after a fresh poll!) without this change but a significantly reduced (but not zero!) error rate with either this or #108 applied, and a C reproduction that I believe matches the haskell logic exactly (except for pthreads vs GHC's green threads) and yet does not exhibit the issue even once on any configuration tried.

The most successful configuration for us has been bumping to librdkafka 1.1.0 and using #108. Any ideas for how we can help make sure we're actually doing the right thing?

@AlexeyRaga
Copy link
Member

Parking it for now

@AlexeyRaga AlexeyRaga closed this Oct 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants