Skip to content

Recreate channel after reconnect. Message confirmation feature #418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from

Conversation

dbalabka
Copy link

@dbalabka dbalabka commented Dec 22, 2016

Fixed issue with missed channels after reconnect

Long running producer job may lead to following exception:

PHP Fatal error:  Uncaught exception 'PhpAmqpLib\Exception\AMQPProtocolConnectionException' with message 'CHANNEL_ERROR - expected 'channel.open'' in /home/sites/dmitry.balabka/src/opcart_test/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Connection/AbstractConnection.php:714
Stack trace:
#0 [internal function]: PhpAmqpLib\Connection\AbstractConnection->connection_close(Object(PhpAmqpLib\Wire\AMQPReader))
#1 /home/sites/dmitry.balabka/src/opcart_test/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Channel/AbstractChannel.php(166): call_user_func(Array, Object(PhpAmqpLib\Wire\AMQPReader))
#2 /home/sites/dmitry.balabka/src/opcart_test/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Channel/AbstractChannel.php(338): PhpAmqpLib\Channel\AbstractChannel->dispatch('10,50', '\x01\xF8'CHANNEL_ERRO...', NULL)
#3 /home/sites/dmitry.balabka/src/opcart_test/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Connection/AbstractConnection.php(617): PhpAmqpLib\Channel\AbstractChannel->wait()
#4 /home/sites/dmitry.balabka/src/opcart_test/vendor/videlalvaro/php-amqplib in /home/sites/dmitry.balabka/src/opcart_test/vendor/videlalvaro/php-amqplib/PhpAmqpLib/Connection/AbstractConnection.php on line 714

This exception appears right after job finish.
It happen in case when during job is sending messages catch some network issues and tried to reconnect. Use use reconnect like self-healing feature instead of just start job from scratch, because multiple messages may already consumed or/and restarting job may took few hours and it is inappropriate. Current reconnect implementation do not recreate channels. It leads to issue that messages are sending to no where and are not fall into queue. Also you will get above mentioned exception right after job finish process messages, because we can't close channels on destruct that do not exists.

We can simulate this case:

  1. Start job that sends messages and have self-healing feature that use reconnect on some network exception
  2. Open RabbbitMQ admin panel
  3. Goto Connections tab
  4. Open Active connection details
  5. Click Force close connection
  6. Than job will get following exception:
exception 'PhpAmqpLib\Exception\AMQPRuntimeException' with message 'Broken pipe or closed connection' in .../vendor/videlalvaro/php-amqplib/PhpAmqpLib/Wire/IO/StreamIO.php:174

AR: Job will continue to send messages, but messages will not fall into queue and will return exception at the end
ER: Queue size will contine grow

Implemented message confirmantion feature enabling for channels

Your RabbitMQ server may block connections. For example when qeueue is out of memory. If you use default producer settings it will lead that all your messages will be dropped. The main issue is that right after connection was blocked Rabbit isn't able to accept any messages, but publisher is still can publish and catch connection exception only after some time. So some messages will be dropped. Publisher job will sends messages until you will get followinge exception

Error sending data. Socket connection timed out

According to documentation rabbit sends notification to all clients that server supports connection blocking and you must be ready:
https://www.rabbitmq.com/connection-blocked.html

connection.blocked:	true

To catch this situation we must use special method:
https://github.com/php-amqplib/php-amqplib/blob/2bc16b9f0859357a26e6b253c4552b61071dfdd9/PhpAmqpLib/Helper/Protocol/Protocol091.php#L152
according to php-amqplib/php-amqplib and php-amqplib/RabbitMqBundle this method never used. So blocked connection case isn't covered by bundle and library.

Unfortunatly Rabbit by default do not garantee that some of messages will be delivered if you have network problem or Rabbit have blocked connections because of hardware resources issues. The only way how to garantee it is to use transations. But transactions are unnecessarily heavyweight and decrease throughput by a factor of 250. To remedy this, a confirmation mechanism was introduced. Confirmation is more lightweight.
Unfortunalty PHP isn't async and such message confirmation will lead to some overhead. But currently it is a only way how to garantee message delivery. Confirmation garantee that not only connection exists, but also channel and exchange that will consume message.

To reproduce this issue:

  1. Run long runnnig publisher job
  2. Set low memory limit using RabbitMQ console admin tool
PS C:\Program Files (x86)\RabbitMQ Server\rabbitmq_server-3.1.3\sbin> .\rabbitmqctl.bat set_vm_memory_high_watermark 0.2
Setting memory threshold on 'rabbit@DMITRYB-PC' to 0.2 ...
...done.
  1. Restore memory limit
    ER: job must catch exception and reconnect w/o lost any messages
    Please check messages count in queue

Сonclusions

To properly catch blocking connection reconnect must work properly and you must use message confirmation.

When connection is blocked it may happen that we send some messages twice and duplicates messages may appear in queue, because we can't confirm that last message before blocking exception appears isn't sent and we must send it once again.
This case may affect only small amount of messages, because in cases when Rabbit blocks connection it means that Rabbit not accepts any new connection as well so reconnect method will hung untill Rabbit outages will be solved. So we asume that only one duplicate message will be send to one customer in case when connection has been blocked.
I didn't found how this case may be handled.

@oleg-andreyev
Copy link

tests are failing

* see https://github.com/php-amqplib/php-amqplib/commit/2ccc97ca5b1229f9b12ea47fbab6c16fad26df41
* see https://github.com/php-amqplib/php-amqplib/commit/c87469ecbbf38fdc18688d3216c1e253d640ba32
*/
$this->conn->reconnect();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.6.3 is released

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need time to recheck why it was done

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry not fixed yet

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now fixed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleg-andreyev 2.6.3 and even 2.7.0-rc1 doesn't help :(

/*
* This file is part of the OpCart software.
*
* (c) 2015, OpticsPlanet, Inc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use proper header

* file that was distributed with this source code.
*/

namespace OldSound\RabbitMqBundle\Tests\RabbitMq;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old namespace

Copy link
Author

@dbalabka dbalabka Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

composer.json Outdated
@@ -31,7 +31,8 @@
},
"extra": {
"branch-alias": {
"dev-master": "1.10.x-dev"
"dev-master": "1.10.x-dev",
"dev-reconnect-fix-upstream": "1.13.x-dev"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it required?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was done for our needs. Have to be removed.

@skafandri
Copy link
Collaborator

Can you explain a bit what issue are you fixing? Use case, what exception/behaviour did you encountered, under which settings..
If you received an exception for example, including it here will be very helpful for everyone that woyld just copy/paste it and google it.

@dbalabka
Copy link
Author

dbalabka commented Dec 22, 2016 via email

@dbalabka dbalabka changed the title Recreate channel after reconnect and message confimration feature Recreate channel after reconnect and message confirmation feature Jan 12, 2017
@dbalabka dbalabka changed the title Recreate channel after reconnect and message confirmation feature Recreate channel after reconnect. Message confirmation feature Jan 12, 2017
@dbalabka
Copy link
Author

dbalabka commented Jan 12, 2017

@skafandri take a look on description

@skafandri
Copy link
Collaborator

This is interesting. Can you please verify your notes about php-amqplib 2.6.3 and add a new line at the end of Tests/RabbitMq/ProducerTest.php?

@dbalabka
Copy link
Author

dbalabka commented May 5, 2017

@skafandri I have update code with accordance of php-amqplib version 2.6.3. I have added version constraint into compose.json.

It will be good to add some integration tests using real RabbitMQ instance like php-amqplib have to cover reconnect and confirmation features.

Also it will be good to add my exaplanation how to handle server outages and missed messages into documentation of this bundle, because it is not simple things and developers must be aware.

@dbalabka
Copy link
Author

dbalabka commented May 5, 2017

@oleg-andreyev please take a look on PR

@dbalabka
Copy link
Author

dbalabka commented Jun 2, 2017

@skafandri what we have to do to merge this PR?

$this->ch = null;
} catch (\Exception $e) {
// ignore exception on Channel object destructor
// TODO: this workaround can be removed after php-amqplib will be updated till 2.6.3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could adjust composer.json to force version of the lib too: ^2.6.3

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stloyd removed

*/
public function setWaitConfirmationTimeout($timeout)
{
if (!is_int($timeout) || $timeout < 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!$timeout || !is_int($timeout)) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stloyd not sure why this condition need to be changed. Please explain.
I have added few test cases for this this method test. By changing this condition, as you propose, it lead to two test cases fail:

  1. When timeout is zero
  2. Timeout is less then zero

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stloyd ping

composer.json Outdated
@@ -15,7 +15,7 @@
"symfony/config": "~2.3 || ~3.0",
"symfony/yaml": "~2.3 || ~3.0",
"symfony/console": "~2.3 || ~3.0",
"php-amqplib/php-amqplib": "~2.6",
"php-amqplib/php-amqplib": "~2.6 >=2.6.3",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, should be ^2.6.3

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stloyd fixed

@chernecov
Copy link
Contributor

@torinaki @stloyd
Hello, can you share updates on this pull request?
We really need it to be merged.
Long running jobs with publishing messages fail when one RabbitMQ node restarts. (by puppet for example)

@dbalabka
Copy link
Author

dbalabka commented Sep 8, 2017

@stloyd, @skafandri we have started migration from videlalvaro/php-amqplib to version 2.6.3 of php-amqplib/php-amqplib and noticed that reconnect still doesn't work

My initial fix has addresed this issue in previous commit:
https://github.com/ecentria/RabbitMqBundle/blob/36cafbcbee5891df3b9aaef49b805a852fc166f8/RabbitMq/BaseAmqp.php#L89
and assumed that following PR must be merged:
php-amqplib/php-amqplib@2ccc97c
php-amqplib/php-amqplib@c87469e
I have missed that one of this PR was merge into 2.7, but not in 2.6, so composer deps are incorrect, but anyway reconnect doesn't work even with 2.7.0-rc1.

Currenlty we are investigating with @chernecov what must be done to make it work.

@@ -15,7 +15,7 @@
"symfony/config": "~2.3 || ~3.0",
"symfony/yaml": "~2.3 || ~3.0",
"symfony/console": "~2.3 || ~3.0",
"php-amqplib/php-amqplib": "~2.6",
"php-amqplib/php-amqplib": "^2.6.3",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconnect fix doesn't work even in 2.7

Note that originaly I have assumed that following PRs must be merged, but it is still not enought.
php-amqplib/php-amqplib@2ccc97c
php-amqplib/php-amqplib@c87469e

@dbalabka
Copy link
Author

@stloyd, @skafandri can you please review our conclusions?

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This PR was closed because it has been stalled for 10 days with no activity.

@github-actions github-actions bot closed this Mar 30, 2021
@dbalabka
Copy link
Author

dbalabka commented Apr 1, 2021

There no many changes. I can update the PR and test with latest version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants