Skip to content

Use systemd for opendistro/kibana/filebeat #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 31, 2021
Merged

Use systemd for opendistro/kibana/filebeat #52

merged 9 commits into from
Mar 31, 2021

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Mar 24, 2021

NB: this PR is for main.

Fixes #45 - see discussion there.

Key aspects:

  • Use systemd to define/control container services so they start on boot.
  • Can't use systemd --user as systemd version in CentOS doesn't allow us to override limits for user services, so run as root service with User: directive etc.
  • In this configuration, podman by default creates a tmpdir on /tmp (not a tmpfs) rather than /run (a tmpfs). A tmpfs is required for podman's tmpdir so that podman can tell a reboot has happened. So use the tmpfiles daemon to create appropriate tmpdirs on /run with right permissions etc.

@jovial
Copy link
Collaborator

jovial commented Mar 25, 2021

Kibana failing with:

Status code was 503 and not [200]: HTTP Error 503: Service Unavailable

https://github.com/stackhpc/openhpc-demo/pull/52/checks?check_run_id=2183262961#step:7:722

retrying...

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 26, 2021

Kibana failing with:

Status code was 503 and not [200]: HTTP Error 503: Service Unavailable

https://github.com/stackhpc/openhpc-demo/pull/52/checks?check_run_id=2183262961#step:7:722

retrying...

Having finally got vagrant working and squashed some other bugs the issue is elasticsearch is refusing connections:

[vagrant@testohpc-control-0 ~]$ curl 10.109.0.140:9200
curl: (7) Failed to connect to 10.109.0.140 port 9200: Connection refused

Is this something to do with networking inside of vagrant?

@jovial
Copy link
Collaborator

jovial commented Mar 26, 2021

Kibana failing with:

Status code was 503 and not [200]: HTTP Error 503: Service Unavailable

https://github.com/stackhpc/openhpc-demo/pull/52/checks?check_run_id=2183262961#step:7:722
retrying...

Having finally got vagrant working and squashed some other bugs the issue is elasticsearch is refusing connections:

[vagrant@testohpc-control-0 ~]$ curl 10.109.0.140:9200
curl: (7) Failed to connect to 10.109.0.140 port 9200: Connection refused

Is this something to do with networking inside of vagrant?

I've got a feeling elastic isn't starting up correctly. Do you see it listening on port 9200 on testohpc-control-0? Something like: sudo ss -nlpt should do the trick for that. Will need to check the elastic logs to see why it is unhappy.

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 26, 2021

Sorry should have said - systemd shows opendistro is running OK, no errors. ss shows:

[vagrant@testohpc-control-0 ~]$ sudo ss -nlpt
State                 Recv-Q                Send-Q                               Local Address:Port                                Peer Address:Port                                                                                         
<snip>
LISTEN                0                     128                                              *:9200                                           *:*                    users:(("exe",pid=39009,fd=14))                                         
<snip>

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 26, 2021

I'm not sure the IP is right:

[podman@testohpc-control-0 ~]$ podman attach kibana
{"type":"log","@timestamp":"2021-03-26T12:13:28Z","tags":["error","elasticsearch","data"],"pid":1,"message":"[ConnectionError]: connect ECONNREFUSED 10.109.0.140:9200"}

but

[podman@testohpc-control-0 vagrant]$ curl -XGET --insecure https://localhost:9200 -u admin:"<password>"
{
  "name" : "opendistro",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "WUxldXjGRJSEGUZa5fyqtQ",
  "version" : {
    "number" : "7.10.0",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "51e9d6f22758d0374a0f3f5c6e8f3a7997850f96",
    "build_date" : "2020-11-09T21:30:33.964949Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

and

[podman@testohpc-control-0 vagrant]$ curl -XGET --insecure https://10.109.0.140:9200 -u admin:"<password>"
curl: (7) Failed to connect to 10.109.0.140 port 9200: Connection refused

@sjpb sjpb requested a review from jovial March 26, 2021 14:55
Copy link
Collaborator

@jovial jovial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good, nice effort.

@@ -15,6 +15,55 @@
tasks_from: config.yml
tags: config

- name: Define tmp directories on tmpfs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in reference to this line in particular, but would this could fit better in the podman role?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the user we're running podman as isn't defined in the role, only at the appliance level. I agree it feels like this (and the validate and the podman_tmp_dir_root) should all really be in the podman role, so if you can see a way of achieving that let me know.

owner: "{{ item.name }}"
group: "{{ item.name }}"
become: yes
loop: "{{ appliances_local_users_podman }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other possibility is to make this code into a utility role that the podman roles all use, passing the relevant user e.g opendistro_user, kibana_user, filebeat_user. That would make the roles more usable in isolation. I don't think this is critical mind...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be possible. I felt like this is all a bit of a hack to work around systemd/rhel/podman limits/interactions so I'd hope it disappears entirely when either a) we have user services working or b) the podman patch to remove /tmp/containers-users-* files on reboot.

@@ -15,6 +15,55 @@
tasks_from: config.yml
tags: config

- name: Define tmp directories on tmpfs
blockinfile:
path: /etc/tmpfiles.d/podman.conf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding this to another file so that it is easier to remove?

@@ -2,6 +2,23 @@

# Fail early if configuration is invalid

- name: Validate podman configuration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put in podman role?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above as to why all of this isn't in the role.

@jovial jovial closed this Mar 26, 2021
@jovial jovial reopened this Mar 26, 2021
@jovial
Copy link
Collaborator

jovial commented Mar 26, 2021

Closing and reopening to re-run pull_request workflow with latest code on main.

@@ -15,3 +15,60 @@

- name: reset ssh connection to allow user changes to affect 'current login user'
meta: reset_connection

- name: Ensure podman users exist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reckon we should only do this in one place and assume that the users exist in this role, but as this will essentially be a no-op at the cost of running a few extra tasks, probably not one to bike-shed over as the overall patch looks good to me.

@sjpb sjpb merged commit d4cfa54 into main Mar 31, 2021
@sjpb sjpb deleted the fix/containers branch March 31, 2021 06:12
@sjpb sjpb mentioned this pull request Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

opendistro role not robust to node reboots
2 participants