Skip to content

Replace opendistro #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Jan 26, 2023
Merged

Replace opendistro #197

merged 56 commits into from
Jan 26, 2023

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Jul 21, 2022

Ticket: https://stackhpc.atlassian.net/browse/DEV-855

OpenDistro is EOL.

This PR:

  • Replaces OpenDistro with OpenSearch.

  • Updates filebeat to the newest-supported version.

  • Adds the required version faking to enable filebeat.

  • Configures important opensearch settings for production use.

  • Updates Grafana version

  • Updates the opensearch Grafana datasource plugin definition

  • Removes the appliances grafana-datasources role as we can use grafana's provisioning mode with the cloudalchemy.grafana rather than requiring a customised API-based approach. <-- TODO CHECK: think this was merged already.

  • Adds a test in CI that expected jobs from the hpctests runs are found via Grafana (NB: for slurm-stats this has to be 5 mins past job completion, so may add some delay)

  • Changes the storage used for open{distro,search} from a podman volume to a host directory to enable easier future upgrades/migration/backups.

  • Adds a playbook ansible/adhoc/migrate-opendistro.yml to migrate opendistro data to opensearch (checked by upgrading a running cluster from main 7bcacb0)

  • Uses a new "prebuilt" image in arcus with the updated Grafana version (note actually CI was using this before this PR, so grafana's been getting downgraded during CI deployments)

  • Merge workaround for ohpc-base-compute dependency on singularity.

  • Update image build with correct grafana version, in a PR, and test that.

  • TODO: fix/document migration - currently it will always run if opendistro service even exists.

  • [ ]
    Once merged and passed:

  • Move appropriate image to release bucket

Closes #70.

@sjpb sjpb marked this pull request as draft July 21, 2022 16:12
@sjpb sjpb mentioned this pull request Aug 8, 2022
11 tasks
@sjpb
Copy link
Collaborator Author

sjpb commented Aug 9, 2022

It is leaving massive gaps between jobs:
image

ETA: fixed by 3a25308

sjpb added 6 commits August 11, 2022 08:35
slurm-stats datasource was not getting the "database" (=index) set, hence
in opensearch which adds additional 'security-auditlog*' indices not present
in opendistro, the dashboard query was returning non-slurm-stats documents
without the fields expected => empty rows
@sjpb
Copy link
Collaborator Author

sjpb commented Aug 15, 2022

3a25308 fails on checking that the hpctests jobs exist in grafana/opensearch. Turns out neither dashboard nor datasources have been provisioned in grafana after rebuilding control node with packer-built image (although grafana is running). Possibly this has always been broken, just not checked for until this PR.

direct configuration:

2022-08-11T12:48:19.5059996Z TASK [cloudalchemy.grafana : Create/Update datasources file (provisioning)] ****
2022-08-11T12:48:19.5062754Z task path: /home/runner/work/ansible-slurm-appliance/ansible-slurm-appliance/ansible/roles/cloudalchemy.grafana/tasks/datasources.yml:26
2022-08-11T12:48:23.6044928Z NOTIFIED HANDLER cloudalchemy.grafana : restart grafana for ci2839370468-control
2022-08-11T12:48:23.6047794Z changed: [ci2839370468-control] => {
2022-08-11T12:48:23.6048876Z     "changed": true,
2022-08-11T12:48:23.6050009Z     "checksum": "c3246446b05c316d4a9dfde60badfa068fd3936c",
2022-08-11T12:48:23.6051407Z     "dest": "/etc/grafana/provisioning/datasources/ansible.yml",
2022-08-11T12:48:23.6052598Z     "gid": 979,
2022-08-11T12:48:23.6053560Z     "group": "grafana",
2022-08-11T12:48:23.6054671Z     "md5sum": "ed1ebc1e5e73ffca64f835dea1145202",
2022-08-11T12:48:23.6055784Z     "mode": "0640",
2022-08-11T12:48:23.6056768Z     "owner": "root",
2022-08-11T12:48:23.6057824Z     "secontext": "system_u:object_r:etc_t:s0",
2022-08-11T12:48:23.6058901Z     "size": 569,
2022-08-11T12:48:23.6060242Z     "src": "/var/lib/rocky/.ansible/tmp/ansible-tmp-1660222100.111777-5114-195439064785952/source",
2022-08-11T12:48:23.6061512Z     "state": "file",
2022-08-11T12:48:23.6062449Z     "uid": 0
2022-08-11T12:48:23.6063363Z }

control image build:

2022-08-11T12:57:56.6329342Z     openstack.control: TASK [cloudalchemy.grafana : Create/Update datasources file (provisioning)] ****
2022-08-11T12:57:56.6339799Z     openstack.control: task path: /home/runner/work/ansible-slurm-appliance/ansible-slurm-appliance/ansible/roles/cloudalchemy.grafana/tasks/datasources.yml:26
2022-08-11T12:57:56.6530631Z     openstack.control: skipping: [default] => {
2022-08-11T12:57:56.6532331Z     openstack.control:     "changed": false,
2022-08-11T12:57:56.6533879Z     openstack.control:     "skip_reason": "Conditional result was False"
2022-08-11T12:57:56.6535269Z     openstack.control: }
2022-08-11T12:57:56.7564322Z     openstack.control:```

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 20, 2022

Note certs have a hardcoded 2yr life.

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 21, 2022

@m-bull I tried using community.crypto.x509_certificate_info to extract validity and delete if necessary, but as podman chowns everything in certs/ the ansible loops/logic were just getting really messy. Put that on hold as not best use of effort at the moment.

@sjpb sjpb marked this pull request as ready for review December 21, 2022 15:59
@sjpb sjpb requested a review from m-bull December 21, 2022 17:02
@sjpb
Copy link
Collaborator Author

sjpb commented Jan 10, 2023

FIXED: that merge won't be right as we need an image using updated grafana etc.

@sjpb sjpb marked this pull request as draft January 20, 2023 09:36
@sjpb sjpb marked this pull request as ready for review January 24, 2023 11:14
m-bull
m-bull previously approved these changes Jan 24, 2023
@sjpb sjpb merged commit bdeda03 into main Jan 26, 2023
@sjpb sjpb deleted the fix/elasticsearch branch January 26, 2023 10:56
antonycleave pushed a commit to eschercloudai/ansible-slurm-appliance that referenced this pull request Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Versions of docker images not easily configurable
2 participants