Add cephadm docs

mnasiadka · mnasiadka · commit a58037ce6ae8 · 2022-11-03T10:22:49.000+01:00
diff --git a/source/ceph_storage.rst b/source/ceph_storage.rst
@@ -21,10 +21,19 @@ Ceph Operations and Troubleshooting
 
 .. include:: include/ceph_troubleshooting.rst
 
+Working with Ceph deployment tool
+=================================
+
 .. ifconfig:: deployment['ceph_ansible']
 
    Ceph Ansible
    ============
 
    .. include:: include/ceph_ansible.rst
 
+.. ifconfig:: deployment['cephadm']
+
+   cephadm
+   =======
+
+   .. include:: include/cephadm.rst
diff --git a/source/data/deployment.yml b/source/data/deployment.yml
@@ -5,6 +5,9 @@ ceph: true
 # Whether the Ceph deployment is done via Ceph-Ansible
 ceph_ansible: false
 
+# Whether the Ceph deployment is done via cephadm
+cephadm: true
+
 # Whether the Ceph deployment is managed by StackHPC
 ceph_managed: false
 
diff --git a/source/include/ceph_ansible.rst b/source/include/ceph_ansible.rst
@@ -4,6 +4,41 @@ Making a Ceph-Ansible Checkout
 Invoking Ceph-Ansible
 =====================
 
+Removing a Failed Ceph Drive
+============================
+
+If a drive is verified dead, stop and eject the osd (eg. `osd.4`)
+from the cluster:
+
+.. code-block:: console
+
+   storage-0# systemctl stop ceph-osd@4.service
+   storage-0# systemctl disable ceph-osd@4.service
+   ceph# ceph osd out osd.4
+
+.. ifconfig:: deployment['ceph_ansible']
+
+   Before running Ceph-Ansible, also remove vestigial state directory
+   from `/var/lib/ceph/osd` for the purged OSD, for example for OSD ID 4:
+
+   .. code-block:: console
+
+      storage-0# rm -rf /var/lib/ceph/osd/ceph-4
+
+Remove Ceph OSD state for the old OSD, here OSD ID `4` (we will
+backfill all the data when we reintroduce the drive).
+
+.. code-block:: console
+
+   ceph# ceph osd purge --yes-i-really-mean-it 4
+
+Unset noout for osds when hardware maintenance has concluded - eg.
+while waiting for the replacement disk:
+
+.. code-block:: console
+
+   ceph# ceph osd unset noout
+
 Replacing a Failed Ceph Drive
 =============================
 
diff --git a/source/include/ceph_troubleshooting.rst b/source/include/ceph_troubleshooting.rst
@@ -5,6 +5,15 @@ After deployment, when a drive fails it may cause OSD crashes in Ceph.
 If Ceph detects crashed OSDs, it will go into `HEALTH_WARN` state.
 Ceph can report details about failed OSDs by running:
 
+.. ifconfig:: deployment['cephadm']
+
+   .. note ::
+
+      Remember to run ceph/rbd commands after issuing ``cephadm shell`` or
+      installing ceph clients.
+      It is also important to run the commands on the hosts with _admin label
+      (Ceph monitors by default).
+
 .. code-block:: console
 
    ceph# ceph health detail
@@ -26,41 +35,6 @@ The failed hardware device is logged by the Linux kernel:
 Cross-reference the hardware device and OSD ID to ensure they match.
 (Using `pvs` and `lvs` may help make this connection).
 
-Removing a Failed Ceph Drive
-----------------------------
-
-If a drive is verified dead, stop and eject the osd (eg. `osd.4`)
-from the cluster:
-
-.. code-block:: console
-
-   storage-0# systemctl stop ceph-osd@4.service
-   storage-0# systemctl disable ceph-osd@4.service
-   ceph# ceph osd out osd.4
-
-.. ifconfig:: deployment['ceph_ansible']
-
-   Before running Ceph-Ansible, also remove vestigial state directory
-   from `/var/lib/ceph/osd` for the purged OSD, for example for OSD ID 4:
-
-   .. code-block:: console
-
-      storage-0# rm -rf /var/lib/ceph/osd/ceph-4
-
-Remove Ceph OSD state for the old OSD, here OSD ID `4` (we will
-backfill all the data when we reintroduce the drive).
-
-.. code-block:: console
-
-   ceph# ceph osd purge --yes-i-really-mean-it 4
-
-Unset noout for osds when hardware maintenance has concluded - eg.
-while waiting for the replacement disk:
-
-.. code-block:: console
-
-   ceph# ceph osd unset noout
-
 Inspecting a Ceph Block Device for a VM
 ---------------------------------------
 
diff --git a/source/include/cephadm.rst b/source/include/cephadm.rst
@@ -0,0 +1,115 @@
+cephadm configuration location
+==============================
+
+In kayobe-config repository, under ``etc/kayobe/cephadm.yml`` (or in a specific
+Kayobe environment when using multiple environment, e.g.
+``etc/kayobe/environments/production/cephadm.yml``)
+
+StackHPC's cephadm Ansible collection relies on multiple inventory groups:
+
+- ``mons``
+- ``mgrs``
+- ``osds``
+- ``rgws`` (optional)
+
+Those groups are usually defined in ``etc/kayobe/inventory/groups``.
+
+Running cephadm playbooks
+=========================
+
+In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of
+cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection.
+
+- ``cephadm.yml`` - runs the end to end process starting with deployment and
+  defining EC profiles/crush rules/pools and users
+- ``cephadm-crush-rules.yml`` - defines Ceph crush rules according
+- ``cephadm-deploy.yml`` - runs the bootstrap/deploy playbook without the
+  additional playbooks
+- ``cephadm-ec-profiles.yml`` - defines Ceph EC profiles
+- ``cephadm-gather-keys.yml`` - gather Ceph configuration and keys and populate
+  kayobe-config
+- ``cephadm-keys.yml`` - defines Ceph users/keys
+- ``cephadm-pools.yml`` - defines Ceph pools\
+
+Running Ceph commands
+=====================
+
+Ceph commands can be run via ``cephadm shell`` utility container:
+
+.. code-block:: console
+
+   ceph# cephadm shell
+
+This command will be only successful on ``mons`` group members (the admin key
+is copied only to those nodes).
+
+Adding a new storage node
+=========================
+
+Add a node to a respective group (e.g. osds) and run ``cephadm-deploy.yml``
+playbook.
+
+.. note::
+   To add other node types than osds (mons, mgrs, etc) you need to specify
+   ``-e cephadm_bootstrap=True`` on playbook run.
+
+Removing a storage node
+=======================
+
+First drain the node
+
+.. code-block:: console
+
+   ceph# cephadm shell
+   ceph# ceph orch host drain <host>
+
+Once all daemons are removed - you can remove the host:
+
+.. code-block:: console
+
+   ceph# cephadm shell
+   ceph# ceph orch host rm <host>
+
+And then remove the host from inventory (usually in
+``etc/kayobe/inventory/overcloud``)
+
+Additional options/commands may be found in
+`Host management <https://docs.ceph.com/en/latest/cephadm/host-management/>`_
+
+Replacing a Failed Ceph Drive
+=============================
+
+Once an OSD has been identified as having a hardware failure,
+the affected drive will need to be replaced.
+
+If rebooting a Ceph node, first set ``noout`` to prevent excess data
+movement:
+
+.. code-block:: console
+
+   ceph# cephadm shell
+   ceph# ceph osd set noout
+
+Reboot the node and replace the drive
+
+Unset noout after the node is back online
+
+.. code-block:: console
+
+   ceph# cephadm shell
+   ceph# ceph osd unset noout
+
+Remove the OSD using Ceph orchestrator command:
+
+.. code-block:: console
+
+   ceph# cephadm shell
+   ceph# ceph orch osd rm <ID> --replace
+
+After removing OSDs, if the drives the OSDs were deployed on once again become
+available, cephadm may automatically try to deploy more OSDs on these drives if
+they match an existing drivegroup spec.
+If this is not your desired action plan - it's best to modify the drivegroup
+spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``).
+Either set ``unmanaged: true`` to stop cephadm from picking up new disks or
+modify it in some way that it no longer matches the drives you want to remove.