Skip to content

Commit a58037c

Browse files
committed
Add cephadm docs
1 parent e532b26 commit a58037c

File tree

5 files changed

+171
-35
lines changed

5 files changed

+171
-35
lines changed

source/ceph_storage.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,19 @@ Ceph Operations and Troubleshooting
2121

2222
.. include:: include/ceph_troubleshooting.rst
2323

24+
Working with Ceph deployment tool
25+
=================================
26+
2427
.. ifconfig:: deployment['ceph_ansible']
2528

2629
Ceph Ansible
2730
============
2831

2932
.. include:: include/ceph_ansible.rst
3033

34+
.. ifconfig:: deployment['cephadm']
35+
36+
cephadm
37+
=======
38+
39+
.. include:: include/cephadm.rst

source/data/deployment.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ ceph: true
55
# Whether the Ceph deployment is done via Ceph-Ansible
66
ceph_ansible: false
77

8+
# Whether the Ceph deployment is done via cephadm
9+
cephadm: true
10+
811
# Whether the Ceph deployment is managed by StackHPC
912
ceph_managed: false
1013

source/include/ceph_ansible.rst

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,41 @@ Making a Ceph-Ansible Checkout
44
Invoking Ceph-Ansible
55
=====================
66

7+
Removing a Failed Ceph Drive
8+
============================
9+
10+
If a drive is verified dead, stop and eject the osd (eg. `osd.4`)
11+
from the cluster:
12+
13+
.. code-block:: console
14+
15+
storage-0# systemctl stop [email protected]
16+
storage-0# systemctl disable [email protected]
17+
ceph# ceph osd out osd.4
18+
19+
.. ifconfig:: deployment['ceph_ansible']
20+
21+
Before running Ceph-Ansible, also remove vestigial state directory
22+
from `/var/lib/ceph/osd` for the purged OSD, for example for OSD ID 4:
23+
24+
.. code-block:: console
25+
26+
storage-0# rm -rf /var/lib/ceph/osd/ceph-4
27+
28+
Remove Ceph OSD state for the old OSD, here OSD ID `4` (we will
29+
backfill all the data when we reintroduce the drive).
30+
31+
.. code-block:: console
32+
33+
ceph# ceph osd purge --yes-i-really-mean-it 4
34+
35+
Unset noout for osds when hardware maintenance has concluded - eg.
36+
while waiting for the replacement disk:
37+
38+
.. code-block:: console
39+
40+
ceph# ceph osd unset noout
41+
742
Replacing a Failed Ceph Drive
843
=============================
944

source/include/ceph_troubleshooting.rst

Lines changed: 9 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,15 @@ After deployment, when a drive fails it may cause OSD crashes in Ceph.
55
If Ceph detects crashed OSDs, it will go into `HEALTH_WARN` state.
66
Ceph can report details about failed OSDs by running:
77

8+
.. ifconfig:: deployment['cephadm']
9+
10+
.. note ::
11+
12+
Remember to run ceph/rbd commands after issuing ``cephadm shell`` or
13+
installing ceph clients.
14+
It is also important to run the commands on the hosts with _admin label
15+
(Ceph monitors by default).
16+
817
.. code-block:: console
918
1019
ceph# ceph health detail
@@ -26,41 +35,6 @@ The failed hardware device is logged by the Linux kernel:
2635
Cross-reference the hardware device and OSD ID to ensure they match.
2736
(Using `pvs` and `lvs` may help make this connection).
2837

29-
Removing a Failed Ceph Drive
30-
----------------------------
31-
32-
If a drive is verified dead, stop and eject the osd (eg. `osd.4`)
33-
from the cluster:
34-
35-
.. code-block:: console
36-
37-
storage-0# systemctl stop [email protected]
38-
storage-0# systemctl disable [email protected]
39-
ceph# ceph osd out osd.4
40-
41-
.. ifconfig:: deployment['ceph_ansible']
42-
43-
Before running Ceph-Ansible, also remove vestigial state directory
44-
from `/var/lib/ceph/osd` for the purged OSD, for example for OSD ID 4:
45-
46-
.. code-block:: console
47-
48-
storage-0# rm -rf /var/lib/ceph/osd/ceph-4
49-
50-
Remove Ceph OSD state for the old OSD, here OSD ID `4` (we will
51-
backfill all the data when we reintroduce the drive).
52-
53-
.. code-block:: console
54-
55-
ceph# ceph osd purge --yes-i-really-mean-it 4
56-
57-
Unset noout for osds when hardware maintenance has concluded - eg.
58-
while waiting for the replacement disk:
59-
60-
.. code-block:: console
61-
62-
ceph# ceph osd unset noout
63-
6438
Inspecting a Ceph Block Device for a VM
6539
---------------------------------------
6640

source/include/cephadm.rst

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
cephadm configuration location
2+
==============================
3+
4+
In kayobe-config repository, under ``etc/kayobe/cephadm.yml`` (or in a specific
5+
Kayobe environment when using multiple environment, e.g.
6+
``etc/kayobe/environments/production/cephadm.yml``)
7+
8+
StackHPC's cephadm Ansible collection relies on multiple inventory groups:
9+
10+
- ``mons``
11+
- ``mgrs``
12+
- ``osds``
13+
- ``rgws`` (optional)
14+
15+
Those groups are usually defined in ``etc/kayobe/inventory/groups``.
16+
17+
Running cephadm playbooks
18+
=========================
19+
20+
In kayobe-config repository, under ``etc/kayobe/ansible`` there is a set of
21+
cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection.
22+
23+
- ``cephadm.yml`` - runs the end to end process starting with deployment and
24+
defining EC profiles/crush rules/pools and users
25+
- ``cephadm-crush-rules.yml`` - defines Ceph crush rules according
26+
- ``cephadm-deploy.yml`` - runs the bootstrap/deploy playbook without the
27+
additional playbooks
28+
- ``cephadm-ec-profiles.yml`` - defines Ceph EC profiles
29+
- ``cephadm-gather-keys.yml`` - gather Ceph configuration and keys and populate
30+
kayobe-config
31+
- ``cephadm-keys.yml`` - defines Ceph users/keys
32+
- ``cephadm-pools.yml`` - defines Ceph pools\
33+
34+
Running Ceph commands
35+
=====================
36+
37+
Ceph commands can be run via ``cephadm shell`` utility container:
38+
39+
.. code-block:: console
40+
41+
ceph# cephadm shell
42+
43+
This command will be only successful on ``mons`` group members (the admin key
44+
is copied only to those nodes).
45+
46+
Adding a new storage node
47+
=========================
48+
49+
Add a node to a respective group (e.g. osds) and run ``cephadm-deploy.yml``
50+
playbook.
51+
52+
.. note::
53+
To add other node types than osds (mons, mgrs, etc) you need to specify
54+
``-e cephadm_bootstrap=True`` on playbook run.
55+
56+
Removing a storage node
57+
=======================
58+
59+
First drain the node
60+
61+
.. code-block:: console
62+
63+
ceph# cephadm shell
64+
ceph# ceph orch host drain <host>
65+
66+
Once all daemons are removed - you can remove the host:
67+
68+
.. code-block:: console
69+
70+
ceph# cephadm shell
71+
ceph# ceph orch host rm <host>
72+
73+
And then remove the host from inventory (usually in
74+
``etc/kayobe/inventory/overcloud``)
75+
76+
Additional options/commands may be found in
77+
`Host management <https://docs.ceph.com/en/latest/cephadm/host-management/>`_
78+
79+
Replacing a Failed Ceph Drive
80+
=============================
81+
82+
Once an OSD has been identified as having a hardware failure,
83+
the affected drive will need to be replaced.
84+
85+
If rebooting a Ceph node, first set ``noout`` to prevent excess data
86+
movement:
87+
88+
.. code-block:: console
89+
90+
ceph# cephadm shell
91+
ceph# ceph osd set noout
92+
93+
Reboot the node and replace the drive
94+
95+
Unset noout after the node is back online
96+
97+
.. code-block:: console
98+
99+
ceph# cephadm shell
100+
ceph# ceph osd unset noout
101+
102+
Remove the OSD using Ceph orchestrator command:
103+
104+
.. code-block:: console
105+
106+
ceph# cephadm shell
107+
ceph# ceph orch osd rm <ID> --replace
108+
109+
After removing OSDs, if the drives the OSDs were deployed on once again become
110+
available, cephadm may automatically try to deploy more OSDs on these drives if
111+
they match an existing drivegroup spec.
112+
If this is not your desired action plan - it's best to modify the drivegroup
113+
spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``).
114+
Either set ``unmanaged: true`` to stop cephadm from picking up new disks or
115+
modify it in some way that it no longer matches the drives you want to remove.

0 commit comments

Comments
 (0)