Skip to content

Commit f38f547

Browse files
committed
Merge drive replacement related sections into one
1 parent 81148a7 commit f38f547

File tree

1 file changed

+53
-63
lines changed

1 file changed

+53
-63
lines changed

doc/source/operations/ceph-management.rst

Lines changed: 53 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -89,11 +89,60 @@ And then remove the host from inventory (usually in
8989
Additional options/commands may be found in
9090
`Host management <https://docs.ceph.com/en/latest/cephadm/host-management/>`_
9191

92-
Replacing a Failed Ceph Drive
93-
-----------------------------
92+
Replacing failing drive
93+
-----------------------
9494

95-
Once an OSD has been identified as having a hardware failure,
96-
the affected drive will need to be replaced.
95+
A failing drive in a Ceph cluster will cause OSD daemon to crash.
96+
In this case Ceph will go into `HEALTH_WARN` state.
97+
Ceph can report details about failed OSDs by running:
98+
99+
.. code-block:: console
100+
# From storage host
101+
sudo cephadm shell
102+
ceph health detail
103+
104+
.. note ::
105+
106+
Remember to run ceph/rbd commands from within ``cephadm shell``
107+
(preferred method) or after installing Ceph client. Details in the
108+
official `documentation <https://docs.ceph.com/en/latest/cephadm/install/#enable-ceph-cli>`__.
109+
It is also required that the host where commands are executed has admin
110+
Ceph keyring present - easiest to achieve by applying
111+
`_admin <https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels>`__
112+
label (Ceph MON servers have it by default when using
113+
`StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm>`__).
114+
115+
A failed OSD will also be reported as down by running:
116+
117+
.. code-block:: console
118+
119+
ceph osd tree
120+
121+
Note the ID of the failed OSD.
122+
123+
The failed disk is usually logged by the Linux kernel too:
124+
125+
.. code-block:: console
126+
127+
# From storage host
128+
dmesg -T
129+
130+
Cross-reference the hardware device and OSD ID to ensure they match.
131+
(Using `pvs` and `lvs` may help make this connection).
132+
133+
See upstream documentation:
134+
https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd
135+
136+
In case where disk holding DB and/or WAL fails, it is necessary to recreate
137+
all OSDs that are associated with this disk - usually NVMe drive. The
138+
following single command is sufficient to identify which OSDs are tied to
139+
which physical disks:
140+
141+
.. code-block:: console
142+
143+
ceph device ls
144+
145+
Once OSDs on failed disks are identified, follow procedure below.
97146

98147
If rebooting a Ceph node, first set ``noout`` to prevent excess data
99148
movement:
@@ -130,25 +179,6 @@ spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``).
130179
Either set ``unmanaged: true`` to stop cephadm from picking up new disks or
131180
modify it in some way that it no longer matches the drives you want to remove.
132181

133-
134-
Operations
135-
==========
136-
137-
Replacing drive
138-
---------------
139-
140-
See upstream documentation:
141-
https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd
142-
143-
In case where disk holding DB and/or WAL fails, it is necessary to recreate
144-
(using replacement procedure above) all OSDs that are associated with this
145-
disk - usually NVMe drive. The following single command is sufficient to
146-
identify which OSDs are tied to which physical disks:
147-
148-
.. code-block:: console
149-
150-
ceph device ls
151-
152182
Host maintenance
153183
----------------
154184

@@ -163,46 +193,6 @@ https://docs.ceph.com/en/latest/cephadm/upgrade/
163193
Troubleshooting
164194
===============
165195

166-
Investigating a Failed Ceph Drive
167-
---------------------------------
168-
169-
A failing drive in a Ceph cluster will cause OSD daemon to crash.
170-
In this case Ceph will go into `HEALTH_WARN` state.
171-
Ceph can report details about failed OSDs by running:
172-
173-
.. code-block:: console
174-
175-
ceph health detail
176-
177-
.. note ::
178-
179-
Remember to run ceph/rbd commands from within ``cephadm shell``
180-
(preferred method) or after installing Ceph client. Details in the
181-
official `documentation <https://docs.ceph.com/en/latest/cephadm/install/#enable-ceph-cli>`__.
182-
It is also required that the host where commands are executed has admin
183-
Ceph keyring present - easiest to achieve by applying
184-
`_admin <https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels>`__
185-
label (Ceph MON servers have it by default when using
186-
`StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm>`__).
187-
188-
A failed OSD will also be reported as down by running:
189-
190-
.. code-block:: console
191-
192-
ceph osd tree
193-
194-
Note the ID of the failed OSD.
195-
196-
The failed disk is usually logged by the Linux kernel too:
197-
198-
.. code-block:: console
199-
200-
# From storage host
201-
dmesg -T
202-
203-
Cross-reference the hardware device and OSD ID to ensure they match.
204-
(Using `pvs` and `lvs` may help make this connection).
205-
206196
Inspecting a Ceph Block Device for a VM
207197
---------------------------------------
208198

0 commit comments

Comments
 (0)