@@ -89,11 +89,60 @@ And then remove the host from inventory (usually in
89
89
Additional options/commands may be found in
90
90
`Host management <https://docs.ceph.com/en/latest/cephadm/host-management/ >`_
91
91
92
- Replacing a Failed Ceph Drive
93
- -----------------------------
92
+ Replacing failing drive
93
+ -----------------------
94
94
95
- Once an OSD has been identified as having a hardware failure,
96
- the affected drive will need to be replaced.
95
+ A failing drive in a Ceph cluster will cause OSD daemon to crash.
96
+ In this case Ceph will go into `HEALTH_WARN ` state.
97
+ Ceph can report details about failed OSDs by running:
98
+
99
+ .. code-block :: console
100
+ # From storage host
101
+ sudo cephadm shell
102
+ ceph health detail
103
+
104
+ .. note ::
105
+
106
+ Remember to run ceph/rbd commands from within ``cephadm shell``
107
+ (preferred method) or after installing Ceph client. Details in the
108
+ official `documentation <https://docs.ceph.com/en/latest/cephadm/install/#enable-ceph-cli>`__.
109
+ It is also required that the host where commands are executed has admin
110
+ Ceph keyring present - easiest to achieve by applying
111
+ `_admin <https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels>`__
112
+ label (Ceph MON servers have it by default when using
113
+ `StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm>`__).
114
+
115
+ A failed OSD will also be reported as down by running:
116
+
117
+ .. code-block :: console
118
+
119
+ ceph osd tree
120
+
121
+ Note the ID of the failed OSD.
122
+
123
+ The failed disk is usually logged by the Linux kernel too:
124
+
125
+ .. code-block :: console
126
+
127
+ # From storage host
128
+ dmesg -T
129
+
130
+ Cross-reference the hardware device and OSD ID to ensure they match.
131
+ (Using `pvs ` and `lvs ` may help make this connection).
132
+
133
+ See upstream documentation:
134
+ https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd
135
+
136
+ In case where disk holding DB and/or WAL fails, it is necessary to recreate
137
+ all OSDs that are associated with this disk - usually NVMe drive. The
138
+ following single command is sufficient to identify which OSDs are tied to
139
+ which physical disks:
140
+
141
+ .. code-block :: console
142
+
143
+ ceph device ls
144
+
145
+ Once OSDs on failed disks are identified, follow procedure below.
97
146
98
147
If rebooting a Ceph node, first set ``noout `` to prevent excess data
99
148
movement:
@@ -130,25 +179,6 @@ spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``).
130
179
Either set ``unmanaged: true `` to stop cephadm from picking up new disks or
131
180
modify it in some way that it no longer matches the drives you want to remove.
132
181
133
-
134
- Operations
135
- ==========
136
-
137
- Replacing drive
138
- ---------------
139
-
140
- See upstream documentation:
141
- https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd
142
-
143
- In case where disk holding DB and/or WAL fails, it is necessary to recreate
144
- (using replacement procedure above) all OSDs that are associated with this
145
- disk - usually NVMe drive. The following single command is sufficient to
146
- identify which OSDs are tied to which physical disks:
147
-
148
- .. code-block :: console
149
-
150
- ceph device ls
151
-
152
182
Host maintenance
153
183
----------------
154
184
@@ -163,46 +193,6 @@ https://docs.ceph.com/en/latest/cephadm/upgrade/
163
193
Troubleshooting
164
194
===============
165
195
166
- Investigating a Failed Ceph Drive
167
- ---------------------------------
168
-
169
- A failing drive in a Ceph cluster will cause OSD daemon to crash.
170
- In this case Ceph will go into `HEALTH_WARN ` state.
171
- Ceph can report details about failed OSDs by running:
172
-
173
- .. code-block :: console
174
-
175
- ceph health detail
176
-
177
- .. note ::
178
-
179
- Remember to run ceph/rbd commands from within ``cephadm shell``
180
- (preferred method) or after installing Ceph client. Details in the
181
- official `documentation <https://docs.ceph.com/en/latest/cephadm/install/#enable-ceph-cli>`__.
182
- It is also required that the host where commands are executed has admin
183
- Ceph keyring present - easiest to achieve by applying
184
- `_admin <https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels>`__
185
- label (Ceph MON servers have it by default when using
186
- `StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm>`__).
187
-
188
- A failed OSD will also be reported as down by running:
189
-
190
- .. code-block :: console
191
-
192
- ceph osd tree
193
-
194
- Note the ID of the failed OSD.
195
-
196
- The failed disk is usually logged by the Linux kernel too:
197
-
198
- .. code-block :: console
199
-
200
- # From storage host
201
- dmesg -T
202
-
203
- Cross-reference the hardware device and OSD ID to ensure they match.
204
- (Using `pvs ` and `lvs ` may help make this connection).
205
-
206
196
Inspecting a Ceph Block Device for a VM
207
197
---------------------------------------
208
198
0 commit comments