|
| 1 | +====================== |
| 2 | +ioctl based interfaces |
| 3 | +====================== |
| 4 | + |
| 5 | +ioctl() is the most common way for applications to interface |
| 6 | +with device drivers. It is flexible and easily extended by adding new |
| 7 | +commands and can be passed through character devices, block devices as |
| 8 | +well as sockets and other special file descriptors. |
| 9 | + |
| 10 | +However, it is also very easy to get ioctl command definitions wrong, |
| 11 | +and hard to fix them later without breaking existing applications, |
| 12 | +so this documentation tries to help developers get it right. |
| 13 | + |
| 14 | +Command number definitions |
| 15 | +========================== |
| 16 | + |
| 17 | +The command number, or request number, is the second argument passed to |
| 18 | +the ioctl system call. While this can be any 32-bit number that uniquely |
| 19 | +identifies an action for a particular driver, there are a number of |
| 20 | +conventions around defining them. |
| 21 | + |
| 22 | +``include/uapi/asm-generic/ioctl.h`` provides four macros for defining |
| 23 | +ioctl commands that follow modern conventions: ``_IO``, ``_IOR``, |
| 24 | +``_IOW``, and ``_IOWR``. These should be used for all new commands, |
| 25 | +with the correct parameters: |
| 26 | + |
| 27 | +_IO/_IOR/_IOW/_IOWR |
| 28 | + The macro name specifies how the argument will be used. It may be a |
| 29 | + pointer to data to be passed into the kernel (_IOW), out of the kernel |
| 30 | + (_IOR), or both (_IOWR). _IO can indicate either commands with no |
| 31 | + argument or those passing an integer value instead of a pointer. |
| 32 | + It is recommended to only use _IO for commands without arguments, |
| 33 | + and use pointers for passing data. |
| 34 | + |
| 35 | +type |
| 36 | + An 8-bit number, often a character literal, specific to a subsystem |
| 37 | + or driver, and listed in :doc:`../userspace-api/ioctl/ioctl-number` |
| 38 | + |
| 39 | +nr |
| 40 | + An 8-bit number identifying the specific command, unique for a give |
| 41 | + value of 'type' |
| 42 | + |
| 43 | +data_type |
| 44 | + The name of the data type pointed to by the argument, the command number |
| 45 | + encodes the ``sizeof(data_type)`` value in a 13-bit or 14-bit integer, |
| 46 | + leading to a limit of 8191 bytes for the maximum size of the argument. |
| 47 | + Note: do not pass sizeof(data_type) type into _IOR/_IOW/IOWR, as that |
| 48 | + will lead to encoding sizeof(sizeof(data_type)), i.e. sizeof(size_t). |
| 49 | + _IO does not have a data_type parameter. |
| 50 | + |
| 51 | + |
| 52 | +Interface versions |
| 53 | +================== |
| 54 | + |
| 55 | +Some subsystems use version numbers in data structures to overload |
| 56 | +commands with different interpretations of the argument. |
| 57 | + |
| 58 | +This is generally a bad idea, since changes to existing commands tend |
| 59 | +to break existing applications. |
| 60 | + |
| 61 | +A better approach is to add a new ioctl command with a new number. The |
| 62 | +old command still needs to be implemented in the kernel for compatibility, |
| 63 | +but this can be a wrapper around the new implementation. |
| 64 | + |
| 65 | +Return code |
| 66 | +=========== |
| 67 | + |
| 68 | +ioctl commands can return negative error codes as documented in errno(3); |
| 69 | +these get turned into errno values in user space. On success, the return |
| 70 | +code should be zero. It is also possible but not recommended to return |
| 71 | +a positive 'long' value. |
| 72 | + |
| 73 | +When the ioctl callback is called with an unknown command number, the |
| 74 | +handler returns either -ENOTTY or -ENOIOCTLCMD, which also results in |
| 75 | +-ENOTTY being returned from the system call. Some subsystems return |
| 76 | +-ENOSYS or -EINVAL here for historic reasons, but this is wrong. |
| 77 | + |
| 78 | +Prior to Linux 5.5, compat_ioctl handlers were required to return |
| 79 | +-ENOIOCTLCMD in order to use the fallback conversion into native |
| 80 | +commands. As all subsystems are now responsible for handling compat |
| 81 | +mode themselves, this is no longer needed, but it may be important to |
| 82 | +consider when backporting bug fixes to older kernels. |
| 83 | + |
| 84 | +Timestamps |
| 85 | +========== |
| 86 | + |
| 87 | +Traditionally, timestamps and timeout values are passed as ``struct |
| 88 | +timespec`` or ``struct timeval``, but these are problematic because of |
| 89 | +incompatible definitions of these structures in user space after the |
| 90 | +move to 64-bit time_t. |
| 91 | + |
| 92 | +The ``struct __kernel_timespec`` type can be used instead to be embedded |
| 93 | +in other data structures when separate second/nanosecond values are |
| 94 | +desired, or passed to user space directly. This is still not ideal though, |
| 95 | +as the structure matches neither the kernel's timespec64 nor the user |
| 96 | +space timespec exactly. The get_timespec64() and put_timespec64() helper |
| 97 | +functions can be used to ensure that the layout remains compatible with |
| 98 | +user space and the padding is treated correctly. |
| 99 | + |
| 100 | +As it is cheap to convert seconds to nanoseconds, but the opposite |
| 101 | +requires an expensive 64-bit division, a simple __u64 nanosecond value |
| 102 | +can be simpler and more efficient. |
| 103 | + |
| 104 | +Timeout values and timestamps should ideally use CLOCK_MONOTONIC time, |
| 105 | +as returned by ktime_get_ns() or ktime_get_ts64(). Unlike |
| 106 | +CLOCK_REALTIME, this makes the timestamps immune from jumping backwards |
| 107 | +or forwards due to leap second adjustments and clock_settime() calls. |
| 108 | + |
| 109 | +ktime_get_real_ns() can be used for CLOCK_REALTIME timestamps that |
| 110 | +need to be persistent across a reboot or between multiple machines. |
| 111 | + |
| 112 | +32-bit compat mode |
| 113 | +================== |
| 114 | + |
| 115 | +In order to support 32-bit user space running on a 64-bit machine, each |
| 116 | +subsystem or driver that implements an ioctl callback handler must also |
| 117 | +implement the corresponding compat_ioctl handler. |
| 118 | + |
| 119 | +As long as all the rules for data structures are followed, this is as |
| 120 | +easy as setting the .compat_ioctl pointer to a helper function such as |
| 121 | +compat_ptr_ioctl() or blkdev_compat_ptr_ioctl(). |
| 122 | + |
| 123 | +compat_ptr() |
| 124 | +------------ |
| 125 | + |
| 126 | +On the s390 architecture, 31-bit user space has ambiguous representations |
| 127 | +for data pointers, with the upper bit being ignored. When running such |
| 128 | +a process in compat mode, the compat_ptr() helper must be used to |
| 129 | +clear the upper bit of a compat_uptr_t and turn it into a valid 64-bit |
| 130 | +pointer. On other architectures, this macro only performs a cast to a |
| 131 | +``void __user *`` pointer. |
| 132 | + |
| 133 | +In an compat_ioctl() callback, the last argument is an unsigned long, |
| 134 | +which can be interpreted as either a pointer or a scalar depending on |
| 135 | +the command. If it is a scalar, then compat_ptr() must not be used, to |
| 136 | +ensure that the 64-bit kernel behaves the same way as a 32-bit kernel |
| 137 | +for arguments with the upper bit set. |
| 138 | + |
| 139 | +The compat_ptr_ioctl() helper can be used in place of a custom |
| 140 | +compat_ioctl file operation for drivers that only take arguments that |
| 141 | +are pointers to compatible data structures. |
| 142 | + |
| 143 | +Structure layout |
| 144 | +---------------- |
| 145 | + |
| 146 | +Compatible data structures have the same layout on all architectures, |
| 147 | +avoiding all problematic members: |
| 148 | + |
| 149 | +* ``long`` and ``unsigned long`` are the size of a register, so |
| 150 | + they can be either 32-bit or 64-bit wide and cannot be used in portable |
| 151 | + data structures. Fixed-length replacements are ``__s32``, ``__u32``, |
| 152 | + ``__s64`` and ``__u64``. |
| 153 | + |
| 154 | +* Pointers have the same problem, in addition to requiring the |
| 155 | + use of compat_ptr(). The best workaround is to use ``__u64`` |
| 156 | + in place of pointers, which requires a cast to ``uintptr_t`` in user |
| 157 | + space, and the use of u64_to_user_ptr() in the kernel to convert |
| 158 | + it back into a user pointer. |
| 159 | + |
| 160 | +* On the x86-32 (i386) architecture, the alignment of 64-bit variables |
| 161 | + is only 32-bit, but they are naturally aligned on most other |
| 162 | + architectures including x86-64. This means a structure like:: |
| 163 | + |
| 164 | + struct foo { |
| 165 | + __u32 a; |
| 166 | + __u64 b; |
| 167 | + __u32 c; |
| 168 | + }; |
| 169 | + |
| 170 | + has four bytes of padding between a and b on x86-64, plus another four |
| 171 | + bytes of padding at the end, but no padding on i386, and it needs a |
| 172 | + compat_ioctl conversion handler to translate between the two formats. |
| 173 | + |
| 174 | + To avoid this problem, all structures should have their members |
| 175 | + naturally aligned, or explicit reserved fields added in place of the |
| 176 | + implicit padding. The ``pahole`` tool can be used for checking the |
| 177 | + alignment. |
| 178 | + |
| 179 | +* On ARM OABI user space, structures are padded to multiples of 32-bit, |
| 180 | + making some structs incompatible with modern EABI kernels if they |
| 181 | + do not end on a 32-bit boundary. |
| 182 | + |
| 183 | +* On the m68k architecture, struct members are not guaranteed to have an |
| 184 | + alignment greater than 16-bit, which is a problem when relying on |
| 185 | + implicit padding. |
| 186 | + |
| 187 | +* Bitfields and enums generally work as one would expect them to, |
| 188 | + but some properties of them are implementation-defined, so it is better |
| 189 | + to avoid them completely in ioctl interfaces. |
| 190 | + |
| 191 | +* ``char`` members can be either signed or unsigned, depending on |
| 192 | + the architecture, so the __u8 and __s8 types should be used for 8-bit |
| 193 | + integer values, though char arrays are clearer for fixed-length strings. |
| 194 | + |
| 195 | +Information leaks |
| 196 | +================= |
| 197 | + |
| 198 | +Uninitialized data must not be copied back to user space, as this can |
| 199 | +cause an information leak, which can be used to defeat kernel address |
| 200 | +space layout randomization (KASLR), helping in an attack. |
| 201 | + |
| 202 | +For this reason (and for compat support) it is best to avoid any |
| 203 | +implicit padding in data structures. Where there is implicit padding |
| 204 | +in an existing structure, kernel drivers must be careful to fully |
| 205 | +initialize an instance of the structure before copying it to user |
| 206 | +space. This is usually done by calling memset() before assigning to |
| 207 | +individual members. |
| 208 | + |
| 209 | +Subsystem abstractions |
| 210 | +====================== |
| 211 | + |
| 212 | +While some device drivers implement their own ioctl function, most |
| 213 | +subsystems implement the same command for multiple drivers. Ideally the |
| 214 | +subsystem has an .ioctl() handler that copies the arguments from and |
| 215 | +to user space, passing them into subsystem specific callback functions |
| 216 | +through normal kernel pointers. |
| 217 | + |
| 218 | +This helps in various ways: |
| 219 | + |
| 220 | +* Applications written for one driver are more likely to work for |
| 221 | + another one in the same subsystem if there are no subtle differences |
| 222 | + in the user space ABI. |
| 223 | + |
| 224 | +* The complexity of user space access and data structure layout is done |
| 225 | + in one place, reducing the potential for implementation bugs. |
| 226 | + |
| 227 | +* It is more likely to be reviewed by experienced developers |
| 228 | + that can spot problems in the interface when the ioctl is shared |
| 229 | + between multiple drivers than when it is only used in a single driver. |
| 230 | + |
| 231 | +Alternatives to ioctl |
| 232 | +===================== |
| 233 | + |
| 234 | +There are many cases in which ioctl is not the best solution for a |
| 235 | +problem. Alternatives include: |
| 236 | + |
| 237 | +* System calls are a better choice for a system-wide feature that |
| 238 | + is not tied to a physical device or constrained by the file system |
| 239 | + permissions of a character device node |
| 240 | + |
| 241 | +* netlink is the preferred way of configuring any network related |
| 242 | + objects through sockets. |
| 243 | + |
| 244 | +* debugfs is used for ad-hoc interfaces for debugging functionality |
| 245 | + that does not need to be exposed as a stable interface to applications. |
| 246 | + |
| 247 | +* sysfs is a good way to expose the state of an in-kernel object |
| 248 | + that is not tied to a file descriptor. |
| 249 | + |
| 250 | +* configfs can be used for more complex configuration than sysfs |
| 251 | + |
| 252 | +* A custom file system can provide extra flexibility with a simple |
| 253 | + user interface but adds a lot of complexity to the implementation. |
0 commit comments