|
| 1 | +# BSON Binary Subtype 9 - Vector |
| 2 | + |
| 3 | +- Status: Pending |
| 4 | +- Minimum Server Version: N/A |
| 5 | + |
| 6 | +______________________________________________________________________ |
| 7 | + |
| 8 | +## Abstract |
| 9 | + |
| 10 | +This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors |
| 11 | +here refer to densely packed arrays of numbers, all of the same type. |
| 12 | + |
| 13 | +## Motivation |
| 14 | + |
| 15 | +These representations correspond to the numeric types supported by popular numerical libraries for vector processing, |
| 16 | +such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed |
| 17 | +format used by these libraries can result in significant memory savings and processing efficiency. |
| 18 | + |
| 19 | +### META |
| 20 | + |
| 21 | +The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and |
| 22 | +"OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). |
| 23 | + |
| 24 | +## Specification |
| 25 | + |
| 26 | +This specification introduces a new BSON binary subtype, the vector, with value `9`. |
| 27 | + |
| 28 | +Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification. |
| 29 | + |
| 30 | +### Data Types (dtypes) |
| 31 | + |
| 32 | +Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented. |
| 33 | + |
| 34 | +| Vector data type | Alias | Bits per vector element | [Arrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) | |
| 35 | +| ---------------- | ---------- | ----------------------- | ----------------------------------------------------------------------------------------- | |
| 36 | +| `0x03` | INT8 | 8 | INT8 | |
| 37 | +| `0x27` | FLOAT32 | 32 | FLOAT | |
| 38 | +| `0x10` | PACKED_BIT | 1 `*` | BOOL | |
| 39 | + |
| 40 | +`*` A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of |
| 41 | +integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16-bit vector |
| 42 | +`[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]`. The idea is that each number (a uint8) can be stored as a single byte. Of course, |
| 43 | +some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk. |
| 44 | + |
| 45 | +### Byte padding |
| 46 | + |
| 47 | +As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of |
| 48 | +bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the |
| 49 | +final byte that are to be ignored. The least-significant bits are ignored. |
| 50 | + |
| 51 | +### Binary structure |
| 52 | + |
| 53 | +Following the binary subtype `9`, a two-element byte array of metadata precedes the packed numbers. |
| 54 | + |
| 55 | +- The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may |
| 56 | + increase. dtype is an unsigned integer. |
| 57 | + |
| 58 | +- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value. It is a non-negative |
| 59 | + integer. It must be present, even in cases where it is not applicable, and set to zero. |
| 60 | + |
| 61 | +- The remainder contains the actual vector elements packed according to dtype. |
| 62 | + |
| 63 | +All values use the little-endian format. |
| 64 | + |
| 65 | +#### Example |
| 66 | + |
| 67 | +Let's take a vector `[238, 224]` of dtype PACKED_BIT (`\x10`) with a padding of `4`. |
| 68 | + |
| 69 | +In hex, it looks like this: `b"\x10\x04\xee\xe0"`: 1 byte for dtype, 1 for padding, and 1 for each uint8. |
| 70 | + |
| 71 | +We can visualize the binary representation like so: |
| 72 | + |
| 73 | +<table border="1" cellspacing="0" cellpadding="5"> |
| 74 | + <tr> |
| 75 | + <td colspan="8">1st byte: dtype (from list in previous table) </td> |
| 76 | + <td colspan="8">2nd byte: padding (values in [0,7])</td> |
| 77 | + <td colspan="8">1st uint8: 238</td> |
| 78 | + <td colspan="8">2nd uint8: 224</td> |
| 79 | + </tr> |
| 80 | + <tr> |
| 81 | + <td>0</td> |
| 82 | + <td>0</td> |
| 83 | + <td>0</td> |
| 84 | + <td>1</td> |
| 85 | + <td>0</td> |
| 86 | + <td>0</td> |
| 87 | + <td>0</td> |
| 88 | + <td>0</td> |
| 89 | + <td>0</td> |
| 90 | + <td>0</td> |
| 91 | + <td>0</td> |
| 92 | + <td>0</td> |
| 93 | + <td>0</td> |
| 94 | + <td>1</td> |
| 95 | + <td>0</td> |
| 96 | + <td>0</td> |
| 97 | + <td>1</td> |
| 98 | + <td>1</td> |
| 99 | + <td>1</td> |
| 100 | + <td>0</td> |
| 101 | + <td>1</td> |
| 102 | + <td>1</td> |
| 103 | + <td>1</td> |
| 104 | + <td>0</td> |
| 105 | + <td>1</td> |
| 106 | + <td>1</td> |
| 107 | + <td>1</td> |
| 108 | + <td>0</td> |
| 109 | + <td>0</td> |
| 110 | + <td>0</td> |
| 111 | + <td>0</td> |
| 112 | + <td>0</td> |
| 113 | + </tr> |
| 114 | +</table> |
| 115 | + |
| 116 | +Finally, after we remove the last 4 bits of padding, the actual bit vector has a length of 12 and looks like this! |
| 117 | + |
| 118 | +| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | |
| 119 | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |
| 120 | + |
| 121 | +## API Guidance |
| 122 | + |
| 123 | +Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while |
| 124 | +following idioms of the language of the driver. |
| 125 | + |
| 126 | +### Encoding |
| 127 | + |
| 128 | +``` |
| 129 | +Function from_vector(vector: Iterable<Number>, dtype: DtypeEnum, padding: Integer = 0) -> Binary |
| 130 | + # Converts a numeric vector into a binary representation based on the specified dtype and padding. |
| 131 | +
|
| 132 | + # :param vector: A sequence or iterable of numbers (either float or int) |
| 133 | + # :param dtype: Data type for binary conversion (from DtypeEnum) |
| 134 | + # :param padding: Optional integer specifying how many bits to ignore in the final byte |
| 135 | + # :return: A binary representation of the vector |
| 136 | +
|
| 137 | + Declare binary_data as Binary |
| 138 | + |
| 139 | + # Process each number in vector and convert according to dtype |
| 140 | + For each number in vector |
| 141 | + binary_element = convert_to_binary(number, dtype) |
| 142 | + binary_data.append(binary_element) |
| 143 | + End For |
| 144 | + |
| 145 | + # Apply padding to the binary data if needed |
| 146 | + If padding > 0 |
| 147 | + apply_padding(binary_data, padding) |
| 148 | + End If |
| 149 | + |
| 150 | + Return binary_data |
| 151 | +End Function |
| 152 | +``` |
| 153 | + |
| 154 | +Note: If a driver chooses to implement a `Vector` type (or numerous) like that suggested in the Data Structure |
| 155 | +subsection below, they MAY decide that `from_vector` that has a single argument, a Vector. |
| 156 | + |
| 157 | +### Decoding |
| 158 | + |
| 159 | +``` |
| 160 | +Function as_vector() -> Vector |
| 161 | + # Unpacks binary data (BSON or similar) into a Vector structure. |
| 162 | + # This process involves extracting numeric values, the data type, and padding information. |
| 163 | +
|
| 164 | + # :return: A BinaryVector containing the unpacked numeric values, dtype, and padding. |
| 165 | +
|
| 166 | + Declare binary_vector as BinaryVector # Struct to hold the unpacked data |
| 167 | +
|
| 168 | + # Extract dtype (data type) from the binary data |
| 169 | + binary_vector.dtype = extract_dtype_from_binary() |
| 170 | +
|
| 171 | + # Extract padding from the binary data |
| 172 | + binary_vector.padding = extract_padding_from_binary() |
| 173 | +
|
| 174 | + # Unpack the actual numeric values from the binary data according to the dtype |
| 175 | + binary_vector.data = unpack_numeric_values(binary_vector.dtype) |
| 176 | +
|
| 177 | + Return binary_vector |
| 178 | +End Function |
| 179 | +``` |
| 180 | + |
| 181 | +#### Validation |
| 182 | + |
| 183 | +Drivers MUST validate vector metadata and raise an error if any invariant is violated: |
| 184 | + |
| 185 | +- Padding MUST be 0 for all dtypes where padding doesn’t apply, and MUST be within \[0, 7\] for PACKED_BIT. |
| 186 | +- A PACKED_BIT vector MUST NOT be empty if padding is in the range \[1, 7\]. |
| 187 | + |
| 188 | +Drivers MUST perform this validation when a numeric vector and padding are provided through the API, and when unpacking |
| 189 | +binary data (BSON or similar) into a Vector structure. |
| 190 | + |
| 191 | +#### Data Structures |
| 192 | + |
| 193 | +Drivers MAY find the following structures to represent the dtype and vector structure useful. |
| 194 | + |
| 195 | +``` |
| 196 | +Enum Dtype |
| 197 | + # Enum for data types (dtype) |
| 198 | +
|
| 199 | + # FLOAT32: Represents packing of list of floats as float32 |
| 200 | + # Value: 0x27 (hexadecimal byte value) |
| 201 | +
|
| 202 | + # INT8: Represents packing of list of signed integers in the range [-128, 127] as signed int8 |
| 203 | + # Value: 0x03 (hexadecimal byte value) |
| 204 | +
|
| 205 | + # PACKED_BIT: Special case where vector values are 0 or 1, packed as unsigned uint8 in range [0, 255] |
| 206 | + # Packed into groups of 8 (a byte) |
| 207 | + # Value: 0x10 (hexadecimal byte value) |
| 208 | + |
| 209 | + # Documentation: |
| 210 | + # Each value is a byte (length of one), a convenient choice for decoding. |
| 211 | +End Enum |
| 212 | +
|
| 213 | +Struct Vector |
| 214 | + # Numeric vector with metadata for binary interoperability |
| 215 | +
|
| 216 | + # Fields: |
| 217 | + # data: Sequence of numeric values (either float or int) |
| 218 | + # dtype: Data type of vector (from enum BinaryVectorDtype) |
| 219 | + # padding: Number of bits to ignore in the final byte for alignment |
| 220 | +
|
| 221 | + data # Sequence of float or int |
| 222 | + dtype # Type: DtypeEnum |
| 223 | + padding # Integer: Number of padding bits |
| 224 | + End Struct |
| 225 | +``` |
| 226 | + |
| 227 | +## Reference Implementation |
| 228 | + |
| 229 | +- PYTHON (PYTHON-4577) |
| 230 | + |
| 231 | +## Test Plan |
| 232 | + |
| 233 | +See the [README](tests/README.md) for tests. |
| 234 | + |
| 235 | +## FAQ |
| 236 | + |
| 237 | +- What MongoDB Server version does this apply to? |
| 238 | + - Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version. |
| 239 | +- In PACKED_BIT, why would one choose to use integers in \[0, 256)? |
| 240 | + - This follows a well-established precedent for packing binary-valued arrays into bytes (8 bits), This technique is |
| 241 | + widely used across different fields, such as data compression, communication protocols, and file formats, where you |
| 242 | + want to store or transmit binary data more efficiently by grouping 8 bits into a single byte (uint8). For an example |
| 243 | + in Python, see |
| 244 | + [numpy.unpackbits](https://numpy.org/doc/2.0/reference/generated/numpy.unpackbits.html#numpy.unpackbits). |
0 commit comments