Skip to content

Commit 58de9a7

Browse files
DRIVERS-2926 BSON Binary Vector Subtype Support (#1658)
1 parent d1bdb68 commit 58de9a7

File tree

7 files changed

+539
-0
lines changed

7 files changed

+539
-0
lines changed
Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# BSON Binary Subtype 9 - Vector
2+
3+
- Status: Pending
4+
- Minimum Server Version: N/A
5+
6+
______________________________________________________________________
7+
8+
## Abstract
9+
10+
This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors
11+
here refer to densely packed arrays of numbers, all of the same type.
12+
13+
## Motivation
14+
15+
These representations correspond to the numeric types supported by popular numerical libraries for vector processing,
16+
such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed
17+
format used by these libraries can result in significant memory savings and processing efficiency.
18+
19+
### META
20+
21+
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
22+
"OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).
23+
24+
## Specification
25+
26+
This specification introduces a new BSON binary subtype, the vector, with value `9`.
27+
28+
Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification.
29+
30+
### Data Types (dtypes)
31+
32+
Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented.
33+
34+
| Vector data type | Alias | Bits per vector element | [Arrow Data Type](https://arrow.apache.org/docs/cpp/api/datatype.html) (for illustration) |
35+
| ---------------- | ---------- | ----------------------- | ----------------------------------------------------------------------------------------- |
36+
| `0x03` | INT8 | 8 | INT8 |
37+
| `0x27` | FLOAT32 | 32 | FLOAT |
38+
| `0x10` | PACKED_BIT | 1 `*` | BOOL |
39+
40+
`*` A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of
41+
integers in \[0, 255\]. So, for example, the vector `[0, 255]` would be shorthand for the 16-bit vector
42+
`[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]`. The idea is that each number (a uint8) can be stored as a single byte. Of course,
43+
some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk.
44+
45+
### Byte padding
46+
47+
As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of
48+
bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the
49+
final byte that are to be ignored. The least-significant bits are ignored.
50+
51+
### Binary structure
52+
53+
Following the binary subtype `9`, a two-element byte array of metadata precedes the packed numbers.
54+
55+
- The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may
56+
increase. dtype is an unsigned integer.
57+
58+
- The second byte (padding) prescribes the number of bits to ignore in the final byte of the value. It is a non-negative
59+
integer. It must be present, even in cases where it is not applicable, and set to zero.
60+
61+
- The remainder contains the actual vector elements packed according to dtype.
62+
63+
All values use the little-endian format.
64+
65+
#### Example
66+
67+
Let's take a vector `[238, 224]` of dtype PACKED_BIT (`\x10`) with a padding of `4`.
68+
69+
In hex, it looks like this: `b"\x10\x04\xee\xe0"`: 1 byte for dtype, 1 for padding, and 1 for each uint8.
70+
71+
We can visualize the binary representation like so:
72+
73+
<table border="1" cellspacing="0" cellpadding="5">
74+
<tr>
75+
<td colspan="8">1st byte: dtype (from list in previous table) </td>
76+
<td colspan="8">2nd byte: padding (values in [0,7])</td>
77+
<td colspan="8">1st uint8: 238</td>
78+
<td colspan="8">2nd uint8: 224</td>
79+
</tr>
80+
<tr>
81+
<td>0</td>
82+
<td>0</td>
83+
<td>0</td>
84+
<td>1</td>
85+
<td>0</td>
86+
<td>0</td>
87+
<td>0</td>
88+
<td>0</td>
89+
<td>0</td>
90+
<td>0</td>
91+
<td>0</td>
92+
<td>0</td>
93+
<td>0</td>
94+
<td>1</td>
95+
<td>0</td>
96+
<td>0</td>
97+
<td>1</td>
98+
<td>1</td>
99+
<td>1</td>
100+
<td>0</td>
101+
<td>1</td>
102+
<td>1</td>
103+
<td>1</td>
104+
<td>0</td>
105+
<td>1</td>
106+
<td>1</td>
107+
<td>1</td>
108+
<td>0</td>
109+
<td>0</td>
110+
<td>0</td>
111+
<td>0</td>
112+
<td>0</td>
113+
</tr>
114+
</table>
115+
116+
Finally, after we remove the last 4 bits of padding, the actual bit vector has a length of 12 and looks like this!
117+
118+
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
119+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
120+
121+
## API Guidance
122+
123+
Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while
124+
following idioms of the language of the driver.
125+
126+
### Encoding
127+
128+
```
129+
Function from_vector(vector: Iterable<Number>, dtype: DtypeEnum, padding: Integer = 0) -> Binary
130+
# Converts a numeric vector into a binary representation based on the specified dtype and padding.
131+
132+
# :param vector: A sequence or iterable of numbers (either float or int)
133+
# :param dtype: Data type for binary conversion (from DtypeEnum)
134+
# :param padding: Optional integer specifying how many bits to ignore in the final byte
135+
# :return: A binary representation of the vector
136+
137+
Declare binary_data as Binary
138+
139+
# Process each number in vector and convert according to dtype
140+
For each number in vector
141+
binary_element = convert_to_binary(number, dtype)
142+
binary_data.append(binary_element)
143+
End For
144+
145+
# Apply padding to the binary data if needed
146+
If padding > 0
147+
apply_padding(binary_data, padding)
148+
End If
149+
150+
Return binary_data
151+
End Function
152+
```
153+
154+
Note: If a driver chooses to implement a `Vector` type (or numerous) like that suggested in the Data Structure
155+
subsection below, they MAY decide that `from_vector` that has a single argument, a Vector.
156+
157+
### Decoding
158+
159+
```
160+
Function as_vector() -> Vector
161+
# Unpacks binary data (BSON or similar) into a Vector structure.
162+
# This process involves extracting numeric values, the data type, and padding information.
163+
164+
# :return: A BinaryVector containing the unpacked numeric values, dtype, and padding.
165+
166+
Declare binary_vector as BinaryVector # Struct to hold the unpacked data
167+
168+
# Extract dtype (data type) from the binary data
169+
binary_vector.dtype = extract_dtype_from_binary()
170+
171+
# Extract padding from the binary data
172+
binary_vector.padding = extract_padding_from_binary()
173+
174+
# Unpack the actual numeric values from the binary data according to the dtype
175+
binary_vector.data = unpack_numeric_values(binary_vector.dtype)
176+
177+
Return binary_vector
178+
End Function
179+
```
180+
181+
#### Validation
182+
183+
Drivers MUST validate vector metadata and raise an error if any invariant is violated:
184+
185+
- Padding MUST be 0 for all dtypes where padding doesn’t apply, and MUST be within \[0, 7\] for PACKED_BIT.
186+
- A PACKED_BIT vector MUST NOT be empty if padding is in the range \[1, 7\].
187+
188+
Drivers MUST perform this validation when a numeric vector and padding are provided through the API, and when unpacking
189+
binary data (BSON or similar) into a Vector structure.
190+
191+
#### Data Structures
192+
193+
Drivers MAY find the following structures to represent the dtype and vector structure useful.
194+
195+
```
196+
Enum Dtype
197+
# Enum for data types (dtype)
198+
199+
# FLOAT32: Represents packing of list of floats as float32
200+
# Value: 0x27 (hexadecimal byte value)
201+
202+
# INT8: Represents packing of list of signed integers in the range [-128, 127] as signed int8
203+
# Value: 0x03 (hexadecimal byte value)
204+
205+
# PACKED_BIT: Special case where vector values are 0 or 1, packed as unsigned uint8 in range [0, 255]
206+
# Packed into groups of 8 (a byte)
207+
# Value: 0x10 (hexadecimal byte value)
208+
209+
# Documentation:
210+
# Each value is a byte (length of one), a convenient choice for decoding.
211+
End Enum
212+
213+
Struct Vector
214+
# Numeric vector with metadata for binary interoperability
215+
216+
# Fields:
217+
# data: Sequence of numeric values (either float or int)
218+
# dtype: Data type of vector (from enum BinaryVectorDtype)
219+
# padding: Number of bits to ignore in the final byte for alignment
220+
221+
data # Sequence of float or int
222+
dtype # Type: DtypeEnum
223+
padding # Integer: Number of padding bits
224+
End Struct
225+
```
226+
227+
## Reference Implementation
228+
229+
- PYTHON (PYTHON-4577)
230+
231+
## Test Plan
232+
233+
See the [README](tests/README.md) for tests.
234+
235+
## FAQ
236+
237+
- What MongoDB Server version does this apply to?
238+
- Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version.
239+
- In PACKED_BIT, why would one choose to use integers in \[0, 256)?
240+
- This follows a well-established precedent for packing binary-valued arrays into bytes (8 bits), This technique is
241+
widely used across different fields, such as data compression, communication protocols, and file formats, where you
242+
want to store or transmit binary data more efficiently by grouping 8 bits into a single byte (uint8). For an example
243+
in Python, see
244+
[numpy.unpackbits](https://numpy.org/doc/2.0/reference/generated/numpy.unpackbits.html#numpy.unpackbits).
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Testing Binary subtype 9: Vector
2+
3+
The JSON files in this directory tree are platform-independent tests that drivers can use to prove their conformance to
4+
the specification.
5+
6+
These tests focus on the roundtrip of the list of numbers as input/output, along with their data type and byte padding.
7+
8+
Additional tests exist in `bson_corpus/tests/binary.json` but do not sufficiently test the end-to-end process of Vector
9+
to BSON. For this reason, drivers must create a bespoke test runner for the vector subtype.
10+
11+
## Format
12+
13+
The test data corpus consists of a JSON file for each data type (dtype). Each file contains a number of test cases,
14+
under the top-level key "tests". Each test case pertains to a single vector. The keys provide the specification of the
15+
vector. Valid cases also include the Canonical BSON format of a document {test_key: binary}. The "test_key" is common,
16+
and specified at the top level.
17+
18+
#### Top level keys
19+
20+
Each JSON file contains three top-level keys.
21+
22+
- `description`: human-readable description of what is in the file
23+
- `test_key`: name used for key when encoding/decoding a BSON document containing the single BSON Binary for the test
24+
case. Applies to *every* case.
25+
- `tests`: array of test case objects, each of which have the following keys. Valid cases will also contain additional
26+
binary and json encoding values.
27+
28+
#### Keys of individual tests cases
29+
30+
- `description`: string describing the test.
31+
- `valid`: boolean indicating if the vector, dtype, and padding should be considered a valid input.
32+
- `vector`: list of numbers
33+
- `dtype_hex`: string defining the data type in hex (e.g. "0x10", "0x27")
34+
- `dtype_alias`: (optional) string defining the data dtype, perhaps as Enum.
35+
- `padding`: (optional) integer for byte padding. Defaults to 0.
36+
- `canonical_bson`: (required if valid is true) an (uppercase) big-endian hex representation of a BSON byte string.
37+
38+
## Required tests
39+
40+
#### To prove correct in a valid case (`valid: true`), one MUST
41+
42+
- encode a document from the numeric values, dtype, and padding, along with the "test_key", and assert this matches the
43+
canonical_bson string.
44+
- decode the canonical_bson into its binary form, and then assert that the numeric values, dtype, and padding all match
45+
those provided in the JSON.
46+
47+
Note: For floating point number types, exact numerical matches may not be possible. Drivers that natively support the
48+
floating-point type being tested (e.g., when testing float32 vector values in a driver that natively supports float32),
49+
MUST assert that the input float array is the same after encoding and decoding.
50+
51+
#### To prove correct in an invalid case (`valid:false`), one MUST
52+
53+
- raise an exception when attempting to encode a document from the numeric values, dtype, and padding.
54+
55+
## FAQ
56+
57+
- What MongoDB Server version does this apply to?
58+
- Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version.
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
{
2+
"description": "Tests of Binary subtype 9, Vectors, with dtype FLOAT32",
3+
"test_key": "vector",
4+
"tests": [
5+
{
6+
"description": "Simple Vector FLOAT32",
7+
"valid": true,
8+
"vector": [127.0, 7.0],
9+
"dtype_hex": "0x27",
10+
"dtype_alias": "FLOAT32",
11+
"padding": 0,
12+
"canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000"
13+
},
14+
{
15+
"description": "Vector with decimals and negative value FLOAT32",
16+
"valid": true,
17+
"vector": [127.7, -7.7],
18+
"dtype_hex": "0x27",
19+
"dtype_alias": "FLOAT32",
20+
"padding": 0,
21+
"canonical_bson": "1C00000005766563746F72000A0000000927006666FF426666F6C000"
22+
},
23+
{
24+
"description": "Empty Vector FLOAT32",
25+
"valid": true,
26+
"vector": [],
27+
"dtype_hex": "0x27",
28+
"dtype_alias": "FLOAT32",
29+
"padding": 0,
30+
"canonical_bson": "1400000005766563746F72000200000009270000"
31+
},
32+
{
33+
"description": "Infinity Vector FLOAT32",
34+
"valid": true,
35+
"vector": ["-inf", 0.0, "inf"],
36+
"dtype_hex": "0x27",
37+
"dtype_alias": "FLOAT32",
38+
"padding": 0,
39+
"canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00"
40+
},
41+
{
42+
"description": "FLOAT32 with padding",
43+
"valid": false,
44+
"vector": [127.0, 7.0],
45+
"dtype_hex": "0x27",
46+
"dtype_alias": "FLOAT32",
47+
"padding": 3
48+
}
49+
]
50+
}
51+

0 commit comments

Comments
 (0)