Skip to content

Commit 0feea2f

Browse files
[𝘀𝗽𝗿] changes to main this commit is based on
Created using spr 1.3.5 [skip ci]
1 parent b510cdb commit 0feea2f

19 files changed

+2056
-0
lines changed
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Content Addressable Storage
2+
3+
## Introduction to CAS
4+
5+
Content Addressable Storage, or `CAS`, is a storage system where it assigns
6+
unique addresses to the data stored. It is very useful for data deduplicaton
7+
and creating unique identifiers.
8+
9+
Unlikely other kind of storage system like file system, CAS is immutable. It
10+
is more reliable to model a computation when representing the inputs and outputs
11+
of the computation using objects stored in CAS.
12+
13+
The basic unit of the CAS library is a CASObject, where it contains:
14+
15+
* Data: arbitrary data
16+
* References: references to other CASObject
17+
18+
It can be conceptually modeled as something like:
19+
20+
```
21+
struct CASObject {
22+
ArrayRef<char> Data;
23+
ArrayRef<CASObject*> Refs;
24+
}
25+
```
26+
27+
Such abstraction can allow simple composition of CASObjects into a DAG to
28+
represent complicated data structure while still allowing data deduplication.
29+
Note you can compare two DAGs by just comparing the CASObject hash of two
30+
root nodes.
31+
32+
33+
34+
## LLVM CAS Library User Guide
35+
36+
The CAS-like storage provided in LLVM is `llvm::cas::ObjectStore`.
37+
To reference a CASObject, there are few different abstractions provided
38+
with different trade-offs:
39+
40+
### ObjectRef
41+
42+
`ObjectRef` is a lightweight reference to a CASObject stored in the CAS.
43+
This is the most commonly used abstraction and it is cheap to copy/pass
44+
along. It has following properties:
45+
46+
* `ObjectRef` is only meaningful within the `ObjectStore` that created the ref.
47+
`ObjectRef` created by different `ObjectStore` cannot be cross-referenced or
48+
compared.
49+
* `ObjectRef` doesn't guarantee the existence of the CASObject it points to. An
50+
explicitly load is required before accessing the data stored in CASObject.
51+
This load can also fail, for reasons like but not limited to: object does
52+
not exist, corrupted CAS storage, operation timeout, etc.
53+
* If two `ObjectRef` are equal, it is guarantee that the object they point to
54+
(if exists) are identical. If they are not equal, the underlying objects are
55+
guaranteed to be not the same.
56+
57+
### ObjectProxy
58+
59+
`ObjectProxy` represents a loaded CASObject. With an `ObjectProxy`, the
60+
underlying stored data and references can be accessed without the need
61+
of error handling. The class APIs also provide convenient methods to
62+
access underlying data. The lifetime of the underlying data is equal to
63+
the lifetime of the instance of `ObjectStore` unless explicitly copied.
64+
65+
### CASID
66+
67+
`CASID` is the hash identifier for CASObjects. It owns the underlying
68+
storage for hash value so it can be expensive to copy and compare depending
69+
on the hash algorithm. `CASID` is generally only useful in rare situations
70+
like printing raw hash value or exchanging hash values between different
71+
CAS instances with the same hashing schema.
72+
73+
### ObjectStore
74+
75+
`ObjectStore` is the CAS-like object storage. It provides API to save
76+
and load CASObjects, for example:
77+
78+
```
79+
ObjectRef A, B, C;
80+
Expected<ObjectRef> Stored = ObjectStore.store("data", {A, B});
81+
Expected<ObjectProxy> Loaded = ObjectStore.getProxy(C);
82+
```
83+
84+
It also provides APIs to convert between `ObjectRef`, `ObjectProxy` and
85+
`CASID`.
86+
87+
88+
89+
## CAS Library Implementation Guide
90+
91+
The LLVM ObjectStore APIs are designed so that it is easy to add
92+
customized CAS implementation that are interchangeable with builtin
93+
CAS implementations.
94+
95+
To add your own implementation, you just need to add a subclass to
96+
`llvm::cas::ObjectStore` and implement all its pure virtual methods.
97+
To be interchangeable with LLVM ObjectStore, the new CAS implementation
98+
needs to conform to following contracts:
99+
100+
* Different CASObject stored in the ObjectStore needs to have a different hash
101+
and result in a different `ObjectRef`. Vice versa, same CASObject should have
102+
same hash and same `ObjectRef`. Note two different CASObjects with identical
103+
data but different references are considered different objects.
104+
* `ObjectRef`s are comparable within the same `ObjectStore` instance, and can
105+
be used to determine the equality of the underlying CASObjects.
106+
* The loaded objects from the ObjectStore need to have the lifetime to be at
107+
least as long as the ObjectStore itself.
108+
109+
If not specified, the behavior can be implementation defined. For example,
110+
`ObjectRef` can be used to point to a loaded CASObject so
111+
`ObjectStore` never fails to load. It is also legal to use a stricter model
112+
than required. For example, an `ObjectRef` that can be used to compare
113+
objects between different `ObjectStore` instances is legal but user
114+
of the ObjectStore should not depend on this behavior.
115+
116+
For CAS library implementer, there is also a `ObjectHandle` class that
117+
is an internal representation of a loaded CASObject reference.
118+
`ObjectProxy` is just a pair of `ObjectHandle` and `ObjectStore`, because
119+
just like `ObjectRef`, `ObjectHandle` is only useful when paired with
120+
the ObjectStore that knows about the loaded CASObject.

llvm/docs/Reference.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ LLVM and API reference documentation.
1515
BranchWeightMetadata
1616
Bugpoint
1717
CommandGuide/index
18+
ContentAddressableStorage
1819
ConvergenceAndUniformity
1920
ConvergentOperations
2021
Coroutines
@@ -232,3 +233,6 @@ Additional Topics
232233
:doc:`ConvergenceAndUniformity`
233234
A description of uniformity analysis in the presence of irreducible
234235
control flow, and its implementation.
236+
237+
:doc:`ContentAddressableStorage`
238+
A reference guide for using LLVM's CAS library.
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
//===- BuiltinCASContext.h --------------------------------------*- C++ -*-===//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
9+
#ifndef LLVM_CAS_BUILTINCASCONTEXT_H
10+
#define LLVM_CAS_BUILTINCASCONTEXT_H
11+
12+
#include "llvm/CAS/CASID.h"
13+
#include "llvm/Support/BLAKE3.h"
14+
#include "llvm/Support/Error.h"
15+
16+
namespace llvm::cas::builtin {
17+
18+
/// Current hash type for the builtin CAS.
19+
///
20+
/// FIXME: This should be configurable via an enum to allow configuring the hash
21+
/// function. The enum should be sent into \a createInMemoryCAS() and \a
22+
/// createOnDiskCAS().
23+
///
24+
/// This is important (at least) for future-proofing, when we want to make new
25+
/// CAS instances use BLAKE7, but still know how to read/write BLAKE3.
26+
///
27+
/// Even just for BLAKE3, it would be useful to have these values:
28+
///
29+
/// BLAKE3 => 32B hash from BLAKE3
30+
/// BLAKE3_16B => 16B hash from BLAKE3 (truncated)
31+
///
32+
/// ... where BLAKE3_16 uses \a TruncatedBLAKE3<16>.
33+
///
34+
/// Motivation for a truncated hash is that it's cheaper to store. It's not
35+
/// clear if we always (or ever) need the full 32B, and for an ephemeral
36+
/// in-memory CAS, we almost certainly don't need it.
37+
///
38+
/// Note that the cost is linear in the number of objects for the builtin CAS,
39+
/// since we're using internal offsets and/or pointers as an optimization.
40+
///
41+
/// However, it's possible we'll want to hook up a local builtin CAS to, e.g.,
42+
/// a distributed generic hash map to use as an ActionCache. In that scenario,
43+
/// the transitive closure of the structured objects that are the results of
44+
/// the cached actions would need to be serialized into the map, something
45+
/// like:
46+
///
47+
/// "action:<schema>:<key>" -> "0123"
48+
/// "object:<schema>:0123" -> "3,4567,89AB,CDEF,9,some data"
49+
/// "object:<schema>:4567" -> ...
50+
/// "object:<schema>:89AB" -> ...
51+
/// "object:<schema>:CDEF" -> ...
52+
///
53+
/// These references would be full cost.
54+
using HasherT = BLAKE3;
55+
using HashType = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));
56+
57+
class BuiltinCASContext : public CASContext {
58+
void printIDImpl(raw_ostream &OS, const CASID &ID) const final;
59+
void anchor() override;
60+
61+
public:
62+
/// Get the name of the hash for any table identifiers.
63+
///
64+
/// FIXME: This should be configurable via an enum, with at the following
65+
/// values:
66+
///
67+
/// "BLAKE3" => 32B hash from BLAKE3
68+
/// "BLAKE3.16" => 16B hash from BLAKE3 (truncated)
69+
///
70+
/// Enum can be sent into \a createInMemoryCAS() and \a createOnDiskCAS().
71+
static StringRef getHashName() { return "BLAKE3"; }
72+
StringRef getHashSchemaIdentifier() const final {
73+
static const std::string ID =
74+
("llvm.cas.builtin.v2[" + getHashName() + "]").str();
75+
return ID;
76+
}
77+
78+
static const BuiltinCASContext &getDefaultContext();
79+
80+
BuiltinCASContext() = default;
81+
82+
static Expected<HashType> parseID(StringRef PrintedDigest);
83+
static void printID(ArrayRef<uint8_t> Digest, raw_ostream &OS);
84+
};
85+
86+
} // namespace llvm::cas::builtin
87+
88+
#endif // LLVM_CAS_BUILTINCASCONTEXT_H
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
//===- BuiltinObjectHasher.h ------------------------------------*- C++ -*-===//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
9+
#ifndef LLVM_CAS_BUILTINOBJECTHASHER_H
10+
#define LLVM_CAS_BUILTINOBJECTHASHER_H
11+
12+
#include "llvm/CAS/ObjectStore.h"
13+
#include "llvm/Support/Endian.h"
14+
15+
namespace llvm::cas {
16+
17+
template <class HasherT> class BuiltinObjectHasher {
18+
public:
19+
using HashT = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));
20+
21+
static HashT hashObject(const ObjectStore &CAS, ArrayRef<ObjectRef> Refs,
22+
ArrayRef<char> Data) {
23+
BuiltinObjectHasher H;
24+
H.updateSize(Refs.size());
25+
for (const ObjectRef &Ref : Refs)
26+
H.updateRef(CAS, Ref);
27+
H.updateArray(Data);
28+
return H.finish();
29+
}
30+
31+
static HashT hashObject(ArrayRef<ArrayRef<uint8_t>> Refs,
32+
ArrayRef<char> Data) {
33+
BuiltinObjectHasher H;
34+
H.updateSize(Refs.size());
35+
for (const ArrayRef<uint8_t> &Ref : Refs)
36+
H.updateID(Ref);
37+
H.updateArray(Data);
38+
return H.finish();
39+
}
40+
41+
private:
42+
HashT finish() { return Hasher.final(); }
43+
44+
void updateRef(const ObjectStore &CAS, ObjectRef Ref) {
45+
updateID(CAS.getID(Ref));
46+
}
47+
48+
void updateID(const CASID &ID) { updateID(ID.getHash()); }
49+
50+
void updateID(ArrayRef<uint8_t> Hash) {
51+
// NOTE: Does not hash the size of the hash. That's a CAS implementation
52+
// detail that shouldn't leak into the UUID for an object.
53+
assert(Hash.size() == sizeof(HashT) &&
54+
"Expected object ref to match the hash size");
55+
Hasher.update(Hash);
56+
}
57+
58+
void updateArray(ArrayRef<uint8_t> Bytes) {
59+
updateSize(Bytes.size());
60+
Hasher.update(Bytes);
61+
}
62+
63+
void updateArray(ArrayRef<char> Bytes) {
64+
updateArray(ArrayRef(reinterpret_cast<const uint8_t *>(Bytes.data()),
65+
Bytes.size()));
66+
}
67+
68+
void updateSize(uint64_t Size) {
69+
Size = support::endian::byte_swap(Size, endianness::little);
70+
Hasher.update(
71+
ArrayRef(reinterpret_cast<const uint8_t *>(&Size), sizeof(Size)));
72+
}
73+
74+
BuiltinObjectHasher() = default;
75+
~BuiltinObjectHasher() = default;
76+
HasherT Hasher;
77+
};
78+
79+
} // namespace llvm::cas
80+
81+
#endif // LLVM_CAS_BUILTINOBJECTHASHER_H

0 commit comments

Comments
 (0)