Skip to content

Commit 4f0c6be

Browse files
[𝘀𝗽𝗿] initial version
Created using spr 1.3.5
2 parents b510cdb + 424bd23 commit 4f0c6be

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+7523
-10
lines changed

llvm/CMakeLists.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -837,6 +837,13 @@ option (LLVM_ENABLE_SPHINX "Use Sphinx to generate llvm documentation." OFF)
837837
option (LLVM_ENABLE_OCAMLDOC "Build OCaml bindings documentation." ON)
838838
option (LLVM_ENABLE_BINDINGS "Build bindings." ON)
839839

840+
if(UNIX AND CMAKE_SIZEOF_VOID_P GREATER_EQUAL 8)
841+
set(LLVM_ENABLE_ONDISK_CAS_default ON)
842+
else()
843+
set(LLVM_ENABLE_ONDISK_CAS_default OFF)
844+
endif()
845+
option(LLVM_ENABLE_ONDISK_CAS "Build OnDiskCAS." ${LLVM_ENABLE_ONDISK_CAS_default})
846+
840847
set(LLVM_INSTALL_DOXYGEN_HTML_DIR "${CMAKE_INSTALL_DOCDIR}/llvm/doxygen-html"
841848
CACHE STRING "Doxygen-generated HTML documentation install directory")
842849
set(LLVM_INSTALL_OCAMLDOC_HTML_DIR "${CMAKE_INSTALL_DOCDIR}/llvm/ocaml-html"
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Content Addressable Storage
2+
3+
## Introduction to CAS
4+
5+
Content Addressable Storage, or `CAS`, is a storage system where it assigns
6+
unique addresses to the data stored. It is very useful for data deduplicaton
7+
and creating unique identifiers.
8+
9+
Unlikely other kind of storage system like file system, CAS is immutable. It
10+
is more reliable to model a computation when representing the inputs and outputs
11+
of the computation using objects stored in CAS.
12+
13+
The basic unit of the CAS library is a CASObject, where it contains:
14+
15+
* Data: arbitrary data
16+
* References: references to other CASObject
17+
18+
It can be conceptually modeled as something like:
19+
20+
```
21+
struct CASObject {
22+
ArrayRef<char> Data;
23+
ArrayRef<CASObject*> Refs;
24+
}
25+
```
26+
27+
Such abstraction can allow simple composition of CASObjects into a DAG to
28+
represent complicated data structure while still allowing data deduplication.
29+
Note you can compare two DAGs by just comparing the CASObject hash of two
30+
root nodes.
31+
32+
33+
34+
## LLVM CAS Library User Guide
35+
36+
The CAS-like storage provided in LLVM is `llvm::cas::ObjectStore`.
37+
To reference a CASObject, there are few different abstractions provided
38+
with different trade-offs:
39+
40+
### ObjectRef
41+
42+
`ObjectRef` is a lightweight reference to a CASObject stored in the CAS.
43+
This is the most commonly used abstraction and it is cheap to copy/pass
44+
along. It has following properties:
45+
46+
* `ObjectRef` is only meaningful within the `ObjectStore` that created the ref.
47+
`ObjectRef` created by different `ObjectStore` cannot be cross-referenced or
48+
compared.
49+
* `ObjectRef` doesn't guarantee the existence of the CASObject it points to. An
50+
explicitly load is required before accessing the data stored in CASObject.
51+
This load can also fail, for reasons like but not limited to: object does
52+
not exist, corrupted CAS storage, operation timeout, etc.
53+
* If two `ObjectRef` are equal, it is guarantee that the object they point to
54+
(if exists) are identical. If they are not equal, the underlying objects are
55+
guaranteed to be not the same.
56+
57+
### ObjectProxy
58+
59+
`ObjectProxy` represents a loaded CASObject. With an `ObjectProxy`, the
60+
underlying stored data and references can be accessed without the need
61+
of error handling. The class APIs also provide convenient methods to
62+
access underlying data. The lifetime of the underlying data is equal to
63+
the lifetime of the instance of `ObjectStore` unless explicitly copied.
64+
65+
### CASID
66+
67+
`CASID` is the hash identifier for CASObjects. It owns the underlying
68+
storage for hash value so it can be expensive to copy and compare depending
69+
on the hash algorithm. `CASID` is generally only useful in rare situations
70+
like printing raw hash value or exchanging hash values between different
71+
CAS instances with the same hashing schema.
72+
73+
### ObjectStore
74+
75+
`ObjectStore` is the CAS-like object storage. It provides API to save
76+
and load CASObjects, for example:
77+
78+
```
79+
ObjectRef A, B, C;
80+
Expected<ObjectRef> Stored = ObjectStore.store("data", {A, B});
81+
Expected<ObjectProxy> Loaded = ObjectStore.getProxy(C);
82+
```
83+
84+
It also provides APIs to convert between `ObjectRef`, `ObjectProxy` and
85+
`CASID`.
86+
87+
88+
89+
## CAS Library Implementation Guide
90+
91+
The LLVM ObjectStore APIs are designed so that it is easy to add
92+
customized CAS implementation that are interchangeable with builtin
93+
CAS implementations.
94+
95+
To add your own implementation, you just need to add a subclass to
96+
`llvm::cas::ObjectStore` and implement all its pure virtual methods.
97+
To be interchangeable with LLVM ObjectStore, the new CAS implementation
98+
needs to conform to following contracts:
99+
100+
* Different CASObject stored in the ObjectStore needs to have a different hash
101+
and result in a different `ObjectRef`. Vice versa, same CASObject should have
102+
same hash and same `ObjectRef`. Note two different CASObjects with identical
103+
data but different references are considered different objects.
104+
* `ObjectRef`s are comparable within the same `ObjectStore` instance, and can
105+
be used to determine the equality of the underlying CASObjects.
106+
* The loaded objects from the ObjectStore need to have the lifetime to be at
107+
least as long as the ObjectStore itself.
108+
109+
If not specified, the behavior can be implementation defined. For example,
110+
`ObjectRef` can be used to point to a loaded CASObject so
111+
`ObjectStore` never fails to load. It is also legal to use a stricter model
112+
than required. For example, an `ObjectRef` that can be used to compare
113+
objects between different `ObjectStore` instances is legal but user
114+
of the ObjectStore should not depend on this behavior.
115+
116+
For CAS library implementer, there is also a `ObjectHandle` class that
117+
is an internal representation of a loaded CASObject reference.
118+
`ObjectProxy` is just a pair of `ObjectHandle` and `ObjectStore`, because
119+
just like `ObjectRef`, `ObjectHandle` is only useful when paired with
120+
the ObjectStore that knows about the loaded CASObject.

llvm/docs/Reference.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ LLVM and API reference documentation.
1515
BranchWeightMetadata
1616
Bugpoint
1717
CommandGuide/index
18+
ContentAddressableStorage
1819
ConvergenceAndUniformity
1920
ConvergentOperations
2021
Coroutines
@@ -232,3 +233,6 @@ Additional Topics
232233
:doc:`ConvergenceAndUniformity`
233234
A description of uniformity analysis in the presence of irreducible
234235
control flow, and its implementation.
236+
237+
:doc:`ContentAddressableStorage`
238+
A reference guide for using LLVM's CAS library.

llvm/include/llvm/CAS/ActionCache.h

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
//===- llvm/CAS/ActionCache.h -----------------------------------*- C++ -*-===//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
9+
#ifndef LLVM_CAS_CASACTIONCACHE_H
10+
#define LLVM_CAS_CASACTIONCACHE_H
11+
12+
#include "llvm/ADT/StringRef.h"
13+
#include "llvm/CAS/CASID.h"
14+
#include "llvm/CAS/CASReference.h"
15+
#include "llvm/Support/Error.h"
16+
17+
namespace llvm::cas {
18+
19+
class ObjectStore;
20+
class CASID;
21+
class ObjectProxy;
22+
23+
/// A key for caching an operation.
24+
/// It is implemented as a bag of bytes and provides a convenient constructor
25+
/// for CAS types.
26+
class CacheKey {
27+
public:
28+
StringRef getKey() const { return Key; }
29+
30+
// TODO: Support CacheKey other than a CASID but rather any array of bytes.
31+
// To do that, ActionCache need to be able to rehash the key into the index,
32+
// which then `getOrCompute` method can be used to avoid multiple calls to
33+
// has function.
34+
CacheKey(const CASID &ID);
35+
CacheKey(const ObjectProxy &Proxy);
36+
CacheKey(const ObjectStore &CAS, const ObjectRef &Ref);
37+
38+
private:
39+
std::string Key;
40+
};
41+
42+
/// A cache from a key describing an action to the result of doing it.
43+
///
44+
/// Actions are expected to be pure (collision is an error).
45+
class ActionCache {
46+
virtual void anchor();
47+
48+
public:
49+
/// Get a previously computed result for \p ActionKey.
50+
///
51+
/// \param Globally if true it is a hint to the underlying implementation that
52+
/// the lookup is profitable to be done on a distributed caching level, not
53+
/// just locally. The implementation is free to ignore this flag.
54+
Expected<std::optional<CASID>> get(const CacheKey &ActionKey,
55+
bool Globally = false) const {
56+
return getImpl(arrayRefFromStringRef(ActionKey.getKey()), Globally);
57+
}
58+
59+
/// Cache \p Result for the \p ActionKey computation.
60+
///
61+
/// \param Globally if true it is a hint to the underlying implementation that
62+
/// the association is profitable to be done on a distributed caching level,
63+
/// not just locally. The implementation is free to ignore this flag.
64+
Error put(const CacheKey &ActionKey, const CASID &Result,
65+
bool Globally = false) {
66+
assert(Result.getContext().getHashSchemaIdentifier() ==
67+
getContext().getHashSchemaIdentifier() &&
68+
"Hash schema mismatch");
69+
return putImpl(arrayRefFromStringRef(ActionKey.getKey()), Result, Globally);
70+
}
71+
72+
virtual ~ActionCache() = default;
73+
74+
protected:
75+
virtual Expected<std::optional<CASID>> getImpl(ArrayRef<uint8_t> ResolvedKey,
76+
bool Globally) const = 0;
77+
78+
virtual Error putImpl(ArrayRef<uint8_t> ResolvedKey, const CASID &Result,
79+
bool Globally) = 0;
80+
81+
ActionCache(const CASContext &Context) : Context(Context) {}
82+
83+
const CASContext &getContext() const { return Context; }
84+
85+
private:
86+
const CASContext &Context;
87+
};
88+
89+
/// Create an action cache in memory.
90+
std::unique_ptr<ActionCache> createInMemoryActionCache();
91+
92+
} // end namespace llvm::cas
93+
94+
#endif // LLVM_CAS_CASACTIONCACHE_H
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
//===- BuiltinCASContext.h --------------------------------------*- C++ -*-===//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
9+
#ifndef LLVM_CAS_BUILTINCASCONTEXT_H
10+
#define LLVM_CAS_BUILTINCASCONTEXT_H
11+
12+
#include "llvm/CAS/CASID.h"
13+
#include "llvm/Support/BLAKE3.h"
14+
#include "llvm/Support/Error.h"
15+
16+
namespace llvm::cas::builtin {
17+
18+
/// Current hash type for the builtin CAS.
19+
///
20+
/// FIXME: This should be configurable via an enum to allow configuring the hash
21+
/// function. The enum should be sent into \a createInMemoryCAS() and \a
22+
/// createOnDiskCAS().
23+
///
24+
/// This is important (at least) for future-proofing, when we want to make new
25+
/// CAS instances use BLAKE7, but still know how to read/write BLAKE3.
26+
///
27+
/// Even just for BLAKE3, it would be useful to have these values:
28+
///
29+
/// BLAKE3 => 32B hash from BLAKE3
30+
/// BLAKE3_16B => 16B hash from BLAKE3 (truncated)
31+
///
32+
/// ... where BLAKE3_16 uses \a TruncatedBLAKE3<16>.
33+
///
34+
/// Motivation for a truncated hash is that it's cheaper to store. It's not
35+
/// clear if we always (or ever) need the full 32B, and for an ephemeral
36+
/// in-memory CAS, we almost certainly don't need it.
37+
///
38+
/// Note that the cost is linear in the number of objects for the builtin CAS,
39+
/// since we're using internal offsets and/or pointers as an optimization.
40+
///
41+
/// However, it's possible we'll want to hook up a local builtin CAS to, e.g.,
42+
/// a distributed generic hash map to use as an ActionCache. In that scenario,
43+
/// the transitive closure of the structured objects that are the results of
44+
/// the cached actions would need to be serialized into the map, something
45+
/// like:
46+
///
47+
/// "action:<schema>:<key>" -> "0123"
48+
/// "object:<schema>:0123" -> "3,4567,89AB,CDEF,9,some data"
49+
/// "object:<schema>:4567" -> ...
50+
/// "object:<schema>:89AB" -> ...
51+
/// "object:<schema>:CDEF" -> ...
52+
///
53+
/// These references would be full cost.
54+
using HasherT = BLAKE3;
55+
using HashType = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));
56+
57+
class BuiltinCASContext : public CASContext {
58+
void printIDImpl(raw_ostream &OS, const CASID &ID) const final;
59+
void anchor() override;
60+
61+
public:
62+
/// Get the name of the hash for any table identifiers.
63+
///
64+
/// FIXME: This should be configurable via an enum, with at the following
65+
/// values:
66+
///
67+
/// "BLAKE3" => 32B hash from BLAKE3
68+
/// "BLAKE3.16" => 16B hash from BLAKE3 (truncated)
69+
///
70+
/// Enum can be sent into \a createInMemoryCAS() and \a createOnDiskCAS().
71+
static StringRef getHashName() { return "BLAKE3"; }
72+
StringRef getHashSchemaIdentifier() const final {
73+
static const std::string ID =
74+
("llvm.cas.builtin.v2[" + getHashName() + "]").str();
75+
return ID;
76+
}
77+
78+
static const BuiltinCASContext &getDefaultContext();
79+
80+
BuiltinCASContext() = default;
81+
82+
static Expected<HashType> parseID(StringRef PrintedDigest);
83+
static void printID(ArrayRef<uint8_t> Digest, raw_ostream &OS);
84+
};
85+
86+
} // namespace llvm::cas::builtin
87+
88+
#endif // LLVM_CAS_BUILTINCASCONTEXT_H

0 commit comments

Comments
 (0)