Skip to content

Commit 3166cc0

Browse files
[CAS] Add LLVMCAS library with InMemoryCAS implementation
Add llvm::cas::ObjectStore abstraction and InMemoryCAS as a in-memory CAS object store implementation. The ObjectStore models its objects as: * Content: An array of bytes for the data to be stored. * Refs: An array of references to other objects in the ObjectStore. And each CAS Object can be idenfied with an unqine ID/Hash. ObjectStore supports following general action: * Expected<ID> store(Content, ArrayRef<Ref>) * Expected<Ref> get(ID) It also introduces following types to interact with a CAS ObjectStore: * CASID: Hash representation for an CAS Objects with its context to help print/compare CASIDs. * ObjectRef: A light-weight ref for an object in the ObjectStore. It is implementation defined so it can be optimized for read/store/references depending on the implementation. * ObjectHandle: A CAS internal light-weight handle to an loaded object in the ObjectStore. Underlying data for the object is guaranteed to be available and no error handling is required to access data. This is not exposed to the users of CAS from ObjectStore APIs. * ObjectProxy: A proxy for the users of CAS to interact with the data inside CAS Object. It bundles a ObjectHandle and an ObjectStore instance. Differential Revision: https://reviews.llvm.org/D133716
1 parent c3ce218 commit 3166cc0

17 files changed

+2067
-0
lines changed
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Content Addressable Storage
2+
3+
## Introduction to CAS
4+
5+
Content Addressable Storage, or `CAS`, is a storage system where it assigns
6+
unique addresses to the data stored. It is very useful for data deduplicaton
7+
and creating unique identifiers.
8+
9+
Unlikely other kind of storage system like file system, CAS is immutable. It
10+
is more reliable to model a computation when representing the inputs and outputs
11+
of the computation using objects stored in CAS.
12+
13+
The basic unit of the CAS library is a CASObject, where it contains:
14+
15+
* Data: arbitrary data
16+
* References: references to other CASObject
17+
18+
It can be conceptually modeled as something like:
19+
20+
```
21+
struct CASObject {
22+
ArrayRef<char> Data;
23+
ArrayRef<CASObject*> Refs;
24+
}
25+
```
26+
27+
Such abstraction can allow simple composition of CASObjects into a DAG to
28+
represent complicated data structure while still allowing data deduplication.
29+
Note you can compare two DAGs by just comparing the CASObject hash of two
30+
root nodes.
31+
32+
33+
34+
## LLVM CAS Library User Guide
35+
36+
The CAS-like storage provided in LLVM is `llvm::cas::ObjectStore`.
37+
To reference a CASObject, there are few different abstractions provided
38+
with different trade-offs:
39+
40+
### ObjectRef
41+
42+
`ObjectRef` is a lightweight reference to a CASObject stored in the CAS.
43+
This is the most commonly used abstraction and it is cheap to copy/pass
44+
along. It has following properties:
45+
46+
* `ObjectRef` is only meaningful within the `ObjectStore` that created the ref.
47+
`ObjectRef` created by different `ObjectStore` cannot be cross-referenced or
48+
compared.
49+
* `ObjectRef` doesn't guarantee the existence of the CASObject it points to. An
50+
explicitly load is required before accessing the data stored in CASObject.
51+
This load can also fail, for reasons like but not limited to: object does
52+
not exist, corrupted CAS storage, operation timeout, etc.
53+
* If two `ObjectRef` are equal, it is guarantee that the object they point to
54+
(if exists) are identical. If they are not equal, the underlying objects are
55+
guaranteed to be not the same.
56+
57+
### ObjectProxy
58+
59+
`ObjectProxy` represents a loaded CASObject. With an `ObjectProxy`, the
60+
underlying stored data and references can be accessed without the need
61+
of error handling. The class APIs also provide convenient methods to
62+
access underlying data. The lifetime of the underlying data is equal to
63+
the lifetime of the instance of `ObjectStore` unless explicitly copied.
64+
65+
### CASID
66+
67+
`CASID` is the hash identifier for CASObjects. It owns the underlying
68+
storage for hash value so it can be expensive to copy and compare depending
69+
on the hash algorithm. `CASID` is generally only useful in rare situations
70+
like printing raw hash value or exchanging hash values between different
71+
CAS instances with the same hashing schema.
72+
73+
### ObjectStore
74+
75+
`ObjectStore` is the CAS-like object storage. It provides API to save
76+
and load CASObjects, for example:
77+
78+
```
79+
ObjectRef A, B, C;
80+
Expected<ObjectRef> Stored = ObjectStore.store("data", {A, B});
81+
Expected<ObjectProxy> Loaded = ObjectStore.getProxy(C);
82+
```
83+
84+
It also provides APIs to convert between `ObjectRef`, `ObjectProxy` and
85+
`CASID`.
86+
87+
88+
89+
## CAS Library Implementation Guide
90+
91+
The LLVM ObjectStore APIs are designed so that it is easy to add
92+
customized CAS implementation that are interchangeable with builtin
93+
CAS implementations.
94+
95+
To add your own implementation, you just need to add a subclass to
96+
`llvm::cas::ObjectStore` and implement all its pure virtual methods.
97+
To be interchangeable with LLVM ObjectStore, the new CAS implementation
98+
needs to conform to following contracts:
99+
100+
* Different CASObject stored in the ObjectStore needs to have a different hash
101+
and result in a different `ObjectRef`. Vice versa, same CASObject should have
102+
same hash and same `ObjectRef`. Note two different CASObjects with identical
103+
data but different references are considered different objects.
104+
* `ObjectRef`s are comparable within the same `ObjectStore` instance, and can
105+
be used to determine the equality of the underlying CASObjects.
106+
* The loaded objects from the ObjectStore need to have the lifetime to be at
107+
least as long as the ObjectStore itself.
108+
109+
If not specified, the behavior can be implementation defined. For example,
110+
`ObjectRef` can be used to point to a loaded CASObject so
111+
`ObjectStore` never fails to load. It is also legal to use a stricter model
112+
than required. For example, an `ObjectRef` that can be used to compare
113+
objects between different `ObjectStore` instances is legal but user
114+
of the ObjectStore should not depend on this behavior.
115+
116+
For CAS library implementer, there is also a `ObjectHandle` class that
117+
is an internal representation of a loaded CASObject reference.
118+
`ObjectProxy` is just a pair of `ObjectHandle` and `ObjectStore`, because
119+
just like `ObjectRef`, `ObjectHandle` is only useful when paired with
120+
the ObjectStore that knows about the loaded CASObject.

llvm/docs/Reference.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ LLVM and API reference documentation.
1515
BranchWeightMetadata
1616
Bugpoint
1717
CommandGuide/index
18+
ContentAddressableStorage
1819
ConvergenceAndUniformity
1920
ConvergentOperations
2021
Coroutines
@@ -228,3 +229,6 @@ Additional Topics
228229
:doc:`ConvergenceAndUniformity`
229230
A description of uniformity analysis in the presence of irreducible
230231
control flow, and its implementation.
232+
233+
:doc:`ContentAddressableStorage`
234+
A reference guide for using LLVM's CAS library.

llvm/include/llvm/CAS/CASID.h

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
//===- llvm/CAS/CASID.h -----------------------------------------*- C++ -*-===//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
9+
#ifndef LLVM_CAS_CASID_H
10+
#define LLVM_CAS_CASID_H
11+
12+
#include "llvm/ADT/ArrayRef.h"
13+
#include "llvm/ADT/DenseMapInfo.h"
14+
#include "llvm/ADT/SmallString.h"
15+
#include "llvm/ADT/StringExtras.h"
16+
#include "llvm/ADT/StringRef.h"
17+
#include "llvm/Support/Error.h"
18+
19+
namespace llvm {
20+
21+
class raw_ostream;
22+
23+
namespace cas {
24+
25+
class CASID;
26+
27+
/// Context for CAS identifiers.
28+
class CASContext {
29+
virtual void anchor();
30+
31+
public:
32+
virtual ~CASContext() = default;
33+
34+
/// Get an identifer for the schema used by this CAS context. Two CAS
35+
/// instances should return \c true for this identifier if and only if their
36+
/// CASIDs are safe to compare by hash. This is used by \a
37+
/// CASID::equalsImpl().
38+
virtual StringRef getHashSchemaIdentifier() const = 0;
39+
40+
protected:
41+
/// Print \p ID to \p OS.
42+
virtual void printIDImpl(raw_ostream &OS, const CASID &ID) const = 0;
43+
44+
friend class CASID;
45+
};
46+
47+
/// Unique identifier for a CAS object.
48+
///
49+
/// Locally, stores an internal CAS identifier that's specific to a single CAS
50+
/// instance. It's guaranteed not to change across the view of that CAS, but
51+
/// might change between runs.
52+
///
53+
/// It also has \a CASIDContext pointer to allow comparison of these
54+
/// identifiers. If two CASIDs are from the same CASIDContext, they can be
55+
/// compared directly. If they are, then \a
56+
/// CASIDContext::getHashSchemaIdentifier() is compared to see if they can be
57+
/// compared by hash, in which case the result of \a getHash() is compared.
58+
class CASID {
59+
public:
60+
void dump() const;
61+
void print(raw_ostream &OS) const {
62+
return getContext().printIDImpl(OS, *this);
63+
}
64+
friend raw_ostream &operator<<(raw_ostream &OS, const CASID &ID) {
65+
ID.print(OS);
66+
return OS;
67+
}
68+
std::string toString() const;
69+
70+
ArrayRef<uint8_t> getHash() const {
71+
return arrayRefFromStringRef<uint8_t>(Hash);
72+
}
73+
74+
friend bool operator==(const CASID &LHS, const CASID &RHS) {
75+
if (LHS.Context == RHS.Context)
76+
return LHS.Hash == RHS.Hash;
77+
78+
// EmptyKey or TombstoneKey.
79+
if (!LHS.Context || !RHS.Context)
80+
return false;
81+
82+
// CASIDs are equal when they have the same hash schema and same hash value.
83+
return LHS.Context->getHashSchemaIdentifier() ==
84+
RHS.Context->getHashSchemaIdentifier() &&
85+
LHS.Hash == RHS.Hash;
86+
}
87+
88+
friend bool operator!=(const CASID &LHS, const CASID &RHS) {
89+
return !(LHS == RHS);
90+
}
91+
92+
friend hash_code hash_value(const CASID &ID) {
93+
ArrayRef<uint8_t> Hash = ID.getHash();
94+
return hash_combine_range(Hash.begin(), Hash.end());
95+
}
96+
97+
const CASContext &getContext() const {
98+
assert(Context && "Tombstone or empty key for DenseMap?");
99+
return *Context;
100+
}
101+
102+
static CASID getDenseMapEmptyKey() {
103+
return CASID(nullptr, DenseMapInfo<StringRef>::getEmptyKey());
104+
}
105+
static CASID getDenseMapTombstoneKey() {
106+
return CASID(nullptr, DenseMapInfo<StringRef>::getTombstoneKey());
107+
}
108+
109+
CASID() = delete;
110+
111+
static CASID create(const CASContext *Context, StringRef Hash) {
112+
return CASID(Context, Hash);
113+
}
114+
115+
private:
116+
CASID(const CASContext *Context, StringRef Hash)
117+
: Context(Context), Hash(Hash) {}
118+
119+
const CASContext *Context;
120+
SmallString<32> Hash;
121+
};
122+
123+
/// This is used to workaround the issue of MSVC needing default-constructible
124+
/// types for \c std::promise/future.
125+
template <typename T> struct AsyncValue {
126+
Expected<std::optional<T>> take() { return std::move(Value); }
127+
128+
AsyncValue() : Value(std::nullopt) {}
129+
AsyncValue(Error &&E) : Value(std::move(E)) {}
130+
AsyncValue(T &&V) : Value(std::move(V)) {}
131+
AsyncValue(std::nullopt_t) : Value(std::nullopt) {}
132+
AsyncValue(Expected<std::optional<T>> &&Obj) : Value(std::move(Obj)) {}
133+
134+
private:
135+
Expected<std::optional<T>> Value;
136+
};
137+
138+
} // namespace cas
139+
140+
template <> struct DenseMapInfo<cas::CASID> {
141+
static cas::CASID getEmptyKey() { return cas::CASID::getDenseMapEmptyKey(); }
142+
143+
static cas::CASID getTombstoneKey() {
144+
return cas::CASID::getDenseMapTombstoneKey();
145+
}
146+
147+
static unsigned getHashValue(cas::CASID ID) {
148+
return (unsigned)hash_value(ID);
149+
}
150+
151+
static bool isEqual(cas::CASID LHS, cas::CASID RHS) { return LHS == RHS; }
152+
};
153+
154+
} // namespace llvm
155+
156+
#endif // LLVM_CAS_CASID_H

0 commit comments

Comments
 (0)