Skip to content

Commit 30bc61f

Browse files
author
Matthew Sackman
committed
wrote a short essay
1 parent 8e0df68 commit 30bc61f

File tree

1 file changed

+80
-1
lines changed

1 file changed

+80
-1
lines changed

src/rabbit_disk_queue.erl

Lines changed: 80 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@
5959

6060
-define(SERVER, ?MODULE).
6161

62-
-record(dqstate, {msg_location, % where are messages?
62+
-record(dqstate, {msg_location, %% where are messages?
6363
file_summary, %% what's in the files?
6464
sequences, %% next read and write for each q
6565
current_file_num, %% current file name as number
@@ -71,6 +71,85 @@
7171
read_file_handles_limit %% how many file handles can we open?
7272
}).
7373

74+
%% The components:
75+
%%
76+
%% MsgLocation: this is a dets table which contains:
77+
%% {MsgId, RefCount, File, Offset, TotalSize}
78+
%% FileSummary: this is an ets table which contains:
79+
%% {File, ValidTotalSize, ContiguousTop, Left, Right}
80+
%% Sequences: this is an ets table which contains:
81+
%% {Q, ReadSeqId, WriteSeqId}
82+
%% rabbit_disk_queue: this is an mnesia table which contains:
83+
%% #dq_msg_loc { queue_and_seq_id = {Q, SeqId},
84+
%% is_delivered = IsDelivered,
85+
%% msg_id = MsgId
86+
%% }
87+
%%
88+
89+
%% The basic idea is that messages are appended to the current file up
90+
%% until that file becomes too big (> file_size_limit). At that point,
91+
%% the file is closed and a new file is created on the _right_ of the
92+
%% old file which is used for new messages. Files are named
93+
%% numerically ascending, thus the file with the lowest name is the
94+
%% eldest file.
95+
%%
96+
%% We need to keep track of which messages are in which files (this is
97+
%% the MsgLocation table); how much useful data is in each file and
98+
%% which files are on the left and right of each other. This is the
99+
%% purpose of the FileSummary table.
100+
%%
101+
%% As messages are removed from files, holes appear in these
102+
%% files. The field ValidTotalSize contains the total amount of useful
103+
%% data left in the file, whilst ContiguousTop contains the amount of
104+
%% valid data right at the start of each file. These are needed for
105+
%% garbage collection.
106+
%%
107+
%% On publish, we write the message to disk, record the changes to
108+
%% FileSummary and MsgLocation, and, should this be either a plain
109+
%% publish, or followed by a tx_commit, we record the message in the
110+
%% mnesia table. Sequences exists to enforce ordering of messages as
111+
%% they are published within a queue.
112+
%%
113+
%% On delivery, we read the next message to be read from disk
114+
%% (according to the ReadSeqId for the given queue) and record in the
115+
%% mnesia table that the message has been delivered.
116+
%%
117+
%% On ack we remove the relevant entry from MsgLocation, update
118+
%% FileSummary and delete from the mnesia table.
119+
%%
120+
%% In order to avoid extra mnesia searching, we return the SeqId
121+
%% during delivery which must be returned in ack - it is not possible
122+
%% to ack from MsgId alone.
123+
124+
%% As messages are ack'd, holes develop in the files. When we discover
125+
%% that either a file is now empty or that it can be combined with the
126+
%% useful data in either its left or right file, we compact the two
127+
%% files together. This keeps disk utilisation high and aids
128+
%% performance.
129+
%%
130+
%% Given the compaction between two files, the left file is considered
131+
%% the ultimate destination for the good data in the right file. If
132+
%% necessary, the good data in the left file which is fragmented
133+
%% throughout the file is written out to a temporary file, then read
134+
%% back in to form a contiguous chunk of good data at the start of the
135+
%% left file. Thus the left file is garbage collected and
136+
%% compacted. Then the good data from the right file is copied onto
137+
%% the end of the left file. MsgLocation and FileSummary tables are
138+
%% updated.
139+
%%
140+
%% On startup, we scan the files we discover, dealing with the
141+
%% possibilites of a crash have occured during a compaction (this
142+
%% consists of tidyup - the compaction is deliberately designed such
143+
%% that data is duplicated on disk rather than risking it being lost),
144+
%% and rebuild the dets and ets tables (MsgLocation, FileSummary,
145+
%% Sequences) from what we find. We ensure that the messages we have
146+
%% discovered on disk match exactly with the messages recorded in the
147+
%% mnesia table.
148+
149+
%% MsgLocation is deliberately a dets table, and the mnesia table is
150+
%% set to be a disk_only_table in order to ensure that we are not RAM
151+
%% constrained.
152+
74153
%% ---- PUBLIC API ----
75154

76155
start_link(FileSizeLimit, ReadFileHandlesLimit) ->

0 commit comments

Comments
 (0)