ColumnString and ColumnFixedString performance fix #29

Enmk · 2020-01-22T08:37:50Z

Improved performance of the the ColumnFixedString and ColumnString loading (reading values from binary stream, typically done on reading result values from SELECT) and saving (writing values to binary stream, typically done on writing input values with INSERT).

This was achieved by:

reducing number of allocations (utilizing large block of memory to store data) and not pre-filling target memory (see std::string::resize) with zeroes for ColumnString.
Streamlining serialization/de-serialization for ColumnFixedString.

Updated API of the ColumnFixedString/ColumnString to return\take std::string_view instead of std::string to allow those optimizations.

Final results, exuding any networking or any other I/O effects as much as possible, by encoding/decoding in-memory buffers corresponding to 1 million items of each kind.

FixedString(8)	OLD time	NEW time	DIFF
Saving:	12023.4	1645.3	-86.32%
Loading:	11508.1	1389.3	-87.93%

String	OLD time	NEW time	DIFF
Saving:	23925.7	23008.1	-3.84%
Loading:	20335.9	14474.1	-28.82%

No client-server protocol modifications were made, since the simplest one of prepending column with it's binary size on the stream lead only to about 1-2% improvement of load time, but was too invasive.

However, in order to achieve the best performance of reading strings from server, we may want to change the serialization format for strings significantly by writing first the stream of lengths, and then continuous stream of data. That would reduce operations required from approx:

(L + S + z) * N (this pr)
to
(L + z) * N + X (two continuous streams: lengths and data)
and even to
Y + X (if lengths are serialized as 64-bit values, but not as VarInts)

where:

N is column size
X is time required to copy string data with memcpy (entire column)
Y is time to copy lengths data with memcpy (entire column)
L is time to decode UInt64_t from VarInt binary representation
S is time to read single string value (boils down to cost of memcpy + dust)
z is loop iteration overhead + other minor factors

…o other things

… improvements.

traceon

Couple of questions are pending.

traceon · 2020-02-03T13:17:19Z

clickhouse/columns/string.cpp

    }

-    return result;
+    return std::move(result);


Are you sure this std::move is right thing to do here? Wouldn't it prevent the "natural" copy elision? Take a look at this: https://stackoverflow.com/questions/17473753/c11-return-value-optimization-or-move

Hmmm, let me check

good find, thank you!

traceon · 2020-02-03T13:54:22Z

clickhouse/columns/string.cpp

-        }
-
-        data_.push_back(std::move(s));
+    data_.resize(string_size_ * rows);


The original code suggests appending to data_ (i.e., as a reader, I'll assume that there already could be some data in data_), the new code writes from the beginning as if there is no data in data_ ever, at this point. Pointing this out to just get a confirmation, that this new behavior is correct conceptually.

Also, resizing a string assumes implicit zero-filling, which could become significant, but I am not sure there is a quick and easy way to avoid it.

unfortunately there is no easy way (not even hard way) to avoid pre-filling std::string, and that is why I use Block for ColumnString below.

As for assumptions: somehow those assumptions are opposite for rest of column types, which erase previous data. So I had to unify behaviors.

Right, but here, that zero-filling still takes place.

If not now, maybe in future, this could be used. Looks like the header itself is standalone:
https://github.com/facebook/folly/blob/master/folly/memory/UninitializedMemoryHacks.h

I guess this is not a big problem for FixedString, but yes, we might want to use Block here too/ That facebook hack...

traceon · 2020-02-03T16:56:20Z

clickhouse/columns/string.cpp

 }

-void ColumnString::Append(const std::string& str) {
-    data_.push_back(str);
+ColumnString::~ColumnString()


This looks like an unused leftover from initial edits?

Nope, just a trick to explicitly instantiate d-tor (and hence d-tor of std::unique_ptr<Block[]>) in the context where Block is defined. Otherwise said unique_ptr fails to compile.

Hm, strange. Doesn't ~ColumnString() = default help?

IIRC that does not prescribe where the d-tor is instantiated, which might be anywhere, like in every other translation unit.

traceon · 2020-02-03T17:28:24Z

clickhouse/columns/string.cpp

+
+    size_t size;
+    const size_t capacity;
+    std::unique_ptr<CharT[]> data_;


Why not std::string? With its capacity and size management.

I wish I could use std::string here, the problem is that there is no way of pre-allocating buffer without pre-filling it with data.

traceon

Questions answered.

Enmk added 3 commits January 22, 2020 11:06

Perf tests for ColumnString and ColumnFixedString, some minor fixes t…

4f55319

…o other things

Performance fixes for ColumnString and ColumnFixedString

c2dd41b

Fixed ClientCase fixture not to crash tests if server is offline

9280c21

Enmk changed the title ~~ColumnString and ColumnFixedString performance fix~~ WIP: ColumnString and ColumnFixedString performance fix Jan 22, 2020

Enmk added 2 commits January 28, 2020 14:08

Reverted notion of size_hint since it does not brings any significant…

591fe15

… improvements.

Minor fixes: empty lines and comments.

0ec5bfd

Enmk changed the title ~~WIP: ColumnString and ColumnFixedString performance fix~~ ColumnString and ColumnFixedString performance fix Jan 28, 2020

Enmk requested a review from alexey-milovidov January 29, 2020 00:30

Cleanup: removed unused code and outdated comments

5eea214

traceon suggested changes Feb 4, 2020

View reviewed changes

removed excess std::move

367958f

traceon approved these changes Feb 4, 2020

View reviewed changes

Enmk merged commit 7c67c6d into ClickHouse:master Feb 4, 2020

ColumnString and ColumnFixedString performance fix #29

ColumnString and ColumnFixedString performance fix #29

Uh oh!

Conversation

Enmk commented Jan 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

traceon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Enmk Feb 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Enmk Feb 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

traceon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Enmk commented Jan 22, 2020 •

edited

Loading

Enmk Feb 4, 2020 •

edited

Loading

Enmk Feb 4, 2020 •

edited

Loading