Skip to content
This repository was archived by the owner on Sep 11, 2020. It is now read-only.

difftree: merkletrie internal package with iterator #133

Merged
merged 1 commit into from
Nov 23, 2016
Merged

difftree: merkletrie internal package with iterator #133

merged 1 commit into from
Nov 23, 2016

Conversation

alcortesm
Copy link
Contributor

@alcortesm alcortesm commented Nov 22, 2016

This patch is the first in a series of patches to make our current implementation of difftree more efficient. The current version is slow for repositories with lots of files, see https://github.com/alcortesm/git-diff-tree-benchmarks/blob/master/README.md for details.

The plan is to improve difftree performance by taking advantage of git trees being Merkle trees, and skip comparing whole directories by just checking that their hash has not change.

This patch is the first step towards that, it introduces a new merkletrie package with support for Radix-Merkle trees and its comparison.

Currently only the tree iterator is implemented, the comparison will come in a future patch. For the iterator to be useful for the future comparison implementation it has to be able to go into directories or skip them completely, depending on what the user wants in each moment.

How to review this PR:

  • read the new doc.go file, refresh merkle trees and radix trees on wikipedia if needed.
  • read noder.go, this is, the node interface for the tree. It should be easy in the future to change git.Tree to implement this interface.
  • read node.go and node_test.go for a simple implementation of Noder that helps me test the package.
  • read frame.go and frame_test.go to learn how to represent siblings in a nice way.
  • read the iter.go for the iterator. Warning: It does heavy use of nested stacks.
  • read iter_test.go and iter_fixtures.go for some combinatorial forest fun.

Notes: please take special attention to:

  • the iterator implementation (which is the complicated part)
  • the iterator tests: test coverage is ok (98%) but test legibility is aweful. I can improve test legibility a lot in about 4 o 5 more hours, just let me know if it is worth it.

@mcuadros
Copy link
Contributor

Great job!

The Noder interface doesn't fit the TreeEntry entity, how you plan to use this with a real Tree? If you plan to have a translator, this will affect the performance?

Maybe you can you modify this to work directly with Tree and create a fixture generator form the current fixtures?

I know you said that the main reason to do this Noder interface was having easy testing, if the performance is good, compare to a direct implementation, I think is ok, actually more than ok. If not maybe worth to do a implementation for our Tree entity, or at least with a interface that fits the struct

@alcortesm
Copy link
Contributor Author

alcortesm commented Nov 23, 2016

Thank you very much @mcuadros .

You raised a very good point. I have not yet decided how I'm going to implement this. As you have already introduced, there are two ways:

A) Make git.Tree implement merkletrie.Noder.
B) Make a "translator" from one type to another.

Both ways are going to be equally efficient as both of them do the same amount of work. In all the cases the children creation is deferred until it is needed, so the tree will be built (and destroyed) at the same time as the iterator goes over it. The rest of fields needed by the Noder are already in the Tree so we will not waste any time, nor memory.

So, in my opinion is a matter of taste and style.

Solution A introduces some new functionality to git.Tree, which is already quite crowded. On the bright side, the new methods are pretty simple to understand just from their name, so the end result would be a type with more methods, but with the same underlying concepts and complexity it has now.

Solution B helps to keep responsibilities separated, but also hides the relationship between git.Trees and merkletrie.Noders. Also, if we want to be coherent in the future, new types "implementing" merkletrie.Noder should also have translators instead, which might be sub-optimal.

Unrelated question: the Noder.Hash() method might change in the future to a func Noder.IsDifferentTo(other Noder) bool to accommodate using modification dates for checking directory equality.

Copy link
Collaborator

@smola smola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

package merkletrie

// The Noder interface is implemented by the elements of a Merkle Trie.
type Noder interface {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be in utils/iterators, but maybe we can wait until we want to actually use it for other algorithms (e.g. commit walk, tree walk).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree.

I'll probably move it in the next patch, after having used it a little bit for some real work.

@alcortesm alcortesm merged commit ccf5bb1 into src-d:master Nov 23, 2016
@alcortesm alcortesm deleted the diff-tree-must-ignore-unmodified-trees branch November 23, 2016 14:20
@alcortesm alcortesm restored the diff-tree-must-ignore-unmodified-trees branch November 23, 2016 14:20
@alcortesm alcortesm deleted the diff-tree-must-ignore-unmodified-trees branch November 23, 2016 14:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants