Skip to content

gix-status-improvements #1030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Oct 5, 2023
Merged

gix-status-improvements #1030

merged 23 commits into from
Oct 5, 2023

Conversation

Byron
Copy link
Member

@Byron Byron commented Sep 23, 2023

Based on #1021

Improve gix status to the point where it's suitable for use in reset functinoality.
Leads to a proper worktree reset implementation, eventually leading to a high-level reset similar to how git supports it.

Architecture

The reason this PR deals quite a bit with gix status is that for a safe implementation of reset() we need to be sure that the files we would want to touch don't don't carry modifications or are untracked files. In order to know what would need to be done, we have to diff the current-index with target-index. The set of files to touch can then be used to lookup information provided by git-status, like worktree modifications, index modifications, and untracked files, to know if we can proceed or not. Here is also where the reset-modes would affect the outcome, i.e. what to change and how.

This is a very modular approach which facilitates testing and understanding of what otherwise would be a very complex algorithm. Having a set of changes as output also allows to one day parallelize applying these changes.

This leaves us in a situation where the current checkout() implementation wants to become a fastpath for situations where the reset involves an empty tree as source (i.e. create everything and overwrite local changes).

On the way to reset() it's a valid choice to warm up more with the matter by improving on the current gix status implementation and assure correctness of what's there, which currently doesn't seem to be the case in comparison. Further, implementing gix status similarly to git status should be made possible.

Tasks

  • correctness: symlink check (see this commit for motivation)
  • add submodule support to index-as-worktree in gix-status - changes to the entry itself should be detected (i.e. must be dir)
  • provide statistics via Outcome (helps to determine early aborts)
  • allow early aborts (for is_dirty() checks for instance)
  • handle filters: comparing worktree changes with odb requires a filter pipeline to 'clean' it. Make sure worktree files can be streamed. Needs actual gix-filter due to ownership/streaming semantics
  • stream all files (leverage 'len' of metadata)
  • why does git still think that an entry is changed even after we have updated it? Use index debug mode?
    • It's the CTime - gix seems to not set it correctly, even though it should be able to. This is why git thinks the file changed, it trusts the CTime by default.
  • figure out why filters don't work exactly as expected, see [warn]: in the working copy of './Source/ThirdParty/ANGLE/src/android_system_settings/res/drawable/icon.png', CRLF will be replaced by LF next time git touches it due to undoing a filter which maybe, isn't applied?
  • fix 'attributes' performance bottleneck - only get attributes when needed - unfortunately it still doesn't manage to be fast with 20 threads like it's something else that goes wrong. - maybe this has to be even closer to what git does… no chunking on MacOS maybe. Try some other time… it's rather involved but worked (even though it wasn't faster). Maybe it needs to be implemented exactly like git?
  • add tests with newly created index that needs to calculate the correct hash as the time is off/not present
  • optionally include unmodified like git2 can do? Or is there a way to implement it on top maybe by providing indices? Indices are now provided with each listed entry, which allows to both enrich conflict information and list unchanged statuses efficiently.
  • what about submodule paths with conflicts in them? Conflict handling should work just the same for all types of entries
  • gix status with statistics
  • handle stage-mask similar to git, making algorithm a two-stage process. This is also related to rename-tracking, which would need access to all entries to try find matches.
  • keep track of postponed features in gix-status in crates-status.md

Next PR

  • status in gix crate
  • diff index with index to learn what we would want to do in the worktree
  • reset() that checks if it's allowed to perform a worktree modification is allowed, or if an entry should be skipped. That way we can postpone safety checks like --hard

Postponed

What follows is important for resets, but won't be needed for cargo worktree resets.

  • what about index/worktree rename tracking? git2 can do that. Needs generalization of what's available for tree/tree diffs, at least learn from it.
  • gix status with actual submodule support - needs status in gix (crate) effectively
  • gix status with actual conflict support
  • a way to obtain untracked files to learn if changes can be made

Limitations

  • It seems that when CTime is newer then MTime, that the Rust std implementation sets Ctime to mtime which then causes us to do extra-work and 'fight' git as we will write an index with the normalized Ctime, but git will rewrite that next time it runs. This can be fixed with core.trustCTime=false

Research

  • How to integrate submodules - probably easy to answer once gix status can deal a little better with submodules. Even though in this case a lot of submodule-related information is needed for a complete reset, probably only doable by a higher-level caller which orchestrates it.
  • How to deal with various modes like merge and keep? How to control refresh? Maybe partial (only the files we touch), and full, to also update the files we don't touch as part of status? Maybe it's part of status if that is run before.
  • Worthwhile to make explicit the difference between git reset and git checkout in terms of HEAD modifications. With the former changing HEADs referent, and the latter changing HEAD itself.
  • figure out how this relates to the current checkout() method as technically that's a reset --hard with optional overwrite check. Could it be rolled into one, with pathspec support added?
    • just keep them separate until it's clear that reset() performs just as well, which is unlikely as there is more overhead. But maybe it's not worth to maintain two versions over it. But if so, one should probably rename it.
  • for git status: what about rename tracking? It's available for tree-diffs and quite complex on its own. Probably only needs HEAD-vs-index rename tracking.
  • for git status: How to deal with detailed conflict messages? Right now we only know if there is a conflict or not and it seems we would need access to the other entries (or condense that knowledge to be status-suitable).
  • submodule states
    • rm -Rf dir -> deleted dir
    • touch dir -> typechange
    • rm dir && mkdir dir -> no change
    • rm -Rf .git/modules/dir -> no change
    • otherwise the submodule HEAD can be compared to the desired HEAD in the superproject, along with fine-grained stats similar to what git status can detect: sha1collisiondetection (new commits, modified content, untracked content)

@Byron Byron force-pushed the reset branch 13 times, most recently from fa7fef7 to ba307f5 Compare September 28, 2023 20:59
That way it's possible to hash entire files as object.
Previously it wasn't possible to read more than u32::MAX bytes even
on 32 bit system even though we are streaming the data.
@Byron Byron force-pushed the reset branch 9 times, most recently from af488c1 to 8d76af7 Compare October 4, 2023 05:30
Byron added 5 commits October 4, 2023 09:29
Previously, submodules where ignored. Now they are treated correctly
as 'directory' which is compared to what's in the worktree.

We also simplify blob handling.
That way, 'is_dirty()` scenarios can be done without wasting too much time.
Byron added 7 commits October 4, 2023 09:30
It adds `Stack::from_state_and_ignore_case()` as utility to more easily instantiate
a stack the is configured correctly.
This also removes the `stack::State::for_status()` method as it's not actually
suitable for status retrieval per se.
…onvert::to_git::IndexObjectFn()`.

It implies that one has to be ready to fetch any kind of path from the index, even though it's always the path to
the file that is currently converted.

Also fix a bug that could cause it to return input as unchanged even though it was read into a buffer already.
…bit systems.

Previously, larger than 4GB files wouldn't be supported, which causes problems when
genrating hashes even when streaming data.
This is important as it allows to streaming-read from the worktree and
correctly change, for example, `git-lfs` files back into their manifests,
and to arrive at the correct hash.
That way it's possible to lookup other, surrounding entries in case
of conflicts or easily find entries that didn't change.
@Byron Byron force-pushed the reset branch 2 times, most recently from 0de13de to 7e82b92 Compare October 5, 2023 10:47
Byron added 7 commits October 5, 2023 13:23
We also adjust the returned data structure to allow the input to be immutable,
which delegates entry updates to the caller.

This also paves the way for rename tracking, which requires free access to entries
for searching renames among the added and removed items, and/or copies among the added ones.
This is useful if a missing index should mean it's empty.
…n what happened.

This is useful for understanding performance characteristics in detail.
This codepath was never tested and its function more subtle than one could have known.
Also fix incorrect configuration handling which could lead to binary files with `text=auto`
to be identified as text, which would then require conversion.
This prevents expensive operations to re-occour.
It seems to work now, but let's keep an eye on it.
It seems windows now has a windows-unspecific `echo` program
and one can't really rely on it producing windows style newlines.

Now we use printf which is more standard and can be used to validate
multiple arguments as well.
@Byron Byron merged commit b842691 into main Oct 5, 2023
@Byron Byron deleted the reset branch October 5, 2023 13:27
@Byron Byron mentioned this pull request Oct 5, 2023
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant