-
Notifications
You must be signed in to change notification settings - Fork 45
Implement approximate mode #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Initial performance numbers are encouraging. I generated a video with the following command:
That produced a 141 MB video file. I then ran our standard benchmark with:
And that yields:
Some explanations of what the options are:
For the full matrix, we could change the seek modes for the core options. Approximate mode is basically meeting our performance expectations here. |
AVCodecContext* codecContext = avcodec_alloc_context3(codec); | ||
codecContext->thread_count = options.ffmpegThreadCount.value_or(0); | ||
TORCH_CHECK(codecContext != nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this TORCH_CHECK
was almost certainly optimized out. Because the check happens after a use, the compiler is allowed to infer that the pointer can never be null, so codecContext != nullptr
can just become true
. The check itself will then be removed as dead code. Null checks must happen before the first use.
Ready for review. Some of the high-level things to agree one that I would ordinarily have written up in a design:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting!! I made a first pass below.
Is there a better name than seek_mode?
Do "exact" and "approximate" convey what we want, or are there better terms to use?
I think your choice of parametrization is good. I have a comment below related to that with get_video_metadata
but for the decoder I think this is good.
As I called out in my summary, this PR makes "exact" the default at all levels. I think there's a valid argument to make the Python VideoDecoder default to "exact", but the Python core API and the C++ layer to default to "approximate".
Consistency across APIs seems more important, so I'm happy with making "exact" the default at all levels. Apart for VideoDecoder
where the choice was deliberate, I don't think we had a strong reasoning towards the choice of default mode in the other APIs (like the C++ APIs being closer to "approximate" by default), so I wouldn't index on that too much.
} | ||
} | ||
|
||
int64_t VideoDecoder::getFramesSize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function and the ones below seem to implement a similar logic to the one we have on the Python's side (within the @properties
). This shouldn't be done in this PR but longer term, do you think it could make sense to aggressively unify these logics and have only one source of truth (in C++)? This is related to #125.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, yes. The C++ code predates the abstractions we built in the public API on the Python side, and I think it's a good idea to try to unify them.
TORCH_CHECK( | ||
frameIndex >= 0 && frameIndex < stream.allFrames.size(), | ||
frameIndex >= 0 && frameIndex < framesSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[warning: not very fleshed out thoughts below]
In approximate mode, do we actually want to check for frameIndex < framesSize
? That translates to checking that frameIndex < numFrames[FromHeader]
which relies on the metadata being somewhat correct.
Assuming the metadata is wrong or inexistant, this check would error loudly, when perhaps such an index is in fact valid and could be returned?
Assuming we don't do the check in approximate mode, what's the worst thing that can happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the worst thing that could happen is that we hit an error when we call avformat_seek_file()
in maybeSeekToBeforeDesiredPts()
:
torchcodec/src/torchcodec/decoders/_core/VideoDecoder.cpp
Lines 729 to 740 in d8dde5c
int ffmepgStatus = avformat_seek_file( | |
formatContext_.get(), | |
firstStreamInfo.streamIndex, | |
INT64_MIN, | |
desiredPts, | |
desiredPts, | |
0); | |
if (ffmepgStatus < 0) { | |
throw std::runtime_error( | |
"Could not seek file to pts=" + std::to_string(desiredPts) + ": " + | |
getFFMPEGErrorStringFromErrorCode(ffmepgStatus)); | |
} |
That's not terrible, because from a user's perspective, I don't think one exception is better or worse from the other. But, for now, I'd like to keep this check for a few reasons:
- Simplifies our code, as we have few behavior divergences between exact and approximate.
- I've been loosely adhering to the principle that if we have a valid value for metadata and we don't have anything better (such as number of frames when we haven't scanned) we just trust it.
- It's easier to relax this sort of thing in the future then make stricter.
If we get instances where people have lying metadata and we can't access valid frames, we can revisit.
return frame - streamInfo.allFrames.begin(); | ||
} | ||
case SeekMode::approximate: | ||
return std::ceil(seconds * streamMetadata.averageFps.value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for self: I'm always scared when ceil
is used to define an index upper bound... I need to convince myself that this is safe
Co-authored-by: Nicolas Hug <[email protected]>
On the question of segfaults: I don't think approximate mode adds any more opportunities for segfaults. The problem we'll run into is metadata that is wrong, which will lead us to seek past the end of a file. That will throw an exception, but it won't segfault. |
@NicolasHug and I discussed the torchcodec/src/torchcodec/decoders/_core/VideoDecoder.cpp Lines 238 to 246 in d26bfbc
Right now, we call it to improve the quality of our metadata. It does so by decoding "some" frames at the beginning of the file to infer what is going on. The value of "some" is something that we can control, although we do not. Reading the code for FFmepg 6, the default length is The question is: should we call I wanted to quantify the potential gain of not calling, so I used the same video as above to do some experiments. Note that for each grouping, I called the benchmark three times independently. Both groupings have exact and approximate mode; I'm treating exact as a control. The command I used, with the same video as above:
Experiments where approximate does call
Experiments where approximate does not call
Some performance observations:
Notably, two of our metadata tests now fail when we don't call There is an argument to continue calling The question then becomes: is the gain in performance worth the potential loss of usefulness? Based on the results above, my answer is a soft no. That is, the scenarios where we see the gain are not our primary use-cases, and I worry we'd make approximate mode much less useful. But I don't hold that position strongly, and I'm happy to discuss it more. |
@@ -66,7 +65,7 @@ def seek(self, pts: float): | |||
class TestOps: | |||
@pytest.mark.parametrize("device", cpu_and_cuda()) | |||
def test_seek_and_next(self, device): | |||
decoder = create_from_file(str(NASA_VIDEO.path)) | |||
decoder = create_from_file(str(NASA_VIDEO.path), seek_mode="approximate") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you applied the logic of "use exact when we called scan_all_streams_to_update_metadata()
, and approximate otherwise".
But in these tests, I don't think we were calling scan_all_streams_to_update_metadata()
with a strong purpose, i.e. I don't think we should read too much into the presence or the absence of a call to scan_all_streams_to_update_metadata()
. Specifically because for this specific test video, both modes are equivalent since the framerate is constant.
So I would suggest to not set the seek_mode
parameter anywhere in these tests: it shouldn't be relevant for the test video, and we weren't asking for a specific mode in the past anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fair. You're correct in what my intention was. In particular, when working on basic correctness, I felt keeping this consistent was important. But you're also correct that's there's not much intention in either situation, so I'm happy to just not set the seek mode. That will be more clear in the future that this test is not about seek modes.
|
||
@property | ||
def begin_stream_seconds(self) -> float: | ||
"""TODO.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just flagging this TODO (and the one below)
"""TODO.""" | ||
if self.begin_stream_seconds_from_content is None: | ||
return 0 | ||
return self.begin_stream_seconds_from_content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, I guess this is personal, but I prefer having an else
clause in such cases
if cond:
return ...
else:
return ...
as I find it easier from a quick glance that there are 2 possible return values (instead of just one at the bottom).
The only case I allow myself to write
if cond:
return ...
# rest here
is for early-return when there is a clear asymetry between the logic within the if
block, and the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an easy way to please both of us here would be to just write
return self.begin_stream_seconds_from_content or 0
:D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually fine with your first suggestion. The second suggestion trips me up because my C++ brain wants the type of the expression to be a bool.
test/samplers/test_samplers.py
Outdated
decoder.metadata.begin_stream_seconds = None | ||
decoder.metadata.end_stream_seconds_from_content = None | ||
decoder.metadata.duration_seconds_from_header = None | ||
decoder.metadata.duration_seconds_from_content = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this one line isn't needed for this check
test/samplers/test_samplers.py
Outdated
decoder.metadata.average_fps_from_header = None | ||
decoder.metadata.duration_seconds_from_header = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one, it should be enough to set these 2 to None:
decoder.metadata.end_stream_seconds_from_content = None
decoder.metadata.average_fps_from_header = None
Thanks a lot for the benchmarks! I think we should be able to reduce the variance by averaging over a bunch of calls. But I agree with the conclusion that we can keep the call to
I just observed that before calling |
Comprehensive implementation of approximate mode. Modifies the C++ and Python sides to match. Adds parameters to existing tests, but does not add any new tests.
I know that doing all of the changes in one go - C++, core ops, Python public interface - is a bear to review. But I wanted to quickly get to a state where we could compare performance. Also, because the approach I took makes the decoder have two different modes of operation, it requires changes everywhere.
Starting with the C++
VideoDecoder
layer:enum
calledSeekMode
:exact
implies doing a full file scan, and we can use all of the prior algorithms for getting ranges of frames.approximate
implies not doing a full file scan, and we have to rely on the average FPS to calculate indices. We also assume that the minimum pts in seconds is 0, and the maximum is the duration. If there is no duration or average FPS in the metadata, that's an error. (TODO: make sure the C++ throws on all of those conditions.)VideoDecoder
, and it applies to all streams. I considered trying to make it apply only when we add a stream, but because we want it to control whether or not we do a full file scan, I made it a property of the entire decoder. (We could potentially make it apply per stream if we moved the file scanning from object initialization to when we add the first stream.)exact
. I modified a bunch of our core ops tests to reflect this fact, as it is now redundant to requestexact
seek mode and manually do a full file scan.[x, y, z]
, we will turn that into indices[a, b, c]
based on the average fps. If the average fps is wrong, or if the video has a variable frame rate, then that mapping may be wrong. In principle, we could have returned the correct frames.In the Python public API, we add:
To the Python
VideoDecoder
constructor. Not much logic actually changes here.We do have a lot of changes in the Python metadata to support the
VideoDecoder
:begin_stream_seconds
->begin_stream_seconds_from_content
: This was always only from the full scan. We need to be explicit now.begin_stream_seconds
is now a property that ifbegin_stream_seconds_from_content
is none, is just 0.end_stream_seconds
->end_stream_seconds_from_content
: Same reasoning as above.end_stream_seconds
is now a property that ifend_stream_seconds_from_content
is none returns theduration_seconds
.By changing the metadata in that manner, the Python
VideoDecoder
class remains mostly unchanged as the real logic is in resolving what metadata to use. Note that the metadata class has no knowledge of the seek mode; it just knows if some values are missing.On defaults, the current implementation has
exact
mode as the default everywhere. We could, however, decide to apply different defaults at each level. The levels we can apply defaults are:VideoDecoder
class. In some ways, the old behavior wasapproximate
as default, since scanning the file did not automatically happen.create_decoder_from
family of functions. Similar to the C++, the old behavior was closer toapproximate
as default.VideoDecoder
constructor. The old behavior was definitelyexact
as we always scanned.