Skip to content

Implement approximate mode #440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Jan 22, 2025
Merged

Implement approximate mode #440

merged 36 commits into from
Jan 22, 2025

Conversation

scotts
Copy link
Contributor

@scotts scotts commented Dec 20, 2024

Comprehensive implementation of approximate mode. Modifies the C++ and Python sides to match. Adds parameters to existing tests, but does not add any new tests.

I know that doing all of the changes in one go - C++, core ops, Python public interface - is a bear to review. But I wanted to quickly get to a state where we could compare performance. Also, because the approach I took makes the decoder have two different modes of operation, it requires changes everywhere.

Starting with the C++ VideoDecoder layer:

  • Adds a new enum called SeekMode:
    • exact implies doing a full file scan, and we can use all of the prior algorithms for getting ranges of frames.
    • approximate implies not doing a full file scan, and we have to rely on the average FPS to calculate indices. We also assume that the minimum pts in seconds is 0, and the maximum is the duration. If there is no duration or average FPS in the metadata, that's an error. (TODO: make sure the C++ throws on all of those conditions.)
  • The seek mode is passed in as a constructor parameter to VideoDecoder, and it applies to all streams. I considered trying to make it apply only when we add a stream, but because we want it to control whether or not we do a full file scan, I made it a property of the entire decoder. (We could potentially make it apply per stream if we moved the file scanning from object initialization to when we add the first stream.)
  • The default seek mode is exact. I modified a bunch of our core ops tests to reflect this fact, as it is now redundant to request exact seek mode and manually do a full file scan.
  • The pts-based member functions that converted the requested pts values to indices still do so. I initially implemented different algorithms for approximate mode where we used the actual pts values as is. That lead to poor performance because we could not allocate our output tensor up front and we could not take advantage of our dedupe logic. Since the point of approximate mode is performance, I switched it to what we currently have: helper functions which are seek mode aware, but the main algorithms are unaware of seek mode. However, this does mean that in approximate mode, we may return inaccurate frames even in cases when in principle we could return exact frames. That is, if a user requests a batch of frames at pts values [x, y, z], we will turn that into indices [a, b, c] based on the average fps. If the average fps is wrong, or if the video has a variable frame rate, then that mapping may be wrong. In principle, we could have returned the correct frames.

In the Python public API, we add:

  seek_mode: Literal["exact", "approximate"] = "exact",

To the Python VideoDecoder constructor. Not much logic actually changes here.

We do have a lot of changes in the Python metadata to support the VideoDecoder:

  • begin_stream_seconds -> begin_stream_seconds_from_content: This was always only from the full scan. We need to be explicit now.
  • begin_stream_seconds is now a property that if begin_stream_seconds_from_content is none, is just 0.
  • end_stream_seconds -> end_stream_seconds_from_content: Same reasoning as above.
  • end_stream_seconds is now a property that if end_stream_seconds_from_content is none returns the duration_seconds.

By changing the metadata in that manner, the Python VideoDecoder class remains mostly unchanged as the real logic is in resolving what metadata to use. Note that the metadata class has no knowledge of the seek mode; it just knows if some values are missing.

On defaults, the current implementation has exact mode as the default everywhere. We could, however, decide to apply different defaults at each level. The levels we can apply defaults are:

  • During construction of the C++ VideoDecoder class. In some ways, the old behavior was approximate as default, since scanning the file did not automatically happen.
  • In the Python core API create_decoder_from family of functions. Similar to the C++, the old behavior was closer to approximate as default.
  • In the Python VideoDecoder constructor. The old behavior was definitely exact as we always scanned.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 20, 2024
@scotts scotts mentioned this pull request Dec 20, 2024
@scotts
Copy link
Contributor Author

scotts commented Dec 20, 2024

Initial performance numbers are encouraging. I generated a video with the following command:

ffmpeg -y -f lavfi -i mandelbrot=s=1920x1080 -t 120 -c:v h264 -r 60 -g 600 -pix_fmt yuv420p mandelbrot_1920x1080_120s.mp4

That produced a 141 MB video file. I then ran our standard benchmark with:

python benchmarks/decoders/benchmark_decoders.py --decoders torchcodec_public:seek_mode=exact,torchcodec_public:seek_mode=approximate,torchcodec_public_nonbatch:seek_mode=exact,torchcodec_public_nonbatch:seek_mode=approximate,decord,decord_batch,torchaudio,torchcodec_core_batch,torchcodec_core_nonbatch --min-run-seconds 40 --video-paths mandelbrot_1920x1080_120s.mp4

And that yields:

[-------------------------------------------------- video=mandelbrot_1920x1080_120s.mp4 h264 1920x1080, 120.0s 60.0fps -------------------------------------------------]
                                                      |  decode 10 uniform frames  |  decode 10 random frames  |  first 1 frames  |  first 10 frames  |  first 100 frames
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------
      TorchAudio                                      |           1590.5           |           2112.7          |       29.3       |        49.0       |       253.2      
      DecordAccurate                                  |           1798.6           |           2201.3          |       95.0       |       171.6       |       701.8      
      DecordAccurateBatch                             |           1756.8           |           2164.2          |       92.9       |       135.7       |       576.7      
      TorchCodecCoreBatch                             |           1605.8           |           1334.6          |       79.6       |        92.6       |       480.1      
      TorchCodecCoreNonBatch                          |           1587.3           |           1971.6          |       25.7       |        38.9       |       155.9      
      TorchCodecPublicNonBatch:seek_mode=exact        |           1620.3           |           2011.6          |       79.3       |        92.7       |       218.8      
      TorchCodecPublicNonBatch:seek_mode=approximate  |           1541.5           |           1981.8          |       26.4       |        39.5       |       166.5      
      TorchCodecPublic:seek_mode=exact                |           1591.5           |           1330.2          |       78.6       |        92.1       |       214.2      
      TorchCodecPublic:seek_mode=approximate          |           1562.3           |           1292.8          |       25.9       |        39.0       |       159.9      

Times are in milliseconds (ms).

Some explanations of what the options are:

  • TorchCodecCoreBatch: core API, uses exact seek mode and batch APIs.
  • TorchCodecCoreNonBatch: core API, uses approximate seek mode and non-batch APIs.
  • Public means using VideoDecoder.
  • NonBatch means using the single frame API, even when getting multiple frames.
  • Batch means using the batch API when getting multiple frames.
  • seek_mode= is changing the seek mode.

For the full matrix, we could change the seek modes for the core options. Approximate mode is basically meeting our performance expectations here.

AVCodecContext* codecContext = avcodec_alloc_context3(codec);
codecContext->thread_count = options.ffmpegThreadCount.value_or(0);
TORCH_CHECK(codecContext != nullptr);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this TORCH_CHECK was almost certainly optimized out. Because the check happens after a use, the compiler is allowed to infer that the pointer can never be null, so codecContext != nullptr can just become true. The check itself will then be removed as dead code. Null checks must happen before the first use.

@scotts scotts changed the title [WIP] Implement approximate mode Implement approximate mode Jan 9, 2025
@scotts scotts marked this pull request as ready for review January 9, 2025 20:29
@scotts
Copy link
Contributor Author

scotts commented Jan 9, 2025

Ready for review. Some of the high-level things to agree one that I would ordinarily have written up in a design:

  • Is there a better name than seek_mode?
  • Do "exact" and "approximate" convey what we want, or are there better terms to use?
  • As I called out in my summary, this PR makes "exact" the default at all levels. I think there's a valid argument to make the Python VideoDecoder default to "exact", but the Python core API and the C++ layer to default to "approximate".

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting!! I made a first pass below.

Is there a better name than seek_mode?
Do "exact" and "approximate" convey what we want, or are there better terms to use?

I think your choice of parametrization is good. I have a comment below related to that with get_video_metadata but for the decoder I think this is good.

As I called out in my summary, this PR makes "exact" the default at all levels. I think there's a valid argument to make the Python VideoDecoder default to "exact", but the Python core API and the C++ layer to default to "approximate".

Consistency across APIs seems more important, so I'm happy with making "exact" the default at all levels. Apart for VideoDecoder where the choice was deliberate, I don't think we had a strong reasoning towards the choice of default mode in the other APIs (like the C++ APIs being closer to "approximate" by default), so I wouldn't index on that too much.

}
}

int64_t VideoDecoder::getFramesSize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function and the ones below seem to implement a similar logic to the one we have on the Python's side (within the @properties). This shouldn't be done in this PR but longer term, do you think it could make sense to aggressively unify these logics and have only one source of truth (in C++)? This is related to #125.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, yes. The C++ code predates the abstractions we built in the public API on the Python side, and I think it's a good idea to try to unify them.

TORCH_CHECK(
frameIndex >= 0 && frameIndex < stream.allFrames.size(),
frameIndex >= 0 && frameIndex < framesSize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning: not very fleshed out thoughts below]

In approximate mode, do we actually want to check for frameIndex < framesSize? That translates to checking that frameIndex < numFrames[FromHeader] which relies on the metadata being somewhat correct.

Assuming the metadata is wrong or inexistant, this check would error loudly, when perhaps such an index is in fact valid and could be returned?
Assuming we don't do the check in approximate mode, what's the worst thing that can happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the worst thing that could happen is that we hit an error when we call avformat_seek_file() in maybeSeekToBeforeDesiredPts():

int ffmepgStatus = avformat_seek_file(
formatContext_.get(),
firstStreamInfo.streamIndex,
INT64_MIN,
desiredPts,
desiredPts,
0);
if (ffmepgStatus < 0) {
throw std::runtime_error(
"Could not seek file to pts=" + std::to_string(desiredPts) + ": " +
getFFMPEGErrorStringFromErrorCode(ffmepgStatus));
}

That's not terrible, because from a user's perspective, I don't think one exception is better or worse from the other. But, for now, I'd like to keep this check for a few reasons:

  • Simplifies our code, as we have few behavior divergences between exact and approximate.
  • I've been loosely adhering to the principle that if we have a valid value for metadata and we don't have anything better (such as number of frames when we haven't scanned) we just trust it.
  • It's easier to relax this sort of thing in the future then make stricter.

If we get instances where people have lying metadata and we can't access valid frames, we can revisit.

return frame - streamInfo.allFrames.begin();
}
case SeekMode::approximate:
return std::ceil(seconds * streamMetadata.averageFps.value());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for self: I'm always scared when ceil is used to define an index upper bound... I need to convince myself that this is safe

@scotts
Copy link
Contributor Author

scotts commented Jan 15, 2025

On the question of segfaults: I don't think approximate mode adds any more opportunities for segfaults. The problem we'll run into is metadata that is wrong, which will lead us to seek past the end of a file. That will throw an exception, but it won't segfault.

@scotts
Copy link
Contributor Author

scotts commented Jan 17, 2025

@NicolasHug and I discussed the avformat_find_stream_info() call offline. In our current code base, we call it here:

void VideoDecoder::initializeDecoder() {
TORCH_CHECK(!initialized_, "Attempted double initialization.");
// In principle, the AVFormatContext should be filled in by the call to
// avformat_open_input() which reads the header. However, some formats do not
// store enough info in the header, so we call avformat_find_stream_info()
// which decodes a few frames to get missing info. For more, see:
// https://ffmpeg.org/doxygen/7.0/group__lavf__decoding.html
int ffmpegStatus = avformat_find_stream_info(formatContext_.get(), nullptr);

Right now, we call it to improve the quality of our metadata. It does so by decoding "some" frames at the beginning of the file to infer what is going on. The value of "some" is something that we can control, although we do not. Reading the code for FFmepg 6, the default length is 5 * AV_TIME_BASE. I think this will tend to be on the order of less than ten frames.

The question is: should we call avformat_find_stream_info() in approximate mode?

I wanted to quantify the potential gain of not calling, so I used the same video as above to do some experiments. Note that for each grouping, I called the benchmark three times independently. Both groupings have exact and approximate mode; I'm treating exact as a control. The command I used, with the same video as above:

python benchmarks/decoders/benchmark_decoders.py --decoders torchcodec_public:seek_mode=exact,torchcodec_public:seek_mode=approximate --min-run-seconds 40 --video-paths mandelbrot_1920x1080_120s.mp4

Experiments where approximate does call avformat_find_stream_info():

                                              |  decode 10 uniform frames  |  decode 10 random frames  |  first 1 frames  |  first 10 frames  |  first 100 frames
1:
      TorchCodecPublic:seek_mode=exact        |           1630.6           |           1346.0          |       78.4       |       111.4       |       458.1      
      TorchCodecPublic:seek_mode=approximate  |           1564.4           |           1323.2          |       26.3       |        60.0       |       403.0      
2:
      TorchCodecPublic:seek_mode=exact        |           1621.3           |           1348.9          |       78.1       |        93.2       |       214.0      
      TorchCodecPublic:seek_mode=approximate  |           1551.1           |           1270.7          |       25.7       |        38.9       |       162.4      
3:
      TorchCodecPublic:seek_mode=exact        |           1585.9           |           1349.1          |       79.2       |       117.7       |       469.2      
      TorchCodecPublic:seek_mode=approximate  |           1543.1           |           1271.6          |       25.6       |        59.2       |       410.8      

Experiments where approximate does not call avformat_find_stream_info():

                                              |  decode 10 uniform frames  |  decode 10 random frames  |  first 1 frames  |  first 10 frames  |  first 100 frames
1:
      TorchCodecPublic:seek_mode=exact        |           1639.1           |           1365.9          |       78.9       |        92.2       |       389.2      
      TorchCodecPublic:seek_mode=approximate  |           1537.3           |           1264.1          |       16.8       |        49.5       |       160.3      
2:
      TorchCodecPublic:seek_mode=exact        |           1633.0           |           1350.5          |       78.7       |        91.8       |       210.4      
      TorchCodecPublic:seek_mode=approximate  |           1559.0           |           1318.0          |       17.1       |        30.1       |       151.0      
3:
      TorchCodecPublic:seek_mode=exact        |           1605.4           |           1353.9          |       78.3       |        93.5       |       217.9      
      TorchCodecPublic:seek_mode=approximate  |           1547.6           |           1286.0          |       17.1       |        30.4       |       166.1      

Some performance observations:

  1. There is a surprising level of variation in the first 100 frames case. I don't have an explanation for it, but I note that it seems to affect exact and approximate equally. Because of that, I have difficulty using it draw any conclusions. At the least, the effect of calling avformat_find_stream_info() seems to be lost in the noise.
  2. There is some variation in the first 10 frames case. If we compare the similar experiments, we may have a gain of about 20%.
  3. There is a clear improvement in the decode 1 frame case by about 35%.
  4. There is no meaningful difference for 10 uniform and random frames.

Notably, two of our metadata tests now fail when we don't call avformat_find_stream_info(); duration_seconds_from_header is not being set anymore. All of the other tests pass because we already have duration_seconds as a property that computes the value from the begin and end values. Presumably, avformat_find_stream_info() is doing something similar to what our metadata class in Python is.

There is an argument to continue calling avformat_find_stream_info() in approximate mode: we are actually more reliant on accurate metadata in approximate mode than we are in exact mode. In exact mode, we do a more extreme version of what that call does, and we scan the entire file. That means exact mode is less reliant on the metadata. But all seeking in approximate mode depends on that metadata. If we don't call avformat_find_stream_info(), then approximate mode will not be useful for files that don't have good metadata directly in their header.

The question then becomes: is the gain in performance worth the potential loss of usefulness? Based on the results above, my answer is a soft no. That is, the scenarios where we see the gain are not our primary use-cases, and I worry we'd make approximate mode much less useful. But I don't hold that position strongly, and I'm happy to discuss it more.

@@ -66,7 +65,7 @@ def seek(self, pts: float):
class TestOps:
@pytest.mark.parametrize("device", cpu_and_cuda())
def test_seek_and_next(self, device):
decoder = create_from_file(str(NASA_VIDEO.path))
decoder = create_from_file(str(NASA_VIDEO.path), seek_mode="approximate")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you applied the logic of "use exact when we called scan_all_streams_to_update_metadata(), and approximate otherwise".

But in these tests, I don't think we were calling scan_all_streams_to_update_metadata() with a strong purpose, i.e. I don't think we should read too much into the presence or the absence of a call to scan_all_streams_to_update_metadata(). Specifically because for this specific test video, both modes are equivalent since the framerate is constant.

So I would suggest to not set the seek_mode parameter anywhere in these tests: it shouldn't be relevant for the test video, and we weren't asking for a specific mode in the past anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fair. You're correct in what my intention was. In particular, when working on basic correctness, I felt keeping this consistent was important. But you're also correct that's there's not much intention in either situation, so I'm happy to just not set the seek mode. That will be more clear in the future that this test is not about seek modes.


@property
def begin_stream_seconds(self) -> float:
"""TODO."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just flagging this TODO (and the one below)

"""TODO."""
if self.begin_stream_seconds_from_content is None:
return 0
return self.begin_stream_seconds_from_content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, I guess this is personal, but I prefer having an else clause in such cases

if cond:
    return ...
else:
    return ...

as I find it easier from a quick glance that there are 2 possible return values (instead of just one at the bottom).

The only case I allow myself to write

if cond:
    return ...

# rest here

is for early-return when there is a clear asymetry between the logic within the if block, and the rest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an easy way to please both of us here would be to just write

return self.begin_stream_seconds_from_content or 0

:D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually fine with your first suggestion. The second suggestion trips me up because my C++ brain wants the type of the expression to be a bool.

decoder.metadata.begin_stream_seconds = None
decoder.metadata.end_stream_seconds_from_content = None
decoder.metadata.duration_seconds_from_header = None
decoder.metadata.duration_seconds_from_content = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this one line isn't needed for this check

decoder.metadata.average_fps_from_header = None
decoder.metadata.duration_seconds_from_header = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, it should be enough to set these 2 to None:

        decoder.metadata.end_stream_seconds_from_content = None
        decoder.metadata.average_fps_from_header = None

@NicolasHug
Copy link
Member

Thanks a lot for the benchmarks! I think we should be able to reduce the variance by averaging over a bunch of calls. But I agree with the conclusion that we can keep the call to avformat_find_stream_info(). At least for now, and definitely for this PR anyways. Let's just keep in mind as a potential source of improvement (e.g. if the user trusts their metadata, we might allow the to bypass it).

Notably, two of our metadata tests now fail when we don't call avformat_find_stream_info(); duration_seconds_from_header is not being set anymore. All of the other tests pass because we already have duration_seconds as a property that computes the value from the begin and end values. Presumably, avformat_find_stream_info() is doing something similar to what our metadata class in Python is.

I just observed that before calling avformat_find_stream_info() on our favorite test video, the duration of the AVFormatContext was wrongly set to 0, even though the AVStream's duration was set to a correct value. After calling avformat_find_stream_info(), the AVFormatContext's duration was correct. So you're right, it does something kinda similar to what we do with our metadata (i.e. it overrides with a more correct value). That being said, we should be able to set our ContainerMetadata.durationSeconds field to that of a stream, instead of relying on that of AVFormatContext, which is potentially wrong. That's for another PR anyway.

@scotts scotts merged commit 0ad3f01 into pytorch:main Jan 22, 2025
44 of 47 checks passed
@scotts scotts deleted the approx branch January 22, 2025 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants