Skip to content

[llava][18/N] Move token generation loop to a class #4652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1,328 commits into from
Aug 13, 2024

Conversation

larryliu0820
Copy link
Contributor

@larryliu0820 larryliu0820 commented Aug 9, 2024

Stack from ghstack (oldest at bottom):

As titled. This PR moves the token generation loop in llama2 runner into
a new class so it can be reused.

Differential Revision: D61047601

mcremon-meta and others added 30 commits July 16, 2024 13:41
…#4108)

Summary:
Pull Request resolved: #4108

We want to be able to run the reference implementations on x86, so we don't want any intrinsics or anything like that in the reference kernels.

In the end, this change has a lot of things:
- introduce a `reference` folder for reference implementations
- moved the primary cmake flow from HiFi to reference, so that the default mode can run on x86
- that means we will need a proper flag to use HiFi optimized ops, which we will add later
- add a `quantized_matmul` reference kernel

Reviewed By: dulinriley

Differential Revision: D59238748

fbshipit-source-id: 830c89fe9ee8dd87ece963e1174ca3cbd1e0fbc6
Summary:
Pull Request resolved: #4272

`MethodMeta` is the new way to get this information.

Reviewed By: tarun292

Differential Revision: D59782278

fbshipit-source-id: 1e1df006ee95886aa80b9704bfda488e1ad93dcf
Summary:
Pull Request resolved: #4264

This diff brings in the latest export serializer changes from `torch/_export/serde` and then fixes the exir serializer to use these changes. This is a temporary workaround until the serialization extensiblity that zhxchen17 is working on is completed. Then we'll not have to copy over changes from export serializer manually, instead we'll just extend it to handle things like delegate calls and other edge dialect specific things.

For now we need this to unblock ASR use case and hence i'm manually syncing and updating for now.

Reviewed By: JacobSzwejbka

Differential Revision: D57071033

fbshipit-source-id: 4408e2a0740b661e3a3555f800d1567ef10d4ea8
Summary:
This PR moves the titoken and bpe tokenizers into `extension/llm/tokenizer` such that they can be reused by other models.

Note: Currently the tiktoken has two sets of unit tests based on llama2's tokenizer:
- default
- multimodal

This PR only moves the default unit test into extension and keeps the multimodal's unit tests inside llama2/tokenizer.

Pull Request resolved: #4278

Test Plan:
- test/run_oss_cpp_tests.sh examples/models/llama2/tokenizer/test
- test/run_oss_cpp_tests.sh extension/llm/tokenizer/test

Reviewed By: larryliu0820

Differential Revision: D59822702

Pulled By: helunwencser

fbshipit-source-id: 5d51ba3e44c9b2d9dc77b9f4349b58947ed68502
Summary: Pull Request resolved: facebookincubator/fizz#142

Reviewed By: namanahuja

Differential Revision: D59832580

Pulled By: ahornby

fbshipit-source-id: 1b936a007e5d08f7bc959d5775bce36b107f4bb3
Summary:
Our [stable branch](https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50) is failing behind for a month. It seems like the android job keeps failing to consume the artifacts fetched from S3. See #4285 details.

To unblock the stable branch, this PR is to temporarily disable the S3 workflow and will re-enable it later as a periodic job. And the workflow will be re-enabled as a periodic job in #4286 once #4285 is fixed

Pull Request resolved: #4287

Reviewed By: dbort

Differential Revision: D59839152

Pulled By: guangy10

fbshipit-source-id: 5ab85aa592c32e7a9048845cf9088eb39573b7ce
Summary:
Pull Request resolved: #4269

as title
ghstack-source-id: 234090425

Reviewed By: dbort

Differential Revision: D59770172

fbshipit-source-id: 4bb63b2497cd3ddb04726da6fb8cefb0a2add391
Summary:
Pull Request resolved: #4290

.

Reviewed By: helunwencser

Differential Revision: D59865664

fbshipit-source-id: 2dfcfc90194dc366ab9811bfa73d5f1b44872255
Summary:
Pull Request resolved: #4273

## Motivation
`run_decompositions()` has a new preserve_ops functionality which allows us to specify which ops we want to refrain from decomposing. This is super helpful for the to_edge_transform_and_lower api because it allows us to preserve decomposition that occur beyond the first level.

For example consider LSTM. when exported using torch.export, we see a torch.ops.aten.LSTM() operator in the graph. When running decompositions this is decomposed into linear, and then further decomposed into addmm. Since the linear op is produced from decomposing LSTM and does not exist until after we run_decompositions(), we can not perform our trick of changing the name space to prevent its decomposition. However, now using `_preserve_ops=(torch.ops.aten.linear.default,)` we are able to prevent this second layer decomposition.

## API Implementation Change
So in the implementation we do two passes. The first pass is we run_decompositions preserving all aten ops specified by our partitioners using `_preserve_ops`. On our second pass, we further filter which aten ops should be preserved by using the check_op_fn given to us by partitioners. We then use our namespace trick to prevent the decomposition of all aten ops which pass our check_op_fn.

## Testing Changes
To strengthen our tests, I first change the functionality of the NonDecompPartitioner. We partition only pre-decomp aten ops. And each of these ops live within their own delegate (this allows us to have a 1:1 mapping for call_delegate and pre_decomp aten nodes). In testing, this will allow us to ensure that the number of ops which are to preserved is correct by counting the number of delegates calls.

In testing we then count the number of aten ops which should correctly be preserved. And then check after the fact that all these ops are
1. No longer in the graph after to_edge_transform_and_lower
2. Each of these preserved ops are transformed into a call_delegate node

Reviewed By: tarun292

Differential Revision: D59786323

fbshipit-source-id: 7ea946e0d5afc8ebddd26913f6e843305116ad3b
…4163)

Summary:
- add utilities for loading context binary generated from qnn tools
- align env variable naming with qnn
- fix bug in online prepare and extend coverage to support bitwise quatization
- llama7b e2e example from qualcomm ai_hub
- minor fixes for syle & typo

Pull Request resolved: #4163

Reviewed By: swolchok, kirklandsign

Differential Revision: D59737140

Pulled By: cccclai

fbshipit-source-id: 16e98d7f5da7204a2d04258fd75dabd8aa1eaa7d
Summary:
Pull Request resolved: #4293

This diff parses the logged intermediate outputs in etdump into Inspector objects. It pretty much automatically works because the infra has already been built out for non-delegate intermediate output logging. The only change needed here is to add delegate debug id to `InstructionEventSignature` and `DebugEventSignature` so they group delegated ops correctly.
Design doc: https://docs.google.com/document/d/1qGHsgd-roqtxPz4CrUlqGrKaAtbaf9bArziMuwHD0So/edit

Reviewed By: Jack-Khuu

Differential Revision: D59840296

fbshipit-source-id: 04f22d4520584090f3b37b83386f704cc7a2c271
Summary:
Pull Request resolved: #4256

Llava is using HF RoPE so adding a config to switch between our
stock RoPE to HF RoPE. We may be able to consolidate them together but
it can come later.

Reviewed By: helunwencser

Differential Revision: D59759975

fbshipit-source-id: 9c3a1825b82f0f32e15fb06e2f73d73e2bacba0c
Summary:
Pull Request resolved: #4262

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from memory, using the following shader, where A and B are readonly and writeonly buffers, respectively.

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= A[offset];
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= A[offset];
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this. We can see a limitation when buffers reach 32 KB in size.

{F1759406621}

Reviewed By: SS-JIA

Differential Revision: D59687299

fbshipit-source-id: 5a97a2c2b0bf077c575de55d23061d5597ba385d
Summary:
Pull Request resolved: #4270

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from UBOs, using the following shader, where A is a UBO and B is a writeonly buffer.

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= A[offset];
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= A[offset];
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this. We can see a decline proportional to the size of the buffer, until it plateaus at 32KB.

{F1759559978}

Comparing it with the Readonly profiler, we can immediately see the superiority in reading speed for UBOs, whenever the hardware is available

Samsung S22
{F1759560675}

Redmi Note
{F1759445004}

Reviewed By: copyrightly

Differential Revision: D59776899

fbshipit-source-id: 0f93186833bbe3610c5b5bf68cda519a7b467aca
Summary:
Pull Request resolved: #4277

This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from UBOs, using the following shader, where A is a shared buffer and B is a writeonly buffer.

  shared vec4 A[nvec];

  void main() {
    vec4 sum = vec4(0);
    const uint workgroup_width = local_group_size * niter * ${NUNROLL};
    uint offset = (gl_WorkGroupID[0] * workgroup_width  + gl_LocalInvocationID[0]) & addr_mask;

    int i = 0;
    for (; i < niter; ++i)
    {
        sum *= A[offset];
        offset = (offset + local_group_size) & addr_mask;
        ...
        ...
        sum *= A[offset];
        offset = (offset + local_group_size) & addr_mask;
    }

    vec4 zero = vec4(i>>31);

    B[gl_LocalInvocationID[0]] = sum + zero;
  }

The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID.

Finally, we make sure to use the `sum` and `i	` variables so that the compiler's optimizer does not flatten the loops.

For a Samsung S22, the bandwidth behaves like this. We can see that accessing the shared memory has a constant latency, until it reaches the Maximum Shared Memory size.

NOTE: The graph is extended for visualization purposes, the experiment stops before it drops, because otherwise it would crash.

{F1759597657}

Comparing it to OpenCL, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth.

{F1759600867}

Reviewed By: copyrightly

Differential Revision: D59811152

fbshipit-source-id: 537be13dbec1a02cb55e689db2a0fd548613c729
…nts (#4292)

Summary:
Pull Request resolved: #4292

Some simple improvements to the SPIR-V compilation script:

1. Allow `layout_declare_tensor` to create a scalar buffer instead of always creating a vectorized buffer
2. Allow handling of non-string (i.e. int) values in shader codegen YAML configurations.

Reviewed By: jorgep31415

Differential Revision: D59877805

fbshipit-source-id: 579888fbc19d19a0d24f2fbd831e74f4ba32f033
Summary:
changed `rm- -rf` to `rm -rf`

Pull Request resolved: #4296

Reviewed By: lucylq

Differential Revision: D59921910

Pulled By: dbort

fbshipit-source-id: bab21a39faae4db53ff4b04c02598f27c535d3ce
Summary:
We would want to reuse the same demo app to benchmark as many models ad possible. It may be not easy to create super generic app for all types of models, but we can reuse our existing demo apps to swap in different models of performing same task, e.g. our llama demo should be able to benchmark different casual LLMs w/o problems. To do this, we need to organize the build vertically by the demo apps. Currently we have two demo apps for android (ios demo app would follow the same rule), this PR is to address the llama demo. The android job 'build-llm-demo' is going to build different flavors of the same app by android-abi and tokenizer library. In the downstream, an app built for arm with bpe tokenizer could be used to benchmark all LLMs using bpe tokenizer on a physical android device.

Pull Request resolved: #4288

Reviewed By: huydhn, kirklandsign

Differential Revision: D59874919

Pulled By: guangy10

fbshipit-source-id: 11bf280765af9ddd4e5459e47c859cc8d37b3848
Summary:
Pull Request resolved: #4299

Remove Buck2 reference

Reviewed By: lucylq

Differential Revision: D59922659

fbshipit-source-id: 12f7c59e9ea23afd743435115e9bd5b5afc825d4
…4279)

Summary:
Pull Request resolved: #4279

I left review feedback for this diff late; applying it (and broadening use of kwargs.get() while I'm here).
ghstack-source-id: 234276753

Reviewed By: kimishpatel

Differential Revision: D59823492

fbshipit-source-id: 0fc8ec2861d2eb2f19bb38ba885024d532f20f44
Summary:
According to the comment in #4288 , add the uploading step back so that TorchChat can consume the artifacts from S3

Pull Request resolved: #4300

Reviewed By: huydhn, kirklandsign

Differential Revision: D59925505

Pulled By: guangy10

fbshipit-source-id: ce389fb16adb30d51240fdff655111580f07130b
Summary: Pull Request resolved: #4259

Reviewed By: helunwencser

Differential Revision: D59759978

fbshipit-source-id: 8ff8a5b24481b28e0814b45f60b4b0fdbfd47e4e
Summary:
Pull Request resolved: #4295

As titled. Pending CI job

Reviewed By: helunwencser

Differential Revision: D59901269

fbshipit-source-id: 0f32357830a677736ac3123526653bff70c8c7af
Summary:
Pull Request resolved: #4291

Starts to completely deprecate DataLoader::Load by converting existing callsites to DataLoader::load, and making DataLoader::load pure virtual.

Searched for all relevant callsites by doing the following:
- Grep for all instances of `core/data_loader.h>` to find all direct imports of DataLoader, finding "Load(" in the file
- Grep for all instances of `executorch\/.*_data_loader.h>` to find all imports of subclasses derived from DataLoader, finding "Load(" in the file
- Grep for all instances of "->Load(" in files with "executorch" in the path
- Grep for all isntances of "loader->Load(" and "loader_->Load(" in the entire codebase

Reviewed By: dbort

Differential Revision: D59767505

fbshipit-source-id: d8a6d998f957dcba291815312f8bde54f84c3100
Summary:
Return an array with size 0, to allow getting array length in code below.

Pull Request resolved: #4266

Reviewed By: cccclai

Differential Revision: D59764473

Pulled By: kirklandsign

fbshipit-source-id: f00eac32879c31a983156c4651714ccf1e1ec280
Summary: Pull Request resolved: #4265

Reviewed By: guangy10

Differential Revision: D59763471

Pulled By: kirklandsign

fbshipit-source-id: 33c3fbe2d0434daab5504856f91f15407e58d463
Summary:
Pull Request resolved: #4298

This very simple metric runs a kernel across an increasing number of workgroups, until there is a noticeable increase in latency, as seen in the following graph:

 {F1762497995}

The shader uses an integer division as its metric, because it is a multi-cycle operation that puts the ALU to work and stops the SM from context switching.

As other metrics, we start by obtaining the minimum number of iterations, NITER, that can run in 1000us, as to have a baseline for comparison and reduce timing noise. With this number of iterations, we run the kernel with an increasing number of threads. We also use a multidimensional global workgroup with a Y size of 1024 in hopes of saturating the ALUs and have a better point of reference for the latency caused by adding warps.

Once we detect a jump in latency, we can assume that that is the warp size.

More information can be found [here](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf) on page 5.

Reviewed By: jorgep31415

Differential Revision: D59920169

fbshipit-source-id: 4ac9324e10f0ab1a72433fd7ce98ad5f5ab839e9
Summary: Pull Request resolved: #4306

Reviewed By: kirklandsign

Differential Revision: D59977405

Pulled By: guangy10

fbshipit-source-id: 2ce889e6f49ade545a668244db8aab6e7f7bef01
Summary:
This PR is auto-generated nightly by [this action](https://github.com/pytorch/executorch/blob/main/.github/workflows/nightly.yml).
Update the pinned pytorch hash.

Pull Request resolved: #4313

Reviewed By: kirklandsign

Differential Revision: D59983250

Pulled By: guangy10

fbshipit-source-id: eec0a71936aad9642b958ce4ac222011a8d0025d
Summary:
Pull Request resolved: #4322

We retropfitted flash attention cpu from aten. The retrofit we did was to
make it work to cacluate attention for a) batched prefill and b) decode with
different start_pos. For b, there was a bug when kv cache's seqlen dim is
split.
As a result attention calculation is not right. There is a detail in the code
to explain the issue.

bypass-github-export-checks
ghstack-source-id: 234634902

Reviewed By: larryliu0820

Differential Revision: D60011925

fbshipit-source-id: 50921846b329e449a4a767cf28c7a55d507217bd
@larryliu0820
Copy link
Contributor Author

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@larryliu0820
Copy link
Contributor Author

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

haowhsu-quic and others added 8 commits August 12, 2024 11:28
Differential Revision: D60967580

Pull Request resolved: #4560
Differential Revision: D61057535

Pull Request resolved: #4657
Differential Revision: D61141396

Pull Request resolved: #4670
Differential Revision: D60424030

Pull Request resolved: #4453
Differential Revision: D60399589

Pull Request resolved: #4443
Differential Revision: D61044259

Pull Request resolved: #4649
Differential Revision: D61150844

Pull Request resolved: #4673
* allow models to use customized token ids during export (#4649)

Summary:
LLama3.1's [bos and eos](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/blob/main/tokenizer_config.json) are different from what is hardcoded in the code. This PR updates the export flow to allow read customized token ids instead of hardcoded ones.

It also deletes a few metadata entries that are not used by the runner.

Pull Request resolved: #4649

Differential Revision: D61044259

Pulled By: helunwencser

* Do not print eos

Summary: We don't want to print eos in the response because some eos tokens could be `<|end_of_text|>`.

Differential Revision: D61048254

---------

Co-authored-by: Lunwen He <[email protected]>
Copy link
Contributor

@lucylq lucylq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm after linter errors

JacobSzwejbka and others added 5 commits August 12, 2024 15:47
Differential Revision: D61147536

Pull Request resolved: #4671
Differential Revision: D61054615

Pull Request resolved: #4642
Differential Revision: D61166041

Pull Request resolved: #4678

---------

Co-authored-by: helunwencser <[email protected]>
Differential Revision: D61108863

Pull Request resolved: #4664
Differential Revision: D61141050

Pull Request resolved: #4662
@larryliu0820 larryliu0820 changed the base branch from gh/larryliu0820/47/base to main August 13, 2024 04:27
@larryliu0820 larryliu0820 changed the base branch from main to gh/kimishpatel/47/base August 13, 2024 04:28
…o a class"

As titled. This PR moves the token generation loop in llama2 runner into
a new class so it can be reused.

Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601)

[ghstack-poisoned]
As titled. This PR moves the token generation loop in llama2 runner into
a new class so it can be reused.

Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601)

[ghstack-poisoned]
larryliu0820 added a commit that referenced this pull request Aug 13, 2024
As titled. This PR moves the token generation loop in llama2 runner into
a new class so it can be reused.

ghstack-source-id: 1108ada
Pull Request resolved: #4652
@larryliu0820
Copy link
Contributor Author

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…o a class"

As titled. This PR moves the token generation loop in llama2 runner into
a new class so it can be reused.

Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601)

[ghstack-poisoned]
As titled. This PR moves the token generation loop in llama2 runner into
a new class so it can be reused.

Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601)

[ghstack-poisoned]
larryliu0820 added a commit that referenced this pull request Aug 13, 2024
As titled. This PR moves the token generation loop in llama2 runner into
a new class so it can be reused.

ghstack-source-id: 92ef9f2
Pull Request resolved: #4652
@larryliu0820
Copy link
Contributor Author

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot facebook-github-bot merged commit 3226daa into gh/kimishpatel/47/base Aug 13, 2024
71 of 79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.