-
Notifications
You must be signed in to change notification settings - Fork 607
[llava][18/N] Move token generation loop to a class #4652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llava][18/N] Move token generation loop to a class #4652
Conversation
…#4108) Summary: Pull Request resolved: #4108 We want to be able to run the reference implementations on x86, so we don't want any intrinsics or anything like that in the reference kernels. In the end, this change has a lot of things: - introduce a `reference` folder for reference implementations - moved the primary cmake flow from HiFi to reference, so that the default mode can run on x86 - that means we will need a proper flag to use HiFi optimized ops, which we will add later - add a `quantized_matmul` reference kernel Reviewed By: dulinriley Differential Revision: D59238748 fbshipit-source-id: 830c89fe9ee8dd87ece963e1174ca3cbd1e0fbc6
Summary: Pull Request resolved: #4272 `MethodMeta` is the new way to get this information. Reviewed By: tarun292 Differential Revision: D59782278 fbshipit-source-id: 1e1df006ee95886aa80b9704bfda488e1ad93dcf
Summary: Pull Request resolved: #4264 This diff brings in the latest export serializer changes from `torch/_export/serde` and then fixes the exir serializer to use these changes. This is a temporary workaround until the serialization extensiblity that zhxchen17 is working on is completed. Then we'll not have to copy over changes from export serializer manually, instead we'll just extend it to handle things like delegate calls and other edge dialect specific things. For now we need this to unblock ASR use case and hence i'm manually syncing and updating for now. Reviewed By: JacobSzwejbka Differential Revision: D57071033 fbshipit-source-id: 4408e2a0740b661e3a3555f800d1567ef10d4ea8
Summary: This PR moves the titoken and bpe tokenizers into `extension/llm/tokenizer` such that they can be reused by other models. Note: Currently the tiktoken has two sets of unit tests based on llama2's tokenizer: - default - multimodal This PR only moves the default unit test into extension and keeps the multimodal's unit tests inside llama2/tokenizer. Pull Request resolved: #4278 Test Plan: - test/run_oss_cpp_tests.sh examples/models/llama2/tokenizer/test - test/run_oss_cpp_tests.sh extension/llm/tokenizer/test Reviewed By: larryliu0820 Differential Revision: D59822702 Pulled By: helunwencser fbshipit-source-id: 5d51ba3e44c9b2d9dc77b9f4349b58947ed68502
Summary: Pull Request resolved: facebookincubator/fizz#142 Reviewed By: namanahuja Differential Revision: D59832580 Pulled By: ahornby fbshipit-source-id: 1b936a007e5d08f7bc959d5775bce36b107f4bb3
Summary: Our [stable branch](https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50) is failing behind for a month. It seems like the android job keeps failing to consume the artifacts fetched from S3. See #4285 details. To unblock the stable branch, this PR is to temporarily disable the S3 workflow and will re-enable it later as a periodic job. And the workflow will be re-enabled as a periodic job in #4286 once #4285 is fixed Pull Request resolved: #4287 Reviewed By: dbort Differential Revision: D59839152 Pulled By: guangy10 fbshipit-source-id: 5ab85aa592c32e7a9048845cf9088eb39573b7ce
Summary: Pull Request resolved: #4269 as title ghstack-source-id: 234090425 Reviewed By: dbort Differential Revision: D59770172 fbshipit-source-id: 4bb63b2497cd3ddb04726da6fb8cefb0a2add391
Summary: Pull Request resolved: #4290 . Reviewed By: helunwencser Differential Revision: D59865664 fbshipit-source-id: 2dfcfc90194dc366ab9811bfa73d5f1b44872255
Summary: Pull Request resolved: #4273 ## Motivation `run_decompositions()` has a new preserve_ops functionality which allows us to specify which ops we want to refrain from decomposing. This is super helpful for the to_edge_transform_and_lower api because it allows us to preserve decomposition that occur beyond the first level. For example consider LSTM. when exported using torch.export, we see a torch.ops.aten.LSTM() operator in the graph. When running decompositions this is decomposed into linear, and then further decomposed into addmm. Since the linear op is produced from decomposing LSTM and does not exist until after we run_decompositions(), we can not perform our trick of changing the name space to prevent its decomposition. However, now using `_preserve_ops=(torch.ops.aten.linear.default,)` we are able to prevent this second layer decomposition. ## API Implementation Change So in the implementation we do two passes. The first pass is we run_decompositions preserving all aten ops specified by our partitioners using `_preserve_ops`. On our second pass, we further filter which aten ops should be preserved by using the check_op_fn given to us by partitioners. We then use our namespace trick to prevent the decomposition of all aten ops which pass our check_op_fn. ## Testing Changes To strengthen our tests, I first change the functionality of the NonDecompPartitioner. We partition only pre-decomp aten ops. And each of these ops live within their own delegate (this allows us to have a 1:1 mapping for call_delegate and pre_decomp aten nodes). In testing, this will allow us to ensure that the number of ops which are to preserved is correct by counting the number of delegates calls. In testing we then count the number of aten ops which should correctly be preserved. And then check after the fact that all these ops are 1. No longer in the graph after to_edge_transform_and_lower 2. Each of these preserved ops are transformed into a call_delegate node Reviewed By: tarun292 Differential Revision: D59786323 fbshipit-source-id: 7ea946e0d5afc8ebddd26913f6e843305116ad3b
…4163) Summary: - add utilities for loading context binary generated from qnn tools - align env variable naming with qnn - fix bug in online prepare and extend coverage to support bitwise quatization - llama7b e2e example from qualcomm ai_hub - minor fixes for syle & typo Pull Request resolved: #4163 Reviewed By: swolchok, kirklandsign Differential Revision: D59737140 Pulled By: cccclai fbshipit-source-id: 16e98d7f5da7204a2d04258fd75dabd8aa1eaa7d
Summary: Pull Request resolved: #4293 This diff parses the logged intermediate outputs in etdump into Inspector objects. It pretty much automatically works because the infra has already been built out for non-delegate intermediate output logging. The only change needed here is to add delegate debug id to `InstructionEventSignature` and `DebugEventSignature` so they group delegated ops correctly. Design doc: https://docs.google.com/document/d/1qGHsgd-roqtxPz4CrUlqGrKaAtbaf9bArziMuwHD0So/edit Reviewed By: Jack-Khuu Differential Revision: D59840296 fbshipit-source-id: 04f22d4520584090f3b37b83386f704cc7a2c271
Summary: Pull Request resolved: #4256 Llava is using HF RoPE so adding a config to switch between our stock RoPE to HF RoPE. We may be able to consolidate them together but it can come later. Reviewed By: helunwencser Differential Revision: D59759975 fbshipit-source-id: 9c3a1825b82f0f32e15fb06e2f73d73e2bacba0c
Summary: Pull Request resolved: #4262 This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from memory, using the following shader, where A and B are readonly and writeonly buffers, respectively. void main() { vec4 sum = vec4(0); const uint workgroup_width = local_group_size * niter * ${NUNROLL}; uint offset = (gl_WorkGroupID[0] * workgroup_width + gl_LocalInvocationID[0]) & addr_mask; int i = 0; for (; i < niter; ++i) { sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; ... ... sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; } vec4 zero = vec4(i>>31); B[gl_LocalInvocationID[0]] = sum + zero; } The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID. Finally, we make sure to use the `sum` and `i ` variables so that the compiler's optimizer does not flatten the loops. For a Samsung S22, the bandwidth behaves like this. We can see a limitation when buffers reach 32 KB in size. {F1759406621} Reviewed By: SS-JIA Differential Revision: D59687299 fbshipit-source-id: 5a97a2c2b0bf077c575de55d23061d5597ba385d
Summary: Pull Request resolved: #4270 This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from UBOs, using the following shader, where A is a UBO and B is a writeonly buffer. void main() { vec4 sum = vec4(0); const uint workgroup_width = local_group_size * niter * ${NUNROLL}; uint offset = (gl_WorkGroupID[0] * workgroup_width + gl_LocalInvocationID[0]) & addr_mask; int i = 0; for (; i < niter; ++i) { sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; ... ... sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; } vec4 zero = vec4(i>>31); B[gl_LocalInvocationID[0]] = sum + zero; } The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID. Finally, we make sure to use the `sum` and `i ` variables so that the compiler's optimizer does not flatten the loops. For a Samsung S22, the bandwidth behaves like this. We can see a decline proportional to the size of the buffer, until it plateaus at 32KB. {F1759559978} Comparing it with the Readonly profiler, we can immediately see the superiority in reading speed for UBOs, whenever the hardware is available Samsung S22 {F1759560675} Redmi Note {F1759445004} Reviewed By: copyrightly Differential Revision: D59776899 fbshipit-source-id: 0f93186833bbe3610c5b5bf68cda519a7b467aca
Summary: Pull Request resolved: #4277 This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from UBOs, using the following shader, where A is a shared buffer and B is a writeonly buffer. shared vec4 A[nvec]; void main() { vec4 sum = vec4(0); const uint workgroup_width = local_group_size * niter * ${NUNROLL}; uint offset = (gl_WorkGroupID[0] * workgroup_width + gl_LocalInvocationID[0]) & addr_mask; int i = 0; for (; i < niter; ++i) { sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; ... ... sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; } vec4 zero = vec4(i>>31); B[gl_LocalInvocationID[0]] = sum + zero; } The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID. Finally, we make sure to use the `sum` and `i ` variables so that the compiler's optimizer does not flatten the loops. For a Samsung S22, the bandwidth behaves like this. We can see that accessing the shared memory has a constant latency, until it reaches the Maximum Shared Memory size. NOTE: The graph is extended for visualization purposes, the experiment stops before it drops, because otherwise it would crash. {F1759597657} Comparing it to OpenCL, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth. {F1759600867} Reviewed By: copyrightly Differential Revision: D59811152 fbshipit-source-id: 537be13dbec1a02cb55e689db2a0fd548613c729
…nts (#4292) Summary: Pull Request resolved: #4292 Some simple improvements to the SPIR-V compilation script: 1. Allow `layout_declare_tensor` to create a scalar buffer instead of always creating a vectorized buffer 2. Allow handling of non-string (i.e. int) values in shader codegen YAML configurations. Reviewed By: jorgep31415 Differential Revision: D59877805 fbshipit-source-id: 579888fbc19d19a0d24f2fbd831e74f4ba32f033
Summary: changed `rm- -rf` to `rm -rf` Pull Request resolved: #4296 Reviewed By: lucylq Differential Revision: D59921910 Pulled By: dbort fbshipit-source-id: bab21a39faae4db53ff4b04c02598f27c535d3ce
Summary: We would want to reuse the same demo app to benchmark as many models ad possible. It may be not easy to create super generic app for all types of models, but we can reuse our existing demo apps to swap in different models of performing same task, e.g. our llama demo should be able to benchmark different casual LLMs w/o problems. To do this, we need to organize the build vertically by the demo apps. Currently we have two demo apps for android (ios demo app would follow the same rule), this PR is to address the llama demo. The android job 'build-llm-demo' is going to build different flavors of the same app by android-abi and tokenizer library. In the downstream, an app built for arm with bpe tokenizer could be used to benchmark all LLMs using bpe tokenizer on a physical android device. Pull Request resolved: #4288 Reviewed By: huydhn, kirklandsign Differential Revision: D59874919 Pulled By: guangy10 fbshipit-source-id: 11bf280765af9ddd4e5459e47c859cc8d37b3848
Summary: Pull Request resolved: #4299 Remove Buck2 reference Reviewed By: lucylq Differential Revision: D59922659 fbshipit-source-id: 12f7c59e9ea23afd743435115e9bd5b5afc825d4
…4279) Summary: Pull Request resolved: #4279 I left review feedback for this diff late; applying it (and broadening use of kwargs.get() while I'm here). ghstack-source-id: 234276753 Reviewed By: kimishpatel Differential Revision: D59823492 fbshipit-source-id: 0fc8ec2861d2eb2f19bb38ba885024d532f20f44
Summary: According to the comment in #4288 , add the uploading step back so that TorchChat can consume the artifacts from S3 Pull Request resolved: #4300 Reviewed By: huydhn, kirklandsign Differential Revision: D59925505 Pulled By: guangy10 fbshipit-source-id: ce389fb16adb30d51240fdff655111580f07130b
Summary: Pull Request resolved: #4259 Reviewed By: helunwencser Differential Revision: D59759978 fbshipit-source-id: 8ff8a5b24481b28e0814b45f60b4b0fdbfd47e4e
Summary: Pull Request resolved: #4295 As titled. Pending CI job Reviewed By: helunwencser Differential Revision: D59901269 fbshipit-source-id: 0f32357830a677736ac3123526653bff70c8c7af
Summary: Pull Request resolved: #4291 Starts to completely deprecate DataLoader::Load by converting existing callsites to DataLoader::load, and making DataLoader::load pure virtual. Searched for all relevant callsites by doing the following: - Grep for all instances of `core/data_loader.h>` to find all direct imports of DataLoader, finding "Load(" in the file - Grep for all instances of `executorch\/.*_data_loader.h>` to find all imports of subclasses derived from DataLoader, finding "Load(" in the file - Grep for all instances of "->Load(" in files with "executorch" in the path - Grep for all isntances of "loader->Load(" and "loader_->Load(" in the entire codebase Reviewed By: dbort Differential Revision: D59767505 fbshipit-source-id: d8a6d998f957dcba291815312f8bde54f84c3100
Summary: Return an array with size 0, to allow getting array length in code below. Pull Request resolved: #4266 Reviewed By: cccclai Differential Revision: D59764473 Pulled By: kirklandsign fbshipit-source-id: f00eac32879c31a983156c4651714ccf1e1ec280
Summary: Pull Request resolved: #4265 Reviewed By: guangy10 Differential Revision: D59763471 Pulled By: kirklandsign fbshipit-source-id: 33c3fbe2d0434daab5504856f91f15407e58d463
Summary: Pull Request resolved: #4298 This very simple metric runs a kernel across an increasing number of workgroups, until there is a noticeable increase in latency, as seen in the following graph: {F1762497995} The shader uses an integer division as its metric, because it is a multi-cycle operation that puts the ALU to work and stops the SM from context switching. As other metrics, we start by obtaining the minimum number of iterations, NITER, that can run in 1000us, as to have a baseline for comparison and reduce timing noise. With this number of iterations, we run the kernel with an increasing number of threads. We also use a multidimensional global workgroup with a Y size of 1024 in hopes of saturating the ALUs and have a better point of reference for the latency caused by adding warps. Once we detect a jump in latency, we can assume that that is the warp size. More information can be found [here](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf) on page 5. Reviewed By: jorgep31415 Differential Revision: D59920169 fbshipit-source-id: 4ac9324e10f0ab1a72433fd7ce98ad5f5ab839e9
Summary: Pull Request resolved: #4306 Reviewed By: kirklandsign Differential Revision: D59977405 Pulled By: guangy10 fbshipit-source-id: 2ce889e6f49ade545a668244db8aab6e7f7bef01
Summary: This PR is auto-generated nightly by [this action](https://github.com/pytorch/executorch/blob/main/.github/workflows/nightly.yml). Update the pinned pytorch hash. Pull Request resolved: #4313 Reviewed By: kirklandsign Differential Revision: D59983250 Pulled By: guangy10 fbshipit-source-id: eec0a71936aad9642b958ce4ac222011a8d0025d
Summary: Pull Request resolved: #4322 We retropfitted flash attention cpu from aten. The retrofit we did was to make it work to cacluate attention for a) batched prefill and b) decode with different start_pos. For b, there was a bug when kv cache's seqlen dim is split. As a result attention calculation is not right. There is a detail in the code to explain the issue. bypass-github-export-checks ghstack-source-id: 234634902 Reviewed By: larryliu0820 Differential Revision: D60011925 fbshipit-source-id: 50921846b329e449a4a767cf28c7a55d507217bd
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Differential Revision: D60967580 Pull Request resolved: #4560
Differential Revision: D61057535 Pull Request resolved: #4657
Differential Revision: D61141396 Pull Request resolved: #4670
Differential Revision: D60424030 Pull Request resolved: #4453
Differential Revision: D60399589 Pull Request resolved: #4443
Differential Revision: D61044259 Pull Request resolved: #4649
Differential Revision: D61150844 Pull Request resolved: #4673
* allow models to use customized token ids during export (#4649) Summary: LLama3.1's [bos and eos](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/blob/main/tokenizer_config.json) are different from what is hardcoded in the code. This PR updates the export flow to allow read customized token ids instead of hardcoded ones. It also deletes a few metadata entries that are not used by the runner. Pull Request resolved: #4649 Differential Revision: D61044259 Pulled By: helunwencser * Do not print eos Summary: We don't want to print eos in the response because some eos tokens could be `<|end_of_text|>`. Differential Revision: D61048254 --------- Co-authored-by: Lunwen He <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm after linter errors
Differential Revision: D61147536 Pull Request resolved: #4671
Differential Revision: D61054615 Pull Request resolved: #4642
Differential Revision: D61166041 Pull Request resolved: #4678 --------- Co-authored-by: helunwencser <[email protected]>
Differential Revision: D61108863 Pull Request resolved: #4664
Differential Revision: D61141050 Pull Request resolved: #4662
…o a class" As titled. This PR moves the token generation loop in llama2 runner into a new class so it can be reused. Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601) [ghstack-poisoned]
As titled. This PR moves the token generation loop in llama2 runner into a new class so it can be reused. Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601) [ghstack-poisoned]
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…o a class" As titled. This PR moves the token generation loop in llama2 runner into a new class so it can be reused. Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601) [ghstack-poisoned]
As titled. This PR moves the token generation loop in llama2 runner into a new class so it can be reused. Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601) [ghstack-poisoned]
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
3226daa
into
gh/kimishpatel/47/base
Stack from ghstack (oldest at bottom):
As titled. This PR moves the token generation loop in llama2 runner into
a new class so it can be reused.
Differential Revision: D61047601