[llava][18/N] Move token generation loop to a class #4652

larryliu0820 · 2024-08-09T21:38:06Z

Stack from ghstack (oldest at bottom):

-> [llava][18/N] Move token generation loop to a class #4652

As titled. This PR moves the token generation loop in llama2 runner into
a new class so it can be reused.

Differential Revision: D61047601

…#4108) Summary: Pull Request resolved: #4108 We want to be able to run the reference implementations on x86, so we don't want any intrinsics or anything like that in the reference kernels. In the end, this change has a lot of things: - introduce a `reference` folder for reference implementations - moved the primary cmake flow from HiFi to reference, so that the default mode can run on x86 - that means we will need a proper flag to use HiFi optimized ops, which we will add later - add a `quantized_matmul` reference kernel Reviewed By: dulinriley Differential Revision: D59238748 fbshipit-source-id: 830c89fe9ee8dd87ece963e1174ca3cbd1e0fbc6

Summary: Pull Request resolved: #4272 `MethodMeta` is the new way to get this information. Reviewed By: tarun292 Differential Revision: D59782278 fbshipit-source-id: 1e1df006ee95886aa80b9704bfda488e1ad93dcf

Summary: Pull Request resolved: #4264 This diff brings in the latest export serializer changes from `torch/_export/serde` and then fixes the exir serializer to use these changes. This is a temporary workaround until the serialization extensiblity that zhxchen17 is working on is completed. Then we'll not have to copy over changes from export serializer manually, instead we'll just extend it to handle things like delegate calls and other edge dialect specific things. For now we need this to unblock ASR use case and hence i'm manually syncing and updating for now. Reviewed By: JacobSzwejbka Differential Revision: D57071033 fbshipit-source-id: 4408e2a0740b661e3a3555f800d1567ef10d4ea8

Summary: This PR moves the titoken and bpe tokenizers into `extension/llm/tokenizer` such that they can be reused by other models. Note: Currently the tiktoken has two sets of unit tests based on llama2's tokenizer: - default - multimodal This PR only moves the default unit test into extension and keeps the multimodal's unit tests inside llama2/tokenizer. Pull Request resolved: #4278 Test Plan: - test/run_oss_cpp_tests.sh examples/models/llama2/tokenizer/test - test/run_oss_cpp_tests.sh extension/llm/tokenizer/test Reviewed By: larryliu0820 Differential Revision: D59822702 Pulled By: helunwencser fbshipit-source-id: 5d51ba3e44c9b2d9dc77b9f4349b58947ed68502

Summary: Pull Request resolved: facebookincubator/fizz#142 Reviewed By: namanahuja Differential Revision: D59832580 Pulled By: ahornby fbshipit-source-id: 1b936a007e5d08f7bc959d5775bce36b107f4bb3

Summary: Our [stable branch](https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50) is failing behind for a month. It seems like the android job keeps failing to consume the artifacts fetched from S3. See #4285 details. To unblock the stable branch, this PR is to temporarily disable the S3 workflow and will re-enable it later as a periodic job. And the workflow will be re-enabled as a periodic job in #4286 once #4285 is fixed Pull Request resolved: #4287 Reviewed By: dbort Differential Revision: D59839152 Pulled By: guangy10 fbshipit-source-id: 5ab85aa592c32e7a9048845cf9088eb39573b7ce

Summary: Pull Request resolved: #4269 as title ghstack-source-id: 234090425 Reviewed By: dbort Differential Revision: D59770172 fbshipit-source-id: 4bb63b2497cd3ddb04726da6fb8cefb0a2add391

Summary: Pull Request resolved: #4290 . Reviewed By: helunwencser Differential Revision: D59865664 fbshipit-source-id: 2dfcfc90194dc366ab9811bfa73d5f1b44872255

Summary: Pull Request resolved: #4273 ## Motivation `run_decompositions()` has a new preserve_ops functionality which allows us to specify which ops we want to refrain from decomposing. This is super helpful for the to_edge_transform_and_lower api because it allows us to preserve decomposition that occur beyond the first level. For example consider LSTM. when exported using torch.export, we see a torch.ops.aten.LSTM() operator in the graph. When running decompositions this is decomposed into linear, and then further decomposed into addmm. Since the linear op is produced from decomposing LSTM and does not exist until after we run_decompositions(), we can not perform our trick of changing the name space to prevent its decomposition. However, now using `_preserve_ops=(torch.ops.aten.linear.default,)` we are able to prevent this second layer decomposition. ## API Implementation Change So in the implementation we do two passes. The first pass is we run_decompositions preserving all aten ops specified by our partitioners using `_preserve_ops`. On our second pass, we further filter which aten ops should be preserved by using the check_op_fn given to us by partitioners. We then use our namespace trick to prevent the decomposition of all aten ops which pass our check_op_fn. ## Testing Changes To strengthen our tests, I first change the functionality of the NonDecompPartitioner. We partition only pre-decomp aten ops. And each of these ops live within their own delegate (this allows us to have a 1:1 mapping for call_delegate and pre_decomp aten nodes). In testing, this will allow us to ensure that the number of ops which are to preserved is correct by counting the number of delegates calls. In testing we then count the number of aten ops which should correctly be preserved. And then check after the fact that all these ops are 1. No longer in the graph after to_edge_transform_and_lower 2. Each of these preserved ops are transformed into a call_delegate node Reviewed By: tarun292 Differential Revision: D59786323 fbshipit-source-id: 7ea946e0d5afc8ebddd26913f6e843305116ad3b

…4163) Summary: - add utilities for loading context binary generated from qnn tools - align env variable naming with qnn - fix bug in online prepare and extend coverage to support bitwise quatization - llama7b e2e example from qualcomm ai_hub - minor fixes for syle & typo Pull Request resolved: #4163 Reviewed By: swolchok, kirklandsign Differential Revision: D59737140 Pulled By: cccclai fbshipit-source-id: 16e98d7f5da7204a2d04258fd75dabd8aa1eaa7d

Summary: Pull Request resolved: #4293 This diff parses the logged intermediate outputs in etdump into Inspector objects. It pretty much automatically works because the infra has already been built out for non-delegate intermediate output logging. The only change needed here is to add delegate debug id to `InstructionEventSignature` and `DebugEventSignature` so they group delegated ops correctly. Design doc: https://docs.google.com/document/d/1qGHsgd-roqtxPz4CrUlqGrKaAtbaf9bArziMuwHD0So/edit Reviewed By: Jack-Khuu Differential Revision: D59840296 fbshipit-source-id: 04f22d4520584090f3b37b83386f704cc7a2c271

Summary: Pull Request resolved: #4256 Llava is using HF RoPE so adding a config to switch between our stock RoPE to HF RoPE. We may be able to consolidate them together but it can come later. Reviewed By: helunwencser Differential Revision: D59759975 fbshipit-source-id: 9c3a1825b82f0f32e15fb06e2f73d73e2bacba0c

Summary: Pull Request resolved: #4262 This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from memory, using the following shader, where A and B are readonly and writeonly buffers, respectively. void main() { vec4 sum = vec4(0); const uint workgroup_width = local_group_size * niter * ${NUNROLL}; uint offset = (gl_WorkGroupID[0] * workgroup_width + gl_LocalInvocationID[0]) & addr_mask; int i = 0; for (; i < niter; ++i) { sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; ... ... sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; } vec4 zero = vec4(i>>31); B[gl_LocalInvocationID[0]] = sum + zero; } The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID. Finally, we make sure to use the `sum` and `i ` variables so that the compiler's optimizer does not flatten the loops. For a Samsung S22, the bandwidth behaves like this. We can see a limitation when buffers reach 32 KB in size. {F1759406621} Reviewed By: SS-JIA Differential Revision: D59687299 fbshipit-source-id: 5a97a2c2b0bf077c575de55d23061d5597ba385d

Summary: Pull Request resolved: #4270 This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from UBOs, using the following shader, where A is a UBO and B is a writeonly buffer. void main() { vec4 sum = vec4(0); const uint workgroup_width = local_group_size * niter * ${NUNROLL}; uint offset = (gl_WorkGroupID[0] * workgroup_width + gl_LocalInvocationID[0]) & addr_mask; int i = 0; for (; i < niter; ++i) { sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; ... ... sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; } vec4 zero = vec4(i>>31); B[gl_LocalInvocationID[0]] = sum + zero; } The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID. Finally, we make sure to use the `sum` and `i ` variables so that the compiler's optimizer does not flatten the loops. For a Samsung S22, the bandwidth behaves like this. We can see a decline proportional to the size of the buffer, until it plateaus at 32KB. {F1759559978} Comparing it with the Readonly profiler, we can immediately see the superiority in reading speed for UBOs, whenever the hardware is available Samsung S22 {F1759560675} Redmi Note {F1759445004} Reviewed By: copyrightly Differential Revision: D59776899 fbshipit-source-id: 0f93186833bbe3610c5b5bf68cda519a7b467aca

Summary: Pull Request resolved: #4277 This diff introduces a profiler that obtains the maximum and minimum bandwidth for reading unique addresses from UBOs, using the following shader, where A is a shared buffer and B is a writeonly buffer. shared vec4 A[nvec]; void main() { vec4 sum = vec4(0); const uint workgroup_width = local_group_size * niter * ${NUNROLL}; uint offset = (gl_WorkGroupID[0] * workgroup_width + gl_LocalInvocationID[0]) & addr_mask; int i = 0; for (; i < niter; ++i) { sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; ... ... sum *= A[offset]; offset = (offset + local_group_size) & addr_mask; } vec4 zero = vec4(i>>31); B[gl_LocalInvocationID[0]] = sum + zero; } The address mask allows us to control how many unique addresses we are accessing. If the number of unique vectors we want to read is 3, the offset will jump between three unique addresses throughout the iterations, giving us the bandwidth for that specific size of data. If the size of the unique data read is larger than the work group size, then each run will have its own block of data to read, defined by the initial offset calculation, where the offset is obtained through the workgroup ID and the local invocation ID. Finally, we make sure to use the `sum` and `i ` variables so that the compiler's optimizer does not flatten the loops. For a Samsung S22, the bandwidth behaves like this. We can see that accessing the shared memory has a constant latency, until it reaches the Maximum Shared Memory size. NOTE: The graph is extended for visualization purposes, the experiment stops before it drops, because otherwise it would crash. {F1759597657} Comparing it to OpenCL, we can observe that, although the behavior is the same, Vulkan has an increased bandwidth. {F1759600867} Reviewed By: copyrightly Differential Revision: D59811152 fbshipit-source-id: 537be13dbec1a02cb55e689db2a0fd548613c729

…nts (#4292) Summary: Pull Request resolved: #4292 Some simple improvements to the SPIR-V compilation script: 1. Allow `layout_declare_tensor` to create a scalar buffer instead of always creating a vectorized buffer 2. Allow handling of non-string (i.e. int) values in shader codegen YAML configurations. Reviewed By: jorgep31415 Differential Revision: D59877805 fbshipit-source-id: 579888fbc19d19a0d24f2fbd831e74f4ba32f033

Summary: changed `rm- -rf` to `rm -rf` Pull Request resolved: #4296 Reviewed By: lucylq Differential Revision: D59921910 Pulled By: dbort fbshipit-source-id: bab21a39faae4db53ff4b04c02598f27c535d3ce

Summary: We would want to reuse the same demo app to benchmark as many models ad possible. It may be not easy to create super generic app for all types of models, but we can reuse our existing demo apps to swap in different models of performing same task, e.g. our llama demo should be able to benchmark different casual LLMs w/o problems. To do this, we need to organize the build vertically by the demo apps. Currently we have two demo apps for android (ios demo app would follow the same rule), this PR is to address the llama demo. The android job 'build-llm-demo' is going to build different flavors of the same app by android-abi and tokenizer library. In the downstream, an app built for arm with bpe tokenizer could be used to benchmark all LLMs using bpe tokenizer on a physical android device. Pull Request resolved: #4288 Reviewed By: huydhn, kirklandsign Differential Revision: D59874919 Pulled By: guangy10 fbshipit-source-id: 11bf280765af9ddd4e5459e47c859cc8d37b3848

Summary: Pull Request resolved: #4299 Remove Buck2 reference Reviewed By: lucylq Differential Revision: D59922659 fbshipit-source-id: 12f7c59e9ea23afd743435115e9bd5b5afc825d4

…4279) Summary: Pull Request resolved: #4279 I left review feedback for this diff late; applying it (and broadening use of kwargs.get() while I'm here). ghstack-source-id: 234276753 Reviewed By: kimishpatel Differential Revision: D59823492 fbshipit-source-id: 0fc8ec2861d2eb2f19bb38ba885024d532f20f44

Summary: According to the comment in #4288 , add the uploading step back so that TorchChat can consume the artifacts from S3 Pull Request resolved: #4300 Reviewed By: huydhn, kirklandsign Differential Revision: D59925505 Pulled By: guangy10 fbshipit-source-id: ce389fb16adb30d51240fdff655111580f07130b

Summary: Pull Request resolved: #4259 Reviewed By: helunwencser Differential Revision: D59759978 fbshipit-source-id: 8ff8a5b24481b28e0814b45f60b4b0fdbfd47e4e

Summary: Pull Request resolved: #4295 As titled. Pending CI job Reviewed By: helunwencser Differential Revision: D59901269 fbshipit-source-id: 0f32357830a677736ac3123526653bff70c8c7af

Summary: Pull Request resolved: #4291 Starts to completely deprecate DataLoader::Load by converting existing callsites to DataLoader::load, and making DataLoader::load pure virtual. Searched for all relevant callsites by doing the following: - Grep for all instances of `core/data_loader.h>` to find all direct imports of DataLoader, finding "Load(" in the file - Grep for all instances of `executorch\/.*_data_loader.h>` to find all imports of subclasses derived from DataLoader, finding "Load(" in the file - Grep for all instances of "->Load(" in files with "executorch" in the path - Grep for all isntances of "loader->Load(" and "loader_->Load(" in the entire codebase Reviewed By: dbort Differential Revision: D59767505 fbshipit-source-id: d8a6d998f957dcba291815312f8bde54f84c3100

Summary: Return an array with size 0, to allow getting array length in code below. Pull Request resolved: #4266 Reviewed By: cccclai Differential Revision: D59764473 Pulled By: kirklandsign fbshipit-source-id: f00eac32879c31a983156c4651714ccf1e1ec280

Summary: Pull Request resolved: #4265 Reviewed By: guangy10 Differential Revision: D59763471 Pulled By: kirklandsign fbshipit-source-id: 33c3fbe2d0434daab5504856f91f15407e58d463

Summary: Pull Request resolved: #4298 This very simple metric runs a kernel across an increasing number of workgroups, until there is a noticeable increase in latency, as seen in the following graph: {F1762497995} The shader uses an integer division as its metric, because it is a multi-cycle operation that puts the ALU to work and stops the SM from context switching. As other metrics, we start by obtaining the minimum number of iterations, NITER, that can run in 1000us, as to have a baseline for comparison and reduce timing noise. With this number of iterations, we run the kernel with an increasing number of threads. We also use a multidimensional global workgroup with a Y size of 1024 in hopes of saturating the ALUs and have a better point of reference for the latency caused by adding warps. Once we detect a jump in latency, we can assume that that is the warp size. More information can be found [here](https://www.microsoft.com/en-us/research/uploads/prod/2022/02/mobigpu_mobicom22_camera.pdf) on page 5. Reviewed By: jorgep31415 Differential Revision: D59920169 fbshipit-source-id: 4ac9324e10f0ab1a72433fd7ce98ad5f5ab839e9

Summary: Pull Request resolved: #4306 Reviewed By: kirklandsign Differential Revision: D59977405 Pulled By: guangy10 fbshipit-source-id: 2ce889e6f49ade545a668244db8aab6e7f7bef01

Summary: This PR is auto-generated nightly by [this action](https://github.com/pytorch/executorch/blob/main/.github/workflows/nightly.yml). Update the pinned pytorch hash. Pull Request resolved: #4313 Reviewed By: kirklandsign Differential Revision: D59983250 Pulled By: guangy10 fbshipit-source-id: eec0a71936aad9642b958ce4ac222011a8d0025d

Summary: Pull Request resolved: #4322 We retropfitted flash attention cpu from aten. The retrofit we did was to make it work to cacluate attention for a) batched prefill and b) decode with different start_pos. For b, there was a bug when kv cache's seqlen dim is split. As a result attention calculation is not right. There is a detail in the code to explain the issue. bypass-github-export-checks ghstack-source-id: 234634902 Reviewed By: larryliu0820 Differential Revision: D60011925 fbshipit-source-id: 50921846b329e449a4a767cf28c7a55d507217bd

larryliu0820 · 2024-08-12T18:17:38Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

larryliu0820 · 2024-08-12T18:20:29Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Differential Revision: D60967580 Pull Request resolved: #4560

Differential Revision: D61057535 Pull Request resolved: #4657

Differential Revision: D61141396 Pull Request resolved: #4670

Differential Revision: D60424030 Pull Request resolved: #4453

Differential Revision: D60399589 Pull Request resolved: #4443

Differential Revision: D61044259 Pull Request resolved: #4649

Differential Revision: D61150844 Pull Request resolved: #4673

* allow models to use customized token ids during export (#4649) Summary: LLama3.1's [bos and eos](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/blob/main/tokenizer_config.json) are different from what is hardcoded in the code. This PR updates the export flow to allow read customized token ids instead of hardcoded ones. It also deletes a few metadata entries that are not used by the runner. Pull Request resolved: #4649 Differential Revision: D61044259 Pulled By: helunwencser * Do not print eos Summary: We don't want to print eos in the response because some eos tokens could be `<|end_of_text|>`. Differential Revision: D61048254 --------- Co-authored-by: Lunwen He <[email protected]>

lucylq

Lgtm after linter errors

Differential Revision: D61147536 Pull Request resolved: #4671

Differential Revision: D61054615 Pull Request resolved: #4642

Differential Revision: D61166041 Pull Request resolved: #4678 --------- Co-authored-by: helunwencser <[email protected]>

Differential Revision: D61108863 Pull Request resolved: #4664

Differential Revision: D61141050 Pull Request resolved: #4662

…o a class" As titled. This PR moves the token generation loop in llama2 runner into a new class so it can be reused. Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601) [ghstack-poisoned]

As titled. This PR moves the token generation loop in llama2 runner into a new class so it can be reused. Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601) [ghstack-poisoned]

As titled. This PR moves the token generation loop in llama2 runner into a new class so it can be reused. ghstack-source-id: 1108ada Pull Request resolved: #4652

larryliu0820 · 2024-08-13T05:02:48Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…o a class" As titled. This PR moves the token generation loop in llama2 runner into a new class so it can be reused. Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601) [ghstack-poisoned]

As titled. This PR moves the token generation loop in llama2 runner into a new class so it can be reused. Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601) [ghstack-poisoned]

As titled. This PR moves the token generation loop in llama2 runner into a new class so it can be reused. ghstack-source-id: 92ef9f2 Pull Request resolved: #4652

larryliu0820 · 2024-08-13T17:48:42Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mcremon-meta and others added 30 commits July 16, 2024 13:41

Delete deprecated non_const_buffer methods from Program (#4272)

242f2c0

Summary: Pull Request resolved: #4272 `MethodMeta` is the new way to get this information. Reviewed By: tarun292 Differential Revision: D59782278 fbshipit-source-id: 1e1df006ee95886aa80b9704bfda488e1ad93dcf

Add homebrew package for libopqs (#142)

2b54194

Summary: Pull Request resolved: facebookincubator/fizz#142 Reviewed By: namanahuja Differential Revision: D59832580 Pulled By: ahornby fbshipit-source-id: 1b936a007e5d08f7bc959d5775bce36b107f4bb3

Use external_deps for sentencepiece (#4269)

8950d90

Summary: Pull Request resolved: #4269 as title ghstack-source-id: 234090425 Reviewed By: dbort Differential Revision: D59770172 fbshipit-source-id: 4bb63b2497cd3ddb04726da6fb8cefb0a2add391

Update Xcode project paths after tokenizer source code move. (#4290)

b448254

Summary: Pull Request resolved: #4290 . Reviewed By: helunwencser Differential Revision: D59865664 fbshipit-source-id: 2dfcfc90194dc366ab9811bfa73d5f1b44872255

Fixed the command (#4296)

92b87e4

Summary: changed `rm- -rf` to `rm -rf` Pull Request resolved: #4296 Reviewed By: lucylq Differential Revision: D59921910 Pulled By: dbort fbshipit-source-id: bab21a39faae4db53ff4b04c02598f27c535d3ce

Remove Buck2 from examples/sdk/README.md (#4299)

8e0f856

Summary: Pull Request resolved: #4299 Remove Buck2 reference Reviewed By: lucylq Differential Revision: D59922659 fbshipit-source-id: 12f7c59e9ea23afd743435115e9bd5b5afc825d4

Add Llava model definition (#4259)

1933dae

Summary: Pull Request resolved: #4259 Reviewed By: helunwencser Differential Revision: D59759978 fbshipit-source-id: 8ff8a5b24481b28e0814b45f60b4b0fdbfd47e4e

Add export_llava.py (#4295)

0333390

Summary: Pull Request resolved: #4295 As titled. Pending CI job Reviewed By: helunwencser Differential Revision: D59901269 fbshipit-source-id: 0f32357830a677736ac3123526653bff70c8c7af

Add a step to build Java extension code in LLAMA docs (#4265)

ba052a4

Summary: Pull Request resolved: #4265 Reviewed By: guangy10 Differential Revision: D59763471 Pulled By: kirklandsign fbshipit-source-id: 33c3fbe2d0434daab5504856f91f15407e58d463

Update Hugging Face packages to latest versions (#4306)

5865a57

Summary: Pull Request resolved: #4306 Reviewed By: kirklandsign Differential Revision: D59977405 Pulled By: guangy10 fbshipit-source-id: 2ce889e6f49ade545a668244db8aab6e7f7bef01

This was referenced Aug 12, 2024

[llava][20/N] Add llava runner using building blocks in e/llm/runner #4666

Merged

[llava][21/N] Add llava runner test binary and build script #4667

Merged

haowhsu-quic and others added 8 commits August 12, 2024 11:28

Qualcomm AI Engine Direct - fix conv2d to meet QNN constraint

e800626

Differential Revision: D60967580 Pull Request resolved: #4560

Not hardcode llama2 model in perf test

d53f8fa

Differential Revision: D61057535 Pull Request resolved: #4657

Update phi3 lora example documentation

9b2bfb6

Differential Revision: D61141396 Pull Request resolved: #4670

[Cadence] Enabled x86 executor flow with numerical verification

0c26dc0

Differential Revision: D60424030 Pull Request resolved: #4453

Add an activity for benchmarking only

440048c

Differential Revision: D60399589 Pull Request resolved: #4443

allow models to use customized token ids during export

8f46971

Differential Revision: D61044259 Pull Request resolved: #4649

Pack buffer-backed tensors correctly when moving into and out of staging

728a29d

Differential Revision: D61150844 Pull Request resolved: #4673

lucylq approved these changes Aug 12, 2024

View reviewed changes

JacobSzwejbka and others added 5 commits August 12, 2024 15:47

Implement load_into for file data loader

b165c28

Differential Revision: D61147536 Pull Request resolved: #4671

Fix periodic run and model name for benchmarking

b6de6ed

Differential Revision: D61054615 Pull Request resolved: #4642

Delete dead code

5e9bab8

Differential Revision: D61166041 Pull Request resolved: #4678 --------- Co-authored-by: helunwencser <[email protected]>

Move metadata util to extension/llm/runner.

56f843b

Differential Revision: D61108863 Pull Request resolved: #4664

Add stories ci for qnn

e71fa03

Differential Revision: D61141050 Pull Request resolved: #4662

larryliu0820 changed the base branch from gh/larryliu0820/47/base to main August 13, 2024 04:27

larryliu0820 changed the base branch from main to gh/kimishpatel/47/base August 13, 2024 04:28

larryliu0820 added 2 commits August 12, 2024 21:55

Update on "[llava][18/N] Move token generation loop to a class"

be01c74

As titled. This PR moves the token generation loop in llama2 runner into a new class so it can be reused. Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601) [ghstack-poisoned]

larryliu0820 added 2 commits August 13, 2024 10:47

Update on "[llava][18/N] Move token generation loop to a class"

0bda164

As titled. This PR moves the token generation loop in llama2 runner into a new class so it can be reused. Differential Revision: [D61047601](https://our.internmc.facebook.com/intern/diff/D61047601) [ghstack-poisoned]

facebook-github-bot merged commit 3226daa into gh/kimishpatel/47/base Aug 13, 2024
71 of 79 checks passed

kirklandsign mentioned this pull request Aug 13, 2024

[llava][18/N] Move token generation loop to a class #4705

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[llava][18/N] Move token generation loop to a class #4652

[llava][18/N] Move token generation loop to a class #4652

Uh oh!

larryliu0820 commented Aug 9, 2024 •

edited

Loading

Uh oh!

larryliu0820 commented Aug 12, 2024

Uh oh!

larryliu0820 commented Aug 12, 2024

Uh oh!

lucylq left a comment

Uh oh!

larryliu0820 commented Aug 13, 2024

Uh oh!

larryliu0820 commented Aug 13, 2024

Uh oh!

Uh oh!

Uh oh!

[llava][18/N] Move token generation loop to a class #4652

[llava][18/N] Move token generation loop to a class #4652

Uh oh!

Conversation

larryliu0820 commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

larryliu0820 commented Aug 12, 2024

Uh oh!

larryliu0820 commented Aug 12, 2024

Uh oh!

lucylq left a comment

Choose a reason for hiding this comment

Uh oh!

larryliu0820 commented Aug 13, 2024

Uh oh!

larryliu0820 commented Aug 13, 2024

Uh oh!

Uh oh!

Uh oh!

larryliu0820 commented Aug 9, 2024 •

edited

Loading