-
Notifications
You must be signed in to change notification settings - Fork 149
Conversation
This is absolutely huge @eaplatanios. I've been working on BERT for the past few months and can't wait to check this code out :). |
@eaplatanios What do you think about restructuring the encoding techniques under a |
I'd like to see what can be done to help drive this to completion. I forget where we left things after the community meeting, but is the biggest blocker right now the compilation errors around TangentVector and the activation functions in Attention.swift? If there are areas you'd like me to look into, I'd be glad to do so. |
Thanks @BradLarson ! The only blocker is the |
Please also let me know what you guys think should be pushed to swift-apis and what kind of restructuring and testing would be useful. :) |
Models/Text/WeightDecayedAdam.swift
Outdated
|
||
extension Dense: Regularizable { | ||
public var regularizationValue: TangentVector { | ||
// TODO: This initializer is currently internal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed TF-1077 to track the non-public TangentVector
memberwise initializer issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great! Thanks a lot Dan! Also, what do you think of this solution for regularization? I don't really like it too much, but I also couldn't think of another easy way to support it. :/
/// The URL where this pre-trained model can be downloaded from. | ||
public var url: URL { | ||
let bertPrefix = "https://storage.googleapis.com/bert_models/2018_" | ||
let robertaPrefix = "https://www.dropbox.com/s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that you don't have to store these large weights in Dropbox, I've re-uploaded them to a hosted GCS bucket we've created for weights / datasets:
https://storage.googleapis.com/s4tf-hosted-binaries/checkpoints/Text/RoBERTa/base.zip
https://storage.googleapis.com/s4tf-hosted-binaries/checkpoints/Text/RoBERTa/large.zip
If there are others you'd like me to place there, just let me know and I'll add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BradLarson would this bucket also be useful for adding support for pre-trained models and their weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shashi456 - Yes, that's one of the goals, in addition to being a reliable backup for sometimes-flaky dataset download locations. I'm working on a quick addition to the checkpoint loaders to make it easy to download from here, so that we simplify the process of working with models that need pretrained checkpoints (the existing Transformer and MiniGo models, as well as BERT and family) and start CI testing inference accuracy using real pretrained models.
Now that the TangentVector visibility issues have been resolved, is there anything else blocking the core model? If so, I'd love to see what we could do to resolve that. Beyond the core model, do you have an example of BERT in action or a unit test of it that we can use to verify correct operation? If those are too tied up in any internal infrastructure you have, we could potentially pull this in and I could add a demo and / or tests as a follow-on, but if you have simple demo code for this that would really ease the process of pulling this in. We can merge together the existing Transformer demo and the new Transformer model and utility functions you have here to migrate our text generation demo over to this new structure, but didn't know if there was a similar demo you had available for BERT. |
…functions. Change `@differentiable` function default arguments from closures to function references. Related issues: - https://bugs.swift.org/browse/TF-690 - https://bugs.swift.org/browse/TF-1030
Fix non-differentiability error: ``` swift-models/Models/Text/BERT.swift:292:6: error: function is not differentiable @differentiable(wrt: self) ~^~~~~~~~~~~~~~~~~~~~~~~~~ swift-models/Models/Text/BERT.swift:293:17: note: when differentiating this function definition public func callAsFunction(_ input: TextBatch) -> Tensor<Scalar> { ^ swift-models/Models/Text/BERT.swift:299:58: note: cannot differentiate through 'inout' arguments let positionPaddingIndex = withoutDerivative(at: { () -> Int in ^ ``` By using `withoutDerivative(at:)` at the correct location.
Add code and data utilities for the CoLA task. Code shared by eaplatanios@. Original sources are listed in comments at the top of each file. This is progress towards end-to-end BERT training. Todo: implement a main function with data loading and training loop.
The BERT for CoLA training loop compiles: https://i.imgur.com/5KyewAg.png Todo: - Fine-tune training so that loss decreases. - Generalize dataset utilities to work with CoLA remote URL.
Loss still does not steadily decrease: ``` [Epoch: 0] Loss: 0.50369537 [Epoch: 1] Loss: 0.7813513 [Epoch: 2] Loss: 1.0023696 [Epoch: 3] Loss: 0.8235911 [Epoch: 4] Loss: 0.621686 [Epoch: 5] Loss: 0.93954027 [Epoch: 6] Loss: 0.76672614 [Epoch: 7] Loss: 0.45236698 [Epoch: 8] Loss: 0.6538984 [Epoch: 9] Loss: 0.7307098 [Epoch: 10] Loss: 0.90539706 [Epoch: 11] Loss: 0.6684798 [Epoch: 12] Loss: 0.5408703 [Epoch: 13] Loss: 1.113673 ```
The training loop operates over minibatches, not batches. Thus, "step" is the correct term, not "epoch".
Evaluation reveals the model is not actually learning: ``` True positives: 0 True negatives: 322 False positives: 0 False negatives: 322 ▿ 1 key/value pair ▿ (2 elements) - key: "matthewsCorrelationCoefficient" - value: 0.0 ``` We ought to debug the loss function and BERT classifier class count.
Change class count to 2 and use softmax cross entropy. The evaluation metric now improves but sometimes decreases back to zero. The model isn't very stable, perhaps there's more room for improvement. After 80 steps: ``` True positives: 567 True negatives: 170 False positives: 152 False negatives: 170 ▿ 1 key/value pair ▿ (2 elements) - key: "matthewsCorrelationCoefficient" - value: 0.3192948 ``` After 130 steps: ``` True positives: 717 True negatives: 0 False positives: 322 False negatives: 0 ▿ 1 key/value pair ▿ (2 elements) - key: "matthewsCorrelationCoefficient" - value: 0.0 ```
Improvement todo: make training loop print epochs.
Fix various issues: 1. The sigmoid cross entropy loss was applied on logits of shape `[B, 1]` and labels of shape `[B]`. This forced a silent broadcast of logits to shape `[B, B]`, which resulted in the loss not being informative for training. 2. The batch size was too small. I added a comment in the main script code explaining how batching works in my data pipelines. 3. This is a minor one but there is bug with how I was copying the prefetching iterator. This was a temporary solution but I just disabled prefetching for the dev and test sets so that the dev set is the same across runs. This is not currently tuned but it's working. After ~20 steps MCC should be at about 0.28 and after ~200 steps it should be getting close to 0.50.
- Remove extraneous comment. - Remove trailing whitespace. - Change `dump` to `print`.
I merged the
There are a bunch of todos:
$ swift run TextModels
...
[Step: 41] Loss: 0.3939771
[Step: 42] Loss: 0.7205323
[Step: 43] Loss: 0.4848549
[Step: 44] Loss: 0.2727359
[Step: 45] Loss: 0.3460037
[Step: 46] Loss: 0.9574833
[Step: 47] Loss: 0.7397041
[Step: 48] Loss: 0.35861257
[Step: 49] Loss: 0.4990799
[Step: 50] Loss: 0.92075384
Evaluate BERT for the CoLA task:
Total predictions: 1043
True positives: 719
True negatives: 34
False positives: 288
False negatives: 34
["matthewsCorrelationCoefficient": 0.26019016] If there are no objections, now seems like a good point to merge this PR and continue incremental improvements. Thanks @eaplatanios for driving this work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to enable further work on BERT and other text models, I'm going to merge this in and we'll continue our work on this with the model inside the repository. I've created a tracking issue for the remaining to-do's that Dan has identified above, and we'll work to knock those down in the near term.
Regarding the failing CI test, we're trying out some CMake-related build options and those are currently failing. I've built this locally, so I'm going to bring this in even without a green Kokoro build.
Once again, this is fantastic work and everyone really appreciates the time and effort you put into building this, as well as your guidance on how to use and improve this. This is going to be tremendously useful to us and to the broader community.
Thanks @dan-zheng and @BradLarson for getting this in! I'm sorry I haven't had too much time lately to address some of the mentioned todos. |
* Added initial support for BERT. * Renamed 'LayerNormalization' to 'LayerNorm'. * Added a 'TextModels' SwiftPM target. * Fixed some of the compilation errors. * Added 'Optimizer' protocol. * Removed 'truncatedNormalInitializer'. * Added initial support for BERT. * Renamed 'LayerNormalization' to 'LayerNorm'. * Added a 'TextModels' SwiftPM target. * Fixed some of the compilation errors. * Added 'Optimizer' protocol. * Removed 'truncatedNormalInitializer'. * Minor cleanup. * Change `@differentiable` function default arguments from closures to functions. Change `@differentiable` function default arguments from closures to function references. Related issues: - https://bugs.swift.org/browse/TF-690 - https://bugs.swift.org/browse/TF-1030 * Fix non-differentiability error using `withoutDerivative(at:)`. Fix non-differentiability error: ``` swift-models/Models/Text/BERT.swift:292:6: error: function is not differentiable @differentiable(wrt: self) ~^~~~~~~~~~~~~~~~~~~~~~~~~ swift-models/Models/Text/BERT.swift:293:17: note: when differentiating this function definition public func callAsFunction(_ input: TextBatch) -> Tensor<Scalar> { ^ swift-models/Models/Text/BERT.swift:299:58: note: cannot differentiate through 'inout' arguments let positionPaddingIndex = withoutDerivative(at: { () -> Int in ^ ``` By using `withoutDerivative(at:)` at the correct location. * Add code for CoLA task. Add code and data utilities for the CoLA task. Code shared by eaplatanios@. Original sources are listed in comments at the top of each file. This is progress towards end-to-end BERT training. Todo: implement a main function with data loading and training loop. * Add working main function. The BERT for CoLA training loop compiles: https://i.imgur.com/5KyewAg.png Todo: - Fine-tune training so that loss decreases. - Generalize dataset utilities to work with CoLA remote URL. * Tune learning rate schedule, add gradient clipping. Loss still does not steadily decrease: ``` [Epoch: 0] Loss: 0.50369537 [Epoch: 1] Loss: 0.7813513 [Epoch: 2] Loss: 1.0023696 [Epoch: 3] Loss: 0.8235911 [Epoch: 4] Loss: 0.621686 [Epoch: 5] Loss: 0.93954027 [Epoch: 6] Loss: 0.76672614 [Epoch: 7] Loss: 0.45236698 [Epoch: 8] Loss: 0.6538984 [Epoch: 9] Loss: 0.7307098 [Epoch: 10] Loss: 0.90539706 [Epoch: 11] Loss: 0.6684798 [Epoch: 12] Loss: 0.5408703 [Epoch: 13] Loss: 1.113673 ``` * Made some minor edits to get the BERT classifier training to work for CoLA. (tensorflow#293) * Rename "epoch" to "step" in training loop. The training loop operates over minibatches, not batches. Thus, "step" is the correct term, not "epoch". * Add CoLA evaluation. Evaluation reveals the model is not actually learning: ``` True positives: 0 True negatives: 322 False positives: 0 False negatives: 322 ▿ 1 key/value pair ▿ (2 elements) - key: "matthewsCorrelationCoefficient" - value: 0.0 ``` We ought to debug the loss function and BERT classifier class count. * Fix BERT training. Change class count to 2 and use softmax cross entropy. The evaluation metric now improves but sometimes decreases back to zero. The model isn't very stable, perhaps there's more room for improvement. After 80 steps: ``` True positives: 567 True negatives: 170 False positives: 152 False negatives: 170 ▿ 1 key/value pair ▿ (2 elements) - key: "matthewsCorrelationCoefficient" - value: 0.3192948 ``` After 130 steps: ``` True positives: 717 True negatives: 0 False positives: 322 False negatives: 0 ▿ 1 key/value pair ▿ (2 elements) - key: "matthewsCorrelationCoefficient" - value: 0.0 ``` * Make training loop an infinite loop. Improvement todo: make training loop print epochs. * Fixed BERT. (tensorflow#294) Fix various issues: 1. The sigmoid cross entropy loss was applied on logits of shape `[B, 1]` and labels of shape `[B]`. This forced a silent broadcast of logits to shape `[B, B]`, which resulted in the loss not being informative for training. 2. The batch size was too small. I added a comment in the main script code explaining how batching works in my data pipelines. 3. This is a minor one but there is bug with how I was copying the prefetching iterator. This was a temporary solution but I just disabled prefetching for the dev and test sets so that the dev set is the same across runs. This is not currently tuned but it's working. After ~20 steps MCC should be at about 0.28 and after ~200 steps it should be getting close to 0.50. * Minor edits. - Remove extraneous comment. - Remove trailing whitespace. - Change `dump` to `print`. * Temporarily disabled bucketing. * Delete extraneous file. Co-authored-by: Dan Zheng <[email protected]>
* Added initial support for BERT. * Renamed 'LayerNormalization' to 'LayerNorm'. * Added a 'TextModels' SwiftPM target. * Fixed some of the compilation errors. * Added 'Optimizer' protocol. * Removed 'truncatedNormalInitializer'. * Added initial support for BERT. * Renamed 'LayerNormalization' to 'LayerNorm'. * Added a 'TextModels' SwiftPM target. * Fixed some of the compilation errors. * Added 'Optimizer' protocol. * Removed 'truncatedNormalInitializer'. * Minor cleanup. * Change `@differentiable` function default arguments from closures to functions. Change `@differentiable` function default arguments from closures to function references. Related issues: - https://bugs.swift.org/browse/TF-690 - https://bugs.swift.org/browse/TF-1030 * Fix non-differentiability error using `withoutDerivative(at:)`. Fix non-differentiability error: ``` swift-models/Models/Text/BERT.swift:292:6: error: function is not differentiable @differentiable(wrt: self) ~^~~~~~~~~~~~~~~~~~~~~~~~~ swift-models/Models/Text/BERT.swift:293:17: note: when differentiating this function definition public func callAsFunction(_ input: TextBatch) -> Tensor<Scalar> { ^ swift-models/Models/Text/BERT.swift:299:58: note: cannot differentiate through 'inout' arguments let positionPaddingIndex = withoutDerivative(at: { () -> Int in ^ ``` By using `withoutDerivative(at:)` at the correct location. * Add code for CoLA task. Add code and data utilities for the CoLA task. Code shared by eaplatanios@. Original sources are listed in comments at the top of each file. This is progress towards end-to-end BERT training. Todo: implement a main function with data loading and training loop. * Add working main function. The BERT for CoLA training loop compiles: https://i.imgur.com/5KyewAg.png Todo: - Fine-tune training so that loss decreases. - Generalize dataset utilities to work with CoLA remote URL. * Tune learning rate schedule, add gradient clipping. Loss still does not steadily decrease: ``` [Epoch: 0] Loss: 0.50369537 [Epoch: 1] Loss: 0.7813513 [Epoch: 2] Loss: 1.0023696 [Epoch: 3] Loss: 0.8235911 [Epoch: 4] Loss: 0.621686 [Epoch: 5] Loss: 0.93954027 [Epoch: 6] Loss: 0.76672614 [Epoch: 7] Loss: 0.45236698 [Epoch: 8] Loss: 0.6538984 [Epoch: 9] Loss: 0.7307098 [Epoch: 10] Loss: 0.90539706 [Epoch: 11] Loss: 0.6684798 [Epoch: 12] Loss: 0.5408703 [Epoch: 13] Loss: 1.113673 ``` * Made some minor edits to get the BERT classifier training to work for CoLA. (tensorflow#293) * Rename "epoch" to "step" in training loop. The training loop operates over minibatches, not batches. Thus, "step" is the correct term, not "epoch". * Add CoLA evaluation. Evaluation reveals the model is not actually learning: ``` True positives: 0 True negatives: 322 False positives: 0 False negatives: 322 ▿ 1 key/value pair ▿ (2 elements) - key: "matthewsCorrelationCoefficient" - value: 0.0 ``` We ought to debug the loss function and BERT classifier class count. * Fix BERT training. Change class count to 2 and use softmax cross entropy. The evaluation metric now improves but sometimes decreases back to zero. The model isn't very stable, perhaps there's more room for improvement. After 80 steps: ``` True positives: 567 True negatives: 170 False positives: 152 False negatives: 170 ▿ 1 key/value pair ▿ (2 elements) - key: "matthewsCorrelationCoefficient" - value: 0.3192948 ``` After 130 steps: ``` True positives: 717 True negatives: 0 False positives: 322 False negatives: 0 ▿ 1 key/value pair ▿ (2 elements) - key: "matthewsCorrelationCoefficient" - value: 0.0 ``` * Make training loop an infinite loop. Improvement todo: make training loop print epochs. * Fixed BERT. (tensorflow#294) Fix various issues: 1. The sigmoid cross entropy loss was applied on logits of shape `[B, 1]` and labels of shape `[B]`. This forced a silent broadcast of logits to shape `[B, B]`, which resulted in the loss not being informative for training. 2. The batch size was too small. I added a comment in the main script code explaining how batching works in my data pipelines. 3. This is a minor one but there is bug with how I was copying the prefetching iterator. This was a temporary solution but I just disabled prefetching for the dev and test sets so that the dev set is the same across runs. This is not currently tuned but it's working. After ~20 steps MCC should be at about 0.28 and after ~200 steps it should be getting close to 0.50. * Minor edits. - Remove extraneous comment. - Remove trailing whitespace. - Change `dump` to `print`. * Temporarily disabled bucketing. * Delete extraneous file. Co-authored-by: Dan Zheng <[email protected]>
This PR adds support for BERT. It is not ready to merge yet.
I believe that the following features should be moved to
swift-apis
if people agree:Embedding
that ought to support using matrix multiplications instead of gather ops for faster execution on TPUs (this is currently commented out due an AutoDiff bug). This can replace the existing layer in swift-apis and should be backwards compatible.MultiHeadAttention
layer that can be used for other models too (other than BERT or Transformers).Regularizable
protocol that I don't really like, but it's a temporary solution for supporting weight decay in BERT. We probably shouldn't move this to swift-apis until we have thought out a nicer solution.Optimizer
protocol that supports learning rate schedules and aWeightDecayedAdam
implementation.ScheduledParameter
corresponds to Added initial support for learning rate schedules. swift-apis#431.The following features are also added:
Note that for RoBERTa I had to convert the published PyTorch checkpoints to TF checkpoints compatible with this model. I have uploaded these to my Dropbox account, but they take too much space, so I would really appreciate it if we could move them to Google Cloud Storage.
This is not ready to merge yet, but is rather aimed at getting feedback to improve the API and also making sure that the code style is compatible with this repository. A big open question is, how do we want to test this? I know it's working fine as I'm using it for my own research projects, but I don't know what kind of tests would be appropriate.
Sorry if I missed something in this list of changes. I'll keep updating it as we refine the PR.
NOTE: The new layers are not generic over the
Scalar
type due to a compiler bug (TF-427).NOTE: Compilation is currently failing because the synthesized
TangentVector
initializers are internal and I cannot declare conformances toRegularizable
without using those initializers.cc @saeta @dan-zheng @BradLarson