Skip to content
This repository was archived by the owner on Apr 23, 2025. It is now read-only.

[WIP] Added support for BERT. #231

Merged
merged 29 commits into from
Feb 14, 2020
Merged

Conversation

eaplatanios
Copy link
Contributor

@eaplatanios eaplatanios commented Nov 26, 2019

This PR adds support for BERT. It is not ready to merge yet.

I believe that the following features should be moved to swift-apis if people agree:

  • A new Embedding that ought to support using matrix multiplications instead of gather ops for faster execution on TPUs (this is currently commented out due an AutoDiff bug). This can replace the existing layer in swift-apis and should be backwards compatible.
  • A MultiHeadAttention layer that can be used for other models too (other than BERT or Transformers).
  • A Regularizable protocol that I don't really like, but it's a temporary solution for supporting weight decay in BERT. We probably shouldn't move this to swift-apis until we have thought out a nicer solution.
  • A new Optimizer protocol that supports learning rate schedules and a WeightDecayedAdam implementation.
  • ScheduledParameter corresponds to Added initial support for learning rate schedules.  swift-apis#431.

The following features are also added:

  • Multiple text tokenization approaches, including a byte-pair encoding (BPE) tokenizer as well as a WordPiece tokenizer.
  • A Transformer implementation.
  • A BERT implementation that supports multiple variants (i.e., BERT, RoBERTa, and ALBERT).
  • Support for automatically downloading and loading pre-trained models for all these variants.

Note that for RoBERTa I had to convert the published PyTorch checkpoints to TF checkpoints compatible with this model. I have uploaded these to my Dropbox account, but they take too much space, so I would really appreciate it if we could move them to Google Cloud Storage.

This is not ready to merge yet, but is rather aimed at getting feedback to improve the API and also making sure that the code style is compatible with this repository. A big open question is, how do we want to test this? I know it's working fine as I'm using it for my own research projects, but I don't know what kind of tests would be appropriate.

Sorry if I missed something in this list of changes. I'll keep updating it as we refine the PR.

NOTE: The new layers are not generic over the Scalar type due to a compiler bug (TF-427).

NOTE: Compilation is currently failing because the synthesized TangentVector initializers are internal and I cannot declare conformances to Regularizable without using those initializers.

cc @saeta @dan-zheng @BradLarson

@eaplatanios eaplatanios changed the title Added support for BERT. [WIP] Added support for BERT. Nov 26, 2019
@Shashi456
Copy link
Contributor

This is absolutely huge @eaplatanios. I've been working on BERT for the past few months and can't wait to check this code out :).

@Shashi456
Copy link
Contributor

@eaplatanios What do you think about restructuring the encoding techniques under a preproccessing directory either in this repo or the swift-apis repo, These techniques are used everywhere.
I'm also suggesting for a preprocessing directory, because down the line if we image preprocessing techniques including augmentatation, this structure would benefit that as well.
cc: @BradLarson @saeta

@Shashi456
Copy link
Contributor

I think we could take some inspiration on how to test these models from how these are done in pytorch-transformers, as @saeta mentioned in the design meeting, they do just check loading the model and check if the tensor shapes are right. You can take a look at these tests here.

@BradLarson
Copy link
Contributor

I'd like to see what can be done to help drive this to completion. I forget where we left things after the community meeting, but is the biggest blocker right now the compilation errors around TangentVector and the activation functions in Attention.swift?

If there are areas you'd like me to look into, I'd be glad to do so.

@eaplatanios
Copy link
Contributor Author

Thanks @BradLarson ! The only blocker is the TangentVector initializer. The issues related to attention have to do with supporting generic BERT layers (currently they're just Float-valued).

@eaplatanios
Copy link
Contributor Author

Please also let me know what you guys think should be pushed to swift-apis and what kind of restructuring and testing would be useful. :)


extension Dense: Regularizable {
public var regularizationValue: TangentVector {
// TODO: This initializer is currently internal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed TF-1077 to track the non-public TangentVector memberwise initializer issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great! Thanks a lot Dan! Also, what do you think of this solution for regularization? I don't really like it too much, but I also couldn't think of another easy way to support it. :/

/// The URL where this pre-trained model can be downloaded from.
public var url: URL {
let bertPrefix = "https://storage.googleapis.com/bert_models/2018_"
let robertaPrefix = "https://www.dropbox.com/s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that you don't have to store these large weights in Dropbox, I've re-uploaded them to a hosted GCS bucket we've created for weights / datasets:

https://storage.googleapis.com/s4tf-hosted-binaries/checkpoints/Text/RoBERTa/base.zip
https://storage.googleapis.com/s4tf-hosted-binaries/checkpoints/Text/RoBERTa/large.zip

If there are others you'd like me to place there, just let me know and I'll add them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BradLarson would this bucket also be useful for adding support for pre-trained models and their weights?

Copy link
Contributor

@BradLarson BradLarson Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shashi456 - Yes, that's one of the goals, in addition to being a reliable backup for sometimes-flaky dataset download locations. I'm working on a quick addition to the checkpoint loaders to make it easy to download from here, so that we simplify the process of working with models that need pretrained checkpoints (the existing Transformer and MiniGo models, as well as BERT and family) and start CI testing inference accuracy using real pretrained models.

@BradLarson
Copy link
Contributor

Now that the TangentVector visibility issues have been resolved, is there anything else blocking the core model? If so, I'd love to see what we could do to resolve that.

Beyond the core model, do you have an example of BERT in action or a unit test of it that we can use to verify correct operation? If those are too tied up in any internal infrastructure you have, we could potentially pull this in and I could add a demo and / or tests as a follow-on, but if you have simple demo code for this that would really ease the process of pulling this in.

We can merge together the existing Transformer demo and the new Transformer model and utility functions you have here to migrate our text generation demo over to this new structure, but didn't know if there was a similar demo you had available for BERT.

eaplatanios and others added 11 commits January 28, 2020 14:05
…functions.

Change `@differentiable` function default arguments from closures to function
references.

Related issues:
- https://bugs.swift.org/browse/TF-690
- https://bugs.swift.org/browse/TF-1030
Fix non-differentiability error:
```
swift-models/Models/Text/BERT.swift:292:6: error: function is not differentiable
    @differentiable(wrt: self)
    ~^~~~~~~~~~~~~~~~~~~~~~~~~
swift-models/Models/Text/BERT.swift:293:17: note: when differentiating this function definition
    public func callAsFunction(_ input: TextBatch) -> Tensor<Scalar> {
                ^
swift-models/Models/Text/BERT.swift:299:58: note: cannot differentiate through 'inout' arguments
        let positionPaddingIndex = withoutDerivative(at: { () -> Int in
                                                         ^
```

By using `withoutDerivative(at:)` at the correct location.
Add code and data utilities for the CoLA task.

Code shared by eaplatanios@.
Original sources are listed in comments at the top of each file.

This is progress towards end-to-end BERT training.
Todo: implement a main function with data loading and training loop.
The BERT for CoLA training loop compiles:
https://i.imgur.com/5KyewAg.png

Todo:
- Fine-tune training so that loss decreases.
- Generalize dataset utilities to work with CoLA remote URL.
dan-zheng and others added 12 commits January 29, 2020 21:08
Loss still does not steadily decrease:
```
[Epoch: 0]	Loss: 0.50369537
[Epoch: 1]	Loss: 0.7813513
[Epoch: 2]	Loss: 1.0023696
[Epoch: 3]	Loss: 0.8235911
[Epoch: 4]	Loss: 0.621686
[Epoch: 5]	Loss: 0.93954027
[Epoch: 6]	Loss: 0.76672614
[Epoch: 7]	Loss: 0.45236698
[Epoch: 8]	Loss: 0.6538984
[Epoch: 9]	Loss: 0.7307098
[Epoch: 10]	Loss: 0.90539706
[Epoch: 11]	Loss: 0.6684798
[Epoch: 12]	Loss: 0.5408703
[Epoch: 13]	Loss: 1.113673
```
The training loop operates over minibatches, not batches.
Thus, "step" is the correct term, not "epoch".
Evaluation reveals the model is not actually learning:
```
True positives: 0
True negatives: 322
False positives: 0
False negatives: 322
▿ 1 key/value pair
  ▿ (2 elements)
    - key: "matthewsCorrelationCoefficient"
    - value: 0.0
```

We ought to debug the loss function and BERT classifier class count.
Change class count to 2 and use softmax cross entropy.

The evaluation metric now improves but sometimes decreases back to zero.
The model isn't very stable, perhaps there's more room for improvement.

After 80 steps:
```
True positives: 567
True negatives: 170
False positives: 152
False negatives: 170
▿ 1 key/value pair
  ▿ (2 elements)
    - key: "matthewsCorrelationCoefficient"
    - value: 0.3192948
```

After 130 steps:
```
True positives: 717
True negatives: 0
False positives: 322
False negatives: 0
▿ 1 key/value pair
  ▿ (2 elements)
    - key: "matthewsCorrelationCoefficient"
    - value: 0.0
```
Improvement todo: make training loop print epochs.
Fix various issues:

1.  The sigmoid cross entropy loss was applied on logits of shape `[B, 1]` and
    labels of shape `[B]`. This forced a silent broadcast of logits to shape
    `[B, B]`, which resulted in the loss not being informative for training.

2.  The batch size was too small. I added a comment in the main script code
    explaining how batching works in my data pipelines.

3.  This is a minor one but there is bug with how I was copying the prefetching
    iterator. This was a temporary solution but I just disabled prefetching for
    the dev and test sets so that the dev set is the same across runs.

This is not currently tuned but it's working. After ~20 steps MCC should be at
about 0.28 and after ~200 steps it should be getting close to 0.50.
- Remove extraneous comment.
- Remove trailing whitespace.
- Change `dump` to `print`.
@dan-zheng
Copy link
Member

I merged the tensorflow:bert-wip branch into eaplatanios:bert. This includes the following changes:

  • Some merge conflicts were fixed.
  • Some differentiation-related compilation errors were fixed.
  • A training loop for BERT was added in Models/Text/main.swift.

There are a bunch of todos:

  • Rewrite utilities for downloading/extracting data in this PR using unified ModelSupport APIs.
  • Improve code organization.
    • Currently, most BERT code lives at the top-level in Models/Text. Models/Text could be better organized, like Models/ImageClassification.
  • Verify that BERT training converges, and that results match a reference implementation.
  • Verify that other BERT variants work.
    • BERT variants like roberta and albert were added in this PR but are untested.

swift run TextModels works with latest Swift for TensorFlow toolchains:

$ swift run TextModels
...
[Step: 41]	Loss: 0.3939771
[Step: 42]	Loss: 0.7205323
[Step: 43]	Loss: 0.4848549
[Step: 44]	Loss: 0.2727359
[Step: 45]	Loss: 0.3460037
[Step: 46]	Loss: 0.9574833
[Step: 47]	Loss: 0.7397041
[Step: 48]	Loss: 0.35861257
[Step: 49]	Loss: 0.4990799
[Step: 50]	Loss: 0.92075384
Evaluate BERT for the CoLA task:
Total predictions: 1043
True positives: 719
True negatives: 34
False positives: 288
False negatives: 34
["matthewsCorrelationCoefficient": 0.26019016]

If there are no objections, now seems like a good point to merge this PR and continue incremental improvements. Thanks @eaplatanios for driving this work!

Copy link
Contributor

@BradLarson BradLarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to enable further work on BERT and other text models, I'm going to merge this in and we'll continue our work on this with the model inside the repository. I've created a tracking issue for the remaining to-do's that Dan has identified above, and we'll work to knock those down in the near term.

Regarding the failing CI test, we're trying out some CMake-related build options and those are currently failing. I've built this locally, so I'm going to bring this in even without a green Kokoro build.

Once again, this is fantastic work and everyone really appreciates the time and effort you put into building this, as well as your guidance on how to use and improve this. This is going to be tremendously useful to us and to the broader community.

@BradLarson BradLarson merged commit dd3e547 into tensorflow:master Feb 14, 2020
@eaplatanios
Copy link
Contributor Author

Thanks @dan-zheng and @BradLarson for getting this in! I'm sorry I haven't had too much time lately to address some of the mentioned todos.

Shashi456 pushed a commit to Shashi456/swift-models that referenced this pull request Feb 20, 2020
* Added initial support for BERT.

* Renamed 'LayerNormalization' to 'LayerNorm'.

* Added a 'TextModels' SwiftPM target.

* Fixed some of the compilation errors.

* Added 'Optimizer' protocol.

* Removed 'truncatedNormalInitializer'.

* Added initial support for BERT.

* Renamed 'LayerNormalization' to 'LayerNorm'.

* Added a 'TextModels' SwiftPM target.

* Fixed some of the compilation errors.

* Added 'Optimizer' protocol.

* Removed 'truncatedNormalInitializer'.

* Minor cleanup.

* Change `@differentiable` function default arguments from closures to functions.

Change `@differentiable` function default arguments from closures to function
references.

Related issues:
- https://bugs.swift.org/browse/TF-690
- https://bugs.swift.org/browse/TF-1030

* Fix non-differentiability error using `withoutDerivative(at:)`.

Fix non-differentiability error:
```
swift-models/Models/Text/BERT.swift:292:6: error: function is not differentiable
    @differentiable(wrt: self)
    ~^~~~~~~~~~~~~~~~~~~~~~~~~
swift-models/Models/Text/BERT.swift:293:17: note: when differentiating this function definition
    public func callAsFunction(_ input: TextBatch) -> Tensor<Scalar> {
                ^
swift-models/Models/Text/BERT.swift:299:58: note: cannot differentiate through 'inout' arguments
        let positionPaddingIndex = withoutDerivative(at: { () -> Int in
                                                         ^
```

By using `withoutDerivative(at:)` at the correct location.

* Add code for CoLA task.

Add code and data utilities for the CoLA task.

Code shared by eaplatanios@.
Original sources are listed in comments at the top of each file.

This is progress towards end-to-end BERT training.
Todo: implement a main function with data loading and training loop.

* Add working main function.

The BERT for CoLA training loop compiles:
https://i.imgur.com/5KyewAg.png

Todo:
- Fine-tune training so that loss decreases.
- Generalize dataset utilities to work with CoLA remote URL.

* Tune learning rate schedule, add gradient clipping.

Loss still does not steadily decrease:
```
[Epoch: 0]	Loss: 0.50369537
[Epoch: 1]	Loss: 0.7813513
[Epoch: 2]	Loss: 1.0023696
[Epoch: 3]	Loss: 0.8235911
[Epoch: 4]	Loss: 0.621686
[Epoch: 5]	Loss: 0.93954027
[Epoch: 6]	Loss: 0.76672614
[Epoch: 7]	Loss: 0.45236698
[Epoch: 8]	Loss: 0.6538984
[Epoch: 9]	Loss: 0.7307098
[Epoch: 10]	Loss: 0.90539706
[Epoch: 11]	Loss: 0.6684798
[Epoch: 12]	Loss: 0.5408703
[Epoch: 13]	Loss: 1.113673
```

* Made some minor edits to get the BERT classifier training to work for CoLA. (tensorflow#293)

* Rename "epoch" to "step" in training loop.

The training loop operates over minibatches, not batches.
Thus, "step" is the correct term, not "epoch".

* Add CoLA evaluation.

Evaluation reveals the model is not actually learning:
```
True positives: 0
True negatives: 322
False positives: 0
False negatives: 322
▿ 1 key/value pair
  ▿ (2 elements)
    - key: "matthewsCorrelationCoefficient"
    - value: 0.0
```

We ought to debug the loss function and BERT classifier class count.

* Fix BERT training.

Change class count to 2 and use softmax cross entropy.

The evaluation metric now improves but sometimes decreases back to zero.
The model isn't very stable, perhaps there's more room for improvement.

After 80 steps:
```
True positives: 567
True negatives: 170
False positives: 152
False negatives: 170
▿ 1 key/value pair
  ▿ (2 elements)
    - key: "matthewsCorrelationCoefficient"
    - value: 0.3192948
```

After 130 steps:
```
True positives: 717
True negatives: 0
False positives: 322
False negatives: 0
▿ 1 key/value pair
  ▿ (2 elements)
    - key: "matthewsCorrelationCoefficient"
    - value: 0.0
```

* Make training loop an infinite loop.

Improvement todo: make training loop print epochs.

* Fixed BERT. (tensorflow#294)

Fix various issues:

1.  The sigmoid cross entropy loss was applied on logits of shape `[B, 1]` and
    labels of shape `[B]`. This forced a silent broadcast of logits to shape
    `[B, B]`, which resulted in the loss not being informative for training.

2.  The batch size was too small. I added a comment in the main script code
    explaining how batching works in my data pipelines.

3.  This is a minor one but there is bug with how I was copying the prefetching
    iterator. This was a temporary solution but I just disabled prefetching for
    the dev and test sets so that the dev set is the same across runs.

This is not currently tuned but it's working. After ~20 steps MCC should be at
about 0.28 and after ~200 steps it should be getting close to 0.50.

* Minor edits.

- Remove extraneous comment.
- Remove trailing whitespace.
- Change `dump` to `print`.

* Temporarily disabled bucketing.

* Delete extraneous file.

Co-authored-by: Dan Zheng <[email protected]>
Shashi456 pushed a commit to Shashi456/swift-models that referenced this pull request Feb 20, 2020
* Added initial support for BERT.

* Renamed 'LayerNormalization' to 'LayerNorm'.

* Added a 'TextModels' SwiftPM target.

* Fixed some of the compilation errors.

* Added 'Optimizer' protocol.

* Removed 'truncatedNormalInitializer'.

* Added initial support for BERT.

* Renamed 'LayerNormalization' to 'LayerNorm'.

* Added a 'TextModels' SwiftPM target.

* Fixed some of the compilation errors.

* Added 'Optimizer' protocol.

* Removed 'truncatedNormalInitializer'.

* Minor cleanup.

* Change `@differentiable` function default arguments from closures to functions.

Change `@differentiable` function default arguments from closures to function
references.

Related issues:
- https://bugs.swift.org/browse/TF-690
- https://bugs.swift.org/browse/TF-1030

* Fix non-differentiability error using `withoutDerivative(at:)`.

Fix non-differentiability error:
```
swift-models/Models/Text/BERT.swift:292:6: error: function is not differentiable
    @differentiable(wrt: self)
    ~^~~~~~~~~~~~~~~~~~~~~~~~~
swift-models/Models/Text/BERT.swift:293:17: note: when differentiating this function definition
    public func callAsFunction(_ input: TextBatch) -> Tensor<Scalar> {
                ^
swift-models/Models/Text/BERT.swift:299:58: note: cannot differentiate through 'inout' arguments
        let positionPaddingIndex = withoutDerivative(at: { () -> Int in
                                                         ^
```

By using `withoutDerivative(at:)` at the correct location.

* Add code for CoLA task.

Add code and data utilities for the CoLA task.

Code shared by eaplatanios@.
Original sources are listed in comments at the top of each file.

This is progress towards end-to-end BERT training.
Todo: implement a main function with data loading and training loop.

* Add working main function.

The BERT for CoLA training loop compiles:
https://i.imgur.com/5KyewAg.png

Todo:
- Fine-tune training so that loss decreases.
- Generalize dataset utilities to work with CoLA remote URL.

* Tune learning rate schedule, add gradient clipping.

Loss still does not steadily decrease:
```
[Epoch: 0]	Loss: 0.50369537
[Epoch: 1]	Loss: 0.7813513
[Epoch: 2]	Loss: 1.0023696
[Epoch: 3]	Loss: 0.8235911
[Epoch: 4]	Loss: 0.621686
[Epoch: 5]	Loss: 0.93954027
[Epoch: 6]	Loss: 0.76672614
[Epoch: 7]	Loss: 0.45236698
[Epoch: 8]	Loss: 0.6538984
[Epoch: 9]	Loss: 0.7307098
[Epoch: 10]	Loss: 0.90539706
[Epoch: 11]	Loss: 0.6684798
[Epoch: 12]	Loss: 0.5408703
[Epoch: 13]	Loss: 1.113673
```

* Made some minor edits to get the BERT classifier training to work for CoLA. (tensorflow#293)

* Rename "epoch" to "step" in training loop.

The training loop operates over minibatches, not batches.
Thus, "step" is the correct term, not "epoch".

* Add CoLA evaluation.

Evaluation reveals the model is not actually learning:
```
True positives: 0
True negatives: 322
False positives: 0
False negatives: 322
▿ 1 key/value pair
  ▿ (2 elements)
    - key: "matthewsCorrelationCoefficient"
    - value: 0.0
```

We ought to debug the loss function and BERT classifier class count.

* Fix BERT training.

Change class count to 2 and use softmax cross entropy.

The evaluation metric now improves but sometimes decreases back to zero.
The model isn't very stable, perhaps there's more room for improvement.

After 80 steps:
```
True positives: 567
True negatives: 170
False positives: 152
False negatives: 170
▿ 1 key/value pair
  ▿ (2 elements)
    - key: "matthewsCorrelationCoefficient"
    - value: 0.3192948
```

After 130 steps:
```
True positives: 717
True negatives: 0
False positives: 322
False negatives: 0
▿ 1 key/value pair
  ▿ (2 elements)
    - key: "matthewsCorrelationCoefficient"
    - value: 0.0
```

* Make training loop an infinite loop.

Improvement todo: make training loop print epochs.

* Fixed BERT. (tensorflow#294)

Fix various issues:

1.  The sigmoid cross entropy loss was applied on logits of shape `[B, 1]` and
    labels of shape `[B]`. This forced a silent broadcast of logits to shape
    `[B, B]`, which resulted in the loss not being informative for training.

2.  The batch size was too small. I added a comment in the main script code
    explaining how batching works in my data pipelines.

3.  This is a minor one but there is bug with how I was copying the prefetching
    iterator. This was a temporary solution but I just disabled prefetching for
    the dev and test sets so that the dev set is the same across runs.

This is not currently tuned but it's working. After ~20 steps MCC should be at
about 0.28 and after ~200 steps it should be getting close to 0.50.

* Minor edits.

- Remove extraneous comment.
- Remove trailing whitespace.
- Change `dump` to `print`.

* Temporarily disabled bucketing.

* Delete extraneous file.

Co-authored-by: Dan Zheng <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants