Skip to content

[TM DownloadDirectory Part2] Implement download directory in transfer manager #3010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

zoewangg
Copy link
Contributor

@zoewangg zoewangg commented Feb 2, 2022

Modifications

Implement download directory in transfer manager

Implementation Notes:

  • ListObjects is performed using DFS. See ListObjectsRecursivelyHelper
  • The SDK sends out individual downloadFileRequest as it iterates the s3 objects returned from ListObjectsV2

### Pending Items:
- Add tests for ListObjectsRecursivelyHelper
- Add more Javadocs

Testing

Added unit tests and integ tests

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Checklist

  • I have read the CONTRIBUTING document
  • Local run of mvn install succeeds
  • My code follows the code style of this project
  • My change requires a change to the Javadoc documentation
  • I have updated the Javadoc documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed
  • I have added a changelog entry. Adding a new entry must be accomplished by running the scripts/new-change script and following the instructions. Commit the new file created by the script in .changes/next-release with your changes.
  • My change is to implement 1.11 parity feature and I have updated LaunchChangelog

License

  • I confirm that this pull request can be released under the Apache 2 license

@zoewangg zoewangg force-pushed the zoewang/tm-downloadDirectoryPart2 branch from 9f07296 to 994a820 Compare February 3, 2022 03:51
@@ -281,8 +281,11 @@ default DirectoryUpload uploadDirectory(Consumer<UploadDirectoryRequest.Builder>
* bucket will be downloaded.
*
* <p>
* The SDK will create the destination directory if it does not already exist.
Copy link
Contributor

@Bennett-Lynch Bennett-Lynch Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a note for what happens if (some) files already exist?

On that same note, should we provide an option to overwrite-existing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the transfer will fail if the corresponding file already exists. We can support overwrite-existing in the future if there's a ask for it. We'd need to make changes in FileAsyncTransformer to expose that option

Will add a note

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be almost guaranteed that there will be user-demand for this feature on something like download directory. It allows users to treat it like "sync" and it would allow it to behave symmetrical to upload directory (where S3 will not reject overwrites by default).

To be honest, I wish the default behavior were to overwrite-if-exists. I know that's not easily changeable, but I wonder if we should reconsider the default for new features like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we have to add overwrite-if-exists to support pause and resume before GA anyway. We can revisit the default behavior for download directory once we have that.



downloadDirectoryRequest.overrideConfiguration()
.map(DownloadDirectoryOverrideConfiguration::downloadFileRequestTransformer)
Copy link
Contributor

@Bennett-Lynch Bennett-Lynch Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<Bit off topic but want to write down my thoughts so we can capture them somewhere>

Even though it's not an immediately planned feature/concern, I'm trying to think what it would look like to extend this download logic to simulate "sync" logic. E.g.,

A s3 object will require copying if the sizes of the two s3 objects differ, the last modified time of the source is newer than the last modified time of the destination, or the s3 object does not exist under the specified bucket and prefix destination.

https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html#examples

I don't think the current downloadFileRequestTransformer is extensible enough to support something like this. Should we consider something like a BiPredicate<S3Object, Path> that allows users to declare if an individual download should be allowed to execute or not?

If we supported something like that, then in theory, implementing sync-download logic would be as simple as implementing the BiPredicate to replicate the behavior quoted above. And users might be able to benefit from it for other similar use cases, beyond just sync.

It's slightly more complicated for the sync-upload case, because it probably requires performing an extra HEAD request per-object, but sync-download seems easier to achieve merely by extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, downloadFileRequestTransformer works for the case where customers want to add the same configurations or parameters to all downloadFileRequests, but it doesn't support filtering. Something like execution interceptor might be more flexible where they can apply the transformer for certain requests and filter out some requests. I'll remove it for now and create a task for it

@zoewangg zoewangg marked this pull request as ready for review February 8, 2022 05:32
@zoewangg zoewangg requested a review from a team as a code owner February 8, 2022 05:32
log.debug(() -> "s3Object key: " + s3Object.key());
futures.add(downloadSingleFile(downloadDirectoryRequest,
failedFileDownloads,
s3Object));
}).whenComplete((r, t) -> {
if (t != null) {
returnFuture.completeExceptionally(SdkClientException.create("Failed to call ListObjectsV2 ", t));
returnFuture.completeExceptionally(SdkClientException.create("Failed to call ListObjectsV2", t));
cleanUpDirectory(downloadDirectoryRequest.destinationDirectory());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably only call this if the SDK was responsible for creating the directory. If the directory already existed, I don't think we should delete it.

Copy link
Contributor Author

@zoewangg zoewangg Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To close the loop, discussed offline and agreed that we should move the logic to create directories to FileAsyncResponseTransformer and FileResponseTransformer. #3018

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 9, 2022

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 20 Code Smells

85.1% 85.1% Coverage
3.4% 3.4% Duplication

@zoewangg zoewangg merged commit 103d2d4 into zoewang/tm-downloadDirectory Feb 9, 2022
zoewangg added a commit that referenced this pull request Feb 15, 2022
* [TM DownloadDirectory Part1] Create POJO classes for download directory (#2993)

* Create POJO classes for download directory

* Address feedback

* Address comments

* A couple of minor refactoring on the S3TransferManager (#2997)

* [TM DownloadDirectory Part2] Implement download directory in transfer manager (#3010)

* Implement download directory in transfer manager

* Add more tests and address comments

* Remove create parent directories logic and add changelog entry

* [TM DownloadDirectory Part3] Various updates on downloadDirectory (#3020)

* Implement download directory in transfer manager

* Add more tests and address comments

* Remove create parent directories logic and add changelog entry

* Various updates on downloadDirectory 1. limit the number of concurrent download file 2. create nonexistent parent directories 3. normalize key if prefix is not empty

* By default, set delimiter to null to avoid potentially excessive listObjectsV2 calls

* Move async buffering logic to SdkPublisher

* Move AsyncBufferingSubscriber back to s3-tranfer-manager module and address feedback

* remove unnecessary isDelivering.set(false)
@zoewangg zoewangg deleted the zoewang/tm-downloadDirectoryPart2 branch February 15, 2022 22:14
aws-sdk-java-automation added a commit that referenced this pull request Apr 30, 2024
…f26d07861

Pull request: release <- staging/a7ad223f-cc6b-4682-a3d9-babf26d07861
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants