Skip to content

Move deabstraction of non-tensorflow convention functions to the TFPartitioning pipeline. #21299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 23 commits into from

Conversation

bgogul
Copy link
Contributor

@bgogul bgogul commented Dec 13, 2018

This PR changes the invocation of deabstraction and partitioning passes as follows:

Diagnostic Pipeline:

  • Deabstraction of tensorflow-convention functions

Perf Pipeline:

  • Partitioning of tensorflow-convention functions (the next step is to move this to diagnostic pipeline so that we can have a proper -Onone mode)
  • Deabstraction and partitioning of non-tensorflow convention functions.

To recover most of the previous behavior with respect to sends and receives for non-tensorflow convention functions, the following optimizations are explicitly invoked after deabstraction (see Deabtraction::doIt()):

  PredictableMemoryOptimizations();
  PropagateSSAValues()

Specifically, the TFPartitionPipeline looks like this now:

SILPassPipelinePlan SILPassPipelinePlan::getTFPartitionPassPipeline() {
  SILPassPipelinePlan P;
  P.startPipeline("TensorFlow Partitioning");
  P.addTFDeabstraction();
  P.addTFPartition();
  return P;
}

Moreover, the perfInliner was modified to prevent inlining of some functions that the deabstraction relies on.

Tests:

  • Ignore the regressions in send/recv warnings in the tests as they are going away.
  • In some tests, the basic blocks got re-ordered because the deabstraction sees optimized SILFunction for non-tensorflow convention functions.

@bgogul
Copy link
Contributor Author

bgogul commented Dec 13, 2018

@swift-ci please test tensorflow

@bgogul bgogul requested review from lattner and mhong December 13, 2018 20:44
Copy link

@mhong mhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very nice step forward!

namespace tf {

/// A helper class to launch deabstraction on functions.
class TFDeabstractionHelper {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about calling this class TFDeabstraction? That matches the file name, and also "helper" suggests its scope is more local (like an impl detail).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: after reading the code changes in the cc file, i feel a better option is to move this helper class (can keep the helper name) to the cc file.

The only external user is then isSpecialNoInlineCallee() -- can we move it to TFUtilities.h or some other suitable header file? If not, i'm fine with creating this new header file for it.

Also, given the helper class seems only used in TFDeabstractionPass::run(), should we just define these public APIs in the TFDeabstractionPass class instead, and avoid creating a new class? I can be convinced either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@bgogul
Copy link
Contributor Author

bgogul commented Dec 14, 2018

Please hold off further review on this PR. The changes that I did to the performance inliner (unsurprisingly) causes regressions in swift optimizer tests. I am looking into it.

@bgogul
Copy link
Contributor Author

bgogul commented Dec 19, 2018

@swift-ci please test tensorflow

@bgogul
Copy link
Contributor Author

bgogul commented Dec 20, 2018

@swift-ci please test tensorflow

@@ -220,6 +220,14 @@ bool tf::flattenTensorFlowValueAggregate(Type ty,
/// parameter of result contains any TensorFlow value type.
bool TypeContainsTensorFlowValue::containsTensorFlowValue(
Type ty, bool checkHigherOrderFunctions) {
llvm::SmallPtrSet<NominalTypeDecl*, 8> parentDecls;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will split this out into a separate CL.

Copy link
Contributor Author

@bgogul bgogul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhong, I have resolved all the comments and the optimization errors. PTAL.

@bgogul
Copy link
Contributor Author

bgogul commented Dec 20, 2018

@swift-ci please test tensorflow

Copy link

@mhong mhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left another around of comments. Thanks.

@@ -92,7 +92,13 @@ namespace tf {
bool containsTensorFlowValue(Type ty, bool checkHigherOrderFunctions);

private:
bool structContainsTensorFlowValue(StructDecl *decl);
bool containsTensorFlowValueImpl(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document parentDecls -- what it does, and whether we mutate it. the func name seems to suggests this func is a predicate, and would not have side effects.

IIUC, parentDecls is used as a cache. Should we call it a cache instead of "parent sth"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not a cache. It keeps track of the nesting structure so that we can detect recursive data structures.

In any case, ignore the file in this PR. I have opened a separate PR for this: #21449

(I had to include the files here to make sure I can run the tests.)

bool TypeContainsTensorFlowValue::structContainsTensorFlowValue(
StructDecl *decl, llvm::SmallPtrSetImpl<NominalTypeDecl *> &parentDecls) {
// If we have a cycle, break it here.
if (parentDecls.count(decl) > 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example of a cycle? I thought structs cannot have such self references

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#21449 has an example.

@@ -154,7 +154,7 @@ namespace {

void promoteToSSA(ArrayRef<AllocStackInst *> allocs);
void prepareStackAllocForPromotion(AllocStackInst *alloc);
void propagateSSAValues();
void propagateSSAValues(SmallVectorImpl<SILInstruction *> &relevantInsts);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document relevantInsts. if it's for read-only, consider using ArrayRef instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would ArrayRef work here?

also as a nit: the word "relevant" does not seem to carry much weight. how about just insts?

//
// e.g., the inlining of functions (like allocateUninitializedArray) may
// result in the following code snippet in a function:
// %9 = alloc_stack $Int32
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide an example on why we should care about optimizing such scalar (non-tensor) insts here? is it for tensorflow convention function?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the larger question is that if we start applying more optimizations that used to belong only to the optimization pipeline, could that be "slippery slope" in that we might keep adding more such logic to the DA pass over time?

It'd be nice if we could try and make the DA pass compose with (and complimentary to) the other compiler passes as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific optimization mainly avoids the introduction of sends and receives by eliminating redundant allocations and creating an SSA value for attribute values. This is actually quite similar to promoteToSSA function. (May be it is sufficient to call promoteToSSA again. Let me try it.)

When working on this PR, I simply tried to avoid introducing new sends and receives as much as possible in our tests. As a result of that approach, I identified that these optimizations were necessary. Given that we won't be running the optimization pipeline for tensor-flow convention functions in -Onone mode and still be able to extract graphs, we will have to perform some of the transformations here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tensorflow convention functions, I agree we should extend the DA pass to handle it, rather than duplicating more optimizer pipeline code. This is especially true for load/store related issues, which DA is designed to handle.

also, do we already have a test example, that show we'll need more optimizing work before we can extract a graph function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question as Mingsheng. with partitioning running after the performance optimizer, it seems like these sorts of things should already be done. Why is this necessary? What is the bad thing that happens if we don't call into "optimizeMemoryAllocations" here? Is this a phase ordering issue that can be resolved some other way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?
What is the bad thing that happens if we don't call into "optimizeMemoryAllocations" here?

If we don't do this here, we introduce new sends and receives. This is mostly OK if correctness is not affected, as sending/receiving TensorHandles should be cheap. However, in some cases, this is a problem. For example, when operations need to be run on TPU, partitioner is not able to determine shapes in some cases because of an intervening send or receive.

Here are technical reasons as to why it is needed here. Note that we prevent the PerformanceInliner from inlining certain functions (such as array inits), and therefore, function-level optimizations (like optimizeMemoryAllocations) do not see the body of these functions and do not have an opportunity to optimize the body of such functions. When we inline the functions during deabstraction, code snippets such as the following occur in the function being partitioned:

//     %9 = alloc_stack $Int32
//     store %0 to %9 : $*Int32
//     %11 = load %9 : $*Int32
//     %12 = struct_extract %11 : $Int32, #Int32._value

Because no optimizations run, the loads and stores are not eliminated and causes sends/receives to be introduced.

I think some of the optimizeMemoryAllocations might be subsumed by PromotableMemoryFinder and PropagateSSAValues, but I haven't tried it yet.

/// deabstracted. If the flag forceTFFunctions is true, forces partitioning of
/// functions that operate on Tensors even if it would have been rejected
/// otherwise.
bool deabstract(SILFunction &fn, bool forceTFFunctions);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can deabstract be a private method instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// - Does this operate on Tensor values?
//
// Helper that returns true if function belongs to TensorFlow module.
auto isTensorFlowFunction = [](SILFunction *func) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding such domain specific (tensor specific) knowledge to this "general purpose" mechanism of performance inliner does not feel ideal. This might prevent us from being able to upstream this patch later.

is there a way to refactor the code so that the caller / constructor of SILInliner can pass in a callback (or a bit) to decide whether func is eligible for inlining? The call-site from TF can then pass in a callback which does the "tensor op check" as written here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is not ideal, but I prefer to leave this as is for now.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is an important issue concering architecture ("layering") that we'll want to resolve ASAP. If you feel strongly about submitting this patch first, please go ahead, as long as we can follow up and address this one shortly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Mingsheng that this is a horrible hack, but I'm ok with it to unblock work - so long as there is a known path out of this. What is the plan to resolve this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chris, are you worried about having tensorflow specific logic here? That is easily fixed by folding this logic this into the TensorFunctionClassifier::isSpecialNoInlineCallee.

Or, are you concerned about the fact that we prevent some inlining from happening? I don't have a good answer for that right now.

@@ -78,7 +78,7 @@ public func weighPetOnlyDefault(pet: Pet) {

// CHECK-LABEL: ---- ANALYSIS STATE FOR FUNCTION {{.*}}testCondBranch
// CHECK: bb0:
// CHECK: [Copy] cond_br {{.*}}, bb1, bb2
// CHECK: [Copy] cond_br {{.*}}, bb2, bb1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why BB numbers are changing for some tests (consider adding the explanation to PR description)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this has to do with the fact that the deabstraction sees optimized SIL function for non-tensorflow convention functions.

@@ -6,6 +6,7 @@ import TensorFlow

var someGlobal = Tensor<Int32>(1)

// expected-warning @+1 {{value implicitly copied to the host}}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes in sends/recvs warnings are fairly distracting -- should i ignore all of them when reviewing this patch?

Ideally, we could send a patch to remove all such warnings first, as discussed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just ignore these for now. The good thing is that the runtime tests pass, which gives me some confidence that the refactoring is still generating correct code.

(I will send out a separate PR removing these send/receive warnings along with adding an internal counter for debugging/testing purposes.)

Copy link

@mhong mhong Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we confirm that this patch has not overall regressed in the # of sends/recvs -- would it make sense to introduce some stats counter infra first?

@@ -49,7 +49,8 @@ public func testDevice() {
// should be a single copy-to-host compiler warning.
public func SR8412_CopyToHost() {
for _ in 0...10 {
let x = Tensor(1) // expected-warning {{value implicitly copied to the host}}
// This gets moved outside the loop by the compiler optimizations. So, no warnings.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tabs are used here and in some other places in this patch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops...fixed now.

Copy link
Contributor Author

@bgogul bgogul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigating the failures in GPU and mac runs, but I have addressed all your comments otherwise.

PTAL.

@@ -92,7 +92,13 @@ namespace tf {
bool containsTensorFlowValue(Type ty, bool checkHigherOrderFunctions);

private:
bool structContainsTensorFlowValue(StructDecl *decl);
bool containsTensorFlowValueImpl(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not a cache. It keeps track of the nesting structure so that we can detect recursive data structures.

In any case, ignore the file in this PR. I have opened a separate PR for this: #21449

(I had to include the files here to make sure I can run the tests.)

bool TypeContainsTensorFlowValue::structContainsTensorFlowValue(
StructDecl *decl, llvm::SmallPtrSetImpl<NominalTypeDecl *> &parentDecls) {
// If we have a cycle, break it here.
if (parentDecls.count(decl) > 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#21449 has an example.

//
// e.g., the inlining of functions (like allocateUninitializedArray) may
// result in the following code snippet in a function:
// %9 = alloc_stack $Int32
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific optimization mainly avoids the introduction of sends and receives by eliminating redundant allocations and creating an SSA value for attribute values. This is actually quite similar to promoteToSSA function. (May be it is sufficient to call promoteToSSA again. Let me try it.)

When working on this PR, I simply tried to avoid introducing new sends and receives as much as possible in our tests. As a result of that approach, I identified that these optimizations were necessary. Given that we won't be running the optimization pipeline for tensor-flow convention functions in -Onone mode and still be able to extract graphs, we will have to perform some of the transformations here.

/// deabstracted. If the flag forceTFFunctions is true, forces partitioning of
/// functions that operate on Tensors even if it would have been rejected
/// otherwise.
bool deabstract(SILFunction &fn, bool forceTFFunctions);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// - Does this operate on Tensor values?
//
// Helper that returns true if function belongs to TensorFlow module.
auto isTensorFlowFunction = [](SILFunction *func) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is not ideal, but I prefer to leave this as is for now.

@@ -78,7 +78,7 @@ public func weighPetOnlyDefault(pet: Pet) {

// CHECK-LABEL: ---- ANALYSIS STATE FOR FUNCTION {{.*}}testCondBranch
// CHECK: bb0:
// CHECK: [Copy] cond_br {{.*}}, bb1, bb2
// CHECK: [Copy] cond_br {{.*}}, bb2, bb1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this has to do with the fact that the deabstraction sees optimized SIL function for non-tensorflow convention functions.

@@ -6,6 +6,7 @@ import TensorFlow

var someGlobal = Tensor<Int32>(1)

// expected-warning @+1 {{value implicitly copied to the host}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just ignore these for now. The good thing is that the runtime tests pass, which gives me some confidence that the refactoring is still generating correct code.

(I will send out a separate PR removing these send/receive warnings along with adding an internal counter for debugging/testing purposes.)

@@ -49,7 +49,8 @@ public func testDevice() {
// should be a single copy-to-host compiler warning.
public func SR8412_CopyToHost() {
for _ in 0...10 {
let x = Tensor(1) // expected-warning {{value implicitly copied to the host}}
// This gets moved outside the loop by the compiler optimizations. So, no warnings.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops...fixed now.

Copy link

@mhong mhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments, in terms of code behavior (verify no regression in sends/recvs) and readability (layering design). You can decide which ones to address before vs after submitting this patch.

return;

TFDeabstractionHelper helper(*this, module);
if (PM->getStageName() == "TensorFlow Partitioning") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of matching on stage name, it would be cleaner to just have two passes, and insert them at the right part of hte pass pipeline. The optimizer version of the partitioning pass should be a function pass, and the mandatory version should be a module pass, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@lattner lattner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to understand why hard integration of optimizeMemoryAllocations is required.

Please also split the deabstraction pass into two different passes (calling into a shared implementation).

Thanks!

Copy link
Contributor Author

@bgogul bgogul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given our recent focus on tracing, some of the changes in this PR (e.g., related to optimizations and send/receives) could be deferred to later. However, I just wanted to reply to the comments and questions, when I have everything fresh in my memory.

I'd really like to understand why hard integration of optimizeMemoryAllocations is required.

Please see my reply to your review comment.

Please also split the deabstraction pass into two different passes (calling into a shared implementation).

I have factored out deabstraction as a utility and I am calling into it from partitioning now.

//
// e.g., the inlining of functions (like allocateUninitializedArray) may
// result in the following code snippet in a function:
// %9 = alloc_stack $Int32
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?
What is the bad thing that happens if we don't call into "optimizeMemoryAllocations" here?

If we don't do this here, we introduce new sends and receives. This is mostly OK if correctness is not affected, as sending/receiving TensorHandles should be cheap. However, in some cases, this is a problem. For example, when operations need to be run on TPU, partitioner is not able to determine shapes in some cases because of an intervening send or receive.

Here are technical reasons as to why it is needed here. Note that we prevent the PerformanceInliner from inlining certain functions (such as array inits), and therefore, function-level optimizations (like optimizeMemoryAllocations) do not see the body of these functions and do not have an opportunity to optimize the body of such functions. When we inline the functions during deabstraction, code snippets such as the following occur in the function being partitioned:

//     %9 = alloc_stack $Int32
//     store %0 to %9 : $*Int32
//     %11 = load %9 : $*Int32
//     %12 = struct_extract %11 : $Int32, #Int32._value

Because no optimizations run, the loads and stores are not eliminated and causes sends/receives to be introduced.

I think some of the optimizeMemoryAllocations might be subsumed by PromotableMemoryFinder and PropagateSSAValues, but I haven't tried it yet.

return;

TFDeabstractionHelper helper(*this, module);
if (PM->getStageName() == "TensorFlow Partitioning") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// - Does this operate on Tensor values?
//
// Helper that returns true if function belongs to TensorFlow module.
auto isTensorFlowFunction = [](SILFunction *func) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chris, are you worried about having tensorflow specific logic here? That is easily fixed by folding this logic this into the TensorFunctionClassifier::isSpecialNoInlineCallee.

Or, are you concerned about the fact that we prevent some inlining from happening? I don't have a good answer for that right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants