Skip to content

[SystemZ][z/OS] Query if a file is text and set the text flag accordingly #109664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

abhina-sree
Copy link
Contributor

This patch sets the text flag correctly when opening files by querying the file tag on z/OS. For other platforms, there will be no change in behaviour.

@llvmbot
Copy link
Member

llvmbot commented Sep 23, 2024

@llvm/pr-subscribers-llvm-support

Author: Abhina Sree (abhina-sree)

Changes

This patch sets the text flag correctly when opening files by querying the file tag on z/OS. For other platforms, there will be no change in behaviour.


Full diff: https://github.com/llvm/llvm-project/pull/109664.diff

3 Files Affected:

  • (modified) llvm/include/llvm/Support/AutoConvert.h (+6)
  • (modified) llvm/lib/Support/AutoConvert.cpp (+26)
  • (modified) llvm/lib/Support/VirtualFileSystem.cpp (+9-1)
diff --git a/llvm/include/llvm/Support/AutoConvert.h b/llvm/include/llvm/Support/AutoConvert.h
index 6f45c4683f7775..8ef6d3e8cf973e 100644
--- a/llvm/include/llvm/Support/AutoConvert.h
+++ b/llvm/include/llvm/Support/AutoConvert.h
@@ -52,6 +52,12 @@ std::error_code restorezOSStdHandleAutoConversion(int FD);
 /// \brief Set the tag information for a file descriptor.
 std::error_code setzOSFileTag(int FD, int CCSID, bool Text);
 
+// Get the the tag ccsid for a file name or a file descriptor.
+ErrorOr<__ccsid_t> getzOSFileTag(const char *FileName, const int FD = -1);
+
+// Query the file tag to determine if the file is a text file.
+ErrorOr<bool> iszOSTextFile(const char *Filename, const int FD = -1);
+
 } // namespace llvm
 #endif // __cplusplus
 
diff --git a/llvm/lib/Support/AutoConvert.cpp b/llvm/lib/Support/AutoConvert.cpp
index 66570735f8fc88..06130876c67f05 100644
--- a/llvm/lib/Support/AutoConvert.cpp
+++ b/llvm/lib/Support/AutoConvert.cpp
@@ -116,4 +116,30 @@ std::error_code llvm::setzOSFileTag(int FD, int CCSID, bool Text) {
   return std::error_code();
 }
 
+ErrorOr<__ccsid_t> llvm::getzOSFileTag(const char *FileName, const int FD) {
+  // If we have a file descriptor, use it to find out file tagging. Otherwise we
+  // need to use stat() with the file path.
+  if (FD != -1) {
+    struct f_cnvrt Query = {
+        QUERYCVT, // cvtcmd
+        0,        // pccsid
+        0,        // fccsid
+    };
+    if (fcntl(FD, F_CONTROL_CVT, &Query) == -1)
+      return std::error_code(errno, std::generic_category());
+    return Query.fccsid;
+  }
+  struct stat Attr;
+  if (stat(FileName, &Attr) == -1)
+    return std::error_code(errno, std::generic_category());
+  return Attr.st_tag.ft_ccsid;
+}
+
+ErrorOr<bool> llvm::iszOSTextFile(const char *Filename, const int FD) {
+  ErrorOr<__ccsid_t> Ccsid = getzOSFileTag(Filename, FD);
+  if (std::error_code EC = Ccsid.getError())
+    return EC;
+  return *Ccsid != FT_BINARY;
+}
+
 #endif // __MVS__
diff --git a/llvm/lib/Support/VirtualFileSystem.cpp b/llvm/lib/Support/VirtualFileSystem.cpp
index 928c0b5a24ed65..50aa60363db166 100644
--- a/llvm/lib/Support/VirtualFileSystem.cpp
+++ b/llvm/lib/Support/VirtualFileSystem.cpp
@@ -325,8 +325,16 @@ ErrorOr<Status> RealFileSystem::status(const Twine &Path) {
 ErrorOr<std::unique_ptr<File>>
 RealFileSystem::openFileForRead(const Twine &Name) {
   SmallString<256> RealName, Storage;
+  auto OpenFlags = sys::fs::OF_None;
+#ifdef __MVS__
+  // If the file is tagged with a text ccsid, it may require autoconversion.
+  llvm::ErrorOr<bool> IsFileText =
+      llvm::iszOSTextFile(Name.str().c_str());
+  if (IsFileText && *IsFileText)
+    OpenFlags |= sys::fs::OF_Text;
+#endif
   Expected<file_t> FDOrErr = sys::fs::openNativeFileForRead(
-      adjustPath(Name, Storage), sys::fs::OF_None, &RealName);
+      adjustPath(Name, Storage), OpenFlags, &RealName);
   if (!FDOrErr)
     return errorToErrorCode(FDOrErr.takeError());
   return std::unique_ptr<File>(

Copy link

github-actions bot commented Sep 23, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@zibi2 zibi2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -325,8 +325,15 @@ ErrorOr<Status> RealFileSystem::status(const Twine &Path) {
ErrorOr<std::unique_ptr<File>>
RealFileSystem::openFileForRead(const Twine &Name) {
SmallString<256> RealName, Storage;
auto OpenFlags = sys::fs::OF_None;
#ifdef __MVS__
// If the file is tagged with a text ccsid, it may require autoconversion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so as discussed in #107906 (mostly leaving it for context of future travelers); it's still unclear to me why we need to perform this check only at this point (and not other places that reads files, e.g. deep down the system helper stack).

Can you expand the comments here a little bit? At high level it's not obvious what an autoconversion is and why it's needed. I think mentioning that in here would make it easier for maintenance going forward, since most people that touch this code, won't know about zOS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So automatic conversion is a z/OS specific functionality to help ASCII applications work on z/OS. The native encoding on z/OS is EBCDIC, not ASCII-based, so without this tool enabled, we aren't able to read files correctly. This is some documentation on this feature: https://www.ibm.com/docs/en/zos/3.1.0?topic=pages-enhanced-ascii

There are also other places that do require distinguishing between text and binary files as well, there were many places that were modified in the past for example https://reviews.llvm.org/D99426.
But since the openFileForRead() function mainly deals with reading existing files, so querying the file tag seems sufficient to determining whether its binary or text based instead of passing an IsText parameter based on where it was called from.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation!

sorry if I am being dense, but I still don't understand why it has to happen in this layer, rather than in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Support/Unix/Path.inc#L1035-L1114:

  • it already has access to all of this information
  • it has logic that's already dealing with autoconversions
  • you can check if we're trying to open a file that exists, even better we've already stat'd the file, so you can check if it contains the tag, without performing any extra system calls on the way.

I assume you're already trying to make sure logic in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Support/Unix/Path.inc#L1098-L1105 triggers by inferring OF_TEXT flag for all callers here.

What I am asking for is, why we're not performing that inference at the layer that's shared for all such filesystem operations, i.e. https://github.com/llvm/llvm-project/blob/main/llvm/lib/Support/Unix/Path.inc#L1035-L1114. What breaks/doesn't work?

Copy link
Contributor Author

@abhina-sree abhina-sree Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because previously OF_None was being passed at this point unconditionally. In Path.inc in functions (like openNativeFileForRead, openFile, etc) we do not modify the OpenFlags and respect the flags that were passed. So it would seem wrong to pass OF_None here (which implies a binary file) and then decide it is a text file later on in the code path.

@@ -325,8 +325,15 @@ ErrorOr<Status> RealFileSystem::status(const Twine &Path) {
ErrorOr<std::unique_ptr<File>>
RealFileSystem::openFileForRead(const Twine &Name) {
SmallString<256> RealName, Storage;
auto OpenFlags = sys::fs::OF_None;
#ifdef __MVS__
// If the file is tagged with a text ccsid, it may require autoconversion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation!

sorry if I am being dense, but I still don't understand why it has to happen in this layer, rather than in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Support/Unix/Path.inc#L1035-L1114:

  • it already has access to all of this information
  • it has logic that's already dealing with autoconversions
  • you can check if we're trying to open a file that exists, even better we've already stat'd the file, so you can check if it contains the tag, without performing any extra system calls on the way.

I assume you're already trying to make sure logic in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Support/Unix/Path.inc#L1098-L1105 triggers by inferring OF_TEXT flag for all callers here.

What I am asking for is, why we're not performing that inference at the layer that's shared for all such filesystem operations, i.e. https://github.com/llvm/llvm-project/blob/main/llvm/lib/Support/Unix/Path.inc#L1035-L1114. What breaks/doesn't work?

Copy link
Contributor

@perry-ca perry-ca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use the actual file to determine if it should be read as text or binary. The context of where the file is being read determines if we need to treat it as a text file, not the file being read. I think one example is the #embed directive. The #embed is a binary file inclusion. If the file is a text file we need to embed the original bytes from the file (i.e. as binary). If we check the file, see it is text and then do a code page conversion when reading the file as text, we will include the wrong values into the program. The call to open the file has to say if it is text or binary.

@perry-ca
Copy link
Contributor

Looks like this will also fail with header map files. These are binary files and don't have any file tags. The compiler will read them as EBCDIC instead as binary.

@abhina-sree
Copy link
Contributor Author

I've also noticed that the lit tests that are testing #embed fail with this solution as well, but pass with the previous solution. So I think this points to needing an IsText flag as well

@abhina-sree
Copy link
Contributor Author

Closing this PR as there were some bugs related with this implementation

@abhina-sree abhina-sree closed this Oct 1, 2024
@abhina-sree abhina-sree deleted the abhina/set_text_flag_correctly branch June 5, 2025 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants