Skip to content

[libc] Change the puts implementation on the GPU #67189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 25, 2023
Merged

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Sep 22, 2023

Summary:
Normally, the implementation of puts simply writes a second newline
charcter after printing the first string. However, because the GPU does
everything in batches of the SIMT group size, this will end up with very
poor output where you get the strings printed and then 1-64 newline
characters all in a row. Optimizations like to turn printf calls into
puts so it's a good idea to make this produce the expected output.

The least invasive way I could do this was to add a new opcode. It's a
little bloated, but it avoids an unneccessary and slow send operation to
configure this.

Summary:
Normally, the implementation of `puts` simply writes a second newline
charcter after printing the first string. However, because the GPU does
everything in batches of the SIMT group size, this will end up with very
poor output where you get the strings printed and then 1-64 newline
characters all in a row. Optimizations like to turn `printf` calls into
`puts` so it's a good idea to make this produce the expected output.

The least invasive way I could do this was to add a new opcode. It's a
little bloated, but it avoids an unneccessary and slow send operation to
configure this.
@llvmbot
Copy link
Member

llvmbot commented Sep 22, 2023

@llvm/pr-subscribers-libc

Changes

Summary:
Normally, the implementation of puts simply writes a second newline
charcter after printing the first string. However, because the GPU does
everything in batches of the SIMT group size, this will end up with very
poor output where you get the strings printed and then 1-64 newline
characters all in a row. Optimizations like to turn printf calls into
puts so it's a good idea to make this produce the expected output.

The least invasive way I could do this was to add a new opcode. It's a
little bloated, but it avoids an unneccessary and slow send operation to
configure this.


Full diff: https://github.com/llvm/llvm-project/pull/67189.diff

4 Files Affected:

  • (modified) libc/include/llvm-libc-types/rpc_opcodes_t.h (+11-10)
  • (modified) libc/src/stdio/gpu/puts.cpp (+3-5)
  • (modified) libc/utils/gpu/server/rpc_server.cpp (+13-9)
  • (modified) openmp/libomptarget/test/libc/puts.c (+1-1)
diff --git a/libc/include/llvm-libc-types/rpc_opcodes_t.h b/libc/include/llvm-libc-types/rpc_opcodes_t.h
index 9895269767d0037..fb0f19cf505e8dc 100644
--- a/libc/include/llvm-libc-types/rpc_opcodes_t.h
+++ b/libc/include/llvm-libc-types/rpc_opcodes_t.h
@@ -15,16 +15,17 @@ typedef enum : unsigned short {
   RPC_WRITE_TO_STDOUT = 2,
   RPC_WRITE_TO_STDERR = 3,
   RPC_WRITE_TO_STREAM = 4,
-  RPC_READ_FROM_STREAM = 5,
-  RPC_OPEN_FILE = 6,
-  RPC_CLOSE_FILE = 7,
-  RPC_MALLOC = 8,
-  RPC_FREE = 9,
-  RPC_HOST_CALL = 10,
-  RPC_ABORT = 11,
-  RPC_FEOF = 12,
-  RPC_FERROR = 13,
-  RPC_CLEARERR = 14,
+  RPC_WRITE_TO_STDOUT_NEWLINE = 5,
+  RPC_READ_FROM_STREAM = 6,
+  RPC_OPEN_FILE = 7,
+  RPC_CLOSE_FILE = 8,
+  RPC_MALLOC = 9,
+  RPC_FREE = 10,
+  RPC_HOST_CALL = 11,
+  RPC_ABORT = 12,
+  RPC_FEOF = 13,
+  RPC_FERROR = 14,
+  RPC_CLEARERR = 15,
 } rpc_opcode_t;
 
 #endif // __LLVM_LIBC_TYPES_RPC_OPCODE_H__
diff --git a/libc/src/stdio/gpu/puts.cpp b/libc/src/stdio/gpu/puts.cpp
index 58a3534c57ef99f..e50e2cc7d55d506 100644
--- a/libc/src/stdio/gpu/puts.cpp
+++ b/libc/src/stdio/gpu/puts.cpp
@@ -17,11 +17,9 @@ namespace __llvm_libc {
 
 LLVM_LIBC_FUNCTION(int, puts, (const char *__restrict str)) {
   cpp::string_view str_view(str);
-  auto written = file::write(stdout, str, str_view.size());
-  if (written != str_view.size())
-    return EOF;
-  written = file::write(stdout, "\n", 1);
-  if (written != 1)
+  auto written = file::write_impl<RPC_WRITE_TO_STDOUT_NEWLINE>(stdout, str,
+                                                               str_view.size());
+  if (written != str_view.size() + 1)
     return EOF;
   return 0;
 }
diff --git a/libc/utils/gpu/server/rpc_server.cpp b/libc/utils/gpu/server/rpc_server.cpp
index 7493ed66ceecb8c..a772cd1d22e5073 100644
--- a/libc/utils/gpu/server/rpc_server.cpp
+++ b/libc/utils/gpu/server/rpc_server.cpp
@@ -59,23 +59,27 @@ struct Server {
     switch (port->get_opcode()) {
     case RPC_WRITE_TO_STREAM:
     case RPC_WRITE_TO_STDERR:
-    case RPC_WRITE_TO_STDOUT: {
+    case RPC_WRITE_TO_STDOUT:
+    case RPC_WRITE_TO_STDOUT_NEWLINE: {
       uint64_t sizes[lane_size] = {0};
       void *strs[lane_size] = {nullptr};
       FILE *files[lane_size] = {nullptr};
-      if (port->get_opcode() == RPC_WRITE_TO_STREAM)
+      if (port->get_opcode() == RPC_WRITE_TO_STREAM) {
         port->recv([&](rpc::Buffer *buffer, uint32_t id) {
           files[id] = reinterpret_cast<FILE *>(buffer->data[0]);
         });
+      } else if (port->get_opcode() == RPC_WRITE_TO_STDERR) {
+        std::fill(files, files + lane_size, stderr);
+      } else {
+        std::fill(files, files + lane_size, stdout);
+      }
+
       port->recv_n(strs, sizes, [&](uint64_t size) { return new char[size]; });
       port->send([&](rpc::Buffer *buffer, uint32_t id) {
-        FILE *file =
-            port->get_opcode() == RPC_WRITE_TO_STDOUT
-                ? stdout
-                : (port->get_opcode() == RPC_WRITE_TO_STDERR ? stderr
-                                                             : files[id]);
-        uint64_t ret = fwrite(strs[id], 1, sizes[id], file);
-        std::memcpy(buffer->data, &ret, sizeof(uint64_t));
+        buffer->data[0] = fwrite(strs[id], 1, sizes[id], files[id]);
+        if (port->get_opcode() == RPC_WRITE_TO_STDOUT_NEWLINE &&
+            buffer->data[0] == sizes[id])
+          buffer->data[0] += fwrite("\n", 1, 1, files[id]);
         delete[] reinterpret_cast<uint8_t *>(strs[id]);
       });
       break;
diff --git a/openmp/libomptarget/test/libc/puts.c b/openmp/libomptarget/test/libc/puts.c
index 18d87ed1b36ae65..0e363f55296184b 100644
--- a/openmp/libomptarget/test/libc/puts.c
+++ b/openmp/libomptarget/test/libc/puts.c
@@ -31,5 +31,5 @@ int main() {
 // CHECK: PASS
 #pragma omp target teams num_teams(4)
 #pragma omp parallel num_threads(2)
-  { fputs("PASS\n", stdout); }
+  { puts("PASS\n"); }
 }

Copy link
Collaborator

@JonChesterfield JonChesterfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the rest of libc. I think it would be wise to change to a more data driven approach where adding a new opcode doesn't change existing code. Likewise at some point we need to stop renumbering all the opcodes and start putting version fields on it.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Sep 25, 2023

This is consistent with the rest of libc. I think it would be wise to change to a more data driven approach where adding a new opcode doesn't change existing code. Likewise at some point we need to stop renumbering all the opcodes and start putting version fields on it.

Yeah, I'm pushing that down the road until I can say that the libc is mostly "done", right now everything is in flux. Though I should probably just remove the numbers altogether since they keep changing, and I need to make this header C compliant since it uses :.

@jhuber6 jhuber6 merged commit 791b279 into llvm:main Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants