Skip to content

Add Travis test to make sure text files are UTF-8 encoded #14701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 2, 2021

Conversation

donatieng
Copy link
Contributor

@donatieng donatieng commented May 25, 2021

Summary of changes

Introducing text files not encoded with UTF-8 breaks our docs CI.
This PR cleans up the relevant file and adds a check to our CI to prevent issues.

Impact of changes

Migration actions required

Documentation

Nonee


Pull request type

[X] Patch update (Bug fix / Target update / Docs update / Test update / Refactor)
[] Feature update (New feature / Functionality change / New API)
[] Major update (Breaking change E.g. Return code change / API behaviour change)

Test results

[] No Tests required for this change (E.g docs only update)
[] Covered by existing mbed-os tests (Greentea or Unittest)
[X] Tests / results supplied as part of this PR

Reviewers

@ARMmbed/mbed-os-maintainers


@ciarmcom ciarmcom added the release-type: patch Indentifies a PR as containing just a patch label May 25, 2021
@ciarmcom ciarmcom requested a review from a team May 25, 2021 17:00
@ciarmcom
Copy link
Member

@donatieng, thank you for your changes.
@ARMmbed/mbed-os-maintainers please review.

@donatieng donatieng force-pushed the utf-8-check branch 7 times, most recently from 27ab346 to 9f20cc9 Compare May 25, 2021 20:00
@donatieng
Copy link
Contributor Author

@donatieng donatieng marked this pull request as ready for review May 25, 2021 20:01
Copy link
Contributor

@LDong-Arm LDong-Arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@donatieng Thanks for the fix!

@@ -502,7 +502,7 @@ extern "C" {
* MaxConnEventLength
* This parameter determines the maximum duration of a slave connection event. When this duration is reached the slave closes
* the current connections event (whatever is the CE_length parameter specified by the master in HCI_CREATE_CONNECTION HCI command),
* expressed in units of 625/256 �s (~2.44 �s)
* expressed in units of 625/256 us (~2.44 us)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

µ is UTF-8. The codepoint is U+03BC, which is encoded in UTF-8 as: 0xCE 0xBC (https://www.compart.com/en/unicode/U+03BC)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think we should agree on what characters are allowed in Mbed OS. ASCII feels a bit restrictive but is more compatible. If we allow any UTF-8, then the CircleCI needs to be fixed instead.

This particular character was added a few weeks ago, so our code base was likely ASCII-only before that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What breaks exactly? What tool chokes on this? I didn't see it explained in the commit message.

Copy link
Contributor

@Patater Patater May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title says "don't allow invalid UTF-8". Does this mean if encoded correctly (in a valid manner), µ would be OK?

Confusingly, the description says "Introducing non-UTF-8 characters in text files breaks our docs CI.", but then removes a UTF-8 character. I'm not sure what is breaking what. Maybe we need to correctly encode shci.h as UTF-8 rather than remove the µ?

Copy link
Contributor

@rwalton-arm rwalton-arm May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc_builder tool breaks when it tries to do a pathlib.Path.read_text. If the file it's reading from is saved with utf8 encoding, then µ should work ok.

>>> import pathlib
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> pathlib.Path("test.txt").read_text()
'μ\n'

Copy link
Contributor

@LDong-Arm LDong-Arm May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so the real issue is the encoding of that file rather than the presence of µ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, rather than replacing µ, we should convert the file to utf-8. For example, in vim you can do https://stackoverflow.com/a/66572180. This keeps the greek letter.

Copy link
Contributor Author

@donatieng donatieng May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup the issue is that the file is not encoded as UTF-8:

file -I shci.h 
shci.h: text/x-c; charset=unknown-8bit

Happy to try and convert the file to keep the "µ"s.

.travis.yml Outdated
- <<: *basic-vm
name: "UTF-8 Check"
script:
# Make sure we're not introducing any non-UTF-8 character in text files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "non-ASCII" character?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible utf-8 check is

file -I <filename>

and the output should contain charset=utf-8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah see above 😄 ⬆️

@mergify mergify bot dismissed Patater’s stale review May 26, 2021 12:51

Pull request has been modified.

.travis.yml Outdated
- <<: *basic-vm
name: "UTF-8 Check"
script:
# Make sure we're not introducing any non-UTF-8 character in text files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't make sense. Pretty much any character is going to be encodable by UTF-8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rephrase - meant invalid encoding

name: "UTF-8 Check"
script:
# Make sure we're not introducing any non-UTF-8 character in text files
- git diff $TRAVIS_BRANCH...HEAD -U0 | ( grep -a '^+' || true ) | ( ! grep -axv '.*' )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check still valid or should we run file -I <filename> and check the encoding is one of the handful of acceptable options (at least ascii, and utf-8)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I've read, file -I doesn't check the whole file and is therefore not always correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Patater hopefully 15c9474 addresses your comment

Copy link
Contributor

@LDong-Arm LDong-Arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@donatieng
Please squash the commits to one for shci.h and one for adding Travis test. And update the title & PR description to text encoding instead of character.

LGTM otherwise, thanks for identifying and fixing the issue! I'll unblock a number of docs PRs.

@donatieng donatieng changed the title Add Travis test to detect any invalid UTF-8 character Add Travis test to make sure text files are UTF-8 encoded Jun 2, 2021
@donatieng
Copy link
Contributor Author

Thanks @LDong-Arm!

Squashed and updated 🙂

Copy link
Contributor

@LDong-Arm LDong-Arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LDong-Arm
Copy link
Contributor

@ARMmbed/mbed-os-maintainers Shall we run CI and merge this, as it unblocks PRs in https://github.com/ARMmbed/mbed-os-5-docs/pulls, thanks

@@ -435,7 +435,7 @@ extern "C" {
* PrWriteListSize
* NOTE: This parameter is ignored by the CPU2 when the parameter "Options" is set to "LL_only" ( see Options description in that structure )
*
* Maximum number of supported prepare write request
* Maximum number of supported “prepare write request”
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ARMmbed/team-st-mcd Please forward this change upstream

@0xc0170
Copy link
Contributor

0xc0170 commented Jun 2, 2021

CI started

@mbed-ci
Copy link

mbed-ci commented Jun 2, 2021

Jenkins CI Test : ✔️ SUCCESS

Build Number: 1 | 🔒 Jenkins CI Job | 🌐 Logs & Artifacts

CLICK for Detailed Summary

jobs Status
jenkins-ci/mbed-os-ci_unittests ✔️
jenkins-ci/mbed-os-ci_cmake-example-GCC_ARM ✔️
jenkins-ci/mbed-os-ci_cmake-example-ARM ✔️
jenkins-ci/mbed-os-ci_build-example-ARM ✔️
jenkins-ci/mbed-os-ci_build-example-GCC_ARM ✔️
jenkins-ci/mbed-os-ci_build-greentea-ARM ✔️
jenkins-ci/mbed-os-ci_build-greentea-GCC_ARM ✔️
jenkins-ci/mbed-os-ci_cmake-cloud-example-ARM ✔️
jenkins-ci/mbed-os-ci_build-cloud-example-ARM ✔️
jenkins-ci/mbed-os-ci_cmake-cloud-example-GCC_ARM ✔️
jenkins-ci/mbed-os-ci_build-cloud-example-GCC_ARM ✔️
jenkins-ci/mbed-os-ci_greentea-test ✔️

@0xc0170 0xc0170 merged commit be9850b into ARMmbed:master Jun 2, 2021
@mergify mergify bot removed the ready for merge label Jun 2, 2021
@mbedmain mbedmain added release-version: 6.12.0 Release-pending and removed release-type: patch Indentifies a PR as containing just a patch Release-pending labels Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants