Skip to content

Remove charset tests with race conditions #12571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mrsdizzie
Copy link
Member

@mrsdizzie mrsdizzie commented Aug 23, 2020

These tests don't seem reliable (with documented race conditions). Remove them since they regularly kill the CI.

For both

// "Hola, así cómo ños"
res = ToUTF8WithFallback([]byte{0x48, 0x6F, 0x6C, 0x61, 0x2C, 0x20, 0x61, 0x73, 0xED, 0x20, 0x63,
0xF3, 0x6D, 0x6F, 0x20, 0xF1, 0x6F, 0x73})
assert.Equal(t, []byte{0x48, 0x6F, 0x6C, 0x61, 0x2C, 0x20, 0x61, 0x73, 0xC3, 0xAD, 0x20, 0x63,
0xC3, 0xB3, 0x6D, 0x6F, 0x20, 0xC3, 0xB1, 0x6F, 0x73}, res)

and

// Latin1
// Hola, así cómo ños
res = ToUTF8(string([]byte{0x48, 0x6F, 0x6C, 0x61, 0x2C, 0x20, 0x61, 0x73, 0xED, 0x20, 0x63,
0xF3, 0x6D, 0x6F, 0x20, 0xF1, 0x6F, 0x73}))
assert.Equal(t, []byte{0x48, 0x6f, 0x6c, 0x61, 0x2c, 0x20, 0x61, 0x73, 0xc3, 0xad, 0x20, 0x63,
0xc3, 0xb3, 0x6d, 0x6f, 0x20, 0xc3, 0xb1, 0x6f, 0x73}, []byte(res))

It is definitely hitting this race condition that we already 'accept' in the tests below:

// due to a race condition in `chardet` library, it could either detect
// "ISO-8859-1" or "IS0-8859-2" here. Technically either is correct, so
// we accept either.

Both tests pass when we detect it as ISO-8859-1 but fail when they sometimes gets detected as ISO-8859-2

With these tests locally it fails one out of every 4 or 5 tries, after removing it never fails

These tests don't seem reliable (with documented race conditions). Remove them since they regularly kill the CI.
@codecov-commenter
Copy link

Codecov Report

Merging #12571 into master will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #12571   +/-   ##
=======================================
  Coverage   43.43%   43.43%           
=======================================
  Files         643      643           
  Lines       71165    71165           
=======================================
+ Hits        30910    30913    +3     
- Misses      35246    35247    +1     
+ Partials     5009     5005    -4     
Impacted Files Coverage Δ
models/unit.go 46.57% <0.00%> (-2.74%) ⬇️
modules/log/event.go 57.54% <0.00%> (-1.89%) ⬇️
modules/queue/unique_queue_disk_channel.go 53.84% <0.00%> (-1.54%) ⬇️
modules/indexer/stats/db.go 60.86% <0.00%> (+17.39%) ⬆️
modules/indexer/stats/queue.go 76.47% <0.00%> (+23.52%) ⬆️
modules/util/timer.go 85.71% <0.00%> (+42.85%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2026d88...2aa0b7e. Read the comment docs.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Aug 23, 2020
@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Aug 23, 2020
@silverwind
Copy link
Member

Could also just comment it out.

Copy link
Contributor

@zeripath zeripath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to disagree - if this race is still happening 4 out of 5 our preferred charset stuff isn't working correctly and people's documents will be being detected incorrectly.

We need to fix that not just paper over by deleting the test that is showing the problem.

For example, what is relative score of iso-8859-1 Vs the iso-8859-2? Would be as simple as adding a small boost to preferred charsets?

@silverwind
Copy link
Member

This test is causing like every fourth CI run to fail which wastes a lot of valuable time of contributors. We can keep an issue around to keep it a "known issue" but as it is now, this test randomly failing does not gain us any additional knowledge.

@silverwind
Copy link
Member

silverwind commented Aug 23, 2020

I really wonder how something like charset detection can be non-deterministic. Do we know if the bug comes from gogs/chardet or our code?

Golang provides charset#DetermineEncoding thought docs say its only for HTML but maybe we can use it. Nope, implementation far too simple.

@zeripath
Copy link
Contributor

zeripath commented Aug 23, 2020

That this test is still failing means that we are detecting the wrong character set frequently in certain cases. I cannot get it to happen on my local machine - if you can then we can finally do something about fixing it. I have replicated it - I was doing the wrong test!! d'oh.

For example adding this would allow us to investigate what is going on with scores:

diff --git a/modules/charset/charset.go b/modules/charset/charset.go
index a7e427db9..8a7fd90c5 100644
--- a/modules/charset/charset.go
+++ b/modules/charset/charset.go
@@ -149,6 +149,17 @@ func DetectEncoding(content []byte) (string, error) {
 		return "", err
 	}
 
+	if log.IsTrace() {
+		logText := &strings.Builder{}
+		logValues := []interface{}{}
+		_, _ = logText.WriteString("Detected Encodings:\n")
+		for _, result := range results {
+			_, _ = logText.WriteString("%d %s\n")
+			logValues = append(logValues, result.Confidence, result.Charset)
+		}
+		log.Trace(logText.String(), logValues...)
+	}
+
 	topConfidence := results[0].Confidence
 	topResult := results[0]
 	priority, has := setting.Repository.DetectedCharsetScore[strings.ToLower(strings.TrimSpace(topResult.Charset))]

@zeripath
Copy link
Contributor

OK the issue is that although I fixed this in Gitea proper the Test does not set the default detectedcharsetsorder.

The below would fix this:

diff --git a/modules/charset/charset_test.go b/modules/charset/charset_test.go
index 394a42c71..33f0c10a7 100644
--- a/modules/charset/charset_test.go
+++ b/modules/charset/charset_test.go
@@ -5,6 +5,7 @@
 package charset
 
 import (
+	"strings"
 	"testing"
 
 	"code.gitea.io/gitea/modules/setting"
@@ -12,6 +13,22 @@ import (
 	"github.com/stretchr/testify/assert"
 )
 
+func resetDefaultCharsetsOrder() {
+	defaultDetectedCharsetsOrder := make([]string, 0, len(setting.Repository.DetectedCharsetsOrder))
+	for _, charset := range setting.Repository.DetectedCharsetsOrder {
+		defaultDetectedCharsetsOrder = append(defaultDetectedCharsetsOrder, strings.ToLower(strings.TrimSpace(charset)))
+	}
+	setting.Repository.DetectedCharsetScore = map[string]int{}
+	i := 0
+	for _, charset := range defaultDetectedCharsetsOrder {
+		canonicalCharset := strings.ToLower(strings.TrimSpace(charset))
+		if _, has := setting.Repository.DetectedCharsetScore[canonicalCharset]; !has {
+			setting.Repository.DetectedCharsetScore[canonicalCharset] = i
+			i++
+		}
+	}
+}
+
 func TestRemoveBOMIfPresent(t *testing.T) {
 	res := RemoveBOMIfPresent([]byte{0xc3, 0xa1, 0xc3, 0xa9, 0xc3, 0xad, 0xc3, 0xb3, 0xc3, 0xba})
 	assert.Equal(t, []byte{0xc3, 0xa1, 0xc3, 0xa9, 0xc3, 0xad, 0xc3, 0xb3, 0xc3, 0xba}, res)
@@ -21,6 +38,7 @@ func TestRemoveBOMIfPresent(t *testing.T) {
 }
 
 func TestToUTF8WithErr(t *testing.T) {
+	resetDefaultCharsetsOrder()
 	var res string
 	var err error
 
@@ -76,6 +94,7 @@ func TestToUTF8WithErr(t *testing.T) {
 }
 
 func TestToUTF8WithFallback(t *testing.T) {
+	resetDefaultCharsetsOrder()
 	// "ABC"
 	res := ToUTF8WithFallback([]byte{0x41, 0x42, 0x43})
 	assert.Equal(t, []byte{0x41, 0x42, 0x43}, res)
@@ -116,7 +135,7 @@ func TestToUTF8WithFallback(t *testing.T) {
 }
 
 func TestToUTF8(t *testing.T) {
-
+	resetDefaultCharsetsOrder()
 	// Note: golang compiler seems so behave differently depending on the current
 	// locale, so some conversions might behave differently. For that reason, we don't
 	// depend on particular conversions but in expected behaviors.
@@ -165,6 +184,7 @@ func TestToUTF8(t *testing.T) {
 }
 
 func TestToUTF8DropErrors(t *testing.T) {
+	resetDefaultCharsetsOrder()
 	// "ABC"
 	res := ToUTF8DropErrors([]byte{0x41, 0x42, 0x43})
 	assert.Equal(t, []byte{0x41, 0x42, 0x43}, res)
@@ -204,6 +224,7 @@ func TestToUTF8DropErrors(t *testing.T) {
 }
 
 func TestDetectEncoding(t *testing.T) {
+	resetDefaultCharsetsOrder()
 	testSuccess := func(b []byte, expected string) {
 		encoding, err := DetectEncoding(b)
 		assert.NoError(t, err)
@@ -225,10 +246,7 @@ func TestDetectEncoding(t *testing.T) {
 	b = []byte{0x44, 0xe9, 0x63, 0x6f, 0x72, 0x0a}
 	encoding, err := DetectEncoding(b)
 	assert.NoError(t, err)
-	// due to a race condition in `chardet` library, it could either detect
-	// "ISO-8859-1" or "IS0-8859-2" here. Technically either is correct, so
-	// we accept either.
-	assert.Contains(t, encoding, "ISO-8859")
+	assert.Contains(t, encoding, "ISO-8859-1")
 
 	old := setting.Repository.AnsiCharset
 	setting.Repository.AnsiCharset = "placeholder"

zeripath added a commit to zeripath/gitea that referenced this pull request Aug 23, 2020
TestToUTF8WithFallback is the cause of recurrent spurious test failures
even despite code to set the detected charset order.

The reason why this happens is because the preferred detected charset order
is not being initialised for these tests.

This PR simply ensures that this is set at the start of each test and would
allow different tests to be written to allow differing orders.

Replaces go-gitea#12571
Close go-gitea#12571

Signed-off-by: Andrew Thornton <[email protected]>
@zeripath zeripath added the topic/build PR changes how Gitea is built, i.e. regarding Docker or the Makefile label Aug 23, 2020
@mrsdizzie
Copy link
Member Author

good :)

@mrsdizzie mrsdizzie closed this Aug 23, 2020
zeripath added a commit that referenced this pull request Aug 23, 2020
TestToUTF8WithFallback is the cause of recurrent spurious test failures
even despite code to set the detected charset order.

The reason why this happens is because the preferred detected charset order
is not being initialised for these tests.

This PR simply ensures that this is set at the start of each test and would
allow different tests to be written to allow differing orders.

Replaces #12571
Close #12571

Signed-off-by: Andrew Thornton <[email protected]>
zeripath added a commit to zeripath/gitea that referenced this pull request Aug 23, 2020
…ea#12574)

Backport go-gitea#12574

TestToUTF8WithFallback is the cause of recurrent spurious test failures
even despite code to set the detected charset order.

The reason why this happens is because the preferred detected charset order
is not being initialised for these tests.

This PR simply ensures that this is set at the start of each test and would
allow different tests to be written to allow differing orders.

Replaces go-gitea#12571
Close go-gitea#12571

Signed-off-by: Andrew Thornton <[email protected]>
@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/need 1 This PR needs approval from one additional maintainer to be merged. topic/build PR changes how Gitea is built, i.e. regarding Docker or the Makefile
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants