Remove charset tests with race conditions #12571

mrsdizzie · 2020-08-23T00:34:44Z

These tests don't seem reliable (with documented race conditions). Remove them since they regularly kill the CI.

For both

Lines 91 to 95 in 2026d88

    
           // "Hola, así cómo ños" 
        
           res = ToUTF8WithFallback([]byte{0x48, 0x6F, 0x6C, 0x61, 0x2C, 0x20, 0x61, 0x73, 0xED, 0x20, 0x63, 
        
           	0xF3, 0x6D, 0x6F, 0x20, 0xF1, 0x6F, 0x73}) 
        
           assert.Equal(t, []byte{0x48, 0x6F, 0x6C, 0x61, 0x2C, 0x20, 0x61, 0x73, 0xC3, 0xAD, 0x20, 0x63, 
        
           	0xC3, 0xB3, 0x6D, 0x6F, 0x20, 0xC3, 0xB1, 0x6F, 0x73}, res)

and

gitea/modules/charset/charset_test.go

Lines 136 to 141 in 2026d88

    
           // Latin1 
        
           // Hola, así cómo ños 
        
           res = ToUTF8(string([]byte{0x48, 0x6F, 0x6C, 0x61, 0x2C, 0x20, 0x61, 0x73, 0xED, 0x20, 0x63, 
        
           	0xF3, 0x6D, 0x6F, 0x20, 0xF1, 0x6F, 0x73})) 
        
           assert.Equal(t, []byte{0x48, 0x6f, 0x6c, 0x61, 0x2c, 0x20, 0x61, 0x73, 0xc3, 0xad, 0x20, 0x63, 
        
           	0xc3, 0xb3, 0x6d, 0x6f, 0x20, 0xc3, 0xb1, 0x6f, 0x73}, []byte(res))

It is definitely hitting this race condition that we already 'accept' in the tests below:

gitea/modules/charset/charset_test.go

Lines 228 to 230 in 2026d88

    
           // due to a race condition in `chardet` library, it could either detect 
        
           // "ISO-8859-1" or "IS0-8859-2" here. Technically either is correct, so 
        
           // we accept either.

Both tests pass when we detect it as ISO-8859-1 but fail when they sometimes gets detected as ISO-8859-2

With these tests locally it fails one out of every 4 or 5 tries, after removing it never fails

These tests don't seem reliable (with documented race conditions). Remove them since they regularly kill the CI.

codecov-commenter · 2020-08-23T01:02:56Z

Codecov Report

Merging #12571 into master will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #12571   +/-   ##
=======================================
  Coverage   43.43%   43.43%           
=======================================
  Files         643      643           
  Lines       71165    71165           
=======================================
+ Hits        30910    30913    +3     
- Misses      35246    35247    +1     
+ Partials     5009     5005    -4

Impacted Files	Coverage Δ
models/unit.go	`46.57% <0.00%> (-2.74%)`	⬇️
modules/log/event.go	`57.54% <0.00%> (-1.89%)`	⬇️
modules/queue/unique_queue_disk_channel.go	`53.84% <0.00%> (-1.54%)`	⬇️
modules/indexer/stats/db.go	`60.86% <0.00%> (+17.39%)`	⬆️
modules/indexer/stats/queue.go	`76.47% <0.00%> (+23.52%)`	⬆️
modules/util/timer.go	`85.71% <0.00%> (+42.85%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2026d88...2aa0b7e. Read the comment docs.

silverwind · 2020-08-23T05:19:29Z

Could also just comment it out.

zeripath

I have to disagree - if this race is still happening 4 out of 5 our preferred charset stuff isn't working correctly and people's documents will be being detected incorrectly.

We need to fix that not just paper over by deleting the test that is showing the problem.

For example, what is relative score of iso-8859-1 Vs the iso-8859-2? Would be as simple as adding a small boost to preferred charsets?

silverwind · 2020-08-23T09:01:24Z

This test is causing like every fourth CI run to fail which wastes a lot of valuable time of contributors. We can keep an issue around to keep it a "known issue" but as it is now, this test randomly failing does not gain us any additional knowledge.

silverwind · 2020-08-23T09:50:26Z

I really wonder how something like charset detection can be non-deterministic. Do we know if the bug comes from gogs/chardet or our code?

~~Golang provides charset#DetermineEncoding thought docs say its only for HTML but maybe we can use it.~~ Nope, implementation far too simple.

zeripath · 2020-08-23T09:54:39Z

That this test is still failing means that we are detecting the wrong character set frequently in certain cases. ~~I cannot get it to happen on my local machine - if you can then we can finally do something about fixing it.~~ I have replicated it - I was doing the wrong test!! d'oh.

For example adding this would allow us to investigate what is going on with scores:

diff --git a/modules/charset/charset.go b/modules/charset/charset.go
index a7e427db9..8a7fd90c5 100644
--- a/modules/charset/charset.go
+++ b/modules/charset/charset.go
@@ -149,6 +149,17 @@ func DetectEncoding(content []byte) (string, error) {
 		return "", err
 	}
 
+	if log.IsTrace() {
+		logText := &strings.Builder{}
+		logValues := []interface{}{}
+		_, _ = logText.WriteString("Detected Encodings:\n")
+		for _, result := range results {
+			_, _ = logText.WriteString("%d %s\n")
+			logValues = append(logValues, result.Confidence, result.Charset)
+		}
+		log.Trace(logText.String(), logValues...)
+	}
+
 	topConfidence := results[0].Confidence
 	topResult := results[0]
 	priority, has := setting.Repository.DetectedCharsetScore[strings.ToLower(strings.TrimSpace(topResult.Charset))]

zeripath · 2020-08-23T11:04:39Z

OK the issue is that although I fixed this in Gitea proper the Test does not set the default detectedcharsetsorder.

The below would fix this:

diff --git a/modules/charset/charset_test.go b/modules/charset/charset_test.go
index 394a42c71..33f0c10a7 100644
--- a/modules/charset/charset_test.go
+++ b/modules/charset/charset_test.go
@@ -5,6 +5,7 @@
 package charset
 
 import (
+	"strings"
 	"testing"
 
 	"code.gitea.io/gitea/modules/setting"
@@ -12,6 +13,22 @@ import (
 	"github.com/stretchr/testify/assert"
 )
 
+func resetDefaultCharsetsOrder() {
+	defaultDetectedCharsetsOrder := make([]string, 0, len(setting.Repository.DetectedCharsetsOrder))
+	for _, charset := range setting.Repository.DetectedCharsetsOrder {
+		defaultDetectedCharsetsOrder = append(defaultDetectedCharsetsOrder, strings.ToLower(strings.TrimSpace(charset)))
+	}
+	setting.Repository.DetectedCharsetScore = map[string]int{}
+	i := 0
+	for _, charset := range defaultDetectedCharsetsOrder {
+		canonicalCharset := strings.ToLower(strings.TrimSpace(charset))
+		if _, has := setting.Repository.DetectedCharsetScore[canonicalCharset]; !has {
+			setting.Repository.DetectedCharsetScore[canonicalCharset] = i
+			i++
+		}
+	}
+}
+
 func TestRemoveBOMIfPresent(t *testing.T) {
 	res := RemoveBOMIfPresent([]byte{0xc3, 0xa1, 0xc3, 0xa9, 0xc3, 0xad, 0xc3, 0xb3, 0xc3, 0xba})
 	assert.Equal(t, []byte{0xc3, 0xa1, 0xc3, 0xa9, 0xc3, 0xad, 0xc3, 0xb3, 0xc3, 0xba}, res)
@@ -21,6 +38,7 @@ func TestRemoveBOMIfPresent(t *testing.T) {
 }
 
 func TestToUTF8WithErr(t *testing.T) {
+	resetDefaultCharsetsOrder()
 	var res string
 	var err error
 
@@ -76,6 +94,7 @@ func TestToUTF8WithErr(t *testing.T) {
 }
 
 func TestToUTF8WithFallback(t *testing.T) {
+	resetDefaultCharsetsOrder()
 	// "ABC"
 	res := ToUTF8WithFallback([]byte{0x41, 0x42, 0x43})
 	assert.Equal(t, []byte{0x41, 0x42, 0x43}, res)
@@ -116,7 +135,7 @@ func TestToUTF8WithFallback(t *testing.T) {
 }
 
 func TestToUTF8(t *testing.T) {
-
+	resetDefaultCharsetsOrder()
 	// Note: golang compiler seems so behave differently depending on the current
 	// locale, so some conversions might behave differently. For that reason, we don't
 	// depend on particular conversions but in expected behaviors.
@@ -165,6 +184,7 @@ func TestToUTF8(t *testing.T) {
 }
 
 func TestToUTF8DropErrors(t *testing.T) {
+	resetDefaultCharsetsOrder()
 	// "ABC"
 	res := ToUTF8DropErrors([]byte{0x41, 0x42, 0x43})
 	assert.Equal(t, []byte{0x41, 0x42, 0x43}, res)
@@ -204,6 +224,7 @@ func TestToUTF8DropErrors(t *testing.T) {
 }
 
 func TestDetectEncoding(t *testing.T) {
+	resetDefaultCharsetsOrder()
 	testSuccess := func(b []byte, expected string) {
 		encoding, err := DetectEncoding(b)
 		assert.NoError(t, err)
@@ -225,10 +246,7 @@ func TestDetectEncoding(t *testing.T) {
 	b = []byte{0x44, 0xe9, 0x63, 0x6f, 0x72, 0x0a}
 	encoding, err := DetectEncoding(b)
 	assert.NoError(t, err)
-	// due to a race condition in `chardet` library, it could either detect
-	// "ISO-8859-1" or "IS0-8859-2" here. Technically either is correct, so
-	// we accept either.
-	assert.Contains(t, encoding, "ISO-8859")
+	assert.Contains(t, encoding, "ISO-8859-1")
 
 	old := setting.Repository.AnsiCharset
 	setting.Repository.AnsiCharset = "placeholder"

TestToUTF8WithFallback is the cause of recurrent spurious test failures even despite code to set the detected charset order. The reason why this happens is because the preferred detected charset order is not being initialised for these tests. This PR simply ensures that this is set at the start of each test and would allow different tests to be written to allow differing orders. Replaces go-gitea#12571 Close go-gitea#12571 Signed-off-by: Andrew Thornton <[email protected]>

mrsdizzie · 2020-08-23T13:04:45Z

good :)

TestToUTF8WithFallback is the cause of recurrent spurious test failures even despite code to set the detected charset order. The reason why this happens is because the preferred detected charset order is not being initialised for these tests. This PR simply ensures that this is set at the start of each test and would allow different tests to be written to allow differing orders. Replaces #12571 Close #12571 Signed-off-by: Andrew Thornton <[email protected]>

…ea#12574) Backport go-gitea#12574 TestToUTF8WithFallback is the cause of recurrent spurious test failures even despite code to set the detected charset order. The reason why this happens is because the preferred detected charset order is not being initialised for these tests. This PR simply ensures that this is set at the start of each test and would allow different tests to be written to allow differing orders. Replaces go-gitea#12571 Close go-gitea#12571 Signed-off-by: Andrew Thornton <[email protected]>

Remove chardat tests with race conditions

2aa0b7e

These tests don't seem reliable (with documented race conditions). Remove them since they regularly kill the CI.

GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Aug 23, 2020

silverwind approved these changes Aug 23, 2020

View reviewed changes

GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Aug 23, 2020

zeripath suggested changes Aug 23, 2020

View reviewed changes

zeripath mentioned this pull request Aug 23, 2020

Ensure that the detected charset order is set in chardet test #12574

Merged

zeripath added the topic/build PR changes how Gitea is built, i.e. regarding Docker or the Makefile label Aug 23, 2020

mrsdizzie closed this Aug 23, 2020

zeripath mentioned this pull request Aug 23, 2020

Ensure that the detected charset order is set in chardet test (#12574) #12578

Closed

go-gitea locked and limited conversation to collaborators Nov 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Remove charset tests with race conditions #12571

Remove charset tests with race conditions #12571

Uh oh!

mrsdizzie commented Aug 23, 2020 •

edited

Loading

Uh oh!

codecov-commenter commented Aug 23, 2020

Uh oh!

silverwind commented Aug 23, 2020

Uh oh!

zeripath left a comment •

edited

Loading

Uh oh!

silverwind commented Aug 23, 2020

Uh oh!

silverwind commented Aug 23, 2020 •

edited

Loading

Uh oh!

zeripath commented Aug 23, 2020 •

edited

Loading

Uh oh!

zeripath commented Aug 23, 2020

Uh oh!

mrsdizzie commented Aug 23, 2020

Uh oh!

Uh oh!

	// "Hola, así cómo ños"
	res = ToUTF8WithFallback([]byte{0x48, 0x6F, 0x6C, 0x61, 0x2C, 0x20, 0x61, 0x73, 0xED, 0x20, 0x63,
	0xF3, 0x6D, 0x6F, 0x20, 0xF1, 0x6F, 0x73})
	assert.Equal(t, []byte{0x48, 0x6F, 0x6C, 0x61, 0x2C, 0x20, 0x61, 0x73, 0xC3, 0xAD, 0x20, 0x63,
	0xC3, 0xB3, 0x6D, 0x6F, 0x20, 0xC3, 0xB1, 0x6F, 0x73}, res)

	// Latin1
	// Hola, así cómo ños
	res = ToUTF8(string([]byte{0x48, 0x6F, 0x6C, 0x61, 0x2C, 0x20, 0x61, 0x73, 0xED, 0x20, 0x63,
	0xF3, 0x6D, 0x6F, 0x20, 0xF1, 0x6F, 0x73}))
	assert.Equal(t, []byte{0x48, 0x6f, 0x6c, 0x61, 0x2c, 0x20, 0x61, 0x73, 0xc3, 0xad, 0x20, 0x63,
	0xc3, 0xb3, 0x6d, 0x6f, 0x20, 0xc3, 0xb1, 0x6f, 0x73}, []byte(res))

	// due to a race condition in `chardet` library, it could either detect
	// "ISO-8859-1" or "IS0-8859-2" here. Technically either is correct, so
	// we accept either.

Uh oh!

Remove charset tests with race conditions #12571

Remove charset tests with race conditions #12571

Uh oh!

Conversation

mrsdizzie commented Aug 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Aug 23, 2020

Codecov Report

Uh oh!

silverwind commented Aug 23, 2020

Uh oh!

zeripath left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silverwind commented Aug 23, 2020

Uh oh!

silverwind commented Aug 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zeripath commented Aug 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zeripath commented Aug 23, 2020

Uh oh!

mrsdizzie commented Aug 23, 2020

Uh oh!

Uh oh!

mrsdizzie commented Aug 23, 2020 •

edited

Loading

zeripath left a comment •

edited

Loading

silverwind commented Aug 23, 2020 •

edited

Loading

zeripath commented Aug 23, 2020 •

edited

Loading