Why quote for string when use StreamWriterBuilder? #894

fseasy · 2019-03-06T11:38:26Z

Oh, It is nice to use StreamWriterBuilder and I can set the precision of float values.
But, I found that all the Non-Ascii characters are quoted.

After dived into the source code, I found the static String valueToQuotedStringN(const char* value, unsigned length), and it scared me: it try to quote non-ascii characters to unicode code-point according to the UTF8 rules, ignoring the actual String Encoding.

I know in the document or somewhere, it says the jsoncpp only support UTF-8, but I think the brutally quote them are not proper methods. The original Api, like FastWriter, just return the raw bytes, Why not follow it?

Or, we can follow the json lib of Python, like

import json

s = [u"中文有时常用GB18030编码",]; # unicode
json_dumps = json.dumps(s, ensure_ascii=False) # get unicode again

that is, let us get the raw bytes by providing a settting.

The text was updated successfully, but these errors were encountered:

dota17 · 2019-05-29T12:36:27Z

@MeMeDa Do you mean non-ascii characters and the actual String Encoding refering to the Chinese String? Chinese character is a widestring ,and is represented by Unicode codes on the network. and UTF-8 is one of the implementations of Unicode. and as I know jsoncpp's Reader or Writer can also support the Chinese Characters.

fseasy · 2019-05-29T12:51:25Z

em, may be my bad English made some errors.
It's true that I came across problem when processing Chinese String. The Chinese String I processed is encoded by GB18030, not UTF-8, so when Json cpp quote the bytes according to UTF-8, it output the wrong result.

The right steps may be

1. decode the bytes to unicode according to right encoding (here GB18030)
2. output unicode code-point

while current json cpp is

1. translate the bytes to unicode code-point, according to UTF-8 <=> Unicode translating table . [x] <- here may be wrong
2. output the unicode code-point

dota17 · 2019-09-02T02:14:42Z

Yes, you are right.
For the Chinese string, or others, which are not encoded by UTF-8, we should decoded them to unicode firstly.
And I think iconv_open() and iconv() can solve this problem quite well, Because the user can converts any encoding format to UTF-8 by themselves.
For example:

#include <iconv.h>
//...
int SometypeToUTF8(const string& input, string& output, const string& type)
{
   //...
   iconv_t cd = iconv_open("utf-8",type); //decode the bytes to unicode according to right encoding (here GB18030)
   //...
   iconv(cd, &input, &InPutLen, &output, &OutPutLen); //output unicode code-point
   //...
   iconv_close(cd);
}

fseasy · 2019-09-02T02:22:47Z

Oh, I did it as you said. While I still can't understand why we should quote the bytes to unicode, instead of directly use the raw bytes. May be some standard?
I thought it is inconvenient. I just want bytes in, bytes out. Not code point, not iconv.

dota17 · 2019-09-10T07:04:00Z

@MeMeDa
Hi, can you give some examples using jsoncpp ? both FastWriter and StreamWriterBuilder will be fine, and tell your expected result.

Also, I made some tests for Chinese String, like 中文有时常用GB18030编码.
My test code :

	const std::string uni = "中文有时常用GB18030编码"; // utf-8
	std::string styled;
	{
		Json::Value v;
		v["abc"] = uni;
		styled = v.toStyledString();
	}
	Json::Value tmp;
	Json::FastWriter writer;
	{
		JSONCPP_STRING errs;
		std::istringstream iss(styled);
		bool ok = parseFromStream(Json::CharReaderBuilder(), iss, &tmp, &errs);
		if (!ok) {
			std::cerr << "errs: " << errs << std::endl;
		}
		std::cout << "ori string: " << uni << std::endl;
		std::cout << "asString: " << tmp["abc"].asString() << std::endl;
		std::cout << "wirte: " << writer.write(tmp) << std::endl;
	}

The actual result:

asString: 中文有时常用GB18030编码
ori string: 中文有时常用GB18030编码
wirte: {"abc":"\u4e2d\u6587\u6709\u65f6\u5e38\u7528GB18030\u7f16\u7801"}

When I used StreamWriterBuilder to wirte, the result was same.

The original Api, like FastWriter, just return the raw bytes,

FastWriter didn't return the raw bytes when processing non-ascii characters.
The actions between the method asString of the Json::Value and the method write of writer are inconvenient.

fseasy · 2019-09-18T10:55:48Z

Thanks for keeping tracking.
The FastWriter didn't return the raw bytes when processing non-ascii characters. because I'm using a old version jsoncpp, like 1.8.4 (may be, I've tried some version including the 0.x).

In the code you post, where

const std::string uni = "中文有时常用GB18030编码"; // utf-8

is the key point. The uni''s bytes may be not encoded by utf-8(such as GB18030 is another popular encoding in Chinese) , and then the *Writer make wrong quote.

I know the best way currently is firstly translated other encoding to utf-8, but some times I just need the raw bytes and don't need the quoted Unicode code-point, it decrease the efficiency.
so I just hope the *Writer just provide a option to let the result keep the original bytes, disable quote, like the json in Python

json.dumps(json_data, ensure_ascii=False)

to let the result keep raw-bytes.
Thanks again 🚀

dota17 · 2019-10-11T02:14:20Z

@MeMeDa
Hi,
There is a PR #1045 for emitUTF8.
I think it maybe helpful for the case - utf-8 encoding, not to quoted unicode code-point.
You can review it.

dota17 · 2019-10-18T06:38:36Z

PR #1045 for emitUTF8 was merged.
If the question is still online, feel free to reopen.

baylesj added enhancement bug and removed enhancement labels Jul 9, 2019

dota17 closed this as completed Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why quote for string when use StreamWriterBuilder? #894

Why quote for string when use StreamWriterBuilder? #894

fseasy commented Mar 6, 2019 •

edited

Loading

dota17 commented May 29, 2019

Uh oh!

fseasy commented May 29, 2019 •

edited

Loading

Uh oh!

dota17 commented Sep 2, 2019 •

edited

Loading

Uh oh!

fseasy commented Sep 2, 2019

Uh oh!

dota17 commented Sep 10, 2019 •

edited

Loading

Uh oh!

fseasy commented Sep 18, 2019 •

edited

Loading

Uh oh!

dota17 commented Oct 11, 2019

Uh oh!

dota17 commented Oct 18, 2019

Uh oh!

Why quote for string when use StreamWriterBuilder? #894

Why quote for string when use StreamWriterBuilder? #894

Comments

fseasy commented Mar 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dota17 commented May 29, 2019

Uh oh!

fseasy commented May 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dota17 commented Sep 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fseasy commented Sep 2, 2019

Uh oh!

dota17 commented Sep 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fseasy commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dota17 commented Oct 11, 2019

Uh oh!

dota17 commented Oct 18, 2019

Uh oh!

fseasy commented Mar 6, 2019 •

edited

Loading

fseasy commented May 29, 2019 •

edited

Loading

dota17 commented Sep 2, 2019 •

edited

Loading

dota17 commented Sep 10, 2019 •

edited

Loading

fseasy commented Sep 18, 2019 •

edited

Loading