Skip to content

Why quote for string when use StreamWriterBuilder? #894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fseasy opened this issue Mar 6, 2019 · 8 comments
Closed

Why quote for string when use StreamWriterBuilder? #894

fseasy opened this issue Mar 6, 2019 · 8 comments
Labels

Comments

@fseasy
Copy link

fseasy commented Mar 6, 2019

Oh, It is nice to use StreamWriterBuilder and I can set the precision of float values.
But, I found that all the Non-Ascii characters are quoted.

After dived into the source code, I found the static String valueToQuotedStringN(const char* value, unsigned length), and it scared me: it try to quote non-ascii characters to unicode code-point according to the UTF8 rules, ignoring the actual String Encoding.

I know in the document or somewhere, it says the jsoncpp only support UTF-8, but I think the brutally quote them are not proper methods. The original Api, like FastWriter, just return the raw bytes, Why not follow it?

Or, we can follow the json lib of Python, like

import json

s = [u"中文有时常用GB18030编码",]; # unicode
json_dumps = json.dumps(s, ensure_ascii=False) # get unicode again 

that is, let us get the raw bytes by providing a settting.

@dota17
Copy link
Member

dota17 commented May 29, 2019

@MeMeDa Do you mean non-ascii characters and the actual String Encoding refering to the Chinese String? Chinese character is a widestring ,and is represented by Unicode codes on the network. and UTF-8 is one of the implementations of Unicode. and as I know jsoncpp's Reader or Writer can also support the Chinese Characters.

@fseasy
Copy link
Author

fseasy commented May 29, 2019

em, may be my bad English made some errors.
It's true that I came across problem when processing Chinese String. The Chinese String I processed is encoded by GB18030, not UTF-8, so when Json cpp quote the bytes according to UTF-8, it output the wrong result.

The right steps may be

1. decode the bytes to unicode according to right encoding (here GB18030)
2. output unicode code-point

while current json cpp is

1. translate the bytes to unicode code-point, according to UTF-8 <=> Unicode translating table . [x] <- here may be wrong
2. output the unicode code-point

@dota17
Copy link
Member

dota17 commented Sep 2, 2019

Yes, you are right.
For the Chinese string, or others, which are not encoded by UTF-8, we should decoded them to unicode firstly.
And I think iconv_open() and iconv() can solve this problem quite well, Because the user can converts any encoding format to UTF-8 by themselves.
For example:

#include <iconv.h>
//...
int SometypeToUTF8(const string& input, string& output, const string& type)
{
   //...
   iconv_t cd = iconv_open("utf-8",type); //decode the bytes to unicode according to right encoding (here GB18030)
   //...
   iconv(cd, &input, &InPutLen, &output, &OutPutLen); //output unicode code-point
   //...
   iconv_close(cd);
}

@fseasy
Copy link
Author

fseasy commented Sep 2, 2019

Oh, I did it as you said. While I still can't understand why we should quote the bytes to unicode, instead of directly use the raw bytes. May be some standard?
I thought it is inconvenient. I just want bytes in, bytes out. Not code point, not iconv.

@dota17
Copy link
Member

dota17 commented Sep 10, 2019

@MeMeDa
Hi, can you give some examples using jsoncpp ? both FastWriter and StreamWriterBuilder will be fine, and tell your expected result.

Also, I made some tests for Chinese String, like 中文有时常用GB18030编码.
My test code :

	const std::string uni = "中文有时常用GB18030编码"; // utf-8
	std::string styled;
	{
		Json::Value v;
		v["abc"] = uni;
		styled = v.toStyledString();
	}
	Json::Value tmp;
	Json::FastWriter writer;
	{
		JSONCPP_STRING errs;
		std::istringstream iss(styled);
		bool ok = parseFromStream(Json::CharReaderBuilder(), iss, &tmp, &errs);
		if (!ok) {
			std::cerr << "errs: " << errs << std::endl;
		}
		std::cout << "ori string: " << uni << std::endl;
		std::cout << "asString: " << tmp["abc"].asString() << std::endl;
		std::cout << "wirte: " << writer.write(tmp) << std::endl;
	}

The actual result:

asString: 中文有时常用GB18030编码
ori string: 中文有时常用GB18030编码
wirte: {"abc":"\u4e2d\u6587\u6709\u65f6\u5e38\u7528GB18030\u7f16\u7801"}

When I used StreamWriterBuilder to wirte, the result was same.

The original Api, like FastWriter, just return the raw bytes,

FastWriter didn't return the raw bytes when processing non-ascii characters.
The actions between the method asString of the Json::Value and the method write of writer are inconvenient.

@fseasy
Copy link
Author

fseasy commented Sep 18, 2019

Thanks for keeping tracking.
The FastWriter didn't return the raw bytes when processing non-ascii characters. because I'm using a old version jsoncpp, like 1.8.4 (may be, I've tried some version including the 0.x).


In the code you post, where

const std::string uni = "中文有时常用GB18030编码"; // utf-8

is the key point. The uni''s bytes may be not encoded by utf-8(such as GB18030 is another popular encoding in Chinese) , and then the *Writer make wrong quote.

I know the best way currently is firstly translated other encoding to utf-8, but some times I just need the raw bytes and don't need the quoted Unicode code-point, it decrease the efficiency.
so I just hope the *Writer just provide a option to let the result keep the original bytes, disable quote, like the json in Python

json.dumps(json_data, ensure_ascii=False)

to let the result keep raw-bytes.
Thanks again 🚀

@dota17
Copy link
Member

dota17 commented Oct 11, 2019

@MeMeDa
Hi,
There is a PR #1045 for emitUTF8.
I think it maybe helpful for the case - utf-8 encoding, not to quoted unicode code-point.
You can review it.

@dota17
Copy link
Member

dota17 commented Oct 18, 2019

PR #1045 for emitUTF8 was merged.
If the question is still online, feel free to reopen.

@dota17 dota17 closed this as completed Oct 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants