Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Since boost 1.72 spirit::unicode::char_ fails to parse non-ASCII #678

Open
timo-schluessler opened this issue May 20, 2021 · 13 comments
Open

Comments

@timo-schluessler
Copy link

Sample code to reproduce the issue:

#define BOOST_SPIRIT_UNICODE
#include <boost/spirit/include/qi.hpp>

int main()
{
   typedef std::string::const_iterator iterator_type;
   namespace qi = boost::spirit::qi;
   namespace unicode = boost::spirit::unicode;

   std::string input("\"Test ⏳\"");
   qi::rule<iterator_type, std::string(), unicode::space_type> quoted_string = qi::lexeme['"' >> +(unicode::char_ - '"') >> '"'];

   iterator_type iter = input.begin();
   iterator_type end = input.end();
   std::string output;
   bool r = phrase_parse(iter, end, quoted_string, unicode::space, output);

   if (r && iter == end)
      std::cout << "successfully parsed " << input << " to " << output << std::endl;
   else
      std::cout << "failed to parse " << input << std::endl;

   return 0;
}

Thanks to sehe who bisected the issue down to commit 16159fb.
Maybe this behavior is by intention - then I simply don't get the use and meaning of spirit::unicode and BOOST_SPIRIT_UNICODE.

@Kojoley
Copy link
Collaborator

Kojoley commented May 20, 2021

  1. The code contains implementation-defined behavior ([lex/1.1]). Please rewrite it without unicode characters in the source code.
  2. Do you intentionally use std::string with unicode? That will encode it with the execution character set.
  3. '"' is not a unicode parser, probably it is the root cause of your issue.

@timo-schluessler
Copy link
Author

timo-schluessler commented May 21, 2021

  1. Sorry I didn't know this was implementation-defined. In the real program the string is read in from a file which is encoded in UTF-8. Please find the updated example below.
  2. Yes. I would like to process the UTF-8 as plain 8-bit chars as the file format itself (say the control characters) is exclusively made up of ASCII characters, like the " in the example. The fields/values in the format though may contain any UTF-8 character. (And because of the way UTF-8 encodes unicode characters no single byte could be falsely interpreted as a valid ASCII control char.)
  3. Do you have an idea how to fix this?
#define BOOST_SPIRIT_UNICODE
#include <boost/spirit/include/qi.hpp>

int main()
{
   typedef std::string::const_iterator iterator_type;
   namespace qi = boost::spirit::qi;
   namespace unicode = boost::spirit::unicode;

   std::string input("\"Test \xe2\x8f\xb3\"");
   qi::rule<iterator_type, std::string(), unicode::space_type> quoted_string = qi::lexeme['"' >> +(unicode::char_ - '"') >> '"'];

   iterator_type iter = input.begin();
   iterator_type end = input.end();
   std::string output;
   bool r = phrase_parse(iter, end, quoted_string, unicode::space, output);

   if (r && iter == end)
      std::cout << "successfully parsed " << input << " to " << output << std::endl;
   else
      std::cout << "failed to parse " << input << std::endl;

   return 0;
}

Edit: Typo.

@Kojoley
Copy link
Collaborator

Kojoley commented May 21, 2021

Yes. I would like to process the UTF-8 as plain 8-bit chars as the file format itself (say the control characters) is exclusively made up of ASCII characters, like the " in the example. The fields/values in the format though may contain any UTF-8 character. (And because of the way UTF-8 encodes unicode characters no single byte could be falsely interpreted as a valid ASCII control char.)

This sounds as a duplicate of #675

Do you have an idea how to fix this?

Use UTF-8 to UTF-32 conversion iterator, if it does not help try to replace '"' with unicode::lit('"').

@3dyd
Copy link
Contributor

3dyd commented May 23, 2021

I suppose this is not a duplicate. While the reason is basically the same (and it is well described here), mine is about character classification checks, and this one happens before that, when both Qi and X3 implicitly convert value of signed type to boost::uint32_t while calling unicode::ischar:

static bool
ischar(char_type ch)
{
// unicode code points in the range 0x00 to 0x10FFFF
return ch <= 0x10FFFF;
}

(0xE2 char becomes 0xFFFFFFE2 uint32_t)

For example, standard::ischar accounts possibility of signed source values this way:

static bool
ischar(int ch)
{
// uses all 8 bits
// we have to watch out for sign extensions
return (0 == (ch & ~0xff) || ~0 == (ch | 0xff)) != 0;
}

@timo-schluessler you example would work if you use standard encoding instead of unicode (simply change unicode to standard).

Also, as it appeared to be, my issue with standard encoding has already been fixed for Qi (6821c82). So, in boost 1.76+ (where this fix has landed) you can also use qi::standard::alpha and other character classification parses to match ASCII markup (assuming that default C locale is used).

@Kojoley
Copy link
Collaborator

Kojoley commented May 23, 2021

Mixing unicode with non-unicode parsers, using unicode parsers on non-unicode input, or non-unicode parsers on unicode input is not supported. I had been researching the problem and prototyping a solution (last time when reviewing #655/#649) but there are behavior choices with no 'one fits all'.

@timo-schluessler
Copy link
Author

Thanks for your replies. unicode::lit() does not exist but using standard instead of unicode fixes my issue. And I somewhat get the sense of it such that if the grammar only uses ASCII then I use standard. If the grammar would contain special characters then I would have to use unicode and also work with wide characters. Does that make sense?

@Trigve
Copy link

Trigve commented Jun 4, 2021

I'm also hit by this while upgrading to boost 1.76.0.

I'm still bit puzzled after reading documentation as there is minimum info about character encoding namespaces. What's the difference between standard, standard_wide and unicode?

How I see it is that standard does use char and standard_wide uses wchar_t. But the string parsed could be in any encoding when using standard?

@Kojoley
Copy link
Collaborator

Kojoley commented Jun 4, 2021

I'm still bit puzzled after reading documentation as there is minimum info about character encoding namespaces. What's the difference between standard, standard_wide and unicode?

Obviously the difference is encoding.

How I see it is that standard does use char and standard_wide uses wchar_t. But the string parsed could be in any encoding when using standard?

  • standard uses standard classification functions from ctype.h header, which use global C locale, which defines the encoding (J.4 Locale-specific behavior).
  • standard_wide uses standard classification functions from wctype.h header, an encoding is implementation-defined unless __STDC_ISO_10646__ macro is defined (J.3.4 Characters).
  • unicode is UTF-32.

@hhaoao
Copy link

hhaoao commented May 7, 2022

How to print unicode _attr?

 error: invalid operands to binary expression ('std::ostream' (aka 'basic_ostream<char>') and 'std::vector<char32_t>')
#define BOOST_SPIRIT_X3_UNICODE
#include <boost/spirit/home/x3.hpp>

auto f1 = [](auto& ctx){std::cout << _attr(ctx) << std::endl;}

x3::rule<class tree, ast::tree> const tree = "tree";

auto const tree_def =
    lexeme[+(char_ -(eol))][f1]
    >> int_
    ;

@tdauth
Copy link

tdauth commented Sep 23, 2024

Any updates on this issue? Will this be fixed or is there any workaround? I use code like this:

typedef std::istreambuf_iterator<byte> IteratorType;
	typedef boost::spirit::multi_pass<IteratorType> ForwardIteratorType;

	ForwardIteratorType first = boost::spirit::make_default_multi_pass(IteratorType(istream));
	ForwardIteratorType last;

	// used for backtracking and more detailed error output
	namespace classic = boost::spirit::classic;
	typedef classic::position_iterator2<ForwardIteratorType> PositionIteratorType;
	PositionIteratorType position_begin(first, last);
	PositionIteratorType position_end;

	try
	{
		if (!client::parse(position_begin, position_end, this->sections()))
		{
			throw Exception(_("Parsing error."));
		}
	}



...

typedef char byte;
typedef std::basic_istream<byte> InputStream;

Does this mean I have to change my char into some UTF8 type now to have a basic_istream for UTF8? While the design decision might make sense, it breaks older code.

@saki7
Copy link
Contributor

saki7 commented Sep 26, 2024

@tdauth I would suggest porting your code to X3, and use the char-related parsers provided in the correct namespaces, like @Kojoley mentioned: #678 (comment)

Also: (quoting from #678 (comment))

Mixing unicode with non-unicode parsers, using unicode parsers on non-unicode input, or non-unicode parsers on unicode input is not supported.

I feel that the problem on this Issue is thoroughly explained here, and I think that there's no actual issue in the Spirit's implementation.

As a sidenote:

  • I'm using X3's char parsers from the unicode namespaces, alongside with std::u32string::const_iterator, and they're working fine without any encoding related issues.
  • I'm a native Japanese speaker who uses CJK (aka multibyte) characters on a daily basis, and I am pretty much familiar with the unicode-related implementation failures in western OSS. The unicode implementation of Spirit.X3 is concrete, and it can surely handle UTF-32 iterators.

@tdauth
Copy link

tdauth commented Oct 2, 2024

So for UTF-8 files I could simply use std::u8string and char8_t instead of std::string and char in Boost Spirit 2? I don't know yet how to migrate to X3, so I am looking for the easiest solution.

@saki7
Copy link
Contributor

saki7 commented Oct 3, 2024

@tdauth Yes, you should pass unicode iterators to Spirit.
Again, I would strongly recommend using X3, since Spirit.Qi is no longer actively maintained.
Feel free to open a new issue with your specific code, if you have further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants