Discussion:
[protobuf] Spec v2 int-lit snafu?
Michael Powell
2018-11-11 03:16:03 UTC
Permalink
Hello,

I think 0 can be a decimal-lit, don't you think? However, the spec
reads as follows:

intLit = decimalLit | octalLit | hexLit
decimalLit = ( "1" … "9" ) { decimalDigit }
octalLit = "0" { octalDigit }
hexLit = "0" ( "x" | "X" ) hexDigit { hexDigit }

Is there a reason, semantically speaking, why decimal must be greater
than 0? And that's not including a plus/minus sign when you factor in
constants.

Of course, parsing, order matters, similar as with the escape
character phrases in the string-literal:

hex-lit | oct-lit | dec-lit

And so on, since you have to rule out 0x\d+ for hex, followed by 0\d* ...

Actually, now that I look at it "0" (really, "decimal" 0) is lurking
in the oct-lit phrase.

Kind of a grammatical nit-pick, I know, but I just wanted to be clear
here. Seems like a possible source of confusion if you aren't paying
careful attention.

Thoughts?

Best regards,

Michael Powell
--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.
Josh Humphries
2018-11-11 15:55:21 UTC
Permalink
For the case of zero by itself, per the spec, it will be parsed as an octal
literal with value zero -- so functionally equivalent to a decimal literal
with value zero. And for values with multiple digits, a leading zero means
it is an octal literal. Decimal values will not have a leading zero.

----
*Josh Humphries*
Post by Michael Powell
Hello,
I think 0 can be a decimal-lit, don't you think? However, the spec
intLit = decimalLit | octalLit | hexLit
decimalLit = ( "1" 
 "9" ) { decimalDigit }
octalLit = "0" { octalDigit }
hexLit = "0" ( "x" | "X" ) hexDigit { hexDigit }
Is there a reason, semantically speaking, why decimal must be greater
than 0? And that's not including a plus/minus sign when you factor in
constants.
Of course, parsing, order matters, similar as with the escape
hex-lit | oct-lit | dec-lit
And so on, since you have to rule out 0x\d+ for hex, followed by 0\d* ...
Actually, now that I look at it "0" (really, "decimal" 0) is lurking
in the oct-lit phrase.
Kind of a grammatical nit-pick, I know, but I just wanted to be clear
here. Seems like a possible source of confusion if you aren't paying
careful attention.
Thoughts?
Best regards,
Michael Powell
--
You received this message because you are subscribed to the Google Groups
"Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.
Michael Powell
2018-11-12 15:06:40 UTC
Permalink
Hello,

Another question following up, how about the sign character for hex
and oct integers? Is it necessary, should it be discarded?

intLit = decimalLit | octalLit | hexLit
decimalLit = ( "1" … "9" ) { decimalDigit }
octalLit = "0" { octalDigit }
hexLit = "0" ( "x" | "X" ) hexDigit { hexDigit }

constant = fullIdent | ( [ "-" | "+" ] intLit ) | ( [ "-" | "+" ]
floatLit ) | strLit | boolLit

https://developers.google.com/protocol-buffers/docs/reference/proto2-spec#integer_literals
https://developers.google.com/protocol-buffers/docs/reference/proto2-spec#constant

For instance, I am fairly certain the sign character is encoded in a
hex encoded integer. Not sure about octal, but I imagine that it is
fairly consistent.

Case in point, the value 107026150751750362 gets encoded as
0X17C3BB7913C48DA (upper-case). Whereas it's negative counterpart,
-107026150751750362, really does get encoded as 0xFE83C4486EC3B726.
Signage included, if memory serves.

In these cases, I think the sign bit falls in the "optional" category?

Cheers, thanks,

Michael
For the case of zero by itself, per the spec, it will be parsed as an octal literal with value zero -- so functionally equivalent to a decimal literal with value zero. And for values with multiple digits, a leading zero means it is an octal literal. Decimal values will not have a leading zero.
----
Josh Humphries
Post by Michael Powell
Hello,
I think 0 can be a decimal-lit, don't you think? However, the spec
intLit = decimalLit | octalLit | hexLit
decimalLit = ( "1" … "9" ) { decimalDigit }
octalLit = "0" { octalDigit }
hexLit = "0" ( "x" | "X" ) hexDigit { hexDigit }
Is there a reason, semantically speaking, why decimal must be greater
than 0? And that's not including a plus/minus sign when you factor in
constants.
Of course, parsing, order matters, similar as with the escape
hex-lit | oct-lit | dec-lit
And so on, since you have to rule out 0x\d+ for hex, followed by 0\d* ...
Actually, now that I look at it "0" (really, "decimal" 0) is lurking
in the oct-lit phrase.
Kind of a grammatical nit-pick, I know, but I just wanted to be clear
here. Seems like a possible source of confusion if you aren't paying
careful attention.
Thoughts?
Best regards,
Michael Powell
--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.
Michael Powell
2018-11-12 17:46:52 UTC
Permalink
Post by Michael Powell
Hello,
Another question following up, how about the sign character for hex
and oct integers? Is it necessary, should it be discarded?
intLit = decimalLit | octalLit | hexLit
decimalLit = ( "1" … "9" ) { decimalDigit }
octalLit = "0" { octalDigit }
hexLit = "0" ( "x" | "X" ) hexDigit { hexDigit }
constant = fullIdent | ( [ "-" | "+" ] intLit ) | ( [ "-" | "+" ]
floatLit ) | strLit | boolLit
https://developers.google.com/protocol-buffers/docs/reference/proto2-spec#integer_literals
https://developers.google.com/protocol-buffers/docs/reference/proto2-spec#constant
For instance, I am fairly certain the sign character is encoded in a
hex encoded integer. Not sure about octal, but I imagine that it is
fairly consistent.
Case in point, the value 107026150751750362 gets encoded as
0X17C3BB7913C48DA (upper-case). Whereas it's negative counterpart,
-107026150751750362, really does get encoded as 0xFE83C4486EC3B726.
Signage included, if memory serves.
In these cases, I think the sign bit falls in the "optional" category?
So... As far as I can determine, there are a couple of ways to
interpret this, semantically speaking. But this potentially informs
whatever parsing stack you are using as well.

I'm using Boost Spirit Qi, for instance, which supports radix-based
integer parsing well enough, but has its own set of issues when
dealing with signage. That being said...

1. Treat the value itself as positive one way or another, with an
optional sign attribute (i.e. '+' or '-'). This would potentially
work, especially when there is base 16 (hex) or base 8 (octal)
involved.

2. Otherwise, open to suggestions, but for Qi constraints; that I know
of, fails to parse negative signed hexadecimal/octal encoded values.

Again, kind of a symptom of an imprecise grammar specification. I can
get a sense for how to handle it, but does it truly capture "intent".

Thanks in advance for any light that can be shed.
Post by Michael Powell
Cheers, thanks,
Michael
For the case of zero by itself, per the spec, it will be parsed as an octal literal with value zero -- so functionally equivalent to a decimal literal with value zero. And for values with multiple digits, a leading zero means it is an octal literal. Decimal values will not have a leading zero.
----
Josh Humphries
Post by Michael Powell
Hello,
I think 0 can be a decimal-lit, don't you think? However, the spec
intLit = decimalLit | octalLit | hexLit
decimalLit = ( "1" … "9" ) { decimalDigit }
octalLit = "0" { octalDigit }
hexLit = "0" ( "x" | "X" ) hexDigit { hexDigit }
Is there a reason, semantically speaking, why decimal must be greater
than 0? And that's not including a plus/minus sign when you factor in
constants.
Of course, parsing, order matters, similar as with the escape
hex-lit | oct-lit | dec-lit
And so on, since you have to rule out 0x\d+ for hex, followed by 0\d* ...
Actually, now that I look at it "0" (really, "decimal" 0) is lurking
in the oct-lit phrase.
Kind of a grammatical nit-pick, I know, but I just wanted to be clear
here. Seems like a possible source of confusion if you aren't paying
careful attention.
Thoughts?
Best regards,
Michael Powell
--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.
Michael Powell
2018-11-13 03:29:40 UTC
Permalink
Post by Michael Powell
Post by Michael Powell
Hello,
Another question following up, how about the sign character for hex
and oct integers? Is it necessary, should it be discarded?
intLit = decimalLit | octalLit | hexLit
decimalLit = ( "1" … "9" ) { decimalDigit }
octalLit = "0" { octalDigit }
hexLit = "0" ( "x" | "X" ) hexDigit { hexDigit }
constant = fullIdent | ( [ "-" | "+" ] intLit ) | ( [ "-" | "+" ]
floatLit ) | strLit | boolLit
https://developers.google.com/protocol-buffers/docs/reference/proto2-spec#integer_literals
https://developers.google.com/protocol-buffers/docs/reference/proto2-spec#constant
For instance, I am fairly certain the sign character is encoded in a
hex encoded integer. Not sure about octal, but I imagine that it is
fairly consistent.
Got it sorted out I believe. Actually, it's quite nice the parser
support Spirit provides, aligns pretty much perfectly with the grammar
specification. There's a bit of gymnastics involved juggling whether
the AST has a sign or not and so forth, but other than that, it flows
well enough.
Post by Michael Powell
Post by Michael Powell
Case in point, the value 107026150751750362 gets encoded as
0X17C3BB7913C48DA (upper-case). Whereas it's negative counterpart,
-107026150751750362, really does get encoded as 0xFE83C4486EC3B726.
Signage included, if memory serves.
In these cases, I think the sign bit falls in the "optional" category?
So... As far as I can determine, there are a couple of ways to
interpret this, semantically speaking. But this potentially informs
whatever parsing stack you are using as well.
I'm using Boost Spirit Qi, for instance, which supports radix-based
integer parsing well enough, but has its own set of issues when
dealing with signage. That being said...
1. Treat the value itself as positive one way or another, with an
optional sign attribute (i.e. '+' or '-'). This would potentially
work, especially when there is base 16 (hex) or base 8 (octal)
involved.
2. Otherwise, open to suggestions, but for Qi constraints; that I know
of, fails to parse negative signed hexadecimal/octal encoded values.
Again, kind of a symptom of an imprecise grammar specification. I can
get a sense for how to handle it, but does it truly capture "intent".
Thanks in advance for any light that can be shed.
Post by Michael Powell
Cheers, thanks,
Michael
For the case of zero by itself, per the spec, it will be parsed as an octal literal with value zero -- so functionally equivalent to a decimal literal with value zero. And for values with multiple digits, a leading zero means it is an octal literal. Decimal values will not have a leading zero.
----
Josh Humphries
Post by Michael Powell
Hello,
I think 0 can be a decimal-lit, don't you think? However, the spec
intLit = decimalLit | octalLit | hexLit
decimalLit = ( "1" … "9" ) { decimalDigit }
octalLit = "0" { octalDigit }
hexLit = "0" ( "x" | "X" ) hexDigit { hexDigit }
Is there a reason, semantically speaking, why decimal must be greater
than 0? And that's not including a plus/minus sign when you factor in
constants.
Of course, parsing, order matters, similar as with the escape
hex-lit | oct-lit | dec-lit
And so on, since you have to rule out 0x\d+ for hex, followed by 0\d* ...
Actually, now that I look at it "0" (really, "decimal" 0) is lurking
in the oct-lit phrase.
Kind of a grammatical nit-pick, I know, but I just wanted to be clear
here. Seems like a possible source of confusion if you aren't paying
careful attention.
Thoughts?
Best regards,
Michael Powell
--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.
Josh Humphries
2018-11-13 14:57:31 UTC
Permalink
Post by Michael Powell
Post by Michael Powell
Post by Michael Powell
Hello,
Another question following up, how about the sign character for hex
and oct integers? Is it necessary, should it be discarded?
intLit = decimalLit | octalLit | hexLit
decimalLit = ( "1" 
 "9" ) { decimalDigit }
octalLit = "0" { octalDigit }
hexLit = "0" ( "x" | "X" ) hexDigit { hexDigit }
constant = fullIdent | ( [ "-" | "+" ] intLit ) | ( [ "-" | "+" ]
floatLit ) | strLit | boolLit
https://developers.google.com/protocol-buffers/docs/reference/proto2-spec#integer_literals
https://developers.google.com/protocol-buffers/docs/reference/proto2-spec#constant
Post by Michael Powell
Post by Michael Powell
For instance, I am fairly certain the sign character is encoded in a
hex encoded integer. Not sure about octal, but I imagine that it is
fairly consistent.
Got it sorted out I believe. Actually, it's quite nice the parser
support Spirit provides, aligns pretty much perfectly with the grammar
specification. There's a bit of gymnastics involved juggling whether
the AST has a sign or not and so forth, but other than that, it flows
well enough.
If you haven't already, take a look at descriptor.proto
<https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/descriptor.proto>
-- FileDescriptorProto
<https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/descriptor.proto#L61>
therein is basically like an AST for the proto language (and is what protoc
produces as it parses). And for parsing options and the literal values in
particular, take a look at UninterpretedOption
<https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/descriptor.proto#L701>.
Options are first parsed into this structure, and then "interpreted" into
the attributes of *Options messages in a second pass. You'll see that the
approach there includes the negation in the literal integer value but
also distinguishes
between the two
<https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/descriptor.proto#L716>
in the AST.
Post by Michael Powell
Post by Michael Powell
Post by Michael Powell
Case in point, the value 107026150751750362 gets encoded as
0X17C3BB7913C48DA (upper-case). Whereas it's negative counterpart,
-107026150751750362, really does get encoded as 0xFE83C4486EC3B726.
Signage included, if memory serves.
In these cases, I think the sign bit falls in the "optional" category?
So... As far as I can determine, there are a couple of ways to
interpret this, semantically speaking. But this potentially informs
whatever parsing stack you are using as well.
I'm using Boost Spirit Qi, for instance, which supports radix-based
integer parsing well enough, but has its own set of issues when
dealing with signage. That being said...
1. Treat the value itself as positive one way or another, with an
optional sign attribute (i.e. '+' or '-'). This would potentially
work, especially when there is base 16 (hex) or base 8 (octal)
involved.
2. Otherwise, open to suggestions, but for Qi constraints; that I know
of, fails to parse negative signed hexadecimal/octal encoded values.
Again, kind of a symptom of an imprecise grammar specification. I can
get a sense for how to handle it, but does it truly capture "intent".
Thanks in advance for any light that can be shed.
Post by Michael Powell
Cheers, thanks,
Michael
Post by Josh Humphries
For the case of zero by itself, per the spec, it will be parsed as
an octal literal with value zero -- so functionally equivalent to a decimal
literal with value zero. And for values with multiple digits, a leading
zero means it is an octal literal. Decimal values will not have a leading
zero.
Post by Michael Powell
Post by Michael Powell
Post by Josh Humphries
----
Josh Humphries
On Sat, Nov 10, 2018 at 10:16 PM Michael Powell <
Post by Michael Powell
Hello,
I think 0 can be a decimal-lit, don't you think? However, the spec
intLit = decimalLit | octalLit | hexLit
decimalLit = ( "1" 
 "9" ) { decimalDigit }
octalLit = "0" { octalDigit }
hexLit = "0" ( "x" | "X" ) hexDigit { hexDigit }
Is there a reason, semantically speaking, why decimal must be
greater
Post by Michael Powell
Post by Michael Powell
Post by Josh Humphries
Post by Michael Powell
than 0? And that's not including a plus/minus sign when you factor
in
Post by Michael Powell
Post by Michael Powell
Post by Josh Humphries
Post by Michael Powell
constants.
Of course, parsing, order matters, similar as with the escape
hex-lit | oct-lit | dec-lit
And so on, since you have to rule out 0x\d+ for hex, followed by
0\d* ...
Post by Michael Powell
Post by Michael Powell
Post by Josh Humphries
Post by Michael Powell
Actually, now that I look at it "0" (really, "decimal" 0) is lurking
in the oct-lit phrase.
Kind of a grammatical nit-pick, I know, but I just wanted to be
clear
Post by Michael Powell
Post by Michael Powell
Post by Josh Humphries
Post by Michael Powell
here. Seems like a possible source of confusion if you aren't paying
careful attention.
Thoughts?
Best regards,
Michael Powell
--
You received this message because you are subscribed to the Google
Groups "Protocol Buffers" group.
Post by Michael Powell
Post by Michael Powell
Post by Josh Humphries
Post by Michael Powell
To unsubscribe from this group and stop receiving emails from it,
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.
Loading...