Lexical grammar


In computer science, a lexical grammar is a formal grammar defining the syntax of tokens. The program is written using characters that are defined by the lexical structure of the language used. The character set is equivalent to the alphabet used by any written language. The lexical grammar lays down the rules governing how a character sequence is divided up into subsequences of characters, each part of which represents an individual token. This is frequently defined in terms of regular expressions.
For instance, the lexical grammar for many programming languages specifies that a string literal starts with a " character and continues until a matching " is found, that an identifier is an alphanumeric sequence, and that an integer literal is a sequence of digits. So in the following character sequence "abc" xyz1 23 the tokens are string, identifier and number because the space character terminates the sequence of characters forming the identifier. Further, certain sequences are categorized as keywords – these generally have the same form as identifiers, but are categorized separately; formally they have a different token type.

Examples

Regular expressions for common lexical rules follow.
Unescaped string literal :
"*"
Escaped string literal :
"*"
Integer literal:
+
Decimal integer literal :
*|0
Hexadecimal integer literal:
0+
Octal integer literal:
0+
Identifier:
*