Python is a context free language

Programming languages ​​and translators
02 - Syntactic Analysis

Our translator's input is a string of characters that the programmer laboriously typed on the keyboard and then saved in a file. In the process of coding, the programmer has the program structure that has emerged in his mind into this string of characters serialized.

As a first step we want to convert the string into a linear one Token stream from language-specific elements such as keywords and identifiers. To do this, the scanner (= Lexer) once (!) over the character string and emits a single one whenever a complete language element, which often consists of more than one character, has been read Token. These tokens always have a token class (e.g. identifier or "opening parenthesis") and sometimes a payload. In this payload, the scanner transports further information about the token to the subsequent translation stages. For example, all identifiers have the same token class () and carry the covered characters from the input (the lexeme) with them as a payload (e.g., identifier name).

Through the scanning process, we eliminate spaces and any code formatting, we standardize different spellings of the same language element and we significantly reduce the number of objects to be processed. This makes life for translators much easier in the following, because the rest of the syntax analysis can run much faster and we have already covered many special cases of the language. For example, we would need a much more complex parser if we were to deal with comments that can appear anywhere in the source code.

To create a scanner for a language, we basically have two options: Either we code a corresponding function manually or we derive a scanner from a set of regular expressions with the help of a generator. The manual approach will too ad hoc This is called the approach because the coding as a normal program does not follow any formalism. This has the disadvantage that it is difficult to write a correct scanner by hand, but it also has the advantage that special treatments can be built in at every point.

In either case, the scanner uses two operations on the input stream: read () consumes a sign from the beginning and returns it, peek () returns the same character without consuming it. Especially the latter, looking ahead, or look-ahead, in the character stream is a concept that we will encounter again in the parser. An important property of scanners and parsers is how many look ahead characters are required. Most languages ​​get by with a scanner that only needs a single character preview. Fortran is a sad counterexample: http://shape-of-code.coding-guidelines.com/2009/12/20/parsing-fortran-95/.

In the example Complete scanner: ./lst/02-scanner.py, the function returns a 2-element list each time it is called, consisting of token type and payload. The function first skips all spaces and tabs before the actual token recognition starts. This sample scanner only recognizes integers, but it is possible to recognize these numbers in binary (), octal (), or hexadecimal notation. For this purpose, the scanner recognizes whether there is a corresponding prefix (e.g. for hexadecimal) and then reads characters from the corresponding class of permitted characters according to the recognized notation until an impermissible character is reached ().

classStream: "" "An (inefficient) abstraction for an input stream of characters that allows a single character to look into the future (look-ahead)." "" def__init __ (self, data): self.data = list (data ) defpeek (self): iflen (stream.data) == 0: return '\ 0'return stream.data [0] defread (self): iflen (self.data) == 0: return' \ 0'returnself. data.pop (0) defread_many (self, allowed_characters): ret = "" whileself.peek () in allowed_characters: ret + = self.read () return ret # scanner0defscanner_next (stream): # Throw away spaces from the beginning while stream.peek ( ) in "\ t": stream.read () ch = stream.read () if ch == '0': # literal: hex, octal, or binary ch = stream.read () if ch == 'x': hexdigits = "0123456789abcdefABCDEF" chars = stream.read_many (hexdigits) value = int (chars, base = 16) return ["literal", value] # ... . # scanner1elif ch == 'b': chars = stream.read_many ('01 ') value = int (chars, base = 2) return [' literal ', value] elif ch in '01234567': chars = ch + stream .read_many ('01234567') value = int (chars, 8) return ['literal', value] # scanner2else: raise ScannerError ("invalid character") # scanner3stream = Stream ("0x17 027 0b10111 0777") whileTrue: token = scanner_next (stream) if token isNone: breakprint (repr (token))

From this example it is important to recognize that the character is a common Is a prefix and it is only with the second character (,, or an octal number) that it really becomes clear which notation is at this point. With each additional character read, the number of possible notations is further restricted. It is clear from the first read that it can no longer be a number in decimal notation.

For the step of token recognition from a stream of characters it has been found that the recourse to formal concepts from theoretical computer science is extremely worthwhile and that one can work with regular expressions can describe the tokens of most programming languages ​​excellently. Regular expressions describe a regular language and have the same expressiveness as a regular grammar. One can say that regular expressions is a formalism with which one can construct regular grammars without accidentally sliding into the next more complex language class (context-free). Therefore we want to give the ad-hoc implementation from the example a more formal background.

To begin with, we need to remember what grammars were all about. The theoretical computer science lecture was a long time ago, and the memory is already blurry. First of all, a (formal) grammar is a set of rules that describes how to start from one Start symbol, via an iterative Text replacement process comes to a string. At any point in time, the string consists of a series of terminal and nonterminal Symbols and the replacement are repeated until only terminal symbols are left. The replacement itself is done by a set of rules that recognize one pattern and replace it with another pattern. All character strings that can be derived are called the associated (formal) language Risk of confusion: Formal languages ​​are different from programming languages ​​and they have nothing to do with a machine model. The script therefore strictly distinguishes between formal languages ​​and (just) languages. In the following we derive the character string from the start symbol, line by line:

hexliteral hexprefix hexdigit hexdigit * 0x hexdigit hexdigit * 0xa hexdigit * 0xa2 hexdigit * 0xa2c hexdigit * 0xa2c

Depending on how the rules of a grammar are structured, they are divided into different classes Wikipedia: Chomsky hierarchy. The simplest class that has particularly desirable properties are regular grammars. In this class all rules apply context free and that one can recognize with a finite automaton whether a word is in the language (word problem). Context-free means that there is always only a single non-terminal on the left-hand side of the replacement rules, so that every non-terminal, regardless of the context, is always replaced in the same way. The ability to map regular grammars to finite automata means that it is very easy to derive a program from the grammar that solves the word problem. In doing so, we reverse the direction of the grammar and no longer generate strings, but consume them. Exactly what we need for our scanner.

However, it is not easy to prove whether a given grammar is really regular. So we approach the problem from a different angle and pick it up regular expressions back to describe the tokens of our language. This formalism ensures that every regular expression describes a regular language. Therefore one can derive a corresponding regular grammar from every regular expression and it can never happen that a regular expression is accidentally no longer regular. This guarantee is achieved by the prohibition of recursive replacements. The non-terminal cannot occur again in any of the possible replacements of a nonterminal. In order to allow the repeated occurrence of a nonterminal, which is possible in regular grammars Regular grammar:, Regular expression: becomes the Kleene star introduced in which the directly preceding non-terminal can be replaced 0 to N times. When parsing we will context-free Use grammars that contain recursive substitutions and therefore cannot be described using more regular expressions.

To recognize a regular language, one constructs one from the grammar rules finite automatathat consumes the character string element by element. At each edge of the machine there is a condition (Guard) that specifies the read characters for which the transition is possible. If the decision as to which transition is taken is not clear at any point in time, then the machine is non-deterministic. When we reach an accepting state while consuming, we have recognized a word in formal language. However, accepting states can also be left again in order to recognize even longer words of language. In the case of hexadecimal numbers, any number of characters from the class may appear at the end.

In the context of a scanner, reaching an accepting state means that we have found a valid token. However, this token could also be the prefix of a longer token. While this is a valid hexadecimal number, it is also the prefix for the word. For scanners we want always find the token that has the maximum length (longest possible match).

Since there are different types of tokens to be recognized in a programming language, we have to combine several automata; an automaton / grammar / regular expression / language for each token type. To do this, we "simply" let all machines run in parallel and feed each machine every read character so that every machine makes a transition. If a machine does not have a valid transition at a given point in time, it fails (transparencies: it turns gray) and is no longer fed with characters. If one of the automatons reaches an accepting state, we memorize the recognized word (also Lexeme called). We feed the machines with characters until the last of them has failed and return the last word found as a token.

We can either actually carry out this parallel execution of finite automata or we can, which is more efficient, construct a combined automaton: To do this, we add a further start state that can reach the start states of the individual tokens via epsilon transitions (which the combined automaton always does non-deterministic). The equivalent deterministic automaton is derived from the power set construction.

Back to the less theoretical aspects of scanning. The example on the slides shows a situation in which the keyword is the prefix for the identifier and at the same time () two machines would recognize the lexeme as a valid word. In such cases we have to specify by additional rules which of the token types has priority. For programming languages, keywords must always have higher priorities than identifiers, as long as the identifier is no longer. This is why you cannot use keywords as variable names and therefore often use them as reserved Words are called.

Since the manual construction of finite automata and their combination is idle, there are programs that generate a scanner from a set of regular expressions. One such program is flex (1). Should you need a lexer later that fast and correctly a character string is broken down into tokens, so do not hesitate and have a scanner generator at hand! We will learn more about the security aspects of scanners and parsers in a small excursus at the end of this lecture.