Handling Associativity and Precedence in Handwritten Parsers

This entry is part 1 of 2 in the series Handling Associativity And Precedence in Handwritten Parser

Post Stastics

  • This post has 4764 words.
  • Estimated read time is 22.69 minute(s).

Introduction

Building parsers for programming or domain-specific languages (DSLs) involves many challenges, one of the most crucial being the correct handling of associativity and operator precedence. While these concepts may seem esoteric at first glance, they play a fundamental role in parsing expressions and ensuring the correct interpretation of code.

Imagine you’re writing code for a simple calculator application. You input an expression like “2 + 3 4″ and expect the result to be 14, following the standard rules of arithmetic. However, without proper handling of associativity and precedence in the parser, the result might be unexpected. For instance, if the parser interprets the expression as “(2 + 3) 4″, the result would be 20, as the addition is evaluated before the multiplication. This discrepancy can lead to incorrect behavior in the calculator application, frustrating users and undermining the reliability of the software.

  • Associativity defines the order in which operators of the same precedence level are evaluated.
  • Precedence determines the order in which different operators are evaluated.

For example, in the expression “2 + 3 * 4”, the multiplication operator has higher precedence than the addition operator, so it should be evaluated first to produce the correct result.

In this series of articles, we’ll explore the importance of handling associativity and precedence in parsers, and how different parsing techniques address these challenges. We’ll illustrate these concepts using examples from everyday programming scenarios, demonstrating their practical relevance to software development. By understanding and implementing proper associativity and precedence handling in parsers, developers can ensure the accuracy and reliability of their language processing tools, paving the way for robust and error-free software systems.

Prerequisites

Before diving into parsing techniques, let’s start by building a few tools to be used throughout our experiments. First, we will develop a standard lexer to be used with all the parsers we will develop and experiment with in this series. The lexer will take our input text and output a single token for each call to the lexer’s get_next_token() method. Each of our parsers will take our standard lexer as a constructor parameter and call its get_next_token() method to retrieve the tokens. Our parsers will output a simplified Abstract Syntax Tree (AST) which we can view directly or in the case of the Shunting Yard parser, will output a Reverse Polish Notation (RPN) expression. We will provide a parser method to convert the RPN expression to an AST. Our simplified AST will consist of groupings of tuples for nodes. We will cover the
AST and a Pretty Printer for the AST in a moment.

Language Grammar

Extended Backus-Naur Form (EBNF) and Backus-Naur Form (BNF) are formal notations used to describe the syntax of programming languages and other formal languages. BNF was introduced by John Backus and Peter Naur in the 1950s as a notation for describing the syntax of the Algol programming language. It consists of a set of production rules that define how valid sentences in the language can be constructed. Each production rule consists of a non-terminal symbol on the left-hand side and a sequence of terminals and/or non-terminals on the right-hand side, representing valid patterns in the language. EBNF extends BNF with additional features such as repetition, optional elements, and grouping, making it more expressive and easier to read. For example, in BNF, a production rule for a simple arithmetic expression might look like this:

expression ::= term ('+' term)*

figure : BNF Production Rule

Where “expression” and “term” are non-terminal symbols, and “+” is a terminal symbol representing addition.

In EBNF, the same rule could be written more concisely as:

expression = term { '+' term }

figure : EBNF Production Rule

Where “{” and “}” indicate zero or more repetitions of the enclosed elements. By understanding these grammar notations, developers can effectively describe the syntax of programming languages and design parsers to analyze and interpret code written in those languages.

Our simple expression language will include arithmetic operations for addition, subtraction, multiplication, division, exponentiation, and parenthesized expressions. Additionally, we will only handle integers however, adding support for floats should be an easy exercise for anyone vaguely familiar with parsing. While this may not seem like much of a language, I assure you it includes more complexity than one might first realize, and has sufficient complexity for our purposes.

Here’s the EBNF grammar for the expression language we’ve been discussing. This grammar defines the syntax of the language using a formal notation:

expression ::= term { ( "+" | "-" ) term } 
term ::= factor { ( "*" | "/" ) factor } 
factor ::= power_expr { "^" power_expr } 
power_expr ::= unary_expr | unary_expr "^" power_expr 
unary_expr ::= ( "+" | "-" )? primary_expr 
primary_expr ::= INTEGER | "(" expression ")"

figure 1: EBNF Expression Parser Grammar

In this grammar:

  • expression represents a mathematical expression, which consists of one or more terms separated by addition or subtraction operators.
  • term represents a term in the expression, which consists of one or more factors separated by multiplication or division operators.
  • factor production now refers to power_expr, which includes both unary expressions and the possibility of a power operation.
  • power_expr production handles the power operation recursively, allowing for expressions like 2^3^4 to be parsed correctly, associating the power operation from right to left.
  • unary_expr represents a unary expression, which can be a positive or negative number or a primary expression.
  • primary_expr represents a primary expression, which can be an integer or an expression enclosed in parentheses.

This grammar captures the basic structure of arithmetic expressions, including support for addition, subtraction, multiplication, division, exponentiation, and parenthesized expressions. It will be used as a basis for building various parsers for our expression language.

If you are unfamiliar with language grammars like BNF, EBNF, ABNF, ANTLR, and others, I suggest you familiarize yourself with them. You can find some helpful info on Wikipedia, tomassetti.me and
A Practical Tutorial on Context Free Grammars.

Railroad Diagrams

In addition to the formal grammars like EBNF, I find Railroad diagrams very helpful for visualizing the language grammar and as a visual aid for coding the parser. My favorite Railroad diagramming tool is an online tool located at: bottlecaps.de/rr/ui. However, after many years of using this tool, it recently became
unavailable in my area. A little investigating revealed that in my area our ISP is not yet supporting IPV6 and the bottlecaps site is not IPV6 only. However, the tools author Gunther Rademacher has the tool available on Github.com and another fan with the same issue as myself has pup the tool up on his IP4
only site at rr.red-dove.com/ui.

If you search “Railroad Diagramming tools” you will find many more tools. Most of these tools do not use EBNF, or at not as defined by John and Peter. It seems everyone has their own idea about what BNF and EBNF should look like! Even the W3C came up with their own version of EBNF. This version however is very similar to the original, and it is what is used by Gunther’s tool. It’s easy to write a small paython script to convert between original EBNF and the W3C’s version.

Below is the grammar in W3C format:

production ::= expression*
expression   ::= term ('+' | '-') term
term         ::= factor (('*' | '/') factor )*
factor       ::= power_expr ( '^' power_expr )*
power_expr   ::= unary_expr | unary_expr ^ power_expr
unary_expr   ::= ('+' | '-')? primary_expr
primary_expr ::= INTEGER | '(' expression ')' 

figure 2: EBNF Grammar using W3C format

And here is the pretty little diagram from Gunther’s Railroad diagramming tool:

Railroad Diagram
figure 3: Railroad diagram of the expression grammar

Hint: I usually have a railroad diagram open on a second monitor when coding my parsers as it is very easy to mentally convert the railroad diagram to code. You can easily see the if branches and while loops in the diagram.

The Lexer

The lexer (Also called a Scanner) scans over the input stream of characters and assembles the characters into Tokens. This is akin to scanning over a long list of characters and separating them into words and labeling them as nouns, verbs and adjectives.

graph TD;
    A[Input Expression] --> B[Lexer];
    B --> C[Token Stream];

figure 2: Lexer IPO Input-Process-Output Diagram

The lexer scans the input expression character by character and produces a stream of tokens. Each token represents a meaningful unit of the expression, such as numbers, identifiers, operators, and keywords. Here’s our token definition:

file: lexer.py

class Token:
    def __init__(self, type, value):
        self.type = type
        self.value = value

The token needs to communicate the token type, (often called “kind” to avoid clashes with the development language) and the token value. Other helpful information may also be placed in the token such as its location in the input stream, which is often used for error reporting. Because I want to keep the code as clean and simple as possible, I will limit our error handling to a few exception. In most “Real” parser you will want to handle errors in a manner that will allow you to report the error, recover, and continue parsing. Here for simplicity, we will just report and exit.

Now let’s look at the complete lexer code: file: lexer.py

class Token:
    def __init__(self, type, value):
        self.type = type
        self.value = value

    def __str__(self):
        return 'Token({type}, {value})'.format(
            type=self.type,
            value=repr(self.value)
        )

    def __repr__(self):
        return self.__str__()

class Lexer:
    def __init__(self, text):
        self.text = text
        self.pos = 0
        self.current_char = self.text[self.pos]

    def advance(self):
        self.pos += 1
        if self.pos < len(self.text):
            self.current_char = self.text[self.pos]
        else:
            self.current_char = None

    def skip_whitespace(self):
        while self.current_char is not None and self.current_char.isspace():
            self.advance()

    def integer(self):
        result = ''
        while self.current_char is not None and self.current_char.isdigit():
            result += self.current_char
            self.advance()
        return int(result)

    def get_next_token(self):
        while self.current_char is not None:
            if self.current_char.isspace():
                self.skip_whitespace()
                continue
            if self.current_char.isdigit():
                return Token('INTEGER', self.integer())
            if self.current_char == '+':
                self.advance()
                return Token('PLUS', '+')
            if self.current_char == '-':
                self.advance()
                return Token('MINUS', '-')
            if self.current_char == '*':
                self.advance()
                return Token('MULTIPLY', '*')
            if self.current_char == '/':
                self.advance()
                return Token('DIVIDE', '/')
            if self.current_char == '^':
                self.advance()
                return Token('POWER', '^')
            if self.current_char == '(':
                self.advance()
                return Token('LPAREN', '(')
            if self.current_char == ')':
                self.advance()
                return Token('RPAREN', ')')
            self.error()
        return Token('EOF', None)

    def error(self):
        raise Exception('Invalid character')

if __name__ == '__main__':
    while True:
        text = input('calc> ')
        lexer = Lexer(text)
        token = lexer.get_next_token()
        while token.type != 'EOF':
            print(token)
            token = lexer.get_next_token()

As stated earlier the lexer is responsible for organizing the input characters into meaningful chunks or words. Simple handwritten lexers usually look at a character in the input stream and use this character to determine what type(s) of token(s) will likely be produced from it. For example, if the next character to process is a digit, then the next token to be produce is likely some type of number value such as an integer or floating point value. Computer languages (Programming and DSLs) often require certain rules that limit the possible language constructs to containing specific characters. This is why most languages forbid an identifier from starting with a digit. In some cases a lexer may need to look ahead a character or two, to determine the type of the next token. This is usually referred to as peeking at the next character. Likewise, for some languages the parser may need to peek at the next token or two, to determine the
proper language construct being used. If you’ve seen the terms LL(1) or LL(k), the k value is simply the number of look-ahead symbols the parser uses to make a parsing decision.

Lexer Description

The lexer is responsible for tokenizing input text into meaningful units called tokens. It works by scanning the input text character by character and identifying tokens based on predefined patterns. The advance method moves the lexer’s position to the next character in the input text. The skip_whitespace method skips over any whitespace characters in the input text, as our language doesn’t use them. The integer method parses and returns an integer token by reading consecutive digits in the input text. The get_next_token method is the main entry point of the lexer. It iterates through the input text and returns the next token encountered. The error method is called when the lexer encounters an invalid character in the input text, signaling an error. Often some special token types will be defined to aid in processing the input text. These are usually used to signal the End of Input or special Error tokens.

Let’s see how the lexer processes the input stream: 1 + 2 * 3. Examine the Callgraph below to understand how the lexer works.

graph TD;
    A["Input: '1 + 2 * 3'"] --> B["get_next_token()"] --> |Token: INTEGER, Value: 1| C['1'];
    C --> D["advance()"] --> E["get_next_token()"] --> |Token: PLUS, Value: +| F['+'];
    F --> G["advance()"] --> H["get_next_token()"] --> |Token: INTEGER, Value: 2| I['2'];
    I --> J["advance()"] --> K["get_next_token()"] --> |Token: MULTIPLY, Value: *| L['*'];
    L --> M["advance()"] --> N["get_next_token()"] --> |Token: INTEGER, Value: 3| O['3'];
    O --> P["advance()"] --> Q["get_next_token()"] --> |Token: EOF, Value: None| R['EOF'];

figure 3: Lexer Callgraph for Expression 1 + 2 3*

This callgraph demonstrates how the lexer processes the input stream “1 + 2 3″. Each step corresponds to a call to the get_next_token() method, which returns the next token encountered in the input stream. The arrows represent the flow of execution through the lexer, from reading the input characters to generating tokens. Finally, the lexer encounters the end of the input stream, indicated by the EOF token. We will use this lexer for all the different parser implementation to follow.

AST Construction

Once the input expression has been parsed using any of the available techniques, most of the parsers construct an Abstract Syntax Tree (AST) to represent the hierarchical structure of the expression. The odd man out is the Shunting Yard parser. This parser returns a RPN formatted expression. So, I’ve added a to_ast() method to this parser to convert the RPN formatted expression to an AST.

graph TD;
    A[Parsed Expression] --> B[AST Construction];
    B --> C[Abstract Syntax Tree];

figure 4: Abstract Syntax Tree IPO Diagram

The AST captures the syntactic structure of the expression, with each node representing an operator or operand. The shape of the AST reflects the precedence and associativity of operators in the expression. Lower-precedence operators appear closer to the root of the tree, while higher-precedence operators are deeper (further from the root) in the tree. Discussing precedence can be confusing in computer science as our trees are depicted upside down, i.e.: the root is at the top.

Note: In the context of Abstract Syntax Trees (ASTs) and parse trees, the terms “higher” and “lower” precedence, as well as tree “depth,” take on a different meaning due to the way parsers traverse the tree structure. In this context, “higher” precedence refers to nodes that are deeper in the tree, further away from the root. When parsers descend into the tree to evaluate expressions or execute algorithms, they typically start at the root and move downwards towards the leaves. Therefore, nodes that are deeper in the tree, or have a higher depth, are processed first, followed by nodes closer to the root. Conversely, nodes closer to the root have lower depth and are processed later in the parsing or evaluation process. This understanding is crucial for parsers and algorithms that rely on tree traversal to correctly interpret and evaluate expressions or perform other operations on tree structures. I have heard these terms used in reverse and for trees in general, this may be correct. I mention this because it can cause confusion, it is worth clarifying these terms when conversing with others.

Tree Pretty Printer

To visualize the AST and better understand its structure, we can implement a tree printer. This tool traverses the AST recursively and prints each node along with its children in an indented “pretty” format.

graph TD;
    A[Abstract Syntax Tree] --> B[Tree Pretty Printer];
    B --> C[Formatted Tree];

figure 5: AST Pretty Print IPO Diagram

The tree pretty printer helps in debugging parsers and verifying that the AST has been constructed correctly. It provides insights into how operators and operands are grouped together based on precedence and associativity.

Here is our AST Pretty Printer Code: (file: ast_pretty_printer.py)

class AST:
    def __init__(self, root):
        self.root = root

    def pretty_print(self, node=None, indent=0, prefix='', is_last_child=True):
        if node is None:
            node = self.root

        if isinstance(node, tuple):
            if node:
                print(prefix + ('└───' if is_last_child else '├───'), node[0])
                if len(node) > 1 and node[1] is not None:
                    if len(node) > 2 and node[2] is not None:
                        self.pretty_print(node[1], indent + 4, prefix + ('    ' if is_last_child else '│   '), False)
                    else:
                        self.pretty_print(node[1], indent + 4, prefix + ('    ' if is_last_child else '│   '), True)
                if len(node) > 2 and node[2] is not None:
                    self.pretty_print(node[2], indent + 4, prefix + ('    ' if is_last_child else '│   '), True)
            else:
                print(prefix + ('└───' if is_last_child else '├───') + " (Empty Tuple)")
        else:
            print(prefix + ('└───' if is_last_child else '├───'), node)

    def __str__(self):
        return str(self.root)

    def __repr__(self):
        return self.__str__()

# Example usage:
if __name__ == '__main__':
    tree = ('Root', ('Left', ('Left', None, None), None), ('Right', None, None))
    ast = AST(tree)
    ast.pretty_print()

    empty_tree = ()
    empty_ast = AST(empty_tree)
    empty_ast.pretty_print()

    print(ast)

And here is an example usage:

# Shunting Yard Example
shunting_yard = ShuntingYard('3 + 4 * (2 - 1)')
postfix_expression = shunting_yard.shunting_yard()
print('Postfix Expression:', postfix_expression)

# Pratt Parsing Example
pratt_parser = PrattParser('3 + 4 * (2 - 1)')
ast = pratt_parser.parse()
print('AST:')
AST(ast).pretty_print()
...

We will use the ast_pretty_printer.py script to display the AST in a nice easy to read format. Here is how an AST appears as it is output from the parser. This AST is for the expression: 7 + (3 + -5) * 12 / 2^3 + 29.

Simple AST:  ('+', ('+', 7, ('/', ('*', ('+', 3, ('-', 0, 5)), 12), ('^', 2, 3))), 29)

figure 6: Simple AST using Python Tuples

And here is how it looks after being printed by our pretty printer script:

AST
└─── +
    ├─── +
    │   ├─── 7
    │   └─── /
    │       ├─── *
    │       │   ├─── +
    │       │   │   ├─── 3
    │       │   │   └─── -
    │       │   │       ├─── 0
    │       │   │       └─── 5
    │       │   └─── 12
    │       └─── ^
    │           ├─── 2
    │           └─── 3
    └─ 29

figure 7: Pretty Printed AST

The AST pretty printer isn’t part of our lexer/parser system. It’s just a tool used to visualize the output of the parser. When writing parsers it is important to have a tool-set that allows you to easily visualize and confirm the AST has been properly generated. It is also very helpful to have a method of verifying the tokens generated by the lexer. In this case I built the lexer as a separate module and limited it to producing a single token for each call to get_next_token(). This allows us to easily verify each token as it is generated simply by printing them out.

Now that we have a lexer and an AST printer, we are ready to move on to our first expression parsing algorithm. So next up, the Recursive Decent parser!

Recursive Descent Expression Parser

Recursive Descent parsing is a top-down parsing technique commonly used to build parsers for Programming or Domain Specific Languages (DSL). It follows the structure of the grammar rules defined by the EBNF grammar. In this approach, the parser is organized into methods, each corresponding to a non-terminal symbol in the grammar. These methods recursively parse the input tokens according to the grammar rules, starting from the highest level of abstraction down to the individual tokens.

Below is the callgraph for parsing an expression using a recursive descent parser:

graph TD
    parse --> parseExpression;
    parseExpression --> term;
    parseExpression --> PLUS;
    term --> factor;
    term --> MULTIPLY;
    factor --> INTEGER;
    factor --> LPAREN;

figure 8: RDP Callgraph for Expression 1 + 2 3*

In this call graph:

  • parseExpression is the entry point for parsing an expression.
  • term represents parsing the term part of the expression, which involves parsing factors.
  • factor represents parsing factors, which can be either integers or sub-expressions enclosed in parentheses.
  • PLUS represent the addition operations.
  • MULTIPLY represents the multiplication operations.
  • INTEGER represents parsing integer literals.
  • LPAREN represents encountering a left parenthesis, indicating the start of a sub-expression.

This call graph illustrates the flow of function calls and recursive nature of parsing expressions using a recursive descent parser.

In our recursive decent expression parser above, we define methods for parsing expressions, terms, and factors. The expression() method handles the highest level of abstraction, representing arithmetic expressions. It calls the term() method to parse terms, which in turn calls the factor() method to parse factors. This recursive structure mirrors the grammar rules, where expressions consist of terms, which consist of factors.

To handle operator precedence and associativity, the parser ensures that higher-precedence operators are parsed before lower-precedence ones. This is achieved by organizing the grammar rules in a way that reflects the precedence levels of operators. For instance, in the expression() method, higher-precedence operations (such as multiplication and division) are parsed before lower-precedence ones (such as addition and subtraction).

The parser builds an Abstract Syntax Tree (AST) as it parses the input expression. The AST represents the hierarchical structure of the expression, with each node in the tree corresponding to an operation or operand. For example, the expression 1 + 2 * 3 would produce the AST (‘+’, 1, (‘*’, 2, 3)), where the top-level node represents addition (+), with 1 as the left operand and the result of the multiplication ((‘*’, 2, 3)) as the right operand.

The following diagram represents the Abstract Syntax Tree (AST) for parsing the expression 1 + 2 * 3:

graph TD;
    Expr-- "+" -->Term1;
    Term1-- "1" -->INTEGER1;
    Term1-- "*" -->Factor;
    Factor-- "2" -->INTEGER2;
    Factor-- "3" -->INTEGER3;

figure 9: AST of parsed expression 1 + 2 3*

In this AST:

  • Expr represents the top-level expression.
  • Term1 represents the first term in the expression.
  • INTEGER1, INTEGER2, and INTEGER3 represent integer literals.
  • Factor represents the second term, which is a multiplication operation.

The + and * nodes represent addition and multiplication operations, respectively.

This diagram illustrates the hierarchical structure of the expression 1 + 2 3 as an AST, where the multiplication operation has a higher precedence than addition, resulting in 2 3 being evaluated before adding 1 by placing it higher (further from the root) in the tree. Contrast this with the following diagram for the expression 1 * 2 + 3.

For the expression 1 * 2 + 3, the AST would look like this:

graph TD;
    Expr-- "+" -->Term2;
    Term2-- "*" -->Factor;
    Factor-- "1" -->INTEGER1;
    Factor-- "2" -->INTEGER2;
    Expr-- "3" -->INTEGER3;

figure 10: Tree built from parsing expression 1 * 2 + 3

In this AST:

  • Expr represents the top-level expression.
  • Term2 represents the first term in the expression.
  • INTEGER1 and INTEGER2 represent integer literals.
  • Factor represents the second term, which is a multiplication operation.

The + and * nodes represent addition and multiplication operations, respectively.

This diagram illustrates the hierarchical structure of the expression 1 * 2 + 3 as an AST, where the multiplication operation is evaluated before addition due to precedence rules. The change in the order of operators does not alter their precedence, but it does affect the sequence of evaluation. In the expression 1 2 + 3, where multiplication comes before addition, the AST reflects this order of evaluation. The multiplication operation, having higher precedence, is evaluated first, leading to a tree where the multiplication node is positioned higher than the addition node. Conversely, in the expression 1 + 2 3, where addition precedes multiplication, the AST reflects this sequence. Despite the consistent precedence of operators, the order of evaluation influences the position of nodes within the tree structure. Higher-precedence operations are evaluated first, resulting in nodes further away from the root of the tree, while lower-precedence operations are evaluated later, leading to nodes located deeper within the tree hierarchy. This adherence to precedence ensures that expressions are evaluated correctly according to the rules of arithmetic and operator precedence.

Now let’s look at the parser code: (file: parser_rdp.py)

from lexer import Lexer, Token

class ParserError(Exception):
    pass

class RecursiveDescentParser:
    def __init__(self, lexer):
        self.lexer = lexer
        self.current_token = None
        self.advance()

    def advance(self):
        self.current_token = self.lexer.get_next_token()

    def expression(self):
        result = self.term()
        while self.current_token.type in ('PLUS', 'MINUS'):
            token = self.current_token
            self.advance()
            if token.type == 'PLUS':
                result = ('+', result, self.term())
            elif token.type == 'MINUS':
                result = ('-', result, self.term())
        return result

    def term(self):
        result = self.power_expr()
        while self.current_token.type in ('MULTIPLY', 'DIVIDE'):
            token = self.current_token
            self.advance()
            if token.type == 'MULTIPLY':
                result = ('*', result, self.power_expr())
            elif token.type == 'DIVIDE':
                result = ('/', result, self.power_expr())
        return result

    def power_expr(self):
        result = self.unary_expr()
        while self.current_token.type == 'POWER':
            self.advance()
            result = ('^', result, self.unary_expr())
        return result

    def unary_expr(self):
        token = self.current_token
        if token.type == 'PLUS':
            self.advance()
            return self.factor()
        elif token.type == 'MINUS':
            self.advance()
            return ('-', 0, self.factor())
        else:
            return self.factor()

    def factor(self):
        token = self.current_token
        if token.type == 'INTEGER':
            self.advance()
            return token.value
        elif token.type == 'LPAREN':
            self.advance()
            result = self.expression()
            if self.current_token.type != 'RPAREN':
                raise ParserError(f"SyntaxError: Unexpected token: {self.current_token}")  # SyntaxError("Expected ')'")
            self.advance()
            return result
        else:
            raise ParserError(f"SyntaxError: Unexpected token: {self.current_token}")  # SyntaxError("Invalid syntax")

    def parse(self):
        result = self.expression()
        if self.current_token.type != 'EOF':
            raise ParserError(f"SyntaxError: Unexpected token: {self.current_token}")
        return result

if __name__ == '__main__':
    while True:
        text = input('calc> ')
        lexer = Lexer(text)
        parser = RecursiveDescentParser(lexer)
        result = parser.parse()
        print(result)

Overall, recursive descent parsing is a powerful and flexible technique for building parsers, particularly suited for handwritten parsers where control over parsing decisions and error handling is important.

Conclusion

In conclusion, building parsers for programming languages or DSLs is a multifaceted endeavor that requires careful consideration of various aspects such as parsing techniques, error handling, extensibility, optimization, tooling support, and testing. By addressing these considerations thoughtfully, developers can create parsers that meet the needs of their target languages and provide a seamless development experience for users.

Next time we will look at another expression parsing technique, the Shunting Yard Algorithm by Dijkstra.

Resources

Here are additional resources that you may find helpful for understanding and implementing parsers, lexer, and related topics:

  1. Source Code Source Code for this article can be found on GitHub
  2. Books:
    • “Parsing Techniques: A Practical Guide” by Dick Grune and Ceriel J.H. Jacobs: This book provides a comprehensive introduction to parsing techniques, covering both theory and practical implementation aspects.
    • “Compiler Construction: Principles and Practice” by Kenneth C. Louden: This book covers various topics related to compiler construction, including parsing, lexical analysis, and syntax-directed translation.
  3. Online Courses:
    • Coursera: “Compilers” by Stanford University: This course covers fundamental concepts in compiler design, including lexical analysis, parsing, and code generation.
    • Udemy: “Compiler Design: Theory and Practice” by Udemy Instructor: This course provides a practical approach to learning compiler design concepts, including hands-on implementation of parsers and lexers.
  4. Tutorials and Articles:
    • “Building an Interpreter” series by Ruslan Spivak: This series of articles provides a step-by-step guide to building an interpreter for a simple programming language, covering topics such as parsing, lexical analysis, and evaluation.
    • “Let’s Build A Simple Interpreter” by Ruslan Spivak: This article series walks through the process of building a simple interpreter for a programming language, starting from lexical analysis and parsing.
  5. Tools and Libraries:
    • ANTLR (ANother Tool for Language Recognition): ANTLR is a powerful parser generator that supports multiple programming languages and provides advanced features for building parsers and lexers.
    • Ply (Python Lex-Yacc): Ply is a Python implementation of lex and yacc parsing tools, which allows for the creation of parsers and lexers in Python using familiar syntax.
  6. Online Resources:
    • Parsing Wiki: This wiki provides extensive resources and information on parsing techniques, algorithms, and tools, including articles, tutorials, and research papers.
    • Compiler Design and Construction Resources: Various universities and institutions offer online resources and lecture notes on compiler design and construction, covering topics such as lexical analysis, parsing, and code generation.

These resources should provide a comprehensive understanding of parsers, lexers, and related topics, and assist you in building robust language processing tools.

Series NavigationTree Rewriting And Shunting Yard Parsers >>

Leave a Reply

Your email address will not be published. Required fields are marked *