Tree Rewriting And Shunting Yard Parsers

This entry is part 2 of 2 in the series Handling Associativity And Precedence in Handwritten Parser

Post Stastics

  • This post has 2917 words.
  • Estimated read time is 13.89 minute(s).

Introduction

Last time we discussed our mission, built a lexer and tree printer to be used throughout our experiments, and introduced the Recursive decent parser. Parsing mathematical expressions involves interpreting their structure, which can be complex due to the presence of operators with different precedence levels and associativity rules. In this article series, we delve into the intricacies of handling operator precedence and associativity in handwritten parsers. In this second installment, we explore the  techniques of Tree Rewriting and the Shunting Yard algorithm to address these challenges effectively.

Handling Operator Precedence and Association by Tree Rewriting

Mathematical expressions often contain operators like addition, subtraction, multiplication, division, and exponentiation, each with its own precedence and associativity rules. Before diving into the implementation details, it’s crucial to understand the problem at hand. Consider an expression like “3 + 4 * 2 / (1 – 5) ^ 2 ^ 3”. To accurately evaluate this expression, we must ensure that the operators are applied in the correct order, respecting their precedence and associativity.

To address this challenge, we’ll explore the concept of tree rewriting. By manipulating the abstract syntax tree (AST) of the expression, we can reorder nodes to enforce the desired precedence and associativity. This approach allows us to transform the initial AST into a modified version that accurately reflects the evaluation order of operators. Additionally, we’ll discuss the Shunting Yard algorithm, which converts infix expressions into postfix notation, facilitating the evaluation process.

The Initial AST

Before we delve into the rewriting process, let’s visualize the initial AST for the expression “3 + 4 * 2 / (1 – 5) ^ 2 ^ 3”:

graph TD
    A[+] --> B[3]
    A --> C[*]
    C --> D[4]
    C --> E["/"]
    E --> F[2]
    D --> G["( )"]
    G --> H[-]
    H --> I[1]
    H --> J[5]
    E --> K[^]
    K --> L[2]
    K --> M[^]
    M --> N[3]

figure 1: This initial AST reflects the structure of the expression before considering operator precedence and associativity.

The Rewriting Process

To handle operator precedence and associativity, we’ll rewrite the AST. Let’s examine the steps involved.

Step 1: Identify Nodes to Rewrite

Traverse the AST and identify nodes where the precedence needs adjustment. For example, in the expression “4 * 2”, multiplication has higher precedence than addition, so the AST node representing this operation needs adjustment.

Step 2: Perform Tree Rewriting

Perform the necessary adjustments to ensure that operators with higher precedence are evaluated first. For example, rewrite the AST node representing “4 * 2” to ensure it takes precedence over addition.

Step 3: Repeat Until No Adjustments Needed

Repeat the process until no further adjustments are required, ensuring that the AST accurately reflects the operator precedence and associativity rules of the expression.

The Final AST

After the rewriting process, the AST for the expression “3 + 4 * 2 / (1 – 5) ^ 2 ^ 3” would look like this:

graph TD
    A[+] --> B[3]
    A --> C["/"]
    C --> D[*]
    D --> E[4]
    D --> F[2]
    C --> G["( )"]
    G --> H[-]
    H --> I[1]
    H --> J[5]
    A --> K[^]
    K --> L[2]
    K --> M[^]
    M --> N[3]

figure 2: The rewritten tree demonstrating the proper precedence order

Understanding the RewriteParser Code

The provided code implements a parser called RewriteParser that parses mathematical expressions and constructs an abstract
syntax tree (AST). The parser employs a technique known as tree rewriting to adjust the AST to ensure correct operator
precedence and associativity. Let’s delve deeper into how this parser works, what it does, and compare it to other parsing
techniques, particularly other top-down parsers.

Here’s the code for our simple tree rewriting expression parser (parser_rewrite.py):

from Part_02.src.lexer import Lexer
from Part_02.src.ast_pretty_printer import AST

class RewriteParser:
    def __init__(self, lexer):
        self.lexer = lexer
        self.current_token = None
        self.advance()

    def advance(self):
        self.current_token = self.lexer.get_next_token()
        if self.current_token.type == "EOF":
            return ("EOF")

    def parse(self):
        return self.expression()

    def peek(self):
        # Save current position
        temp_pos = self.lexer.pos
        # Peek at the next token without consuming it.
        # This will advance the lexer position.
        tok = self.lexer.get_next_token()
        # Now restore previous lexer position
        self.lexer.pos = temp_pos
        # And return the peeked token
        return tok

    def eat(self, token_type):
        if self.current_token.type == token_type:
            self.current_token = self.lexer.get_next_token()
        else:
            raise ValueError("Unexpected token")

    def expression(self):
        result = self.term()

        while self.current_token.type in ('PLUS', 'MINUS'):
            token = self.current_token
            self.eat(token.type)
            if token.type == 'PLUS':
                result = ('+', result, self.term())
            elif token.type == 'MINUS':
                result = ('-', result, self.term())
            else:
                if self.peek() == 'INTEGER':
                    result = ('-', result, self.power())
                else:
                    result = ('-', result, self.term())

        return result

    def term(self):
        result = self.factor()

        while self.current_token.type in ('MULTIPLY', 'DIVIDE'):
            token = self.current_token
            self.eat(token.type)
            if token.type == 'MULTIPLY':
                result = ('*', result, self.factor())
            elif token.type == "DIVIDE":
                result = ('/', result, self.factor())
            elif token.type == 'POWER':
                result = ('^', result, self.factor())

        return result

    def factor(self):
        result = self.power()

        while self.current_token.type == 'POWER':
            self.eat('POWER')
            result = ('^', result, self.factor())

        return result

    def power(self):
        token = self.current_token

        if token.type == 'LPAREN':
            self.eat('LPAREN')
            result = self.expression()
            self.eat('RPAREN')
            return result
        elif token.type == 'INTEGER':
            self.eat('INTEGER')
            return int(token.value)
        elif token.type == "MINUS":
            self.eat('MINUS')
            return ('-', 0, self.power())
        elif token.type == "PLUS":
            self.eat('PLUS')
            return self.power()
        elif token.type == "EOF":
            return ('EOF')

        raise ValueError("Invalid syntax")

    def eat(self, token_type):
        if self.current_token.type == token_type:
            self.current_token = self.lexer.get_next_token()
        else:
            raise ValueError("Unexpected token")

    def fix_precedence(self, tree):
        precedence = {'^': 3, '*': 2, '/': 2, '+': 1, '-': 1}

        if len(tree) == 3 and tree[0] in precedence:
            left, operator, right = tree
            if isinstance(left, tuple) and left[0] in precedence and precedence[left[0]] > precedence[operator]:
                tree[1], tree[0] = tree[0], tree[1]
                tree[0] = self.fix_precedence(tree[0])

            if isinstance(right, tuple) and right[0] in precedence and precedence[right[0]] > precedence[operator]:
                tree[2], tree[1] = tree[1], tree[2]
                tree[2] = self.fix_precedence(tree[2])

        return tree

if __name__ == "__main__":
    import ast_pretty_printer

    #expr = "+3 + -14 * 2 / (1 - 5) ^ 27 ^ 3"
    expr = "2 3" # 2 + 3
    lexer = Lexer(expr)
    parser = RewriteParser(lexer)
    ast = parser.parse()
    printer = AST(ast)
    print("AST")
    printer.pretty_print()
    print(ast)
    # for inputs like "2 3" each expression (value) will be returned on additional calls to the parser.
    ast = parser.parse()
    printer = AST(ast)
    print("AST")
    printer.pretty_print()
    print(ast)
  1. What the Code Does:

The RewriteParser class initializes with a lexer object, which is responsible for tokenizing the input mathematical expression. The parse method initiates the parsing process by calling the expression method. The expression, term, factor, and power methods implement recursive descent parsing to construct the AST. The parser handles various operators such as +, -, *, /, and ^ along with parentheses to represent expression grouping. After constructing the initial AST, the fix_precedence method is called to rewrite the tree, ensuring correct operator precedence and associativity. Finally, the parser prints the AST using the AST class’s pretty_print method.

  1. How It Works:

The parser employs recursive descent parsing, a top-down parsing technique, to construct the AST.
As the parser traverses the input expression, it identifies and handles different tokens (operators, parentheses, integers) based on their grammar rules. After constructing the initial AST, the fix_precedence method recursively adjusts the tree to ensure that operators with higher precedence are evaluated first. The adjustment is made by comparing the precedence of operators in the tree and reorganizing the tree structure as needed. Comparison with Other Parsing Techniques

Benefits of Tree Rewriting Parser

Simplicity – The Tree Rewriting Parser is relatively simple and easy to understand, making it suitable for parsing simple mathematical expressions. Explicit Control: Developers have explicit control over the parsing process and can easily extend the parser to handle additional operators or tokens.

Correctness – By employing tree rewriting, the parser ensures correct operator precedence and associativity, leading to accurate evaluation of expressions.

Liabilities of Tree Rewriting Parser

Limited Expressiveness – The parser may be limited in its ability to handle more complex grammars or expressions compared to other parsing techniques like LR parsers.

Performance Overhead – Tree rewriting adds additional processing overhead, especially for large or deeply nested expressions, potentially impacting performance.

Comparison with Other Top-Down Parsers

The Tree Rewriting Parser, using recursive descent parsing, is a type of top-down parser. Both approaches start from the top of the grammar and work their way down to construct the parse tree. Tree Rewriting Parser offers more explicit control over the parsing process, making it easier to understand and extend for smaller grammars and expressions. In contrast, top-down parsers like LL parsers can handle more complex grammars efficiently but may require more complex machinery to implement.

Summary

The Tree Rewriting Parser provides a simple and effective solution for parsing mathematical expressions while ensuring correct operator precedence and associativity through tree rewriting. While it may not be as powerful as other parsing techniques for handling complex grammars, it offers simplicity, control, and correctness for simpler parsing tasks. Developers can leverage Tree Rewriting Parser for small-scale parsing tasks where explicit control over the parsing process is desired.

Shunting Yard Parser

The Shunting Yard algorithm, devised by Edsger Dijkstra, is a method for parsing mathematical expressions specified in infix notation. It converts the infix expression into a postfix (Reverse Polish Notation) form, which can then be evaluated efficiently. Shunting Yard is particularly useful for handling operator precedence.

Infix notation is a form of mathematical notation that places the operator in between the operands as in 1 + 2. Infix is what most people use every day. Prefix notation (Polish notation) places the operator before the operands ex: + 1 2 You may have seen this style used in the Lisp programming language or on some early calculators. There is also a Postfix notation (Reverse Polish Notation or RPN) that places the operator after the operands: 1 2 +.

Each notation has its own advantages and use cases. Infix notation is the most intuitive and commonly used by humans, but it can be ambiguous without the use of parentheses to clarify precedence. Prefix and postfix notations eliminate the need for parentheses and have a consistent and unambiguous structure, making them preferable for use in computer algorithms. Among the three, postfix notation is the most easily used by computers due to its simplicity and ease of evaluation using stack-based algorithms like the shunting yard algorithm. The shunting yard algorithm, as well as other parsing and
evaluation algorithms, typically operate on expressions in postfix notation due to its straightforward and efficient processing nature.

graph TD;
    A[Infix Expression] --> B[Shunting Yard Algorithm];
    B --> C[Postfix Expression];
    C --> D[to_AST];

figure 3: Shunting Yard IPO Diagram

The algorithm uses a stack to keep track of operators and operands while processing the input expression. It maintains the correct order of operators based on their precedence and associativity. Here’s a brief overview of how the Shunting Yard algorithm is used to convert the infix expression to the RPN format:

  1. Initialize an output queue for the RPN expression and an operator stack.
  2. Iterate through each token in the infix expression from left to right.
  3. If the token is an operand (number), add it to the output queue.
  4. If the token is an operator:
    • While the operator stack is not empty and the top operator has higher precedence than the current token (or equal precedence with left associativity), pop operators from the stack and add them to the output queue.
    • Push the current token onto the operator stack.
  5. If the token is a left parenthesis ‘(‘, push it onto the operator stack.
  6. If the token is a right parenthesis ‘)’:
    • Pop operators from the stack and add them to the output queue until a left parenthesis ‘(‘ is encountered (but do not add the ‘(‘ to the output queue).
    • Pop and discard the left parenthesis ‘(‘ from the stack.
  7. If there are no more tokens to read, pop any remaining operators from the stack and add them to the output queue.
  8. The output queue now contains the infix expression in RPN format.

These steps ensure that operators are arranged in the correct order of precedence and associativity in the RPN expression, ready for evaluation.

graph TD;
    Start(Start) --> A(Get next token);
    A --> |If INTEGER| B{Append to output};
    B --> |If yes| C(Append to output);
    B --> |If no| B;
    C --> A;
    A --> |If operator| D{Check operator precedence};
    D --> |If precedence higher| E{Pop operators to output};
    D --> |If precedence lower| D;
    E --> F{Push current operator to stack};
    F --> A;
    E --> A;
    A --> |If LPAREN| G{Push LPAREN to stack};
    G --> A;
    A --> |If RPAREN| H{Pop operators until LPAREN};
    H --> I{Discard LPAREN};
    I --> A;
    A --> |If EOF| J{Pop remaining operators};
    J --> K{Append to output};
    K --> A;
    A --> End(End);

figure 4: Shunting Yard Algorithm Implementation

Here is our implementation: (file: parser_shunting_yard.py).

from lexer import Lexer

class ShuntingYard:
    def __init__(self, lexer):
        self.lexer = lexer
        self.output = []
        self.operator_stack = []

    def precedence(self, op):
        precedences = {'+': 1, '-': 1, '*': 2, '/': 2, '^': 3}
        return precedences.get(op, 0)

    def parse(self):
        while True:
            token = self.lexer.get_next_token()
            if token.type == 'INTEGER':
                self.output.append(token.value)
            elif token.type in ('PLUS', 'MINUS', 'MULTIPLY', 'DIVIDE', 'POWER'):
                while (self.operator_stack and
                       self.precedence(self.operator_stack[-1]) >= self.precedence(token.value)):
                    self.output.append(self.operator_stack.pop())
                self.operator_stack.append(token.value)
            elif token.type == 'LPAREN':
                self.operator_stack.append(token.value)
            elif token.type == 'RPAREN':
                while self.operator_stack[-1] != '(':
                    self.output.append(self.operator_stack.pop())
                self.operator_stack.pop()
            elif token.type == 'EOF':
                while self.operator_stack:
                    self.output.append(self.operator_stack.pop())
                break
        return self.output

    def to_ast(self):
        stack = []

        for token in self.output:
            if token in ('+', '-', '*', '/', '^'):
                right = stack.pop()
                if token in ('+', '-'):
                    left = stack.pop() if stack else 0  # Handle unary operators
                    stack.append((token, left, right))
                else:
                    left = stack.pop()
                    stack.append((token, left, right))
            else:
                stack.append((token))
        return stack[0]  # Return the first element of the stack

if __name__ == '__main__':
    while True:
        text = input('calc> ')
        lexer = Lexer(text)
        parser = ShuntingYard(lexer)
        result = parser.parse()
        #print(result)
        print(parser.to_ast())

The Shunting Yard algorithm doesn’t produce an AST directly, but we can easily convert the postfix expression to an AST using a stack. See the _to_ast()_ method for details.

Conclusion

In this installment, we’ve explored the technique of tree rewriting and the Shunting Yard algorithm to handle operator precedence and associativity in handwritten parsers. By adjusting the structure of the abstract syntax tree (AST) and converting infix expressions to postfix notation, we can ensure that mathematical expressions are parsed and evaluated correctly. Both techniques provide powerful mechanisms for enforcing precedence rules and associativity, making them valuable tools in parsing complex expressions. In the next installment, we’ll delve deeper into other parsing techniques and compare their strengths and weaknesses.

Certainly! Here’s a resource section with a list of materials for learning more about tree rewriting parsers and Shunting Yard parsers:

Resources

Resources for Tree Rewriting Parsers:

  1. Source Code – for this article can be found on GitHub.
  2. Parsing Techniques – A Practical Guide (2nd Edition)

    By Dick Grune, Ceriel J.H. Jacobs, et al.
    This comprehensive guide provides a detailed overview of parsing techniques, including tree rewriting. It covers various parsing algorithms and their applications in practical scenarios.

  3. Crafting Interpreters: A Handbook for Making Programming Languages

    By Robert Nystrom
    Although primarily focused on interpreter construction, this book delves into the implementation of parsers and explores tree rewriting techniques for handling operator precedence and associativity.

  4. Compilers: Principles, Techniques, and Tools (2nd Edition)
    By Alfred V. Aho, Monica S. Lam, et al.
    Widely regarded as the “Dragon Book,” this classic textbook covers compiler design principles, including parsing techniques such as recursive descent and operator precedence parsing, which are foundational for understanding tree rewriting.

Resources for Shunting Yard Parsers:

  1. The Shunting Yard Algorithm

    By Edsger W. Dijkstra
    This seminal paper by Edsger Dijkstra introduces the Shunting Yard algorithm for converting infix expressions to postfix notation. It provides a thorough explanation of the algorithm’s principles and its significance in parsing and evaluating mathematical expressions.

  2. The Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd Edition)

    By Donald E. Knuth
    In this volume of his renowned series, Donald Knuth discusses various parsing algorithms, including the Shunting Yard algorithm. It provides detailed insights into the algorithm’s implementation and its role in expression parsing.

  3. Introduction to Algorithms (3rd Edition)
    By Thomas H. Cormen, Charles E. Leiserson, et al.
    This comprehensive textbook covers fundamental algorithms used in computer science, including parsing algorithms like the Shunting Yard. It offers clear explanations and pseudocode implementations for better understanding.

Online Resources:

  1. Parsing Expressions by Recursive Descent

    Recursive Decent Explained
    This online resource provides a detailed explanation of recursive descent parsing, which is often used in conjunction with tree rewriting techniques. It includes examples and code snippets for implementation.

  2. Shunting Yard Algorithm – Wikipedia

    Shunting Yard Algorithm
    The Wikipedia page on the Shunting Yard algorithm offers a concise overview of the algorithm, its history, and its applications. It also provides pseudocode and implementation examples for reference.

  3. Compiler Design – GeeksforGeeks
    Compiler Design
    GeeksforGeeks’ compiler design section offers tutorials and articles on various parsing techniques, including Shunting Yard parsing. It provides practical insights and examples to aid understanding.

These resources should provide a solid foundation for learning about Tree Rewriting and Shunting Yard parsers, and related parsing techniques.

Series Navigation<< Handling Associativity and Precedence in Handwritten Parsers

Leave a Reply

Your email address will not be published. Required fields are marked *