Implementing Stack Oriented Languages — Part 2

This entry is part 2 of 4 in the series Implementing Stack Oriented Languages

Post Stastics

  • This post has 3836 words.
  • Estimated read time is 18.27 minute(s).

Part 2 – The Parser

It’s been some time since the first installment. It amazes me how quickly time flies by when you have so many projects at hand. So let’s jump right in and get coding!

Before we get started though, let’s make some changes to our keywords file. First, we won’t need the PSH (Push) keyword. The same is true for the ROT keyword. So remove them both.

    # Stack Operators
    'POP', 'PSH', 'SWAP', 'DUMP', 'DROP', 'ROT', 'OVER',
    'DUP', 'DUP2', 'DUP3', 'CLEAR',

We will now add a few new keywords in the file. Add the SWAP2, ROL, ROR, and DUP4 to the stack operators section of the keyword file as shown below:

     # Stack Operators
    'POP', 'SWAP', 'SWAP2', 'DUMP', 'DROP', 'ROL', 'ROR', 'OVER',
    'DUP', 'DUP2', 'DUP3', 'DUP4', 'CLEAR',

The ROT keyword was going to be a stack rotation command. After some thought, however, I realized that having the ability to rotate left and right would be better. So, I decided to remove ROT and replace it with ROL and ROR for rotate left and rotate right. I’m building this language on the fly. So, there may be more changes to come.

Now that we have the changes out of the way, let’s move on to the parser. Create a new file called parser.py and add the code below:

The Parser

# parser.py
from lexer import Lexer
from stack import Stack
from tokens import *
#######################################
# Parser                              #
#######################################
class Parser:
    pass

Our parser will ask the lexer for tokens. The lexer will take the input text and generate tokens to pass to the parser for further processing. The parser will recognize the token type and handle it, ensuring it is an expected type, given the current context of the program. This all sounds rather complicated. However, once you get a feel for hope to implement this, you’ll find it rather easy to understand. Let’s see some code. We will begin by adding an __init__ method to our parser.

...
...
#######################################
# Parser                              #
#######################################
class Parser:
    def __init__(self, lexer: Lexer, data_stack: Stack = None):
        self.lexer = lexer
        # Setup data stack
        if data_stack:
            self.data = data_stack
        else:
            self.data = Stack()
        self.current_tok = Token(TT_START, None)

Notice the __init__ method takes in a lexer instance and an optional stack instance. We will need a method to get the next token in the input stream from the lexer. So add the next_token() method as shown below. Notice how we check the current_tok property and test if the current_tok is an end of file “EOF” token. If so, we don’t need to call the lexer any longer. If however, the current_token is any other token, we need to call the lexer for the next token and save it in current_tok for use later.

    def next_token(self):
        if self.current_tok != TT_EOF:
            self.current_tok = self.lexer.next_token()

The meat of our parser comes from the parse method. The parse method looks at the token stored in current_tok and decides how to process it. Let’s start simple. We will first parse integers. Add the following code to the parser.

 def parse(self):
        # get token and decide what to do with it
        tok = self.current_tok
        while tok.type != TT_EOF:
            self.next_token()
            tok = self.current_tok
            print(f'token: {tok}')
            if tok.type == TT_START:
                continue

Next, we will need a driver program, including a source string to parse. The following code will do fine for our first foray into parsing.

if __name__ == '__main__':
    source = """ 1 2 3 5 4 """
    lexer = Lexer(source)
    parser = Parser(lexer)
    parser.parse()

Just to insure we’re all on the same page, I’ll show the complete parser file in it’s current state.

# parser.py
from lexer import Lexer
from stack import Stack
from tokens import *
#######################################
# Parser                              #
#######################################
class Parser:
    def __init__(self, lexer: Lexer, data_stack: Stack = None):
        self.lexer = lexer
        # Setup data stack
        if data_stack:
            self.data = data_stack
        else:
            self.data = Stack()
        self.current_tok = Token(TT_START, None)
    def next_token(self):
        if self.current_tok != TT_EOF:
            self.current_tok = self.lexer.next_token()
    def parse(self):
        # get token and decide what to do with it
        tok = self.current_tok
        while tok.type != TT_EOF:
            self.next_token()
            tok = self.current_tok
            print(f'token: {tok}')
            if tok.type == TT_START:
                continue
if __name__ == '__main__':
    source = """ 1 2 3 5 4 """
    lexer = Lexer(source)
    parser = Parser(lexer)
    parser.parse()

If we’ve done everything correctly, we can run the parser and get the following output:

token: type: NUMBER, value: 1, 
                at line 1 in column 3
token: type: NUMBER, value: 2, 
                at line 1 in column 5
token: type: NUMBER, value: 3, 
                at line 1 in column 7
token: type: NUMBER, value: 5, 
                at line 1 in column 9
token: type: NUMBER, value: 4, 
                at line 1 in column 11
token: type: EOF, value: None, 
                at line 1 in column 11
Process finished with exit code 0

The next step is to figure out what to do with each token. In the case of a stack-based language, we simply take constant values and place them on the stack. So, let’s do that next.

Add the following code at the end of the parse method. We will fill the parse method with tests for each token type and code to handle each token type.

            if tok.type == TT_NUMBER:
                self.data.push(int(tok.value))
                continue

The parse method now looks like the following:

    def parse(self):
        # get token and decide what to do with it
        tok = self.current_tok
        while tok.type != TT_EOF:
            self.next_token()
            tok = self.current_tok
            print(f'token: {tok}')
            if tok.type == TT_START:
                continue
            if tok.type == TT_NUMBER:
                self.data.push(int(tok.value))
                continue

Recall that back in the ‘__init__’ method we set the ‘self.data’ to an empty stack. We will use this stack to store function parameters and results. Recall that stacked oriented languages primarily operate on values stored on the stack and place the results of operations back onto the stack.

Let’s test what we have now. But first, let’s add code to the driver to print the stack contents so we can see if our data made it onto the stack. Add the following print statement to the end of the driver:

    print(parser.data) 

The driver should now look like the following:

if __name__ == '__main__':
    source = """ 1 2 3 5 4 """
    lexer = Lexer(source)
    parser = Parser(lexer)
    parser.parse()
    print(parser.data)

Now, if we run the parser driver we should get the following output:

token: type: NUMBER, value: 1, 
                at line 1 in column 3
token: type: NUMBER, value: 2, 
                at line 1 in column 5
token: type: NUMBER, value: 3, 
                at line 1 in column 7
token: type: NUMBER, value: 5, 
                at line 1 in column 9
token: type: NUMBER, value: 4, 
                at line 1 in column 11
token: type: EOF, value: None, 
                at line 1 in column 11
[ '1', '2', '3', '5', '4' ]
Process finished with exit code 0

As you can see, the stack now contains the data values from our source code. Also, note the order of these values. The right-most value is in the head position of the stack and will be the first value popped off the stack.

The next thing to do is add some code so we can do something useful with the data on the stack. The easiest operations to code and understand are the arithmetic operations. So let’s code those next.

Arithmetic Operators

First, we will need tests for the arithmetic operator token types. Looking at our tokens in tokens.py we see that we need tests in our parser for the TT_PLUS, TT_MINUS, TT_MUL, and TT_DIV token types. These will be simple if tests.

When we find an arithmetic operator token, we need to handle the operation by popping the values needed for the operation off the stack, perform the arithmetic operation, and then place the result back on the stack.

After adding these tests the parse method should look as follows:

    def parse(self):
        # get token and decide what to do with it
        tok = self.current_tok
        while tok.type != TT_EOF:
            self.next_token()
            tok = self.current_tok
            print(f'token: {tok}')
            if tok.type == TT_START:
                continue
            if tok.type == TT_NUMBER:
                self.data.push(int(tok.value))
                continue
                
            elif tok.type == TT_PLUS:
                right = self.data.pop()
                left = self.data.pop()
                result = left + right
                self.data.push(result)
            elif tok.type == TT_MINUS:
                right = self.data.pop()
                left = self.data.pop()
                result = left - right
                self.data.push(result)
            elif tok.type == TT_MUL:
                right = self.data.pop()
                left = self.data.pop()
                result = left * right
                self.data.push(result)
            elif tok.type == TT_DIV:
                right = self.data.pop()
                left = self.data.pop()
                result = left // right
                self.data.push(result)

Let’s change the source code in the driver to:

    source = """ 4 5 + """

Now run the driver. On exit, the stack should contain the value 9. 4 + 5 = 9. Now change the operator from addition to subtraction and run the driver. The stack now contains -1. Go on and test multiplication and division. If they work as expected, try some more complex problems. As an example:

4 5 * 2 / 5 - 
127 16 - 2 *
197 2 * 4 /

For the above your results should be 5, 222, and 98 respectively.

Bitwise Operators

Next, let’s add the code for raising a number to a power. This code follows the same pattern as the other arithmetic operators. So give it a try on your own. The truth is most of the operations we will perform will follow the same pattern. So let’s add the code to handle the bitwise operators:

            # Logical Operations
            elif tok.type == TT_INV:
                left = self.data.pop()
                left = ~left
                self.data.push(left)
            elif tok.type == TT_XOR:
                right = self.data.pop()
                left = self.data.pop()
                self.data.push((left ^ right))
            elif tok.type == TT_AND:
                right = self.data.pop()
                left = self.data.pop()
                self.data.push((left & right))
            elif tok.type == TT_OR:
                right = self.data.pop()
                left = self.data.pop()
                self.data.push((left | right))
            elif tok.type == TT_LSL:
                right = self.data.pop()
                left = self.data.pop()
                self.data.push(left << right)
            elif tok.type == TT_LSR:
                right = self.data.pop()
                left = self.data.pop()
                self.data.push((left >> right))

The INV or Invert operator simply flips all ones in a number to zeros and all zeros to 1s. Most computers handle and computer languages handle integers as 2’s complement values. See this Computerphile episode for more info on 2’s complement representation, or just trust me when I say that inverting all the bits in the value 1 in python should result in -2. So the change the source in the driver to read:

source = “”” 1 ~ “””

Then run the driver. You should receive an result of -2.

Now lets try shifting 1 by 2. The result should be 4. However, when I tried this I found a small bug in the lexer. It seems that I return the TT_LSR token type for the ‘<<‘ operator. This can be fixed by changing the token type to TT_LSL. Now, change the source line in the driver to read:

source = “””” 1 2 << “””

If you run this driver now you should get a result of 4. Now change the shift distance of 2 to 3, 4, and 5. You should get results of 8, 16, 32 respectively. If not you have some debugging to do. Otherwise, Try the other logical operator and ensure they work as expected. Remember that shifting right by 1 is the same as dividing a value by 2. If you right sift a value by 2 it is the same as dividing the value by 4.

OK, we’ve covered enough ground for this post. Next time we will look at adding a front end. We will need this so that we can test our parser with real input source files. This will ease further development as we add more complex features like conditional statements, loops, variables, string handling, and input/output routines.

I hope your enjoying this series of articles. If your wanting to develop your languages, stack-oriented languages are some of the easiest to implement, and they are a great way to get your feet wet and learn the basics of language implementation.

As always, here’s the complete code for part 2 of the series:

Final Code

# parser.py
from lexer import Lexer
from stack import Stack
from tokens import *
#######################################
# Parser                              #
#######################################
class Parser:
    def __init__(self, lexer: Lexer, data_stack: Stack = None):
        self.lexer = lexer
        # Setup data stack
        if data_stack:
            self.data = data_stack
        else:
            self.data = Stack()
        self.current_tok = Token(TT_START, None)
    def next_token(self):
        if self.current_tok != TT_EOF:
            self.current_tok = self.lexer.next_token()
    def parse(self):
        # get token and decide what to do with it
        tok = self.current_tok
        while tok.type != TT_EOF:
            self.next_token()
            tok = self.current_tok
            print(f'token: {tok}')
            if tok.type == TT_START:
                continue
            if tok.type == TT_NUMBER:
                self.data.push(int(tok.value))
                continue
            # Arithmetic
            elif tok.type == TT_PLUS:
                right = self.data.pop()
                left = self.data.pop()
                result = left + right
                self.data.push(result)
            elif tok.type == TT_MINUS:
                right = self.data.pop()
                left = self.data.pop()
                result = left - right
                self.data.push(result)
            elif tok.type == TT_MUL:
                right = self.data.pop()
                left = self.data.pop()
                result = left * right
                self.data.push(result)
            elif tok.type == TT_DIV:
                right = self.data.pop()
                left = self.data.pop()
                result = left // right
                self.data.push(result)
            # Logical Operations
            elif tok.type == TT_INV:
                left = self.data.pop()
                left = ~left
                self.data.push(left)
            elif tok.type == TT_XOR:
                right = self.data.pop()
                left = self.data.pop()
                self.data.push((left ^ right))
            elif tok.type == TT_AND:
                right = self.data.pop()
                left = self.data.pop()
                self.data.push((left & right))
            elif tok.type == TT_OR:
                right = self.data.pop()
                left = self.data.pop()
                self.data.push((left | right))
            elif tok.type == TT_LSL:
                right = self.data.pop()
                left = self.data.pop()
                print(f'{left << right}')
                self.data.push(int(left) << int(right))
            elif tok.type == TT_LSR:
                right = self.data.pop()
                left = self.data.pop()
                self.data.push((left >> right))
if __name__ == '__main__':
    source = """ 16 2 >> """
    lexer = Lexer(source)
    parser = Parser(lexer)
    parser.parse()
    print(parser.data)
# keywords.py
KEYWORDS = [
    # Stack Operators
    'POP', 'SWAP', 'SWAP2', 'DUMP', 'DROP', 'ROL', 'ROR', 'OVER',
    'DUP', 'DUP2', 'DUP3', 'CLEAR',
    # Conditional Operators
    'IF', 'ELSE', 'THEN', 'NOT',
    # Iterative
    'DO', 'LOOP',
    'WHILE', 'WEND',
    # I/O Operators
    'TYPE',  # Outputs a string to stdout
    'EMIT',  # Outputs a character to stdout
    'KEY',   # Takes input character from stdin
    # Special Character Values
    # See ASCII Character Table for values
    'CR', 'LF', 'NULL', 'BELL', 'BS', 'TAB',
    'VT', 'FF', 'ESC',
    # Logical Operators
    'NEQ', 'AND', 'OR',
]
from keywords import KEYWORDS
from tokens import *
class Lexer:
    def __init__(self, text, line=None, col=None):
        self.text = text
        self.pos = 0
        self.line = line if line else 1
        self.col = col if col else 1
        self.current_char = self.text[0]
    #############################################
    # Utility Methods                           #
    #-------------------------------------------#
    # These methods help manage the internal    #
    # workings of the lexer.                    #
    #############################################
    def advance(self):
        """advance():
        Advances the current position in the character stream.
        """
        if not self.is_eoi():
            self.pos += 1
            self.current_char = self.text[self.pos]
            self.update_position()
        else:
            self.current_char = ''
    def update_position(self):
        """update_position
        Maintains the line and column positions for the current token scan.
        """
        if self.current_char == '\n':
            self.line += 1
            self.col = 0
        else:
            self.col += 1
    def is_eoi(self):
        """is_eoi():
        :return:
            True if current position is End Of Input stream.
            Returns False otherwise.
        """
        return self.pos >= (len(self.text) - 1)
    def expected(self, expected, found):
        """expected
        Raises error is expected and found are different.
        """
        if expected == found:
            return
        else:
            raise ValueError(f'[Error] Expected {expected} but found {found} in line: {self.line} position: {self.col}')
    def peek(self):
        """peek
        Returns next character in input stream after the current character.
        Note: If we are at EOI (End Of Input) then a space character is returned.
        """
        if not self.is_eoi():
            return self.text[self.pos + 1]
        else:
            return ' '  # space
    #############################################
    # Recognizer Methods                        #
    #-------------------------------------------#
    # These methods collect the characters that #
    # belong to a single multi-character token  #
    # into a single unit, and return that unit  #
    # to the caller.                            #
    #############################################
    def char(self):
        if self.current_char != "'":
            self.expected("'", self.current_char)
        self.advance()  # Consume opening single quote
        char = self.current_char
        self.advance()  # Consume character
        # Now we must have a closing single quote mark
        if self.current_char != "'":
           self.expected("'", self.current_char)
        self.advance()  # Consume closing single quote
        return char
    def string(self):
        if self.current_char != '"':
            self.expected('"', self.peek())
        self.advance()  # Consume leading '"'
        string = ''
        while self.current_char != '"':
            string += self.current_char
            self.advance()
        self.advance()  # Consume trailing '"'
        return string
    def number(self):
        number = ''
        while self.current_char.isdigit():
            number += self.current_char
            self.advance()
        return number
    def is_ident(self):
        return self.current_char.isalpha() or self.current_char == '_'
    def _ident(self):
        _id = ''
        # identifiers begin with alpha or underscores
        if self.current_char.isalpha() or self.current_char == '_':
            while self.current_char.isalnum() or self.current_char == '_':
                _id += self.current_char
                self.advance()
        return _id
    def is_keyword(self, kw):
        return kw.upper() in KEYWORDS
    def eat_comment(self):
        if self.current_char == '(':
            while self.current_char != ')' and not self.is_eoi():
                self.advance()
            self.advance()  # skip trailing )
        elif self.current_char == '\\':
            while self.current_char != '\n' and not self.is_eoi():
                self.advance()
            self.advance()  # skip new line character
    #############################################
    # Tokenization                              #
    #-------------------------------------------#
    # The method next_token() is called by the  #
    # client to obtain the next token in the    #
    # input stream.                             #
    #############################################
    def next_token(self) -> Token:
        if not self.is_eoi():
            if self.current_char.isspace():
                while self.current_char.isspace():
                    self.advance()
            if self.current_char.isdigit():
                return Token(TT_NUMBER, self.number(), self.line, self.col)
            elif self.current_char == '+':
                val = self.current_char
                self.advance()
                return Token(TT_PLUS, val, self.line, self.col)
            elif self.current_char == '-':
                val = self.current_char
                self.advance()
                return Token(TT_MINUS, val, self.line, self.col)
            elif self.current_char == '*':
                val = self.current_char
                if self.peek() == '*':
                    val += '*'
                    self.advance()
                    self.advance()
                    return Token(TT_POW, val, self.line, self.col)
                self.advance()
                return Token(TT_MUL, val, self.line, self.col)
            elif self.current_char == '/':
                val = self.current_char
                self.advance()
                return Token(TT_DIV, val, self.line, self.col)
            elif self.current_char == '&':
                val = self.current_char
                self.advance()
                return Token(TT_AND, val, self.line, self.col)
            elif self.current_char == '|':
                val = self.current_char
                self.advance()
                return Token(TT_OR, val, self.line, self.col)
            elif self.current_char == '^':
                val = self.current_char
                self.advance()
                return Token(TT_XOR, val, self.line, self.col)
            elif self.current_char == '~':
                val = self.current_char
                self.advance()
                return Token(TT_INV, val, self.line, self.col)
            elif self.is_ident():
                _id = self._ident()
                if self.is_keyword(_id):
                    return Token(TT_KEYWORD, _id, self.line, self.col)
                else:
                    return Token(TT_ID, _id, self.line, self.col)
            elif self.current_char == '<':
                val = self.current_char
                if self.peek() == '<':
                    val += '<'
                    self.advance()
                    self.advance()
                    return Token(TT_LSL, val, self.line, self.col)
                self.advance()
                return Token(TT_LESS, val, self.line, self.col)
            elif self.current_char == '>':
                val = self.current_char
                if self.peek() == '>':
                    val += '>'
                    self.advance()
                    self.advance()
                    return Token(TT_LSR, val, self.line, self.col)
                self.advance()
                return Token(TT_GREATER, val, self.line, self.col)
            elif self.current_char == '=':
                val = self.current_char
                if self.peek() == '=':
                    val += '='
                    self.advance()
                    self.advance()
                    return Token(TT_EQ, val, self.line, self.col)
                self.advance()
                return Token(TT_ASSIGN, val, self.line, self.col)
            elif self.current_char == ':':
                val = self.current_char
                self.advance()
                return Token(TT_STARTDEF, val, self.line, self.col)
            elif self.current_char == ';':
                val = self.current_char
                self.advance()
                return Token(TT_ENDDEF, val, self.line, self.col)
            elif self.current_char == '(' or self.current_char == '\\':
                self.eat_comment()
                return self.next_token()
            elif self.current_char == "'":
                return Token(TT_CHAR, self.char(), self.line, self.col)
            elif self.current_char == '"':
                string = self.string()
                return Token(TT_STRING, string, self.line, self.col)
            elif self.is_eoi():
                return Token(TT_EOF, None, self.line, self.col)
            else:
                raise ValueError(f'Unexpected character: {self.current_char} in input stream at position: {str(self.pos)}.')
        else:
            return Token(TT_EOF, None, self.line, self.col)
def main(text: str):
    lexer = Lexer(text)
    tok = lexer.next_token()
    while tok.type != TT_EOF:
        print(tok)
        tok = lexer.next_token()
if __name__ == "__main__":
    text = """ 
    ( this is a multi-line 
       comment. ) 
    \ This ia a line comment." 
    _myIdent """
    main(text)
# stack.py
class Stack:
    def __init__(self, report_errors=False):
        self.store = []
        self.report_error = report_errors
    def push(self, item):
        self.store.append(item)
    def pop(self):
        if self.store:
            return self.store.pop()
        else:
            if self.report_error:
                raise IndexError('Attempt to pop a value from empty stack!')
            else:
                return None
    def tail(self):
        if self.store:
            return self.store.pop(0)
        else:
            return None
    def clear(self):
        self.store.clear()
    def count(self):
        return len(self.store)
    def is_empty(self):
        return not bool(self.store)
    def __str__(self):
        s = '[ '
        for v in self.store:
            s += "'" + str(v) + "', "
        s = s[:-2]
        s += ' ]'
        return s
    def __repr__(self):
        return self.__str__()
# tokens.py
# Data
TT_CHAR     = 'CHAR'    # 'c'
TT_STRING   = 'STRING'  # "string"
TT_NUMBER   = 'NUMBER'  # 1234 (integers only
TT_ID       = 'ID'
# Arithmatic Operators
TT_ASSIGN   = 'ASSIGN'  #   '='
TT_PLUS     = 'PLUS'    #   '+'
TT_MINUS    = 'MINUS'   #   '-'
TT_MUL      = 'MUL'     #   '*'
TT_DIV      = 'DIV'     #   '/'
TT_POW      = 'POW'     #   '**"
# Bitwise Operators
TT_LSR      = 'LSR'     #   '>>'
TT_LSL      = 'LSL'     #   '<<'
TT_AND     = 'AND'    #   '&' Bitwise AND
TT_OR      = 'OR'     #   '|' Bitwise OR
TT_XOR     = 'XOR'     #   '^' Bitwise XOR
TT_INV     = 'INV'     #   '~' Bitwise NOT
# Relational Operators
TT_EQ       = 'EQ'      #   '==' Is Equal?
TT_LESS     = 'LESS'    #   '<'  Is Less?
TT_GREATER  = 'GREATER' #   '>' Is greater?
# Logical Operators
TT_LAND  = 'LAND'   #   AND Logical
TT_LOR   = 'LOR'    #   OR Logical
TT_NEQ   = 'NEQ'    #   NEQ Logical
# DEFINE Functions
TT_STARTDEF = 'STARTDEFINE' #   ':'
TT_ENDDEF   = 'ENDDEFINE'   #   ';'
# INTERNAL USE
TT_KEYWORD  = 'KEYWORD'
TT_START    = "START"
TT_EOF      = 'EOF'
class Token:
    def __init__(self, _type, _value, line=None, col=None):
        self.type = _type
        self.value = _value
        self.col = col
        self.line = line
    def __str__(self):
        return f'''type: {self.type}, value: {self.value}, 
                at line {self.line} in column {self.col}'''
    def __repr__(self):
        return self.__str__()

Series Navigation<< Implementing Stack Oriented LanguagesImplementing Stack Oriented Languages – Part 3 >>

Leave a Reply

Your email address will not be published. Required fields are marked *