Implementing Stack Oriented Languages

This entry is part 1 of 4 in the series Implementing Stack Oriented Languages

Post Stastics

  • This post has 6620 words.
  • Estimated read time is 31.52 minute(s).

TLDR; Warning long post

While most software developers have heard of Structural, Imperative, Object-Oriented, Prototypal, and Functional programming paradigms, and the language types that support them. Few have heard of Stack Oriented Programming even though it’s been around for quite some time. while this seemingly obscure programming construct is out of the norm for most developers, many have heard of some popular languages (and perhaps even used them) that implement stack-oriented programming. Some of the most popular stack-oriented languages are PostScript and Forth. Most (not all) stack-oriented languages are concatenative languages (see http://concatenative.org). However, a concatenative language is not always stacked based, nor are all stack-based languages concatenative.

In most popular imperative languages, a stack is employed to handle the passing of parameters and return values, as well as storing the return location (address). However, this stack is not directly exposed to the developer. In a Stack-Oriented language the stack is not only exposed to the developer but doing anything in the language requires manipulation of the stack. Most stack-oriented languages include at least two stacks: One for storing parameters and results, and the second, for storing return addresses.

Why would I need a Stack -Oriented language?

Stack-Oriented languages are powerful, usually, compact, often interactive, and are some of the simplest languages to implement. They find use as general programming languages mostly in space and aeronautics, process control, and embedded systems. While many will claim that Forth is dead, it is still used both by N.A.S.A. and SpaceX, Boeing, and others. But these languages are also used to create DSL (Domain Specific Languages) and provide scripting languages in games.

Why should I care

Today I want to explore stack-oriented languages, their power, and a simple implementation of a custom S.O. language. Why should you be interested in S.O.s? For the same reason, you should be interested in OOP or Functional programming. For the professional developer, these paradigms are tools in our toolbox, and each has it’s own unique abilities, strengths, and weaknesses. If you don’t have a hammer or know how to use it effectively, driving nails becomes a major challenge.

If you never programmed in a stack-based language and have never used an HP programmable calculator, then things will look and feel very obscure to you at first. But I promise you, with a short introduction, it will all make sense.

Let’s start by talking about stacks. Hopefully, you know that a stack is one of the fundamental data structures in Computer Science. They are a simple storage facility for data. Stacks are often referred to as LIFOs (Last In First Out) data structures. You can think of a stack much like one of those carts that hold plates at a buffet. The first plate placed on the cart is at the bottom of the stack. The next plate pushed onto the stack is placed on top of the previous plate to be pushed onto the stack of plates. If we need a plate we cannot take the plate off the bottom of the stack. We must remove the top plate first. Which was the last plate placed on the stack?

The stack has two fundamental operations, “Push” to place an item on the stack and “Pop” to remove an item from the stack. When we push an item it is always placed on top of all previously existing items on the stack. When we remove an item is it always the last item placed on the stack. At first, this sounds very restrictive. But it’s power is very deceiving!

Let’s implement a simple Stack class in python (or feel free to use your favorite language) so we can do a few simple experiments.

A remedial version of a stack might look like this:

class Stack:
    def __init__(self, report_errors=False):
        self.store = []
        self.report_error = report_errors

    def push(self, item):
        self.store.append(item)

    def pop(self):
        if self.store:
            return self.store.pop()
        else:
            if self.report_error:
                raise IndexError('Attempt to pop a value from empty stack!')
            else:
                return None

It will eventually come in handy to have a few more methods in our Stack class. So let’s go ahead and add the following methods. These are all pretty self-explanatory:

    def tail(self):
        if self.store:
            return self.store.pop(0)
        else:
            return None

    def clear(self):
        self.store.clear()

    def count(self):
        return len(self.store)

    def is_empty(self):
        return not bool(self.store)

    def __str__(self):
        s = '[ '
        for v in self.store:
            s += "'" + str(v) + "', "
        s = s[:-2]
        s += ' ]'
        return s

    def __repr__(self):
        return self.__str__()

This is my generic Stack implementation in Python. It has a few accessories not always found in a stack implementation. First, the “tail” method allows us to remove the first item placed in the stack. This is here as it allows me to use the stack as a queue when needed. The “clear” method allows us to empty the stack and start afresh. The “count” method returns a count of the items on the stack. Method “is_empty” returns true if the stack is empty and false otherwise. The last two methods are special methods for python. The method “__str__” pronounced dunder str dunder (dunder is short for” double underscore”) is used to provide the python print function a string representation of the stack object. The “__repr__” method provides python a representation of the stack object when needed. In this case, it simply calls “__str__” to get the string representation of the object. These methods, however, allow us to simply write:

print(stack)

To display the stack contents. In a C implementation, this will require a loop.

So now that we have a stack let’s look at how we might use it. Let’s say we have a string we need to reverse. How would you go about this?

One way would be to use a loop:

def reverse(text: str) -> str:
    out_text = ''
    for char in range(len(text)-1, -1, -1):
        out_text += text[char]
    return out_text

A more pythonic version:

def reverse(text: str) -> str:
    return text[::-1]

In this example, we apply the slice operator directly to the input text. In languages like C, this is not supported, however. So we have to use a loop. But, there is another way! You guessed it. It employs our stack class. Let’s see how we might use our stack to reverse the string.

from stack import Stack

def reverse(text: str) -> str:
    stack = Stack()
    s = ''
    [stack.push(char) for char in text]
    return s.join([stack.pop() for _ in range(stack.count())])


if __name__ == '__main__':
    print(reverse('This is a test.'))

This version of the reverse function may be a bit hard to follow if you are not used to list comprehensions in Python. Basically they are shorthand for “for loops”. The line:

 [stack.push(char) for char in text]

is a for loop that iterates over each character in the “text” variable, assigning each value to “char” and then calling the stack.push() method passing the new value of char to it on each iteration. Basically, in C this would be a loop getting each character of the text string and pushing it onto the stack.

If you were to examine the stack now you would find that the last character of the “text” string is on top of the stack. The characters of the string are in reverse order. Now, all we need to do is get them back off the stack and assemble them back into a string. The following code does just that:

s.join([stack.pop() for _ in range(stack.count())])

All we have to do now is return it to the caller. Which we do using return.

This was a simple use of a stack to demonstrate their power in a trivial example. Stacks can also be used to perform mathematical operations. (See: https://www.studytonight.com/post/arithmetic-expressioninfix-evaluation-using-stack for more info)

Now with our little demo out of the way, I hope you can see the power of a properly used stack! Let’s get on with implementing a simple interpreted Stack Oriented Language.

Before we can begin we need to define what it is we are going to create. What is our language going to look like? Rather than creating our syntax from scratch, I’ll take a lot from Forth. However, Forth uses a lot of single-character commands that can be confusing for a newbie. Our language will do away with things like “.” and “.s” and attempt to make the language more intuitive by replacing “.” with “clear” and “.s” with “dump” to display the stack contents. We will make other changes as well. But we will also keep those commands that make sense like the family of “dup” commands used to duplicate items on the stack.

Writing a compiler in any language requires a few moving parts. At a minimum, we need a lexer, a parser, and an interpreter. Our input will be a stream of characters. So, we will need to collect the characters together into the keywords, commands, and symbols used by our language. You can think of this as taking the alphabet of English and using it to create meaningful words. This is the job of the lexer. Once we have a collection of words, we need to make sure that they are used in the proper order and context to have meaning in our language. For example, The chase dog the cat. Is made up of proper English words but their use here is out of order and makes no sense. Ensuring that the words the lexer hand us form sentences that make sense is the job of the parser. Our interpreter will take these sentences and produce meaning from them by converting them into actions. In a compiler, we would convert these sentences into code in the target language.

First, we will create a class and define a few simple fields to store the input text and keep track of the character position we are currently at. It will be helpful to also include the line and column where we found the token. Place the following code in a file named lexer.py.

class Lexer:
    def __init__(self, text, line=None, col=None):
        self.text = text
        self.pos = 0
        self.line = line if line else 1
        self.col = col if col else 1
        self.current_char = self.text[0]

Here we initialize the current_char field to the first character in the input stream. This primes the pump and gets us ready to start scanning the input.

Next we will need a few methods to help us manage the scanning and maintain the position fields. We will use the “advance()” method when we want to get the next character in the input stream. We will also use a “update_position()” method to track the line and column we are located at. We will also need a method to tell us when we have reached the end of the input stream. I’ll call this “is_eoi()” for “is the end of input”. Next, we will want an error method to help us raise an error when we find an unexpected character in the input stream. We will use a method called “expected()” for this. Finally, it will be helpful to have a method that will allow us to look ahead one character in the input stream. The reason for this is that I want to use C style operators like “<” for less than and “<<” for the left shift. When the lexer reaches the first “<” symbol it has no idea which operator it is looking at without taking a peek at the next character. This technique is common for languages that have ambiguous grammars. It provides a method of disambiguating the grammar by examining the context.

OK, let’s see what this all looks like in Python. Add the following to the end of lexer.py:

    #############################################
    # Utility Methods                           #
    #-------------------------------------------#
    # These methods help manage the internal    #
    # workings of the lexer.                    #
    #############################################
    def advance(self):
        """advance():
        Advances the current position in the character stream.
        """
        if not self.is_eoi():
            self.pos += 1
            self.current_char = self.text[self.pos]
            self.update_position()
        else:
            self.current_char = ''

    def update_position(self):
        """update_position()
        Maintains the line and column positions for the current token scan.
        """
        if self.current_char == '\n':
            self.line += 1
            self.col = 0
        else:
            self.col += 1

    def is_eoi(self):
        """is_eoi():
        :return:
            True if current position is End Of Input stream.
            Returns False otherwise.
        """
        return self.pos >= (len(self.text) - 1)

    def expected(self, expected, found):
        """expected()
        Raises error is expected and found are different.
        """
        if expected == found:
            return
        else:
            raise ValueError(f'[Error] Expected {expected} but found {found} in line: {self.line} position: {self.col}')

    def peek(self):
        """peek()
        Returns next character in input stream after the current character.
        Note: If we are at EOI (End Of Input) then a space character is returned.
        """
        if not self.is_eoi():
            return self.text[self.pos + 1]
        else:
            return ' '  # space

All of our methods are exactly what you would write given a short description of what they need to do. So far this is all pretty basic.

Next, we need a method to collect the input characters into words and symbols used in our language. These words and symbols are called tokens and usually carry at least a type and value information. We will also include the position in the input string where we found the token, and the line and column values.

I said above that our tokens contain a type and value. Many of the words we find in our language will fit into the category of “keyword”. A keyword is a word that has a special meaning in the language. These words are reserved and cannot be used for variables or function names. All keywords will be returned as type TT_KEYWORD (TT for Token Type). However, this is not enough information to tell what keyword the lexer encountered. So we will pass the actual keyword as the value. We will need an object to hold this contain this data as a single unit. So create another file named tokens.py and add the following code:

# Data
TT_CHAR     = 'CHAR'    # 'c'
TT_STRING   = 'STRING'  # "string"
TT_NUMBER   = 'NUMBER'  # 1234 (integers only
TT_ID       = 'ID'

# Arithmatic Operators
TT_ASSIGN   = 'ASSIGN'  #   '='
TT_PLUS     = 'PLUS'    #   '+'
TT_MINUS    = 'MINUS'   #   '-'
TT_MUL      = 'MUL'     #   '*'
TT_DIV      = 'DIV'     #   '/'
TT_POW      = 'POW'     #   '**"

# Bitwise Operators
TT_LSR      = 'LSR'     #   '>>'
TT_LSL      = 'LSL'     #   '<<'
TT_AND     = 'LAND'    #   '&' Bitwise AND
TT_OR      = 'LOR'     #   '|' Bitwise OR
TT_XOR     = 'XOR'     #   '^' Bitwise XOR
TT_INV     = 'INV'     #   '~' Bitwise NOT

# Relational Operators
TT_EQ       = 'EQ'      #   '==' Is Equal?
TT_LESS     = 'LESS'    #   '<'  Is Less?
TT_GREATER  = 'GREATER' #   '>' Is greater?

# Logical Operators
TT_LAND  = 'LAND'   #   '&&' Logical
TT_LOR   = 'LOR'    #   '||' Logical
TT_NEQ   = 'NEQ'     #   '!=' Logical

# INTERNAL USE
TT_KEYWORD  = 'KEYWORD'
TT_STARTDEF = 'STARTDEFINE'
TT_ENDDEF   = 'ENDDEFINE'
TT_START    = "START"
TT_EOF      = 'EOF'


class Token:
    def __init__(self, _type, _value, line=None, col=None):
        self.type = _type
        self.value = _value
        self.col = col
        self.line = line

    def __str__(self):
        return f'''type: {self.type}, value: {self.value}, 
                at line {self.line} in column {self.col}'''

    def __repr__(self):
        return self.__str__()
        return self.__str__()

As you can see, the implementation is straight forward. Now we will continue on with the lexer. In lexer.py we will add a method named “nextToken”. Add the code below to the end of your lexer.py file:

    #############################################
    # Tokenization                              #
    #-------------------------------------------#
    # The method next_token() is called by the  #
    # client to obtain the next token in the    #
    # input stream.                             #
    #############################################
    def next_token(self) -> Token:
        if not self.is_eoi():
            if self.current_char.isspace():
                while self.current_char.isspace():
                    self.advance()

            # Recognize number
            if self.current_char.isdigit():
                return Token(TT_NUMBER, self.number())

           # Error
            else:
                raise ValueError(f'[Error] Unexpected character: {self.current_char} in input stream at position: {str(self.pos)}.')

        else:
            return Token(TT_EOF, None)

This method is the heart of our lexer. We will use it to look at each character from the input stream and decide what token to return. In C like languages, this can be implemented as a switch statement. As you can see I have already added two cases to the method, that for a white-space character and for a digit. Our digit method makes a call to the number() method which we now need to implement. In lexer.py add the following just above the next_token() method:

    #############################################
    # Recognizer Methods                        #
    #-------------------------------------------#
    # These methods collect the characters that #
    # belong to a single multi-character token  #
    # into a single unit, and return that unit  #
    # to the caller.                            #
    #############################################
    def number(self):
        number = ''
        while self.current_char.isdigit():
            number += self.current_char
            self.advance()
        return number

All of our “recognizer methods will go in this region of the file. We will build these up as needed. Before we move on, let’s make sure what we have is working. In the lexer.py file at the very bottom let’s create a function to test the lexer:

def main(text: str):
    lexer = Lexer(text)
    tok = lexer.next_token()
    while tok.type != TT_EOF:
        print(tok)
        tok = lexer.next_token()

if __name__ == "__main__":
    text = "12 34 87 2 5 67"
    main(text)

If you execute this file you should see the following output:

type: NUMBER, value: 12, 
                at line 1 in column 3
type: NUMBER, value: 34, 
                at line 1 in column 6
type: NUMBER, value: 87, 
                at line 1 in column 9
type: NUMBER, value: 2, 
                at line 1 in column 11
type: NUMBER, value: 5, 
                at line 1 in column 13
type: NUMBER, value: 67, 
                at line 1 in column 15

As you can see, with a little infrastructure we can now recognize numbers. Try adding a letter in the input text and see that you get an error.

Normally I write unit tests as I code. In fact, in the repository for this project, you will find unit tests. But here, we’re just going to manually test the lexer by editing the input text and running the code.

Let’s add some math operators. We will start with the binary operators (so-called because they take two arguments to operate on), +, -, *, /. These all have their own token types in the tokens.py file. Since these are simple one character words, we will only need to test our current_char for them. Add the following code to the next_token() method between the test for digits and the #Error comment:

            elif self.current_char == '+':
                val = self.current_char
                self.advance()
                return Token(TT_PLUS, val)

            elif self.current_char == '-':
                val = self.current_char
                self.advance()
                return Token(TT_MINUS, val)
            
            elif self.current_char == '*':
                val = self.current_char
                self.advance()
                return Token(TT_MUL, val)

            elif self.current_char == '/':
                val = self.current_char
                self.advance()
                return Token(TT_DIV, val)
            
            # Error
            else:
                raise ValueError(f'[Error] Unexpected character: 
          . . .

OK, edit the text in our test driver at the bottom of lexer.py to read:

    text = "12 6 * 2 - 5 *"
    main(text)

Hopefully, you got something like this:

type: NUMBER, value: 12, 
                at line 1 in column 3
type: NUMBER, value: 6, 
                at line 1 in column 5
type: MUL, value: *, 
                at line None in column None
type: NUMBER, value: 2, 
                at line 1 in column 9
type: MINUS, value: -, 
                at line None in column None
type: NUMBER, value: 5, 
                at line 1 in column 13
type: MUL, value: *, 
                at line None in column None

You may have noticed that I have passed a value of None for the lines and columns parameters. This is because I didn’t pass those values to the various tokens when we created them. So let’s go back and add them now. In your next_token method your elif selector statements should now look like this:

elif self.current_char == '+':
                val = self.current_char
                self.advance()
                return Token(TT_PLUS, val, self.line, self.col)

            elif self.current_char == '-':
                val = self.current_char
                self.advance()
                return Token(TT_MINUS, val, self.line, self.col)
            
            elif self.current_char == '*':
                val = self.current_char
                self.advance()
                return Token(TT_MUL, val, self.line, self.col)

            elif self.current_char == '/':
                val = self.current_char
                self.advance()
                return Token(TT_DIV, val, self.line, self.col)
            
            # Error
            else:
                raise ValueError(f'[Error] Unexpected character: 
          . . .

If we rerun our driver program we should get something like this:

type: NUMBER, value: 12, 
                at line 1 in column 3
type: NUMBER, value: 6, 
                at line 1 in column 5
type: MUL, value: *, 
                at line 1 in column 7
type: NUMBER, value: 2, 
                at line 1 in column 9
type: MINUS, value: -, 
                at line 1 in column 11
type: NUMBER, value: 5, 
                at line 1 in column 13
type: MUL, value: *, 
                at line 1 in column 14

OK, it looks like we’ve got a good start on the lexer. But if you look at the token types in the tokens.py file you’ll see we have a lot more to add. Take a look at the TT_POW token type. The comment says this is a two-character token and it starts with an asterisk “*” which is also used by the TT_MUL token. This is one of those tokens that cause ambiguity in our language. It’s also one of the reasons we wrote that peek() method!

Since the POW token first sees a single asterisk, we must modify the selector statement in next_token to look ahead one character and test for a second asterisk. Replace the test for asterisk with the following code:

            elif self.current_char == '*':
                val = self.current_char
                if self.peek() == '*':
                    val += '*'
                    self.advance()
                    self.advance()
                    return Token(TT_POW, val, self.line, self.col)
                self.advance()
                return Token(TT_MUL, val, self.line, self.col)

We enter this code when the current character in the input stream is an asterisk. Once in the elif clause, we use peek to look ahead and see if the next character is also an asterisk. If it is, we call advance twice, once for each asterisk we need to consume. Then we return token for TT_POW. If the next character is anything other than an asterisk, we recognize the single asterisk as the TT_MUL token and advance only once to consume the single asterisk.

By now you should be getting a feeling for how most tokens are handled. For more complex tokens, however, we use recognizer methods. These methods simply test the input and return True if the input contains the given token type or false otherwise. Recognizers are usually paired with consumer methods that consume the characters of the input matching the desired token type and return the consumed characters as a string to the caller to be used as the token value. This all sounds more complex than it is.

Let’s look at comments in our language. Our language will have two types of comments. Comments beginning with a backslash “\” will continue to the end of a line. Multi-line comments begin with open parentheses “(” and continue to the closing parentheses (inclusively). We will first need to add a new selector method to our next_token method:

elif self.current_char == '(' or self.current_char == '\\':
    self.eat_comment()

Notice we test for either the open parentheses or the back-slash characters. If we found one of them, then we call the eat_comment method. We will not send comments to our parser as the parser will only throw them away. However, the parser does expect a token to be returned. So we will have to remedy that soon. But first, let’s see what eat_comment looks like:

    def eat_comment(self):
        if self.current_char == '(':
            while self.current_char != ')' and not self.is_eoi():
                self.advance()
            self.advance()  # skip trailing )
        elif self.current_char == '\\':
            while self.current_char != '\n' and not self.is_eoi():
                self.advance()
            self.advance()  # skip new line character

Here we enter the method with the current character being the ( or \ character that opens the comment. All we want to do here is consume the comment. So we use a while statement to skip ahead until we find the end of the comment, whether that is the new line character or the closing parentheses. Finally, our while loop stopped on the comment closing character so we need to consume it with an additional call to advance(). It really is that easy!

Now back to that pesky issue of the parser expecting a token back, even when we meet a comment. We can solve that simply by placing a recursive call to next_token() at the end of the selector and returning the results of that call. So change the comment selector to read as follows:

            elif self.current_char == '(' or self.current_char == '\\':
                self.eat_comment()
                return self.next_token()

Now let’s look at handling identifiers. In our language identifiers follow the C convention. That is they must begin with an Alpha character or an underscore and may be followed by any number of additional alpha or underscore characters or digits. Spaces and special characters are not allowed in identifiers.

To recognize an identifier we will need a recognizer method is_ident() and a consumer method ident(). Both are shown below:

    def is_ident(self):
        return self.current_char.isalpha() or self.current_char == '_'

    def _ident(self):
        _id = ''
        # identifiers begin with alpha or underscores
        if self.current_char.isalpha() or self.current_char == '_':
            while self.current_char.isalnum() or self.current_char == '_':
                _id += self.current_char
                self.advance()
        return _id

Add these methods to the recognizer section above next_token().

Back in the next_token method we now need to add our selector statement.:

elif self.is_ident():
    _id = self._ident()
    return Token(TT_ID, _id, self.line, self.col)

Now it’s time to test! We need to change our text in the driver to include a comment and an identifier. Try something like this:

if __name__ == "__main__":
    text = """ 
    ( this is a multi-line 
       comment. ) 
    \ This ia a line comment." 
    _myIdent """

    main(text)

If you run this you should see that the comments were skipped and only the identifier was returned:

type: ID, value: _myIdent, 
                at line 5 in column 13

This leaves us with an issue however, if our keywords are also legal identifiers, how do we resolve this?

There are two methods commonly used to filter out the keywords from other identifiers. The first is to simply place the test for keywords higher in the selector statements in next_token so they are tested for first. This however means that the developer must remember to do this for each new keyword. In truth this is a fragile method and IMHO shouldn’t be used. A more robust method is to place all our keywords in a list and once we know we have an identifier, see if it matches anything in our keyword list. This is the method I use below. Edit the selector statement for identifiers in next_token to read as follows:

elif self.is_ident():
_id = self._ident()
if self.is_keyword(_id):
return Token(TT_KEYWORD, _id, self.line, self.col)
else:
return Token(TT_ID, _id, self.line, self.col)

OK, now we need a list of keyword. So create a new file and name it keywords.py. Add the following code to the file:

KEYWORDS = [
# Stack Operators
'POP', 'PSH', 'SWAP', 'DUMP', 'DROP', 'ROT', 'OVER',
'DUP', 'DUP2', 'DUP3', 'CLEAR',

# Conditional Operators
'IF', 'ELSE', 'THEN', 'NOT',

# Iterative
'DO', 'LOOP',
'WHILE', 'WEND',

# I/O Operators
'TYPE', # Outputs a string to stdout
'EMIT', # Outputs a character to stdout
'KEY',  # Takes input character from stdin

# Special Character Values
# See ASCII Character Table for values
'CR', 'LF', 'NULL', 'BELL', 'BS', 'TAB',
'VT', 'FF', 'ESC',

# Logical Operators
'NEQ', 'AND', 'OR',

]

Ok, your turn, add selector methods in next_token to recognize each the following tokens: TT_POW, TT_LSR, TT_LSL, TT_AND, TT_OR, TT_XOR, TT_INV, TT_EQ, TT_LAND, TT_LOR, TT_NEQ. Be sure to test your code using the driver or write some unit tests.

Ok, now that you have everything except strings and characters working, lets complete the lexer by adding them. First, let’s start with strings. A string in our language is recognized by and set of characters, except the double quote, enclosed in double-quotes. Go ahead and add a simple selector statement to next_token to filter on the double-quote character.

We will leave the double-quote in current_char and call a recognizer function to collect the string and return it to the selector. The selector will use the string value returned to create the token which is then returned to the caller. Here’s the selector code for next_token():

            elif self.current_char == '"':
                string = self.string()
                return Token(TT_STRING, string, self.line, self.col)

Now we need the string recognizer method. It will simply ensure we enter with a double-quote as the current_char value, otherwise, it will throw an error. If no error occurred, we advance() to consume the opening double-quote. Next, we collect the characters of the string in a variable named string using a while loop. The loop terminates on finding the closing double-quote. This means when the loop exits, the closing double-quote is still the current character. So we must advance once again to consume the closing double-quote. Lastly, we return the string we collected in the string variable. The code below is our implementation of string():

        def string(self):
        if  self.current_char !='"':
            self.expected('"', self.peek())
        self.advance()  # Consume leading '"'

        string = ''
        while self.current_char != '"':
            string += self.current_char
            self.advance()
        self.advance()  # Consume trailing '"'
        return string

Next, we need to implement the character selector. Characters in our language will be placed inside single-quotes. Try your hand at writing the code to implement the handling of characters.

This has been a long post and quite a bit of code for a post. I’ll list out my implementations below. I’ll also put the code up on Bitbucket so you can download it for inspection. If you are truly interested in learning about language implementation, you’ll type everything out and spend some time tracing through the code. The only way to learn is to do it! Write each method and see how they work together.

In my next post we will implement the parser. Until then, keep coding!

# stack.py

class Stack:
    def __init__(self, report_errors=False):
        self.store = []
        self.report_error = report_errors

    def push(self, item):
        self.store.append(item)

    def pop(self):
        if self.store:
            return self.store.pop()
        else:
            if self.report_error:
                raise IndexError('Attempt to pop a value from empty stack!')
            else:
                return None

    def tail(self):
        if self.store:
            return self.store.pop(0)
        else:
            return None

    def clear(self):
        self.store.clear()

    def count(self):
        return len(self.store)

    def is_empty(self):
        return not bool(self.store)

    def __str__(self):
        s = '[ '
        for v in self.store:
            s += "'" + str(v) + "', "
        s = s[:-2]
        s += ' ]'
        return s

    def __repr__(self):
        return self.__str__()
# keywords.py

KEYWORDS = [
    # Stack Operators
    'POP', 'PSH', 'SWAP', 'DUMP', 'DROP', 'ROT', 'OVER',
    'DUP', 'DUP2', 'DUP3', 'CLEAR',

    # Conditional Operators
    'IF', 'ELSE', 'THEN', 'NOT',

    # Iterative
    'DO', 'LOOP',
    'WHILE', 'WEND',

    # I/O Operators
    'TYPE',  # Outputs a string to stdout
    'EMIT',  # Outputs a character to stdout
    'KEY',   # Takes input character from stdin

    # Special Character Values
    # See ASCII Character Table for values
    'CR', 'LF', 'NULL', 'BELL', 'BS', 'TAB',
    'VT', 'FF', 'ESC',

    # Logical Operators
    'NEQ', 'AND', 'OR',

]
# tokens.py


# Data
TT_CHAR     = 'CHAR'    # 'c'
TT_STRING   = 'STRING'  # "string"
TT_NUMBER   = 'NUMBER'  # 1234 (integers only
TT_ID       = 'ID'

# Arithmatic Operators
TT_ASSIGN   = 'ASSIGN'  #   '='
TT_PLUS     = 'PLUS'    #   '+'
TT_MINUS    = 'MINUS'   #   '-'
TT_MUL      = 'MUL'     #   '*'
TT_DIV      = 'DIV'     #   '/'
TT_POW      = 'POW'     #   '**"

# Bitwise Operators
TT_LSR      = 'LSR'     #   '>>'
TT_LSL      = 'LSL'     #   '<<'
TT_AND     = 'AND'    #   '&' Bitwise AND
TT_OR      = 'OR'     #   '|' Bitwise OR
TT_XOR     = 'XOR'     #   '^' Bitwise XOR
TT_INV     = 'INV'     #   '~' Bitwise NOT

# Relational Operators
TT_EQ       = 'EQ'      #   '==' Is Equal?
TT_LESS     = 'LESS'    #   '<'  Is Less?
TT_GREATER  = 'GREATER' #   '>' Is greater?

# Logical Operators
TT_LAND  = 'LAND'   #   AND Logical
TT_LOR   = 'LOR'    #   OR Logical
TT_NEQ   = 'NEQ'    #   NEQ Logical

# DEFINE Functions
TT_STARTDEF = 'STARTDEFINE' #   ':'
TT_ENDDEF   = 'ENDDEFINE'   #   ';'

# INTERNAL USE
TT_KEYWORD  = 'KEYWORD'
TT_START    = "START"
TT_EOF      = 'EOF'


class Token:
    def __init__(self, _type, _value, line=None, col=None):
        self.type = _type
        self.value = _value
        self.col = col
        self.line = line

    def __str__(self):
        return f'''type: {self.type}, value: {self.value}, 
                at line {self.line} in column {self.col}'''

    def __repr__(self):
        return self.__str__()

# lexer.py

from keywords import KEYWORDS
from tokens import *

class Lexer:
    def __init__(self, text, line=None, col=None):
        self.text = text
        self.pos = 0
        self.line = line if line else 1
        self.col = col if col else 1
        self.current_char = self.text[0]


    #############################################
    # Utility Methods                           #
    #-------------------------------------------#
    # These methods help manage the internal    #
    # workings of the lexer.                    #
    #############################################
    def advance(self):
        """advance():
        Advances the current position in the character stream.
        """
        if not self.is_eoi():
            self.pos += 1
            self.current_char = self.text[self.pos]
            self.update_position()
        else:
            self.current_char = ''

    def update_position(self):
        """update_position
        Maintains the line and column positions for the current token scan.
        """
        if self.current_char == '\n':
            self.line += 1
            self.col = 0
        else:
            self.col += 1

    def is_eoi(self):
        """is_eoi():
        :return:
            True if current position is End Of Input stream.
            Returns False otherwise.
        """
        return self.pos >= (len(self.text) - 1)

    def expected(self, expected, found):
        """expected
        Raises error is expected and found are different.
        """
        if expected == found:
            return
        else:
            raise ValueError(f'[Error] Expected {expected} but found {found} in line: {self.line} position: {self.col}')

    def peek(self):
        """peek
        Returns next character in input stream after the current character.
        Note: If we are at EOI (End Of Input) then a space character is returned.
        """
        if not self.is_eoi():
            return self.text[self.pos + 1]
        else:
            return ' '  # space

    #############################################
    # Recognizer Methods                        #
    #-------------------------------------------#
    # These methods collect the characters that #
    # belong to a single multi-character token  #
    # into a single unit, and return that unit  #
    # to the caller.                            #
    #############################################
    def char(self):
        if self.current_char != "'":
            self.expected("'", self.current_char)
        self.advance()  # Consume opening single quote

        char = self.current_char
        self.advance()  # Consume character

        # Now we must have a closing single quote mark
        if self.current_char != "'":
           self.expected("'", self.current_char)
        self.advance()  # Consume closing single quote
        return char

    def string(self):
        if self.current_char != '"':
            self.expected('"', self.peek())
        self.advance()  # Consume leading '"'

        string = ''
        while self.current_char != '"':
            string += self.current_char
            self.advance()
        self.advance()  # Consume trailing '"'
        return string

    def number(self):
        number = ''
        while self.current_char.isdigit():
            number += self.current_char
            self.advance()
        return number

    def is_ident(self):
        return self.current_char.isalpha() or self.current_char == '_'

    def _ident(self):
        _id = ''
        # identifiers begin with alpha or underscores
        if self.current_char.isalpha() or self.current_char == '_':
            while self.current_char.isalnum() or self.current_char == '_':
                _id += self.current_char
                self.advance()
        return _id

    def is_keyword(self, kw):
        return kw.upper() in KEYWORDS

    def eat_comment(self):
        if self.current_char == '(':
            while self.current_char != ')' and not self.is_eoi():
                self.advance()
            self.advance()  # skip trailing )
        elif self.current_char == '\\':
            while self.current_char != '\n' and not self.is_eoi():
                self.advance()
            self.advance()  # skip new line character

    #############################################
    # Tokenization                              #
    #-------------------------------------------#
    # The method next_token() is called by the  #
    # client to obtain the next token in the    #
    # input stream.                             #
    #############################################
    def next_token(self) -> Token:
        if not self.is_eoi():
            if self.current_char.isspace():
                while self.current_char.isspace():
                    self.advance()

            if self.current_char.isdigit():
                return Token(TT_NUMBER, self.number(), self.line, self.col)

            elif self.current_char == '+':
                val = self.current_char
                self.advance()
                return Token(TT_PLUS, val, self.line, self.col)

            elif self.current_char == '-':
                val = self.current_char
                self.advance()
                return Token(TT_MINUS, val, self.line, self.col)

            elif self.current_char == '*':
                val = self.current_char
                if self.peek() == '*':
                    val += '*'
                    self.advance()
                    self.advance()
                    return Token(TT_POW, val, self.line, self.col)
                self.advance()
                return Token(TT_MUL, val, self.line, self.col)

            elif self.current_char == '/':
                val = self.current_char
                self.advance()
                return Token(TT_DIV, val, self.line, self.col)

            elif self.current_char == '&':
                val = self.current_char
                self.advance()
                return Token(TT_AND, val, self.line, self.col)

            elif self.current_char == '|':
                val = self.current_char
                self.advance()
                return Token(TT_OR, val, self.line, self.col)

            elif self.current_char == '^':
                val = self.current_char
                self.advance()
                return Token(TT_XOR, val, self.line, self.col)

            elif self.current_char == '~':
                val = self.current_char
                self.advance()
                return Token(TT_INV, val, self.line, self.col)

            elif self.is_ident():
                _id = self._ident()
                if self.is_keyword(_id):
                    return Token(TT_KEYWORD, _id, self.line, self.col)
                else:
                    return Token(TT_ID, _id, self.line, self.col)

            elif self.current_char == '<':
                val = self.current_char
                if self.peek() == '<':
                    val += '<'
                    self.advance()
                    self.advance()
                    return Token(TT_LSR, val, self.line, self.col)
                self.advance()
                return Token(TT_LESS, val, self.line, self.col)

            elif self.current_char == '>':
                val = self.current_char
                if self.peek() == '>':
                    val += '>'
                    self.advance()
                    self.advance()
                    return Token(TT_LSR, val, self.line, self.col)
                self.advance()
                return Token(TT_GREATER, val, self.line, self.col)

            elif self.current_char == '=':
                val = self.current_char
                if self.peek() == '=':
                    val += '='
                    self.advance()
                    self.advance()
                    return Token(TT_EQ, val, self.line, self.col)
                self.advance()
                return Token(TT_ASSIGN, val, self.line, self.col)




            elif self.current_char == ':':
                val = self.current_char
                self.advance()
                return Token(TT_STARTDEF, val, self.line, self.col)

            elif self.current_char == ';':
                val = self.current_char
                self.advance()
                return Token(TT_ENDDEF, val, self.line, self.col)

            elif self.current_char == '(' or self.current_char == '\\':
                self.eat_comment()
                return self.next_token()


            elif self.current_char == "'":
                return Token(TT_CHAR, self.char(), self.line, self.col)

            elif self.current_char == '"':
                string = self.string()
                return Token(TT_STRING, string, self.line, self.col)

            elif self.is_eoi():
                return Token(TT_EOF, None, self.line, self.col)

            else:
                raise ValueError(f'Unexpected character: {self.current_char} in input stream at position: {str(self.pos)}.')

        else:
            return Token(TT_EOF, None, self.line, self.col)


def main(text: str):
    lexer = Lexer(text)
    tok = lexer.next_token()
    while tok.type != TT_EOF:
        print(tok)
        tok = lexer.next_token()

if __name__ == "__main__":
    text = """ 
    ( this is a multi-line 
       comment. ) 
    \ This ia a line comment." 
    _myIdent """

    main(text)

Until next time, Happy Coding!

Series NavigationImplementing Stack Oriented Languages — Part 2 >>

Leave a Reply

Your email address will not be published. Required fields are marked *