Building Machines In Code – Part 5

This entry is part 5 of 9 in the series Building Machines in Code

Tooling

Tooling

Hardware and software developers are tool makers by trade. Just like a machinist, software developers often need to develop their own tools for the job at hand. Sometimes these tools are simple scripts to automate a boring, or complicated task, or perhaps, a tool to fill a yet unfilled niche. Whatever the reason, tool making is quite common for some software developers. Hardware developers often have to develop tools such as assemblers, compilers, and programmers for the hardware systems they develop.

Since we don’t have an assembler for the Tiny-P, we will develop a simple assembler in this installment. Having an assembler will help us develop software more easily. Our assembler will have a couple of features to help us more easily write our assembly programs. Namely, labels and origin statements. Below is an example of a Tiny-P assembly language program for our assembler.

# Sample Tiny-P Assembler Program
# Comments start with a hash symbol "#"

             ORG. 0     # Origin: starting address of program code

start:       LDA  0        # Load ACC with value stored in Mem[0]
             ADD  1        # Add value stored in Mem[1] to ACC
             STA  1        # Save result of Add in Mem[2]
             BRP  start    # If result is zero or greater return to start
             ...

Let’s dissect the assembly code above. The first is the “ORG.” command is an assembler directive. It tells the assembler where our program will start in program memory. If not provided, the assembler will default to an origin of 0. Next, we encounter the “start:” label. Labels are just named locations. Since our origin is set at 0 and LDA is the first instruction in our program start is equal to program address 0. Notice how the last instruction “BRP” (branch Positive) uses the “start” label instead of 0. Names are much easier to remember and keep track of in a longer assembly language program. Use them wisely and they will become your friend!

As you may have noticed in part-4 of this series, assembly language statements and machine code have an almost one-to-one correspondence. This makes developing a simple assembler rather easy. The approach I like to use is to define the assembly language grammar in EBNF (Extended Backus Naur Form) and yes, there is a BNF. These notations allow a developer to describe programming languages and protocols. Using the EBNF grammar I then head over to my favorite railroad-diagramming tool at: https://www.bottlecaps.de/rr/ui and generate a railroad diagram from my grammar:

PROG:

STATEMENTS

no references


STATEMENTS:

STATEMENT

referenced by:


STATEMENT:

LABEL_DECL DIRECTIVE MNEMONIC OPERAND # \n

referenced by:


DIRECTIVE:

[a-z] [A-Z] _ [a-z] [A-Z] .

::= [a-zA-Z] [_a-zA-Z]* .

referenced by:


LABEL_DECL:

LABEL :

::= LABEL ':'

referenced by:


LABEL:

_ [a-z] [A-Z] _ [a-z] [A-Z] [0-9]

LABEL ::= [_a-zA-Z]+ [_a-zA-Z0-9]*

referenced by:


OPERAND:

LABEL NUMBER

referenced by:


NUMBER:

[0-9] [0-9]

NUMBER ::= [0-9]+ [0-9]*

referenced by:



… generated by RR – Railroad Diagram Generator R R

Using this diagram as a reference it becomes quite easy to produce a parser for the assembler. Covering BNF and EBNF is beyond the scope of this article. But you can find more information at: https://matt.might.net/articles/grammars-bnf-ebnf/. Also, the W3C has some information on BNF and EBNF. You can find that page here: https://www.w3.org/community/markdown/wiki/Main_Page.

Railroad diagrams are quite intuitive. You simply follow the tracks. We can see that a NUMBER is made up of one or more 0-9 characters. An OPERAND can be either a LABEL or a NUMBER. A LABEL must begin with an underscore or alpha character in either upper or lower case, followed by any number of underscores, letters, or digits. We can also see that a LABEL_DECL (label declaration) is a label followed immediately by a colon. A DIRECTIVE is a label followed immediately by a period, AKA full stop. The most interesting part of the diagram for me is the STATEMENT portion. Using this we can see how our assembly program lines are formed. This is the information we need to write our parser.

Before our parser can parse anything, we need a lexer to break up our assembly language code into tokens. You can think of tokens as the words of a language. Our lexer will use the Python str.split() method to place each token into a list. Then we’ll call a next_token() method to pop off the first element in the list. Using Python the lexer becomes trivial. Create a file called “assembler.py” and add the following code:

class Lexer:
    def __init__(self):
        self.line = None
        self.tokens = []

    def set_text(self, line: str):
        self.line = line
        self.tokens = line.split()

    def next_token(self):
        if not self.tokens:
            return None

        tok = self.tokens.pop(0)
        return tok

Next, write the standard if __name__ == “__main__” clause at the bottom of the file. We’ll add a bit of code to open a file and pass it to the lexer.

...
if __name__ == "__main__":
    inputfile = "example.asm"
    with open(inputfile, 'r') as ifh:
        text = ifh.read()
    ifh.close()

    lexer = Lexer()
    lexer.set_text(text)

    tok = lexer.next_token()
    while tok is not None:
        print(f"Token: {tok}")
        tok = lexer.next_token()

Now we need a test assembly language file. Create a file named “example.asm” and add the following Tiny-P assembly language statements to it:

# File: example.asm
# Sample Tiny-P Assembler Program
# Comments start with a hash symbol "#"

             ORG. 0        # Origin: starting address of program code

start:       LDA  0        # Load ACC with value stored in Mem[0]
             ADD  1        # Add value stored in Mem[1] to ACC
             STA  2        # Save result of Add in Mem[2]
             BRZ  start    # If result is zero return to start

Save this file in the same directory as your “assembler.py” file.

Now run the assembler.py file. You should see every word in the file printer as its own token. What you may not notice is that spaces are not returned as tokens. This is due to how the str.split() function works. Once this is working, remove the code from the if clause. Replace it with a call to main(). We don’t have this function yet so let’s create it:

...

def main():
    inputfile = "example.asm"
    with open(inputfile, 'r') as ifh:
        program_text = ifh.read()
    ifh.close()

    assembler = Assembler(program_text)
    machine_text = assembler.parse()


if __name__ == "__main__":
    main()

With this code in place, it’s time to start work on our assembler. There are an infinite number of ways to write a parser. The approach I will take here is a simple top-down parser. However, we will only parse one line at a time. The one difficulty we have is dealing with labels. It is possible for a label to be used before we reach the label declaration. To handle this we will use a dictionary as a symbol table. When we encounter a label we will first check to see if it is in the table and if so, we will resolve it to its address. If not, we will leave the label in the machine code and then resolve it once we have completed parsing the file.

Let’s start working on the Assembler class:

...

class Assembler:

    def __init__(self, text_: str):
        self.text = text_
        self.lines = self.text.split('\n')
        self.current_address = 0 
        self.opcode = 0
        self.operand = 0
        self.lexer = Lexer()
        self.symbol_table = {}
        self.code = []

The __init__() method sets up some member variables for things we need to track such as the current address, the opcode, the operand, etc. Notice that we use the split() method again on the text. This time passing in a new line character as the character to split on. This will return a list of lines and allow us to parse a line at a time. Let’s work on the parse() method now:

    ...

    def parse(self):
        for line in self.lines:
            line = line.lower()
            self.opcode = 0
            self.operand = 0

            self.lexer.set_text(line)

            tok = self.lexer.next_token()
            while tok is not None:
                self.skip_spaces(tok)

                if tok is None or not tok:
                    break

Here, we first initialize some variables and initialize our lexer with the current line of the program text. Next, we grab a token from the lexer and enter a while loop. The loop first calls a method to any white space. This isn’t really needed as our split() method doesn’t return white space when we use the default separator. But I felt compelled to include it here for those who may be porting this code to another language. So let’s add the skip_spaces() method:

    ...
    
    def skip_spaces(self, tok: str):
        while tok.isspace():
            tok = self.lexer.next_token()

    def parse(self):
        ...

Now let’s continue on with our parse() method. We need to add checks for each of our token types (LABEL_DECL, COMMENT, DIRECTIVE, and MNEMONIC):

    ...
    def parse(self):
        for line in self.lines:
            line = line.lower()
            self.opcode = 0
            self.operand = 0

            self.lexer.set_text(line)

            tok = self.lexer.next_token()
            while tok is not None:
                self.skip_spaces(tok)

                if tok is None or not tok:
                    break

                elif tok.endswith(':'):
                    # LABEL_DECL
                    print(f'LABEL_DEL: {tok}')
                    pass
                elif tok == '#':
                    # COMMENT
                    print(f'COMMENT: {tok}')
                    pass
                elif tok.endswith('.'):
                    # DIRECTIVE
                    print(f'DIRECTIVE: {tok}')
                    pass
                elif tok in MNEMONICS:
                    # INSTRUCTION
                    print('MNEMONIC: {tok')}
                    pass

                tok = self.lexer.next_token()
             

This code simply filters through the various token types. We do need a few more things before we can run this. The MNEMONIC test looks for a list of mnemonics. So let’s add that now. Just above the lexer class add the following code:

...

MNEMONICS = ['nop', 'lda', 'sta', 'and', 'or', 'not', 'add', 'sub', 'brz', 'brp']

class Lexer:
    ...

Next, just as we did with the skip_spaces() we’ll need to add a small method to our class to skip_comments. Doing this in a method keeps the code a bit cleaner. Add the following method just below the skip_spaces() method:

    ...

    def skip_comment(self, tok: str):
        if tok == '#':
            while tok:
                tok = self.lexer.next_token()

    ...

Now add a call to skip_comments() to the parse() method as shown:

                ...
                elif tok == '#':
                    # COMMENT
                    print(f'COMMENT: {tok}')
                    self.skip_comments(tok)
                    break
               ...
                

We should now have enough code to run the parse() method. At the moment it should print out the various tokens found. For comments it should only print out the hash symbol “#” and the comment text should be consumed without being displayed.

The next task we need to tackle is handling the LABEL_DECL or label declarations. We can tell a declaration from a label’s use in that a declaration occurs at the start of a line and the label name is followed immediately by a colon. Since labels and label declarations only differ by the colon, we use this fact to distinguish between them. When we have found a label declaration we need to record the label name and the corresponding address where it occurred. We mustn’t forget to remove the trailing colon or label lookups will fail. Modify the code as shown below:

                 ...

                 elif tok.endswith(':'):
                    # LABEL_DECL
                    key = tok[:-1]
                    self.symbol_table[key] = self.current_address

                elif tok == '#':
                ...

Now that we have the label declaration handled, and looking at the railroad diagram for our grammar, it’s time to handle the directives. Like the labels directives are string identifiers. They differ in that they end with a period. We use this period to filter them out for processing. We only have one directive and that is the “ORG.” or origin directive. This command simply changes the location in the memory where our program will be stored. When used with labels it can make a program relocatable. To effect this change we only need to set the current_address variable to the value of the operand following the directive. The code follows:

                ...
                elif tok.endswith('.'):
                    # DIRECTIVE
                    if tok[:-1] == 'org':
                        operand = self.lexer.next_token()
                        if operand.isnumeric():
                            self.current_address = int(operand)
                        else:
                            raise ValueError('Illegal Origin. Expected: integer, Found {operand}')
                    break
               ...

As you can see in the code above, we use an inner “if” statement to make sure we are dealing with an origin directive. Then we make a call to next_token to get the token following the directive. This should be a numeric value. If not something has gone wrong and we raise a ValueError. Finally, we set the current_address variable to the new value, being sure to convert it to an integer and then break from the while loop. The only thing that could follow after the numeric value would be a comment that we want to skip anyway. So, breaking out here simply speeds things up.

Next, we need to handle the mnemonics. This is a bit more complicated than the other tokens and will require some additional code to map mnemonics to their corresponding opcodes, and processing of any operands or symbols that follow.

To prepare for handling the mnemonics we need to add a dictionary with the mnemonics as the keys and their values from the opcode table in the previous post as the keyed item. Place this code at near the top of the file:

...

# The Opcode table relates mnemonics
# to the corresponding opcode value.
OPCODE_TABLE = {
    'nop': 0,
    'lda': 1,
    'sta': 2,
    'and': 3,
    'or':  4,
    'not': 5,
    'add': 6,
    'sub': 7,
    'brz': 8,
    'brp': 9
}

MNEMONICS = ['nop', 'lda', 'sta', 'and', 'or', 'not', 'add', 'sub', 'brz', 'brp']

...

We will use this table to find the opcode value for the associated mnemonic.

                ...
                elif tok in MNEMONICS:
                    # INSTRUCTION
                    self.opcode = OPCODE_TABLE[tok]
                    operand = self.lexer.next_token()
                    if operand.isnumeric():
                        self.operand = operand
                    elif operand.isalnum():
                        if operand in self.symbol_table:
                            self.operand = self.symbol_table[operand]
                        else:
                            self.operand = operand
                    elif operand.startswith('#'):
                        self.operand = 0
                        self.skip_comment(operand)

                    self.code.append(f"{self.current_address} : {self.opcode}-{self.operand}")
                    self.current_address += 1
                    ...

In the above code, we check to see if our token is in the list of mnemonics. If so, we enter the body of the “if” statement and look up the opcode from our OPCODE_TABLE. Next, we call next_token to grab the operand (if any) and check to see if it is a numerical value. If it is, we can use it directly, otherwise, we check to see if it is a symbol. If the operand is a symbol, we see if we can find it already in the symbol_table, and if it is there, we assign its value to the class member variable “self.operand”. If the symbol is not yet in the table, we simply assign the symbol to the class member for later lookup in the table. Next, if the operand is not a symbol, we check to see if it is a comment. The only other token that can legally follow the mnemonic before the end of a line. If it is a comment, we set the operand to 0 and call skip_comment(), to finish parsing the line.

Once we have the line parsed, the member variables “self.opcode” and “self.operand”, are set. We now build an intermediate value and store it in “self.code”. To produce the intermediate value we use the following format: <current_address> : <opcode> – <operand | symbol>. We append the intermediate value to our list of ROM values kept in the member variable “self.code” and bump the address point (self.current_addressI) to the next address.

Now we have parsed a single line of assembly code and it’s time to do it again. So, we make a call to next_token() to get the next token and re-enter the while loop. If we are out of tokens, then the while loop ends and we enter the for loop to handle the next line of assembly text.

We now have a list of (almost) machine code lines stored in “self.code”. These lines however are contaminated with symbols and format markers. Cleaning this intermediate code up and converting it to the proper machine code values for ROM storage is the responsibility of the “fixup()” method. Add this method just above the “parse()” method:

    ...
        
    def fixup(self):
        text_ = ''
        for line in self.code:
            parts = line.split(':')
            addr = parts[0]
            sub_parts = parts[1].split('-')
            opcode = sub_parts[0]
            operand = sub_parts[1]

            if operand.isalnum() and not operand.isnumeric():
                if operand in self.symbol_table:
                    operand = self.symbol_table[operand]
                else:
                    raise ValueError(f"Undefined Symbol: {operand}")

            bin_code = (int(opcode) * 100) + int(operand)
            if bin_code > 999 or bin_code < 0:
                raise ValueError(f"Illegal Machine Code Value {bin_code}")
            code_line = f'{addr.zfill(4)} {bin_code}\n'
            text_ += code_line

        return text_

    def parse(self):
       ...

The “fixup()” method steps through each line of intermediate code. It then takes each of these lines and splits them using the colons and dashes we inserted earlier to locate the address, opcode, and operand/symbol values.

Next “fixup()” checks the operand to see if it is a numeric value. If so, it’s used directly to calculate the machine code. If it is not a numeric value it must be a symbol. Since we’ve parsed the whole file, the symbol should be in the symbol table and if not, this is an undefined symbol error.

We use the value to calculate the machine code values using the formula: (opcode * 100) + operand. This value is then stored in the bin_code variable and then used to create a text. If the “bin_code” is not between 0 and 999, then an error has occurred and is raised. The zfill() method is used to force formatting of 3-digits with leading zeros for the address. Once the fixup is complete, the machine code text is returned to the caller (parse()), which then returns it to “main()“.

We can now add a line of code to the bottom of the main to print this machine_code text to the console:

        ...
        print(machine_text)
        ...

If we run the code now, you should see the machine code in four lines of digits. On each line is the 3-digit address followed by the 3-digit machine code.

Finishing Touches

While it would be possible to use the assembler as it is, cutting and pasting its output into files. It would be far better to add a little polish to the assembler and make it accept input and output file parameters and display a usage statement when run without them.

We will use the “getopt” python module to help us implement this functionality. Let’s import the module near the top of the assembler.py file:

#!/usr/bin/python3
# Tiny-P Assembler
import sys, getopt

# The Opcode table relates mnemonics
# to the corresponding opcode value.
OPCODE_TABLE = {
    ...

You may also want to add the shebang line (shown at the very top), particularly if you are Linux/Mac. This line will allow you to execute the file directly by telling the operating system what interpreter to use.

This post is already very long, and the last bit of code has little to do with creating an assembler. You can learn more about using “getopt” at: https://docs.python.org/3/library/getopt.html so I won’t go into any detail and just leave you with the code that follows:

def main(argv):
    inputfile = ''
    outputfile = ''
    usage_message = "Usage: assembler.py -i <inputfile> -o <outputfile>"

    try:
        opts, args = getopt.getopt(argv, "hi:0:", ["help", "ifile=", "ofile="])
    except getopt.GetoptError:
        print(usage_message)
        sys.exit(2)
    for opt, arg in opts:
        if opt in ('-h', '--help'):
            print(usage_message)
            sys.exit()
        elif opt in ('-i', '--ifile'):
            inputfile = arg
        elif opt in ('-o', '--ofile'):
            outputfile = arg

    if not inputfile:
        print(usage_message)
        sys.exit(2)

    # If only input file given default output file to <inputfile>.bin
    if inputfile and not outputfile:
        outputfile = inputfile.split('.')[0]
        outputfile += '.bin'

    with open(inputfile, 'r') as ifh:
        program_text = ifh.read()
    ifh.close()

    # Assemble program
    assembler = Assembler(program_text)
    machine_text = assembler.parse()

    # Write output file
    with open(outputfile, 'w') as ofh:
        ofh.write(machine_text)
    ofh.close()

    # Exit message
    print(f"Assembled: {inputfile} and wrote machine code to {outputfile}")


if __name__ == '__main__':
    main(sys.argv[1:])

If you modify the code, as shown above, you now have a fully functional TIny-P assembler. We can now write our assembly code in a file, run the assembler on it, and get a bin file filled with machine code for the Tiny-P.

Conclusion

This has been a very long post. Indeed it took me all day to complete. But we have an assembler for our Tiny-P processor. You may be wondering how we get the machine code from the bin file into the Tiny-P’s memory. This is the job of the programmer/loader. In our next installment, we will look at developing a simple loader program.

Until next time, Happy Coding!

Resources

You can find the code for this installment here: https://github.com/Monotoba/Building-Machines-In-Code.git

Information on the getopt module can be found at the following sites:

or, https://www.programcreek.com/python/example/121/getopt.getopt,

https://docs.python.org/3/library/getopt.html

You may also find this site on writing assemblers and compilers interesting: https://www.plantation-productions.com/Webster/RollYourOwn/index.html

Series Navigation<< Building Machines In Code – Part 4Building Machines In Code – Part 6 >>

Leave a Reply

Your email address will not be published.