Overview

Everybody should learn to program a computer... because it teaches you how to think.

—

Steve Jobs⁠

⁠

Before we get into writing the tokeniser, let’s first take a moment to appreciate how all the moving parts come together to give us a working programming language. The diagram below may help to elucidate the various components of our language:

⁠

In the last sprint, we set up our syntax highlighter. In this sprint and many subsequent ones, we are going to be working on the compiler — the actual beating heart of the entire language.

The compiler itself is made up of a tokeniser, parser, and interpreter.

Tokeniser

The first thing we want to do is to take the user's source code, and produce a stream of tokens. A token contains the following data:

Lexeme

Token type

Literal (if applicable)

And also, optionally:

Line number, column number

The parser will then take this list of tokens and produce an Abstract Syntax Tree (AST), which can then be interpreted, transpiled, or compiled. But what's a lexeme anyway? A piece of source code is made up of smaller, distinct, lexemes:

var a <str> = "hello";

In this case, var, <, str, >, =, etc. are lexemes because they are small, logical chunks of the language. What isn't a lexeme?

If we were to take just ar out of var, it doesn't make any sense, so it isn't a lexeme. Neither is var a, because it's too long and can be further broken down into smaller chunks. So a token essentially identifies what a certain character stands for, where it is located in the source code, and the actual value of the word if it’s a literal.

Parser

The parser then takes this linear stream of tokens and produces an Abstract Syntax Tree (AST). The AST arranges the tokens in a hierarchical manner, and serves a few functions:

static analysis pass

Interpreter

Gallery

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.