WARNING
The code shown in this document is just a temporary, overly simplified markup in order to visualize the process easily. But, in a production environment, the code is generally more complex in order to deal with all of the screnarios and conditions that could arise and to ensure proper performance.
Features of the language
Firstly write out how you want the language to look and function, you should be thinking about what and how the person writing a program in your language should feel. Try implementing a simple program in your language just out of imagination and start building your idea of the langauge from there, the most simple program could be a fibonacci sequence or simple text game based on if else statements.
Lexical Analysis
Lexing
The process of converting source code into tokens is called lexical analysis and it is done by a lexer.
Every single bit of the source code would be converted into it’s corresponding token e.g.
let x = 5 + 5; would be converted into
[
LET,
IDENTIFIER("x"),
EQUAL_SIGN,
INTEGER(5),
PLUS_SIGN,
INTEGER(5),
SEMICOLON
]
These tokens should be concretely defined, e.g. the INTEGER() token will not contain “5” but 5, i.e value with that datatype itself
NOTE
In order to make a more extensible lexer, one that allows easy debugging by the user, we can assign line number, column number and filename to a token, because upon faulty code, we can easily hint the user where the problem is.
Tokens
We need to define a data structure for the Token, the most common one would be:
enum TokenType{
ILLEGAL,
EOF,
LET,
IDENTIFIER,
EQUAL_SIGN,
...
}
struct Token{
token_type: TokenType,
value: String,
line: usize,
column: usize,
file_path: String
}NOTE
As you can see there are two new special token types in the above code,
ILLEGAL(a token/character we don’t know about) andEOF(“end of file”, which tells our parser later on that it can stop). These two are very special as they both serve a special purpose for the interpreter iteslf, and will be very usefull later on.
The above token data type also keeps track of the line, column and file it came from in order to make it easy for the end user to debug the program.
Later in the program when we implement the lexer itself, we can utilize the flow of the lexer in order to properly utilize these features.
Lexer
The main process of the lexer will be going through each character in the source code one by one, and tokenizing them.
In order to properly implement the line number, column number and file path tracking for each token, it is generally a good idea to intialize the lexer with an IO Reader and a filepath of the file to which the lexer is currently attatched to.
A basic idea for the lexer data type would be:
struct Lexer{
input: String,
pos: usize, // current position in the input, position of the curr_char
read_pos: usize, // current reading position in the input, position after the curr_char
curr_char: char,
};WARNING
Above data structure is a temporary markup and should be NOT be literally used in a production environment
NOTE
In order to persue smooth development of the language, you should write tests for anything before you write the actual thing itself.
fn read_char(self) { // self -> lexer
if self.read_pos >= self.input.length { // check whether we have reached the end of the input
self.curr_char = 0; // ASCII code for the "NUL" character which signifies, "we haven't read anything yet" or "end of file"
} else {
self.curr_char = self.input[self.read_pos]; // advance the curr_char from `pos` to `read_pos`
};
self.pos = self.read_pos; // advance the current position to match the new position of the curr_char
self.read_pos += 1; // increment the read_pos to the corresponding next character
};NOTE
Above function is complete for Rust because Rust itself offers good support for handling unicode characters but in other languages like GO, this is not there. So, in those language you have to custom implement functionality to properly read the unicode characters and even emojis
fn next_token(self) { // self -> lexer
let mut tok: Token; // innitialize an empty
match self.curr_char {
"=" => {
tok = Token {
token_type: EQUAL_TO,
value: self.curr_char
};
}
":" => {
tok = Token {
token_type: COLON,
value: self.curr_char
};
}
... // continue for all the tokens you want to support
};
self.read_char(); // advance the reader onto the next position
return tok; // return the read token
}At this point, you want to make sure you already have made or start making tests using the actual code of the language you are building. Instead of focusing on simpler parts, make a test for a whole chunk of a program in your language and see if you interpreter can tokenize it.
In order to properly read identifiers and keywords, our interpreters need proper functionality to understand them.
An easy way to implement that functionality would be to alter the next_token function to go into a loop when the current character is a letter until it reaches a character that is not a letter (like whitespace,equal*to,colon etc)
while matching the curr_char with various tokens, at the very beginning check whether the current char is a letter,
if yes
- we send it through a function that
- tracks the beginning of the identifier/keyword
- loops till the curr_char is not a letter
- advances the position using the `read_char` function
- at the end of the loop, sets the token value to be equal to the identifier/token
- after that, we run another function that
- takes in the token value as input
- matches it with a predefined set of identifiers/keywords in order to check whether it is a keyword or an identifier
- sets the token type to be either identifier or keyword
-
if no
- we mark the current token as `ILLEGAL` until the match statement updates it
The above code is important as it directly decides the rules you want to implement for you language, e.g the above loop will decide whether you want variable names in your language to contain certain characters such as * etc
Now we can also specify what whitespace means in our language, if it means nothing just add functionality to skip_whitespace or if it’s like python (ew), add functionality to actually take the whitespace into account.
We can also skip over the newline characters or count them in.
NOTE
Usually when you are writing an interpreter for a programming language, you can ignore newline characteres (unless you want to make python-like language), and while making an interpreter for a configuration language (like JSON,TOML etc), you should NOT ignore the newline characters as they are very important to the syntax
We can also add functionality to read various data types as well. Reading numbers is pretty straightforward, just read untill you encounter a character is not a digit.
fn read_number(self) -> String { // self -> lexer
let position = self.position; // keep track of the starting of the number
for is_digit(self.curr_char){ // loop untill the curr_char is a digit
self.read_char();
}
return self.input[position:self.position]; // return the read number as a string value
}The above code only reads integers and ignore all the other numerical data types like float, complex numbers, hexadecimal notation etc, but in a production environment, you must support all those numerical data types.
Extending the lexer
Now that we have the basic structure of the lexer down, we need to extend it to recognise more tokens like single-character tokens, keywords, and double-character tokens.
Handling single-character tokens and keywords is pretty easy, we have been doing similar things above.
But, in order to add support for double-character tokens, we can’t just add cases to our match statement. We cannot add a case in our match statement as we cannot match a string like ”==” with a character(which is the comparing type in our match statement).
What we can do is add a function called peek_char() to our lexer in order to peek what is the next character going to be and decide whether you want to return the current token as ”=” or ”==”.
fn peek_char(self) { // self -> lexer
if self.read_pos >= self.input.length { // check whether we have reached the end of file
return 0;
} else {
return self.input[self.read_pos] // return the character corresponding to read_pos
}
}peek_char is just read_char except it just doesn’t increment the self.pos and self.read_pos.
Most lexers have a function that peeks the immediate next character, but often times we also need to peek a few characters ahead and even look backwards to make sense of the source code.
Now we can easily extend code of the cases in our match statment to just peek the next char if the current char could be the starting of double-character token.
In order to further extend this functionality into multiple double-character tokens, it is a good idea to abstract the process in a method called make_two_char_token() or something like that that peeks and advances if it found the right token, this is because all the match cases handling double-character tokens are extremely identical.
Start of REPL
REPL stands for read eval print loop, basically it is just a console in which we can type our input of our language and it will evaluate that input and output/print the evaluated output to the console itself.
Currently our language cannot evaluate any code but we can tokenize it so currently all that our repl will do is take input and print the tokenized output.
We can create a CLI argument --repl which can be used to start the REPL and start accepting user input.
Parsing
Parsers
Parser is a program that takes our tokenized input and makes something that closely represents the structure of the program that is being written in our language.
Parser represents the source code in an internal data structure called the Abstract Syntax Tree(AST). The reason it is called abstract is because it hides away certain details that are visible in the source code like semicolons, commas, parantheses, brackets, colons, etc.
Each programming language that is implemented, has it’s own unique AST because AST is something internal to the language and there is no general implementation or representation of AST.
var input = 'if (3 * 5 > 10) { return "hello"; } else { return "goodbye"; }';
var tokens = MagicLexer.parse(input);
MagicParser.parse(tokens);
{
type: "if-statement",
condition: {
type: "operator-expression",
operator: ">",
left: {
type: "operator-expression",
operator: "*",
left: { type: "integer-literal", value: 3 },
right: { type: "integer-literal", value: 5 }
29
},
right: { type: "integer-literal", value: 10 }
},
consequence: {
type: "return-statement",
returnValue: { type: "string-literal", value: "hello" }
},
alternative: {
type: "return-statement",
returnValue: { type: "string-literal", value: "goodbye" }
}
}
The above is a simple example of how the AST might not contain the source code elements like commas, semicolons, and, many other characters etc but still closely represents the actual source code logic itself.
The parser is very usefull as it can be used to detect faulty code and raise syntax errors when the source code does not obey the grammar or the language.
NOTE
An alternative to writing your own parser is using
parser generatorswhich generate the output based on some formal grammar rules that you provide along with the source code. We can feed a context-free grammer (CFG) to the parser generator and it generates the output of the source code based upon that. The most common used format for CFG is Backus-Naur Form or the Extended Backus-Naur Form
Writing our own parser
Now this is the part where we usually deviate between implementing a programming language and implementing a language like JSON, TOML etc.