Mochaccino Sprint 1
Share
Explore

icon picker
Grammar and Definition

There is no right answer, just different flavors of wrong.
We need a way to represent the grammar of our language, one that can be used to guide the design of the scanner and parser, and also define the rules of the language. But before we get to representing our language, we first need to understand the language itself. Let’s first find out what a grammar is, and then what type of grammar our language falls under. Then, we’ll delve into different ways to represent the grammar for our language.

What Is a Grammar?

The grammar of our language is a set of rules that produce strings valid in the language’s syntax. Each rule is known as a production. Each production has a left-hand side (nonterminal) and right-hand side (expansion). You could liken production rules to words being defined in a dictionary:

nonterminal 🠒 expansion
ˬ ˬ
transmogrify 🠒 transform in a surprising or magical manner
Each expansion can contain terminals or non-terminals.
A terminal is a word that actually appears in the language, like “func”, “struct”, “(“, etc.
A non-terminal is a word that is defined elsewhere in the grammar, like function_declaration, etc.
Let’s look at some examples of production rules:
func_decl 🠒 "func" identifier "(" arguments ")" "{" body "}"
identifier 🠒 “a” | “b” | “c”
...
The above productions define the grammar for a function declaration. It says that a function declaration occurs when the word “func” is followed by an identifier (whose rules are defined elsewhere) followed by an opening parenthesis, arguments (whose rules are also defined elsewhere), followed by a closing parenthesis, followed by– you get the idea.
If you substitute a nonterminal with text that conforms to its rule, then that text is valid in your language’s syntax. Different types of grammars require different ways of defining production rules, as we’ll learn in the next section.

Types of Grammars

According to the , these are the types of formal grammars:
Chomsky Hierarchy of formal grammars
Starting with regular languages, as we go up the hierarchy, we notice that:
production rules become increasigly complex
the grammars defined offer more flexibility
Fortunately for us, Mochaccino and many other languages we see today are context-free grammars (CFGs).

Grammar Representation Metalanguages

The most common way to represent grammars for CFGs is in the Backus-Naur Form, invented by John Backus. Over time, the cumulative success of BNF has led to many improved variants being produced. Let’s look at different ways to describe the grammar for a language in which the statements var x = 1 or var x = 2 are valid.

BNF: Backus-Naur Form

<var_decl> ::= "var" identifier "=" ("1"|"2")

EBNF: Extended Backus-Naur Form

var_decl = "var" identifier "=" ("1"|"2");

ABNF: Augmented Backus-Naur Form

var_decl = "var" identifier "=" ("1"/"2");
Essentially, each variant of BNF offers its own syntax for the same set of features:
grouping
concatenation
repetition
built-in terminals
precedence
RegEx
And, at time of writing, there isn’t really an established standard of each variant, so whenever you define a grammar using BNF, make sure to include a file that specifies the syntax of the BNF variant you are using. That said, let’s now get to work representing Mochaccino in EBNF.

EBNF Grammar Represenation

EBNF Flavour Definition

Here’s the markdown file the Infinitum Team is using to define their flavour of EBNF:
# EBNF Definition
This file clarifies the syntax of the Extended Backus-Naur Form (EBNF) metalanguage used to denote Mochaccino's grammar rules.

## Option
Symbol: `|`

Example:
```js
term = "A" | "B" // A or B
term2 = ("A" | "B") "C" // A or B, followed by C
```

## Repetition
Symbol: `term*` or `term+` or `term{6}`

Example:
```js
term = "A"* // A, repeated zero or more times
term2 = "B"+ // B, repeated at least once
term3 = ("A" "B"){6} // A followed by B, and the set is repeated exactly 6 times
```

## Grouping
Symbol: `(term1 term2)`

Example:
```js
term = ("A" "B") | "C" // A followed by B, or just C
```

## Concatenation
Juxtaposition implicitly denotes concatenation.

## Terminals
The definitions for the following terminals are as such (RegEx expressions are included in some definitions):

- `NUMBER`: whole numbers of any number of digits, decimals are strictly not included
- `NUMBER = (1|2|3|4|5|6|7|8|9|0)*;`
- `STRING`: a line of unrestricted text enclosed in double or single quotes, whitespace allowed
- `STRING = ('"' char* '"') | ("'" char* "'")`
- `char` is known to be any character, uppercase or lowercase, number, symbol, or whitespace
- `TYPE`: a type literal which follows the same rules as `TEXT`
- `TYPE = TEXT;`
- `TEXT`: an identifier-safe text which can include numbers (though not as the first character) and underscores, but not whitespaces
- `TEXT = ["_"] {a-zA-Z} (NUMBER | [a-zA-z])*;`
You could define your own flavour if you want, but try not to include RegEx in it too much. After all, this is a context-free grammar we’re building, so let’s have some class.

EBNF Representation

Generic

Let’s slowly build up our EBNF representation, starting from the highest level, conventionally known as “program”. The program nonterminal is sort of like the entry point to our grammar. Every piece of code is evaluated against this nonterminal.
program = (statement | block)*;
This means that our program can be a block (struct/module/function, etc.), or statement (include statement, variable declaration, etc.) repeated zero or more times. Now, we need to define the different types of statements, blocks, and expressions.
statement = package_decl
| include_decl
| import_decl
| assignment
| debug_flag
| struct_annotation
| kw_ok
| kw_notok
| expression
;
Note
In our case, we put keywords as nonterminals, even though we don’t have to. We prefixed keywords with kw_ so that any changes to the keywords can be reflected across multiple production rules.
block = dock_block
| module_block
| struct_block
| func_decl
| if_block
| for_loop
| while_loop
| prop_block
;

expression = binary_exp
| unary_exp
| comparative_exp
| property_access
| item_access
;
Also, our EBNF file is going to be filled with nonterminals, so it’s probably a good idea to use comments to organise different parts of the file. We’ll put all the above nonterminals under the generic section:
###################
# GENERIC
###################
Now, let’s continue working in the generic section. Remember, our type annotations can include type arguments and nested type annotations (
type_annotation = "<"
TYPE
["," type_annotation]
">"
;
Let’s also include the various operators:
arithmetic_op = "+" | "-" | "*" | "/" | "%";
logical_op = "||" | "&&" | "<=" | ">=";
comparative_op = "==" | "!=";
unary_op = "!" | "++" | "--" | "-" | "+";
And a little something:
params = (
([kw_named] identifier [type_annotation])
("," [kw_named] identifier [type_annotation])*
)*
;

Literals and Keywords

Before we start defining the other nonterminals, let’s first list out the keywords in Mochaccino as nonterminals:
kw_ok = "ok";
kw_notok = "notok";
kw_named = "named";
kw_package = "package";
kw_include = "include";
kw_from = "from";
kw_as = "as";
kw_var = "var";
kw_return = "return";
kw_async = "async";
kw_module = "module":
kw_struct = "struct";
kw_func = "func";
kw_dock = "dock";
kw_extends = "extends";
kw_implements = "implements";
kw_if = "if";
kw_elif = "elif";
kw_else = "else";
kw_for = "for";
kw_while = "while";
kw_prop = "prop";
kw_get = "get";
kw_set = "set";
While we're at it, let's also define the rules for literals (null, true/false, "string", 10, [], {}).
################################
# LITERALS #####################
################################
literal = NUMBER
| STRING
| "true"
| "false"
| "null"
| array_literal
| map_literal
;
value = literal | identifier;
array_literal = "[" (
(value)
| "," (value)
)*
"]"
;
map_literal = "{" (
(value ":" value)
| "," (value ":" value)
)*
"}"
;
kw_ok = "ok";
kw_notok = "notok";
kw_named = "named";
kw_package = "package";
Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.