Mochaccino Sprint 1

Explore

Gallery

Mochaccino Sprint 1

Syntax Highlighting

Everybody should learn to program a computer... because it teaches you how to think.

—

Steve Jobs⁠

⁠

One of the great things about VSCode is that it is so extensible, and we’re going to use that to our advantage by adding syntax highlighting support for our grammar on VSCode.

Overview

According to VSCode’s

Syntax Highlight Guide⁠

, these are the steps that need to be taken:

Create tokenisation file

Provide theming

grammars Contribution Point

Let’s set up another directory in our project:

🗀 grammar

🗎 ebnf_definition.md

🗎 grammar.ebnf

🗀 vscode

🗎 package.json

🗀 syntaxes

🗎 mochaccino.tmGrammar.json

🗎 LICENSE

🗎 README.md

The vscode directory is where we will develop all our VSCode-related tools, such as a syntax highlighter, analyser, and linter. Just a quick heads-up: extensions for VSCode must be written in JavaScript or TypeScript, so choose your implementation language wisely! Fortunately for us, we can still write VSCode extensions in Dart, because dart has the option to compile to JavaScript. Another thing is that we are, against our better judgement, going to try to develop our syntax highlighting grammar without the use of Yeoman, because we just love venturing into uncharted territory, don’t we?

package.json

For your package.json file, you would want to use the following boilerplate code:

{

"name": "mochaccino",

"version": "0.0.1",

"engines": {

"vscode": ">=0.9.0-pre.1"

"publisher": "Infinitum Labs Inc",

"contributes": {

"languages": [

{

"id": "mochaccino",

"aliases": ["Mochaccino", "mocc", "Mocc"],

"extensions": [".mocc",".mochaccino"]

}

"grammars": [

{

"language": "mochaccino",

"scopeName": "main.mocc",

"path": "./syntaxes/mochaccino.tmGrammar.json"

}

]

}

Replace the names and extensions with those of your language, and we can move on to the next file.

mochaccino.tmGrammar.json

If you look at the Syntax Highlighting Guide, your tmGrammar file should look something like:

{

"scopeName": "main.mocc",

"patterns": [{ "include": "#expression" }],

"repository": {

"expression": {

"patterns": [{ "include": "#letter" }, { "include": "#paren-expression" }]

"letter": {

"match": "a|b|c",

"name": "keyword.letter"

"paren-expression": {

"begin": "\\(",

"end": "\\)",

"beginCaptures": {

"0": { "name": "punctuation.paren.open" }

"endCaptures": {

"0": { "name": "punctuation.paren.close" }

"name": "expression.group",

"patterns": [{ "include": "#expression" }]

}

Ok, before you freak out and Ctrl+W this tab and close your computer and toss it out the window, let me tell you that this is a structure we are very familiar with. That's right, we've seen this structure back when we were defining grammars with EBNF. So sit down, have a sip of water, take a deep breath, and let's break this monstrosity down in the next section.

Tokenisation File

Our tmGrammar.json file is a tokenisation file which splits up source code into tokens to be highlighted. Each token is then assigned a scope. In a separate file, we will define the colour of each scope. Now, the tokenisation file is actually a JSON representation of our EBNF grammar, with an additional feature — RegEx. RegEx, short for regular expressions, are a very handy tool that every developer should know. You’ve probably tried creating a programming language with RegEx yourself, and it probably didn’t go so well, did it? We’ve tried that a few times, too (6 times, to be exact), and it wasn’t long before RegEx rules became unreadable and it became difficult to progress to context-sensitive syntax. Anyway, we are going to use RegEx for a similar yet different purpose here. RegExes here will no longer be the base of our language, but rather, a simple tool to add a splash of colour to our source code. So, now, how do we translate our EBNF grammar into the JSON representation?

Let’s start with the first key, scopeName, which is the main scope of your entire program. You can use something like source.lang or main.lang.

Then comes the pattern key, which corresponds to the program nonterminal in our EBNF grammar. It contains top-level rules of the program. Basically, every line of the program is first evaluated against each one of the rules in this list until a match is found.

program = (statement | block)*;

would become

"patterns": [{ "include": "#statement" }, { "include": "#block" }]

In EBNF, we referred to nonterminals simply by their name, but for the tmGrammar JSON, we need to prefix nonterminals with #, and we use them by pairing them with the include key. So what our patterns key is saying is “The program can be either a nonterminal called statement, or a nonterminal called block”.

In our EBNF grammar, we defined the statement nonterminal right below the program nonterminal, right? However, for tmGrammar JSON, we need to define all our nonterminals inside the top-level repository key. For example:

...

"repository": {

"expression": {

"patterns": [{ "include": "#letter" }, { "include": "#paren-expression" }]

"letter": {

"match": "a|b|c",

"name": "keyword.letter"

...

}

...

Highlighted in purple are the nonterminals. To define the expansion (right-hand side) of the rule, you can use the patterns or match keys depending on the expansion. If the expansion contains nonterminals, as is the case with expression, then you would need to use patterns. If the rule expands to a terminal, then you should use the match key, which allows you to define a RegEx rule that defines the terminal. Furthermore, every terminal should be accompanied by the name key, which gives the scope of the terminal.

But what if the expansion has both terminals and nonterminals? Since the pattern key is an array of maps, you can use include to reference a nonterminal, and match to

Theming

grammars Contribution Point

Overview

package.json

mochaccino.tmGrammar.json

Tokenisation File

Theming

grammars Contribution Point

Gallery

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.