[Talk] ElixirConf EU 24 — Let's write a toy language from scratch in Elixir
My talk “Let’s write a toy language from scratch in Elixir” is out on YouTube :
You can find the slides here : /talks/lets_write_a_toy_language_elixirconf-eu-24
Ovo is not in use anymore today but I carried many of the techniques I learned building it with me, and put them to use in my startup Alzo, even if I decided against having a turing-complete language for configuration. Keep it simple ;-).
Transcript :
Welcome to this session called “Let’s Write a Toy Language from Scratch.” We’ll see how to get from zero, so text input up until execution, and a few explorations in programming models. So this project started as a low-code data manipulation UI in TypeScript. The goal was to write little data accessors and mutators in this language called OVO with a graphical node editor, and to write little API endpoints by composing these little programs. I stopped this TypeScript version and now work on it in Elixir.
Here are two samples of OVO programs. You can see that the syntax is a bit Elixir-like, but not totally. On the left, there’s a simple example that we can see here where “bar” is a function that takes one argument, captures the value of “F,” and when executed, evaluates to nine. On the right, there’s the basic recursive Fibonacci example which shows that I don’t have infix operators, at least I didn’t at the start, except for assignment. I changed my mind later, but we’ll get to that. This evaluates to eight. We’ll see what the bang is later too.
To get to execution, I followed five big steps. The first step was tokenizing my input to get from program text to a list of tokens by working characters recursively. Then I parsed that list of tokens into an abstract syntax tree that represents the program with parser combinators. Then I worked on a little printer that goes back to the text form of the program by working that tree, and I explored a few rewrites and transforms. Like if you want to modify your program before running it, it’s a good place to do it on the syntax tree with matching. Finally, we can see how to evaluate that tree up until there’s a single node left, which is the output value, with a working interpreter.
So tokenizing, you’re not forced to do that. Some people like to parse text directly, but I work in a more straightforward way with tokens. It’s a step where I let go of whitespace and formatting, and if you were to go to proper error handling, you would add the metadata to tokens like the line number or character numbers to inform the user about errors.
My tokenizer is quite simple. It’s only two recursive functions, “work” and “act,” with a lot of clauses for “work.” I start in an undefined state, and some tokens trigger state changes, and we recurse through “walk” and “act” until there are no characters left. Here, a character is what is defined by the output of string.
When the state changes, I emit a token. The thing to watch for is: Are we inside some long token, like inside a string or a number, or do we have some short token like this “error” which emits directly? In code, you can see that I have a lot of very simple clauses that call “act” directly. Then I have some more complex ones with logic for inside a string or inside the number, but when the mutual recursion between “work” and “act” ends, we have successfully tokenized our input.
The second step is parsing this list of tokens to answer a simple question: Does the token list form a valid program? Because the tokenizer is not concerned with whether the input is valid or not, by answering that question, we create a syntax tree that represents the program without syntax. Here we see that a call to “F” without arguments is parsed to a root node that has as the children a single function call node with a single value of a symbol with value “F.” We don’t have children nodes because we don’t have arguments to this function call.
The way that we parse the list of tokens is by combining very tiny parsers into larger parsers. We can see that parsing an expression is a matter of successfully parsing any of a shape. We’ll see what that is: a lambda, an assignment expression, an infix operator that I finally added, a condition, a call, an expression with parentheses, or a value. We can see that a value is any of a number, a string, a symbol, or “a” or “aoan.” We can also see that I work with anonymous functions since we have this “do” syntax, and we’ll see why in a minute.
A successful parser emits AST nodes. The AST is the abstract syntax tree that we saw earlier. We can see that if this parser for a multiple-element list matches, we emit a node with a kind of list and child nodes as children. What I like with these parser combinators is that it’s readable like French or English if you want. We can read that parsing a multiple-element list is a matter of matching an opening bracket, then multiple times, maybe an expression followed by a comma, then a last expression since it’s a multiple-element list, then a closing bracket. If we match all of that, we emit, otherwise we fail, and the next parser in a list is called. Failing parsers, of course, could inform the user with the position and nature of the errors if I added this metadata at the start.
If a parser is a function that goes from tokens to AST nodes, combinators are higher-order functions on parsers that ultimately return parsers. We can see the “do” combinator here that takes a first parser and a second parser, and this could be a whitespace expression. But if the first parser succeeds and the second parser succeeds on the remaining input of the first, we emit the concatenated result of both parsers.
Usually, you would use Nimble Parsec in the production code, which works with macros, so the syntax is a bit lighter. It’s compile-time, so there’s no runtime cost here. I’m creating a lot of functions on the fly, but it’s quite rewarding to implement parser combinators. I encourage you to try. Here’s the either parser combinator, also called “choice,” which is a boolean “or” between two parsers. With that, we can parse larger expressions into larger trees, like this list with a primitive value of one, a string, and a function that takes an argument, doesn’t use it, and returns an integer, and this reads quite easily in the tree form.
Now that we have a tree, we can print it back to text. I didn’t make a proper printer, but it was still useful for the next steps because any printer gives you a standardized form of your input text. Maybe it adds the line breaks at places that aren’t exactly what you would do as a user, but at least it’s always coherent. So you can use that in tests to debug your parser at the start and mutually debug your parser and your printer since parsing input should be strictly equivalent to parsing back printed input. You would implement Elixir’s “Inspect” protocol which is exactly made for that kind of use as a better way, but it’s still useful.
The way I print is by walking the full AST, and I have a function clause for every kind of node, so it’s quite simple. Printing a list is an opening bracket, then mapping over the nodes with “print node” with a comma between them and a closing bracket. You can see I have clauses for every kind of node, but it’s very short.
The fourth step I took is to think about program rewrites. Elixir is very well suited for that with pattern matching. The idea is to recurse over the whole AST just like I did in the printer, but instead of printing, you take that as an opportunity to match on specific patterns. Here, I match on an infix node with a value of “op,” which is the operator, and left and right nodes, and I rewrite it to the built-in function “cs” that they were equivalent with since I didn’t have those infix operators at the start. So you just write the pattern visually that you want to match and write the shape that you want to get and rewrite the child nodes too, so the whole AST can be rewritten.
Maybe you think in a better way in nested structures than in text, and I think it’s a neat way to modify your program. You can do that for more complex examples too. Here it isn’t very efficient to do that, but it’s an illustration matching on more complex input. We match on the idea to map some inputs, then filtering the results of this succeed map. Map list is the output of the map and the input of filter, and we can express that visually in our function clause since name one here is the same symbol as the first argument to filter which matches this specific condition. Then we just construct the equivalent AST of this big expression and we have rewritten code. We can see it here, and by seeing this, we can appreciate the fact that we have a great macro system in Elixir and we don’t have to do that, but still it works. By recursing over the whole tree, you can transform your programs that way.
Before talking of evaluation, everything that we just saw has deeply practical uses in addition to helping you think in trees. I encourage you to watch Ricardo Binette’s talk from Code BEAM EU 23, which was titled “Craft Your DSL with Nimble Parsec and Ecto,” and he explains in a very interesting way why parsing things like that with combinators is useful.
Evaluation: Now that we have our full program tree that has been rewritten in some cases with my little transform passes before, we can match and recurse on the AST until we have a single value left. Some nodes, like assignment, update an environment with the value of this assignment, and some nodes just go to evaluate a level deeper. You can see how a condition is implemented too. A condition node has exactly three children: a predicate, a first branch, and a second branch. We recurse to evaluate to get the value of the evaluated predicate. If it’s true, we evaluate the first branch. If it’s false, we evaluate the second branch, and we always return the environment and the resulting value of this evaluation.
My environment and scopes are processes. In fact, there are agents. I wanted to juggle with processes, and speaking of that, you shouldn’t use agents for that. There are lots of other solutions like ETS, but I wanted to play. There’s a super cool talk by Bryan Underwood, who’s today’s host from Code BEAM of last October too, that was titled “You Might Not Need Processes,” and I also encourage you to watch it. It was very interesting.
When we have function calls, we either call a built-in function or a function that the user defined in the language. This is quite crucial because if you choose well your built-ins, you can do more work in the language. If your built-ins are too sparse, you have to work by adding more built-ins to the language, working in the host language, and in that respect, I didn’t choose my built-ins very well. As you can see, I have quite a lot of them, and I could not implement filter with use map or that kind of things by just implementing this construction and deconstruction operators, just like you would do if you did a toy lisp anyway. Built-in function pattern match on the evaluated arguments, so we can see that the built-in concat matches for two strings or two list nodes and rewrites the node that corresponds to the operation of concatenation. This is how you reduce nodes to a little value at the end, and this is how you get to evaluate a full program like this: mapping of a list with three elements on which we add five, then filter for inferior to seven and we get a list with a single integer node of value six.
What do we have at this point? We have a thing called OVO that evaluates little programs. It would need a lot of fuzzing because every new program brings a bit of holes in the parser of our interpreter. It would need error handling, but it works very well as a creative outlet. It doesn’t do anything special yet, so I have the steps to build a language, but it doesn’t do anything interesting. While thinking about what it could do, I think the BEAM, with its actors, state handling, concurrency, and distribution, influenced my thinking. So I went to explore a stateful system.
The first feature of the stateful system was the ability to “shake” functions. It means that when you prefix a function with a bang, internally it creates a stack of its results that you can’t access directly but that you can “shake” later to pop a value off the top of the stack. When we call “S A,” which is a shakeable function, the stack gets pushed with 11 then 17. When we call it back with two shakings of it, we pop 17 and 11, but we push 28 since it’s the result, and when we call add with 28 and 28 that was on the stack, the stack is empty and we get 55 as output. Of course, it’s your responsibility to not shake an empty stack.
Building on that, I started to think about program runners. So there’s a registry of programs and programs are registered as runners, which are a little description, the code, and the AST. Every runner has a stack of results too, and from OVO you can call “air shake” with the hash of a runner to pop a result from the stack so you can have interaction between multiple programs that you can compose. The hash of this multi-runner program, for instance, is the one I9JP, which is the first five characters of the MD5 of the print of its AST. You can also have a cheap program composition with “run chain,” where “run chain ABC” on some arguments is equivalent to calling A on the arguments, then B on the result of A, then C on the result of B, so it’s kind of composition between programs.
I then went to experiment with LiveView because I write quite a lot of tree editors for work, but usually in TypeScript, so I wanted to start with LiveView. By working the program AST graphically by emitting HTML nodes and forms, we can have two-way editing of a program. Like in a tree editor, you could add nodes. Of course, it’s lacking features; you would need to be able to swap nodes, to change their type, to copy subtrees and to paste them elsewhere, but it’s a start of the idea of what I would like to do. Different node types are rendered by different input types, and the program text changes, and changing the program’s text changes the UI indirectly.
With this ability to graphically define runners, I went to build a little runner dashboard. This is a live demo of a nomadic OVO implementation of the “bottle song,” which is a famous little toy language example. There’s a joiner runner that joins strings; there’s a register runner that just logs its argument, but in OVO, that means filling your stack; and there’s a filler runner who just calls the register 100 times with a value that decreases. To run the bottle song, we just prefill the register with 100 results. Then the full program will pop this register to evaluate each bottle number. I could just recurse like in the filler in the full song, but that’s not the way. We can see that this is built on “invoke,” which allows you to call a runner from another runner. So if you run it, every runner gets its stack filled, and we have the whole song text here. It’s quite basic, but it’s also quite neat.
The next thing I felt is that programs didn’t need to be on the same computer or node, so I started to experiment with multi-user program composition, and there’s a fictional conversation between “L” and “Bob.” Bob needs a hash function, and Alice tells him, okay, just invoke “SJ2 PY” at my node with a string as the first argument. If we go on Bob’s side, we can see that he made two runners. Bob’s interface is in pink, so we recognize him. If he hashes “the quick brown fox” on a node, he gets a hash, and on Alice’s side, we can see that the stack of her runner got filled with Bob’s hash. This is not a cryptographic hash function; it’s more like a hash map key hash function. It’s called the Jenkin hash. It was simpler than MD5, which would take a lot of pages, and I’m not even talking about SHA or bcrypt.
With his hash, Bob is confident, and he will now proceed to hash his password with this second runner that has a secret argument. He runs it and gets a hash, but on Alice’s side, we can see that Alice swapped on her runner “SJ2 PY” because she found another OVO program with a magic value here that hashes to the same identifier, and this allowed her to own Bob and log his password.
That’s it for the demo. There’s still one thing I would like to build on that, and it’s more concurrent and time-based. It would be the ability to link runners inside a kind of graph where runners depend on each other’s inputs and outputs, but computation would advance a fixed number of reductions at each beat. So maybe a program takes three beats to run, maybe a program needs ten beats to run, so runners should have to wait for some errors to complete. Maybe some queues of inputs would fill up, maybe some queues would be consumed too fast, and I think of it as a kind of crossover between Patapon, which was a very nice rhythmic game on PSP in 2008, and the Observer, which has those graphs of processes and reduction counts.
That’s it for today. The code is on GitHub; it’s not very clean. It’s an experimental project, but anyway, feel free to fork it, and feel free to drop me a message if you want to chat about programming models or some kind of things. Thank you for your attention.
Okay, thank you very much. So I just wanted to check first, is there anybody who has any questions? Maybe post in the Q&A. Oh, we do have a question. Yeah, so Aron asked, uh, maybe you can see. I’ll just read it out just for the recording. Given a problem of parsing, when do I use, uh, LLVM, Lex, YACC versus Nimble Parsec versus some implemented self-implementation? Okay, I don’t have experience with YACC, but if I recall correctly, you have to feed it a BNF grammar.
I was going to ask just a quick question. Have you ever played the game TIS-100? It’s this computing parallel processing game. It’s fun to, um, maybe check it out. Your visual interface very much reminded me of what the game looks like.