Hosting a small language (Ovo2) from scratch in Elixir, pt 2
Tokenization

May 23 2023

Part 1 : gathering requirements from a previous experiment
Part 2 : tokenization
Part 3 : parsing
Part 4 : AST emission and print-parse-print loop
Part 5 : evaluation
Part 6 : basic recursion : environment as processes
Part 7 : weird features, towards a global stateful machine
Part 8 : a graphical environment with liveview

To parse our language, we will first define a rough map of the syntax.

A program is made of a list of expressions

An expression is either :
  - an assignment (returns the assigned value)
  - a function call
  - a lambda / function declaration (since we don't have a real distinction between the two in ovo)
  - a primitive value  (number, string, list)
  - a symbol

We can note that there’s no syntax for maps in this list. One of my goals is to encourage myself to use ovo2 to reduce complex data to strings, and avoid using it to produce other data shapes. The final goal stays sandboxed data reduction for printable templates.

I did not mention infix operators either in this list. In my last Typescript prototype, I had infix operators, but that seemed clunky when I build the prototypal visual editor, and I now prefer having arithmetic as functions. String concatenation as an infix operator will also be removed, to have functions operating on strings instead. The access operator will also be removed.

Maps are normalized to be stringly typed.

So a sample of the complete (updated) syntax could be :

# Ovo2 program, having as input %{"name" => "Martin", "age" => 44}
foo = 5
bar = 6
age = add(access(data, `age`), bar)

say_hi = \name, age ->
  join([name, `has the age`, to_string(age)], ``)
end

say_hi(access(data, `name`), age)

fibs = \a -> 
    if greater_or_equals(a, 2) then
        add(fibs(subtract(a, 1)), fibs(subtract(a, 2)))
    else 
        1
    end
end

fibs(10)

To me, there’s something fishy, (or.. lispy maybe), slowly creeping into scope.

In the previous typescript iteration of ovo2, we were closer to Elixir’s syntax (see below), but as I’m writing, it seems interesting to reduce the syntax footprint and go towards a better-fitting visual and text editing experience… slowly deprecating text out of ovo2.

fn = data°first_name
ln = data°last_name

join2 = \strs, spacer ->
 list~reduce(strs, \out, s ->
    spaced = out <> spacer
    spaced <> s
 end)
end

join2([fn, ln], \` \`)

That’s the beauty of the exercise of those posts : writing the post is the creation of what the post is about.

For the sake of simplicity, I’ll start by tokenizing the input, and only discriminate strings and non-strings. Non-strings will be parsed and determined in-context afterwards. We can see that I’m going for ` as a string delimiter, and \ as an escape character. We swallow whitespace outside strings.

test "Tokenizes input, discriminating strings and non-strings" do
  for {program, tokens} <- [
        {
          """
          foo = 5
          bar = `stringy\\` value`
          baz(a, b, c, d)

          """,
          [
            {:nonstring, "foo"},
            {:nonstring, "="},
            {:nonstring, "5"},
            {:nonstring, "bar"},
            {:nonstring, "="},
            {:string, "stringy` value"},
            {:nonstring, "baz(a,b,c,d)"}
          ]
        }
      ] do
    assert Ovo.tokenize(program) == tokens
  end
end

We’ll start by splitting the string to graphemes, and walk them :

defmodule Ovo do
  def tokenize(input) do
    input
    |> String.graphemes()
    |> walk()
  end

Then, walking characters, we change the state of our tokenizer, from undefined to string and nonstring :

defmodule Ovo do
  def tokenize(input) do
    input
    |> String.graphemes()
    |> walk()
  end

  defp is_whitespace?(a), do: Regex.match?(~r/\s/, a)

  def accumulate(rest, state, out, buf, next_state) do
    if buf != "" do
      walk(rest, next_state, [{state, buf} | out], "")
    else
      walk(rest, next_state, out, "")
    end
  end

  def walk(graphemes), do: walk(graphemes, :undefined, [], "")
  def walk(["\\" | ["`" | rest]], :string, out, buf), do: walk(rest, :string, out, buf <> "`")
  def walk(["`" | rest], :string, out, buf), do: accumulate(rest, :string, out, buf, :undefined)
  def walk([a | rest], :string, out, buf), do: walk(rest, :string, out, buf <> a)
  def walk(["`" | rest], state, out, buf), do: accumulate(rest, state, out, buf, :string)

  def walk([a | rest], state, out, buf) do
    if is_whitespace?(a) do
      accumulate(rest, state, out, buf, :undefined)
    else
      walk(rest, :nonstring, out, buf <> a)
    end
  end

  def walk([], state, out, buf), do: (if buf != "", do: [{state, buf} | out], else: out) |> Enum.reverse()
end

We get a list of tokens, without whitespace, in two categories : strings, and non-strings. String delimiter escaping and whitespace in strings is handled. But a few shortcomings are present :

{:nonstring, "foo"},
{:nonstring, "="},
{:nonstring, "5"},
{:nonstring, "bar"},
{:nonstring, "="},
{:string, "stringy` value"},
{:nonstring, "baz(a,b,c,d)"}

At this point, if we change baz(a,b,c,d) to be baz(a, b, c, d), we will get this output :

{:nonstring, "baz(a,"}
{:nonstring, "b,"},
{:nonstring, "c,"},
{:nonstring, "d)"}

Before that, I wanted to explore the use of macros to write list matching operations.

iex(1)> quote do
...(1)> ["a" | rest]
...(1)> end
[{:|, [], ["a", {:rest, [], Elixir}]}]

We can see that [a|b] is represented as the operator :| applied to a and b in the resulting AST.

def binary_to_pattern(bin) do
  [first | rest] = String.graphemes(bin) |> Enum.reverse()

  base = [{:|, [], [first, {:tail, [], nil}]}]

  Enum.reduce(rest, base, fn element, out ->
    [
      {:|, [],
        [
          element,
          out
        ]}
    ]
  end)
end

defmacro defpat(pattern) do
  binary_to_pattern(pattern)
end

We can then use defpat("abcd") to generate ["a" | ["b", | ["c" | ["d", | tail]]]], or in shorter form, ["a", "b", "c", "d" | tail].

The pattern matching on a backtick inside a string can then be rewritten from:

def walk(["\\" | ["`" | rest]], :string, out, buf), do: walk(rest, :string, out, buf <> "`")

to:

def walk(defpat("\\`"), :string, out, buf), do: walk(tail, :string, out, buf <> "`")

I’m not sure that I like tail leaking from no visible spot though. That said, at this stage, it allows me to extend the syntax quite easily :

Updating accumulate to handle a “next buffer” as tokens are flushed at the next token change :

def accumulate(rest, state, out, buf, next_state, next_buf \\ "") do
  if buf != "" do
    walk(rest, next_state, [{state, buf} | out], next_buf)
  else
    walk(rest, next_state, out, next_buf)
  end
end

We then allow buf to be nil since we were already testing for <empty string>.

def walk(defpat("->"), state, out, buf), do: accumulate(tail, state, out, buf, :arrow, nil)
def walk(defpat("if"), state, out, buf), do: accumulate(tail, state, out, buf, :if, nil)
def walk(defpat("else"), state, out, buf), do: accumulate(tail, state, out, buf, :else, nil)
def walk(defpat("then"), state, out, buf), do: accumulate(tail, state, out, buf, :then, nil)
def walk(defpat("end"), state, out, buf), do: accumulate(tail, state, out, buf, :end, nil)

And update the tests to reflect those new tokens :

test "Tokenizes arrows and if/then/else/end" do
  program = """
  bar -> baz
  if bar do 5 else end
  """

  tokens = [
    {:nonstring, "bar"},
    {:arrow, nil},
    {:nonstring, "baz"},
    {:if, nil},
    {:nonstring, "bar"},
    {:then, nil},
    {:nonstring, "5"},
    {:else, nil},
    {:end, nil}
  ]

  assert Ovo.tokenize(program) == tokens
end

I switched my preference to then instead of do for if expressions, simply as a matter of taste.

Hosting a small language (Ovo2) from scratch in Elixir, pt 2 Tokenization

Previously on that topic :

Hosting a small language (Ovo2) from scratch in Elixir, pt 2
Tokenization