Lucas Sifoni

Hosting a small language (Ovo2) from scratch in Elixir, pt 3
Parsing

elixir parsing explorations


full code hosted on github

After the last post, I took a bit of time to extend the tokenizer. The defpat macro has been quite useful. Numbers were tokenized with a strategy similar to the one I had for strings. A digit marks the start of a number, that can then contain either a digit or a dot, but no more than one dot. A regex could handle that just fine but I came to like the character-per-character recursion, avoiding peeking and backtracking. The clause matching on digits could be optimized by not looking into a list, but again, I liked the simplicity of using a single function clause here.

def walk(["." | t], :number, out, buf) do
  if has_dot?(buf) do
    walk(["."|t], :undefined, out, buf)
  else
    walk(t, :number, out, buf <> ".")
  end
end
def walk([a | rest], :number, out, buf) when a in @digits, do: walk(rest, :number, out, buf <> a)
def walk(input, :number, out, buf), do: accumulate(input, :number, out, buf, :undefined)

After that, I’m able to have a complete stream of simple tokens for my reduced grammar.

defmodule Ovo.Token do
  @moduledoc """
  An Ovo Token, a tuple carrying the token kind and optional string representation.
  """
  @type t_string :: {:string, String.t()}
  @type t_symbol :: {:symbol, String.t()}
  @type t_arrow :: {:arrow, nil}
  @type t_if :: {:if, nil}
  @type t_else :: {:else, nil}
  @type t_then :: {:then, nil}
  @type t_end :: {:end, nil}
  @type t_number :: {:number, String.t()}
  @type t_equals :: {:equals, nil}
  @type t_comma :: {:comma, nil}
  @type t_open_paren :: {:open_paren, nil}
  @type t_open_bracket :: {:open_bracket, nil}
  @type t_close_paren :: {:close_paren, nil}
  @type t_close_bracket :: {:close_bracket, nil}
  @type t_backslash :: {:backslash, nil}

To parse that stream of tokens to an abstract syntax tree, I took a bit of time to think of what it meant to execute an ovo2 program.

The smallest ovo2 program possible is a single expression, like add(5, 6). A complex ovo2 program can have multiple expressions evaluated one after the other. Since the last expression only is returned, this means all expressions before the last one can only modify the environment. This means the execution of an ovo2 program is only a reduction of a list of expressions to an updated environment until the production of a final value.

Seen that way, a lambda in ovo2 that captures necessary variables from its outside environment can be seen as an autonomous subprogram with an altered initial environment.

So, in pseudo-code :

Ast is a root ast with no nodes
Until there are no more tokens or an error is encountered ;
  parse_expression
    if successful, add the relevant expression node to the ast 
    otherwise, error

Where parse_expression is :

parse_expression is
  parse_assignment
  or parse_call
  or parse_lambda
  or parse_list
  or parse_value
  or parse_condition

and where :

parse_assignment is :
  <symbol> and <equals> and <expression>

parse_call is :
  <symbol> and <open_paren> and many(<expression> and <comma (optional if trailing)>) until <close_paren>

parse_list is :
  <open_bracket> and many(<expression> and <comma (optional if trailing)>) until <close_bracket>

parse_value is :
  <symbol> or <number> or <string> or <lambda>

parse_lambda is :
  <backslash> and many(<symbol> and <comma (optional if trailing)>) and <arrow> and many(<expression>) until <end>

parse_condition is :
  <if> and <expression> and <then> and many(expression) and <else> and many(expression) and <end>

Since assignments done in an if/else block do not leak outside of it, we can also think of both branches as independent ovo2 programs being lists of expressions with access to a particular enclosing environment. This can be thought as slightly similar to a parameter-less lambda. I’ll keep this interpretation in mind during the next steps, to decide if it is advantageous to think of it this way, or not.

The way I wrote these pseudo-code specifications of various parsers is akin to traditional parser combinators.

In the typescript implementation of the previous iteration of Ovo, this idea was defined in this way :

export interface ParserResult {
    rest: string,
    kind: 'result',
    found: string,
    tokens: Ovo2Node[],
};

export interface ParserError {
    error: string,
    rest: string,
    found: string,
    kind: 'error',
};

export type ParserOutput = ParserError | ParserResult

export type Parser = (input: string) => ParserOutput;

A parser consumed a string, and could either :

Combinators then looked like this : a combinator is a function taking a parser, some number of parsers, or many parsers, and returning a function that adheres to the Parser type.

export const either = (parserA: Parser, parserB: Parser): Parser => {
    return (input: string) => {
        const r = parserA(input);
        if (!errored(r)) {
            return r;
        }
        const r2 = parserB(input);
        if (!errored(r2)) {
            return r2;
        }
        return err('Either failed', input);
    };
};

export const possibly = (a: Parser) => either(a, nothing);

It was then easy to combine parsers with combinators to create complex parsers from simple ones.

export const between = (startParser: Parser, middleParser: Parser, endParser: Parser): Parser =>
    all([startParser, middleParser, endParser]);

export const trim = (p: Parser) =>
    between(possibly(spaces), p, possibly(spaces));

The drawback of this approach, operating on the raw input string, was the complexity of having to think about whitespace and noise in the stream of meaningful characters. I hope to have an easier time by having already reduced the input string to a stream of tokens.

I’ll start with a draft of the first combinators to go from simple value parsers to complex expression parsers :

defmodule Ovo.Combinators do
  def either(a, b) do
    fn tokens ->
      case a.(tokens) do
        {:ok, _, _} = res ->
          res

        _ ->
          case b.(tokens) do
            {:ok, _, _} = res -> res
            _ -> {:error, nil, tokens}
          end
      end
    end
  end

  def any(parsers) do
    fn tokens ->
      Enum.reduce(parsers, {:error, nil, tokens}, fn parser, out ->
        case out do
          {:ok, _, _} ->
            out

          _ ->
            case parser.(tokens) do
              {:ok, _, _} = res -> res
              _ -> out
            end
        end
      end)
    end
  end
end

any could short-circuit traversing the list of parsers later. But that’s enough to implement a parse_value parser that can discriminate between strings and numbers.

defmodule Ovo.Parser do
  alias Ovo.Ast
  alias Ovo.Combinators, as: C

  def err(tokens), do: {:error, nil, tokens}
  def ok(result, rest), do: {:ok, result, rest}

  def parse_number([{:number, val} | rest] = tokens) do
    if String.contains?(val, ".") do
      case Float.parse(val) do
        {n, ""} -> ok(Ast.float(n), rest)
        _ -> err(tokens)
      end
    else
      case Integer.parse(val, 10) do
        {n, ""} -> ok(Ast.integer(n), rest)
        _ -> err(tokens)
      end
    end
  end
  def parse_number(a), do: err(a)

  def parse_string([{:string, val}|rest]), do: ok(Ast.string(val), rest)
  def parse_string(a), do: err(a)

  def parse_value(tokens), do: C.either([&parse_number/1, &parse_string/1]).(tokens)
end

With this small implementation, we can convert from tokens to ast nodes, consuming the flow of tokens, and “walk back” if a parser fails. Not having to think about whitespace is pleasant. But what about line delimiters ? I totally discarded them earlier. But I could use them to simplify deciding of the end of expression parsers. I’ll think about it for a while before going to the next step.

  iex> Ovo.Parser.parse_value([{:number, "5"}])
  {:ok, %Ovo.Ast{kind: :integer, nodes: [], value: 5}, []}
  iex> Ovo.Parser.parse_value([{:string, "foo"}])
  {:ok, %Ovo.Ast{kind: :string, nodes: [], value: "foo"}, []}
  iex> Ovo.Parser.parse_value([{:arrow, nil}])
  {:error, nil, [{:arrow, nil}]}

Previous post : Hosting a small language (Ovo2) from scratch in Elixir, pt 2
Tokenization

Next post : A second silvering attempt yielding a better result