RSeven development 1: a parser

Hello,

I am kicking off a new series to talk about a project of mine that will take a lot of time and probably make me discover new things. This is the first “long” project I’m starting since I made a front-end for Xi editor. Xile sadly got abandoned as I lost motivation after the main team behind the core of Xi Editor let it go to focus on making a Rust GUI library.

What is RSeven?

RSeven is first and foremost my 2021 year long project on which I try to devote time to keep learning Rust.

As my interest in text editors grew, I started to enjoy Emacs a lot for the simplicity but completeness of EmacsLisp, and lisps in general. I ate SICP afterwards, and started to look into Lisps and Scheme in general as an alternative for projects. Xile got born out of my wish to have an editor that is scriptable in scheme, but using a much faster editor core to alleviate some of Emacs’ slowness: making a Scheme front-end to Xi-core seemed like a good thing to try. As I said before, Xi-core development is a prolonged break now, and it left behind a rope crate, among other things.

A shower thought on EmacsLisp made me realize that all Emacs is, is an EmacsLisp interpreter. And EmacsLisp is “just” a lisp-2 with native types relevant to text editing. So why not try to make a similar thing ? A lisp (but a scheme, that is lisp-1 and has a clear and concise standard) that has native types to make text editing faster. RSeven (actually r7.rs) is my attempt at this. I want to host a R7RS scheme in Rust, both to expand my knowledge of Scheme (I’m still not confident on call/cc) and of Rust (memory management and looking into tail call optimization will be something interesting). Learning about language theory and compiling is the cherry on the cake; although I don’t expect to make a scheme compiler in Rust before the second half of the decade.

So yeah, the tl;dr is that RSeven aims to a R7RS interpreter in Rust.

A parser for r7rs

The first step on this journey was to obtain a parser/lexer for r7rs syntax in order to process the tokens properly.

The first crate I made for this effort is therefore rseven-parser, which takes the syntax specification given, and uses Pest to generate a parser for the grammar. As Pest website says, it is not the fastest but it definitely is good enough and allows to keep a grammar in a nice format that’s easier to understand and change

I mostly faced 3 challenges when writing the syntax

0-or-more rules

As Pest will try to match everything in order, the placement of rules that share prefixes within patterns, especially 0-or-more rules, is very important and changes significantly the result. Almost all changes from the grammar in the specification are made to handle precedence/priority for Pest parser.

Quasiquoting and recursive rules

The only rule that I wasn’t able to properly put in Pest grammar is the quasiquoting rule.

It is recursive and there is no way to properly define a recursive pattern (which can go to infinite depth) in the grammar. So currently the solution is to only parse the outer structure, and maintain in the Rust core that will interpret it, a quotation_level which will allow to know when/if the current level is one of the base cases while evaluating.

Count of rules

There are a lot of rules in the parser, and it generates a lot of enum variants. If I hadn’t had rust-analyzer to provide code actions that fill all my match cases for me it would have been really tedious to write the client code of the parser (the core of the interpreter). Thanks to LSP for making this a lot easier.

Test harness

Testing that the syntax works correctly was harder than I thought, mostly because I wasn’t sure of the best way to do it.

Testing on pest.rs

I started by testing the syntax in the editor of pest.rs. It’s a little bit cumbersome because I needed to copy-paste the whole .pest file in the editor everytime I closed my browser, but it worked and it helped me to reorder rules correctly to avoid issues explained in the 0 or more rules part.

There is no way to make this work without an internet connection, and it clearly doesn’t allow to write tests so I had to find something else.

Testing with pest-mode and pesta

Emacs being emacs, there was already a pest-mode that provides both syntax highlighting, and a command to test a grammar within Emacs, using pesta to provide real-time feedback on a string being parsed.

This is very useful to debug on lone sessions, as changing either the input string or the grammar on the fly provides a much faster feedback with this method. It also allows to see the kind of token parsed on each character in the input string.

Testing with pest_ascii_tree

The main reason I used these tools to begin with is that it was really hard to display a nice version of the rust tree returned by a parser. Comparing a tree with an “expected” version of it is thus very hard, and writing tests to check it was daunting. That’s when I discovered pest_ascii_tree, that pretty prints a parsed tree. It is still a little clunky, but now it is possible to write the expected representation of a tree in a test and directly compare the strings. The clunky part is writing the expected part of the test with string additions, but at least it allows to show the expected tree correctly:

#[test]
fn parse_bool() {
    let parsed = parse("#t").unwrap();
    let expected = String::new()
        + " program\n"
        + " └─ command_or_definition\n"
        + "    └─ expression\n"
        + "       └─ literal\n"
        + "          └─ boolean \"#t\"\n";
    assert_eq!(into_ascii_tree(parsed.into_inner()).unwrap(), expected);
}

#[test]
fn parse_nested_comment() {
    let parsed = parse("(list #|Not in the list|# stuff)").unwrap();
    let expected = String::new()
        + " program\n"
        + " └─ command_or_definition\n"
        + "    └─ expression\n"
        + "       └─ procedure_call\n"
        + "          ├─ operator\n"
        + "          │  └─ expression\n"
        + "          │     └─ identifier \"list\"\n"
        + "          └─ operand\n"
        + "             └─ expression\n"
        + "                └─ identifier \"stuff\"\n";
    assert_eq!(into_ascii_tree(parsed.into_inner()).unwrap(), expected);
}

Now, whenever I write syntax that doesn’t get parsed correctly, I can easily write a test and check the associated tree to see what it looks like before fixing it.

Conclusion

This was a short dive into building the parser for r7rs syntax in Rust that I’ll use as a basis for hosting a scheme in Rust. This was a little shorter than I expected, and now that I have a faster way to check parsing and write tests, I think I’m well covered for the rest. I lack documentation though :(

Next article will go to the core part and might try to explain how I’m trying to store the values to match the “slot” semantics of scheme.