The article looks great and I’m looking forward to reading it; this comment is not a criticism of the article.
This API is the only bad thing about Rust!
.expect("Could not read file")
It’s so unfortunate to have an API that reads .expect("thing we don’t expect")
I think we should all just forget it’s there and use .unwrap_or_else(|| panic!(“thing we don’t expect”))
Working on tokenization and parsing there have been two "lights clicking on" moments that I think every dev working on a PL implementation should have :
- Tokens are the leaves of your syntax trees
- File locations are relative, not absolute
It's easier to build a parser that doesn't buy into these things, but it's way harder to build tooling and good error messaging if you don't.
Hey folks just saw this, author here. Happy to answer questions!
There's also Luster[1].
Does Rust have computed goto, which really helps interpreter speed?
It basically means you can do something like "goto opcode_table[*(++ip)];"
GCC offers it as a non-standard extension to C.
https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
FORTRAN has had it since 1957. But Pascal and C purged "evil computed GOTO" and only offered non-computed goto. Then Java etc. purged non-computed goto.Thanks for sharing! A great learning
undefined
undefined
Being that tokens are the leaves of the AST, there are a lot of them and they can take a lot of space. To save memory it is a good idea to store only a file location instead of a full token. Whenever token information is needed, just lex again to get the full token, starting at the file location. This works only for languages with a context-free lexical syntax, of course (and not entirely sure "context-free" is the right term here but you get what I mean).
Storing row/column in file location data is wasteful - just a file offset should be enough. Whenever the row/column coordinates are needed (normally only in user messages) they can be quickly recomputed.
In effect, parsed tokens can be stored as just an offset - a 4 or 8 byte integer.