Programming languages face a tension between the ability to extend them dynamically and the ability to reason about them statically. We take it for granted that constructs that bind variables, such as function definitions or pattern matching, are a fixed part of a language's core syntax, unless we go for Lisp-like macros, which are powerful but hard to reason about.
There is an unexplored middle ground, however, of allowing binding constructs to be defined and extended, while ensuring that all variables are lexically bound and that their scope can be determined statically. What follows is an exploration of this middle ground, while respecting a single guiding principle:
Every construct in the language can be redefined, there are no privileged language constructs. But the scope of variables remains immediately obvious, just by looking at the source code, without evaluating any functions.
The result will be a surface syntax that looks similarly dynamic to Lisp macros, but is fully desugared to a functional language without macros.
The more a language is defined in terms of static constructs that cannot be redefined, the easier it is to reason about (both for humans and compilers), because we can be sure that certain constructs always mean the same thing and cannot be rebound. But the more dynamic a language is (by aggressively using late binding such as Self and Io, or by using Lisp-like macros), the more it can be molded for a particular usecase that the original language designers might not have had in mind.
This tension is quite visible for binding constructs, in other words constructs that involve binding one or more variables in the scope of a function or block. Even heavily dynamic languages usually only support variables to be bound by special constructs, for example assignment operators like “=”, in function definition or at the beginning of lambdas/blocks/closures.
This makes it hard to define new constructs that bind variables, such as a pattern matching construct in a language that doesn't provide it, because a pattern matching construct needs to take a pattern, which might include fresh variables, and then bind these variables to their matched values in the scope of a body, but only execute the body if the match actually succeeds. There seem to be two different ways to deal with such a situation in a language:
Lisp macros are the best example for the second option. They are definitely powerful enough to add any construct we could possibly think of, and then some! But this power comes at a price. Let's take a look at the following Lisp code:
(foo (bar a b) (baz x y))
Which functions are being called here? Well, without looking at the context of the program we wouldn't know, because foo could be a macro, in which case bar and baz might not be functions, but just symbols that might mean whatever the macro interprets them as. This is of course why macros should be used sparingly: They are extremely powerful, but might make it harder for humans to figure out what is happening, because macros can obscure the structure of code.
Extensive use of macros violates the principle of least power: Powerful language features can be enticing, but power isn't everything, or we would all be happily living in the Turing tarpit of the lambda calculus. Well-chosen restrictions and limitations help us reason more easily about code. Macros are powerful, but they come with almost no limitations.
Can we do better?
Let's start with an observation. Most languages do not visually distinguish between bound variables that are used and fresh variables that are bound:
a + b // the variable a is being used
a = b // the variable a is being bound
In the first case, “a + b”, the variable a is looked up in the current scope, whereas in the second case, “a = b”, a is a fresh variable that will be bound in the current scope. That they play such different roles is not visible just by looking at the variables, nor by looking at the operators, because + and = both look like infix symbols. That's bad news if we want to allow all constructs to be definable by the user.
Let's distinguish these two cases by making the different roles explicit. Whenever a variable is resolved in the current scope and used, we will continue to just write its name. But whenever a fresh variable is being bound, we prefix it with a single quote:
a + b // the variable a is being used
'a = b // the variable a is being bound
We haven't said anything about where and for how long a variable remains in scope. Let's make this scope explicit by writing “{ ... }”, which delimits the block where a binding is active. A variable that is bound inside the block goes out of scope at the end of the block.
So far, so normal. But let's depart from how bindings are defined in other languages by splitting the place where bindings are declared from the blocks where they are bound:
foo('x, y, { f(x) })
// | \______/
// | scope of x
// |
// \__ binding of x
Whenever a variable is single-quoted and appears as 'x, it will be bound as x in the nearest block the appears after it (technically using a depth-first traversal of the AST, which matches what we visually perceive as the nearest block).
Viewed from the perspective of the block { f(x) }, we can immediately tell which variables are bound inside this block and which variables are resolved in the outer scope, just by looking at the explicit bindings to the left of the block (or until we encounter another block, which would then “consume” all the explicit bindings to the left of it). In the example above, we see that x is explicitly bound in { f(x) }, whereas f will be looked up in the outer scope.
What is the value of x in the block? Well, we can't know syntactically, because it is up to the function foo, which receives 'x as an argument, to decide what value x is bound to. It could act like a “let x = y in ...” construct and bind x to the value y or do something else, the value is entirely up to the implementation of foo. But it is statically guaranteed that foo will bind x to some value and this is how the block { f(x) } gets its value for x.
This is all a bit abstract, so let's look at how this might be used to define a pattern matching construct in a language that does not already provide such a construct. Let's write “Pair(x, y)” for a struct named Pair with two fields containing the values x and y. Matching on a value v if v is a pair where both fields are the same could look as follows:
match(v, Pair('x, 'x), { x })
The block { x } is statically guaranteed to receive the value of x from the function match, x will have that value only for as long as the block is active and go out of scope at the end of it.
The match function is responsible for deciding how x is bound, based on the arguments v and Pair('x, 'x). It could bind x only if v is a Pair and both fields have the same value, calling the block with x bound to that value and returning an error value otherwise. Or it could treat the first 'x as being shadowed by the second 'x, in which case the call would match every pair and return its second field. The point is: The value of x is up to the implementation of match, but x must be bound by the match function before the block is called.
Under the hood, a block is simply desugared into an anonymous function, with the number and names of its arguments being determined statically by the explicit bindings that precede it. If a block is not preceded by any explicit bindings, it will be treated as a single argument function that simply ignores its argument.
In a language with arrays written as “[..., ...]” and user-definable infix functions like “... -> ...”, a pattern matching construct could then look as follows:
match(x, [
Pair('x, 'x) -> { x },
'x -> { ... }
])
Let's add a bit more syntactic sugar: Right now blocks always have to be used after the explicit bindings that the block is supposed to consume, but it makes sense to also allow blocks to consume bindings by enclosing bindings:
{
let('x, y),
x
}
Whenever explicit bindings are declared inside a block but not consumed until the end of the block, the bindings will be consumed by the enclosing block, as if everything that follows the call with the explicit bindings had been passed as a block right then. The above example will be desugared to “let('x, y, { f(x) })” In other words, instead of passing a block as the last argument of a function call, we can just use that function inside the block. This is especially helpful for nested bindings:
// desugars to: let('a, x, { let('b, y, { Pair(a, b) }) })
{
let('a, x),
let('b, y),
Pair(a, b)
}
Right now we assume that every element in a block will be a binding construct that knows how to handle the block argument that is implicitly passed to it. But especially in a language with side effects it would be nice to allow block elements to not bind anything at all and just “do” something, like print(a).
To allow this, let's decide that whenever a block element does not contain any explicit bindings that could be consumed by the enclosing block, the element will be evaluated (assuming call-by-value) as an argument to an anonymous function, which when called will return the rest of the block. (We write (x => y) for the function that takes an argument x and returns y, keeping in mind that this is not surface syntax, because the binding is not explicitly marked.)
// binding constructs
// desugar to: let('x, Foo)(x => ...)
{ let('x, Foo), ... }
// non-binding constructs, for example side effects
// desugar to: (_ => ...)(f())
{ f(), ... }
This makes it easy to mix binding constructs and side effects:
// desugars to: let('a, x, { (_ => f(a))(print(a)) })
{
let('a, x),
print(a),
f(a)
}
So far we have looked at how binding constructs can be used, but how can they actually be defined? Let's say we wanted to define our own let binding construct, called “def”, which can be called as def('x, value, { x }). How would we implement def?
We already know that the second argument, value, is just a normally evaluated function argument. We also know that blocks are just syntactic sugar and get translated into anonymous functions, with the number of arguments determined by the bindings to its left. In the case of def('x, value, { x }), the block { x } is translated to the function ('x => x). But what about the first argument, 'x? If explicit bindings are purely syntactical and get desugared into a functional language without macros, how are explicit bindings translated?
The exact representation of a binding at runtime depends on the specific host language, but any representation of a binding-as-value must meet a couple of criteria:
Returning to the example of defining a custom let binding, our def function needs to check that its first argument (the binding-as-value) is a single binding, in which case we know it's safe to call the third argument (the block) as a function with a single argument.
// assuming the following built-in functions:
// `=>`, infix binding construct, for single argument functions,
// `if`, if-then-else, executes either the then-block or else-block,
// `is-binding`, returns true if its argument is a binding-as-value.
'var => {
'v => {
'f => {
if(is_binding(var), { f(v) }, { Error })
}
}
}
// or in a more familiar notation without explicit bindings:
// (var, v, f) =>
// if(is_binding(var), () => f(v), () => Error)
The above example used several built-in functions and thus might seem to beg the question, but crucially all of these functions are normal functions that can be redefined. They are not privileged in any way.
The above implementation of a custom let binding is only a rough sketch. The code for steps 1 to 3 (the translation from surface syntax with explicit bindings to a functional calculus with normal lambdas) is on Github.
The current proposal assumes that explicit bindings can only be used in a single scope, because they get consumed by the nearest block and are not available in the next block. Frequently, however, it would make sense to bind a variable both in the block that follows and in the enclosing block.
A good example is letrec, the recursive cousin of let that allows the value-to-be-bound to access the binding recursively:
// we'd like to define r = Cons(x, r) = Cons(x, Cons(x, r)) = ...
// not possible right now, r is only in scope in the inner block
{
letrec('r, { Cons(x, r) }),
head(r) // r is not in scope here anymore, was already consumed
}
How can this be solved? I see two options, neither of which seems ideal:
Either way, more exploration is needed. Nevertheless, explicit bindings and user-defined binding constructs seem to be an interesting direction that I haven't encountered in any programming language before, especially when combined with flexible but general syntax, such as infix calls.