What makes one programming language more powerful than another one? Is the “power” of programming languages a clear concept or are there several orthogonal notions of what we mean by power? Is it even desirable to make a programming language as powerful as possible?
I want to argue that there are several different notions of programming language power and only one of them, computational power, is usually explicitly discussed and measured. The different notions of power are sometimes confused, which makes it hard to see that each notion of power comes with its own tradeoffs and that making a language more powerful has both advantages and disadvantages. As with all tradeoffs, however, there are points in the design space that are worse than others. Clarifying the different notions of power can help identify these points.
While it might be natural to think that a more powerful language is always better, because it can do more, this ignores that a less powerful language is easier to understand and manipulate for both humans and computers.
Here's what Tim Berners-Lee had to say about this in 1998:
The choice of language is a common design choice. The low power end of the scale is typically simpler to design, implement and use, but the high power end of the scale has all the attraction of being an open-ended hook into which anything can be placed: a door to uses bounded only by the imagination of the programmer.
Computer Science in the 1960s to 80s spent a lot of effort making languages which were as powerful as possible. Nowadays we have to appreciate the reasons for picking not the most powerful solution but the least powerful. The reason for this is that the less powerful the language, the more you can do with the data stored in that language. If you write it in a simple declarative from, anyone can write a program to analyze it in many ways.
— Principle of Least Power, Tim Berners-Lee (emphasis mine)
In other words, a more powerful language can do more in the language, but if we restrict ourselves to a less powerful language we can do more with the language (if we reason about it or manipulate it as data).
According to the principle of least power, we should always choose the least powerful language that we can get away with, because it is “simpler to design, implement and use”. What we can get away with will depend on the specific situation, but the emphasis on designing, implementing and using a language is a good reminder of the disadvantages of too much power.
What kind of power does Tim Berners-Lee have in mind? Earlier in the text, he talks about declarative languages that are not Turing-complete, making it clear that he is talking specifically about computational power.
This notion of power is rigorously defined, thanks to concepts like the Chomsky hierarchy of formal languages and fields such as computability theory. But for practical programming languages, this notion of power is usually beside the point, because Turing-completeness is taken as a given for a general purpose programming language. In fact, the difference between programming languages and domain-specific languages is often that DSLs are explicitly not Turing-complete, whereas programming languages are.
Worse, the fact that a programming language is Turing-complete (and thus maximally powerful when considered on that axis, because it can in theory compute anything that any other Turing-complete language can compute) tells us very little about whether it is a language that is powerful in a practical sense: Lambda calculus is Turing-complete, but few people would use it for day-to-day programming.
While computational power is an important concept (and probably the only concept where it is easy to define a point of maximum power, namely Turing completeness), it is insufficient to explain why we say that some programming languages are more powerful than others. So let's look at some other notions of power.
We often talk about how a language becomes more “expressive” when a new language construct is added, or how one language is more expressive than another one because it has a language construct that the other does not have and which it can only imitate in a cumbersome or verbose way. It is clear that this is not about computational power, since of course all Turing-complete language can in theory express the same programs. We are instead talking about whether a language gives us enough power to express some construct in a short and natural way.
This notion of expressive power has been formalized by Matthias Felleisen in the 1991 paper On the expressive power of programming languages and similarly by John C. Mitchell in the 1993 paper On abstraction and the expressive power of programming languages . The papers have different aims, but are similar enough that I will not go into the differences.
In a nutshell, two languages are considered equally expressive if constructs that are in one language but not in the other can be translated into the other language in a way that the translated version is “observationally equivalent” (Mitchell) and preserves the “global program structure” (Felleisen).
Being “observationally equivalent” means that the translated version behaves the same as the version that uses more constructs: both always produce the same output and are thus indistinguishable in this regard. (Felleisen only talks about termination behavior, Mitchell about output values, but the difference does not matter here.) Preserving the “global program structure” means that the translation of constructs must happen locally without rewriting the whole program. After all, a Turing-complete language could always just implement an interpreter for another language, so a translation is always possible, but this would be a global translation and we explicitly want to distinguish expressive power from computational power, so only local translations are allowed.
Felleisen also defines a more restrictive version of expressiveness, called “macro expressiveness”, which can distinguish syntactic sugar from more complicated translations. Both normal expressiveness and macro expressiveness preserve the global program structure, but macro expressiveness “not only preserves the global program structures but also the structure of the subexpressions of phrases built from eliminable constructs”. Macro expressiveness is compositional, whereas this must not be true for expressiveness in general. This neatly explains why we perceive syntactic sugar as superficial: The new construct can be added to a language using a purely syntactic and compositional transformation.
Let's look at an example: We want to express an if-else construct in a simple functional language with first-class functions and numbers as data types. The construct is supposed to take three arguments, “if-else(v, then, else)”. If v is 1, the then-value should be evaluated and returned. If v is 0, the else-value should be evaluated and returned. If v is any other value, the construct should loop endlessly (or raise an error).
We can macro-express such a construct in a language without if-else by translating it in such a way that the then/else values each get wrapped in a lambda (so that they are not immediately evaluated) and the whole construct becomes a function that checks its first argument and then, based on the value, either calls the then-lambda or the else-lambda.
Let's look at an example of a construct that is not macro-expressible: We want to express an if-else construct, but now instead of using 0 or 1 we want to use True and False as values, which do not exist in our target language without if-else. If the value is neither True nor False, the if-else construct should loop endlessly (or return an error).
Interestingly, this is not macro-expressible, because if our target language does not have True and False as values, we need to map True and False to some values that our target language already has, which contradicts the requirement that if-else should loop endlessly (or return an error) for normal values, because if True is for example mapped to 1, then if-else(1, t, f) will return t, because there is no way to distinguish True and 1 in a language that does not have an explicit True value.
This sounds trivial, and in some ways it is, but it formalizes the notion that adding data types to a minimal language like lambda calculus makes it more expressive, because it allows us to distinguish concepts that lambda calculus conflates: In lambda calculus, because everything is a function, every piece of data needs to be encoded in some way, which means multiple different concepts might get mapped to the same function, such as the number 0 and the boolean value false. Adding data types thus increases a language's expressive power in a way that we can explain formally.
The idea to formalize expressiveness as a local translation between languages while preserving the observable program behavior is a good first step towards explaining our intuition that a practical programing language is more powerful than lambda calculus and that syntactic sugar is superficial compared to other constructs. But the concept of expressive power still has some problems:
First of all, Felleisen and Mitchell are only interested in whether a local translation could exist, not in how a translation can be specified in a language in practice: A language like Lisp makes it easy to define such translations using macros, while a language without macros or a preprocessor might need to change its parser to implement even the simplest forms of syntactic sugar. The expressiveness that is unique to Lisp-like macros is not captured in the formal definitions of Felleisen and Mitchell.
More interestingly, a Lisp-like language with macros or fexprs is actually less expressive than most other languages, because as Mitchell points out, no translation is ever observationally equivalent in the presence of Lisp fexprs or macros: Even something simple as (+ 1 2) cannot be treated as observationally equal to the value 3, because a macro could examine the unevaluated AST and simply return the first node of the tree, which would be “+” for (+ 1 2) but “3” for the value 3. (Mitchell Wand makes a very similar observation in the 1998 paper The theory of fexprs is trivial.)
This can be explained more generally: The more a language can introspect its own program, the more reflective it is, the less it is possible to replace parts of the program with other behaviorally equal parts, because the difference could be observed, even if it's merely a syntactic difference. Reflection, macros and other forms of meta-programming blur the line between syntax and semantics, since syntax becomes semantically observable. This makes it harder for humans and compilers to encapsulate and abstract away implementation details: an increase in reflective power implies a decrease in abstractive power.
Mitchell doesn't distinguish between expressive and abstractive power, but let's follow John Shutt and his work on the Kernel programming language here and distinguish between expressive and abstractive power (without diving into the details of how to formalize it here). It then makes sense to say that Lisp-like languages, thanks to being homoiconic and having macros, provide a lot of expressive power, but little abstractive power, as a consequence of being very reflective.
(Shutt would probably disagree with this way of putting things and say that abstractive power is only lost when it comes to traditional fexprs and macros, not for Kernel-style fexprs with explicit evaluation, since they don't exhibit the sort of trivial theory that Mitchell and Wand are criticizing. This may be true, but it comes at the cost of completely blurring the line between syntactic and semantic abstraction, because in Kernel it is only clear at runtime which expressions are fexprs and which are applicative functions. This is deliberate, but means that a compiler has a hard time figuring out which expressions can be safely optimized. The point thus still stands: More reflective power comes at the cost of statically analyzable abstractive power.)
So far, all of the different notions of power have only been concerned with static programs that are written once and then never changed. This might be fine for something like lambda calculus, but one of the main characteristics that distinguishes programming language from simple formalisms is that languages need to deal with growth and change. As Guy Steele points out, growing a language is one of the things that makes designing a programming language hard.
There have been some attempts to formalize the notion of extensiblity, for example a 1998 paper by Shriram Krishnamurthi and Matthias Felleisen called Toward a Formal Theory of Extensible Software. Their definition captures the intuitive notion that a language is extensible if it allows extension by merely adding new definitions, without copying or editing old ones. A language with more extensible power is then a language that allows such an additive extension in more situations than another language.
The tradeoffs inherent to such power are similar as in the case of abstractive / reflective power: The more extensible a language is, the more things could be changed later “from the outside” and it is hard to reason about the code until these outside extensions are fully known.
There is also the question of when and how these extension points should be made explicit, because there is a fundamental tension, the expression problem (which is a misleading pun, as it is about how operations/cases of expressions can be extended and could just as well be called the “extension problem”). In a nutshell, OOP makes it easy to add new cases of objects but fixes the operations available on them in an interface, whereas FP makes it easy to add new operations in the form of functions but fixes the cases of data in algebraic data types.
Extending an existing program along both dimensions by adding both new cases and new operations is tricky in a statically typed context. There are solutions, such as type classes, but a fundamental tension remains: Proving that no extensions conflict and generating efficient code requires some knowledge about the totality of cases or operations, which makes it hard to compile extensions truly independently.
This is not surprising, because it shows the tradeoffs inherent to extensible power: The more open and extensible a programming language is, the harder it is to implement the language efficiently.
There is another notion of power that isn't adequately captured by any of the notions discussed so far: What is more powerful, a language with just two orthogonal primitives from which everything else can be derived, or a language with three, four, five?
We might associate a small number of primitives with formal elegance, but languages which require more orthogonal primitives than others are in a certain sense more powerful than languages with fewer primitives, because languages that are decomposed into a larger number of primitives are easier to reason about, they are more surveyable.
Some esoteric languages pride themselves on achieving Turing-completeness using a small number of primitives, for example using just two combinators. Even lambda calculus only needs three constructs. But these languages are esoteric for a reason: When everything looks like a combinator or a function, it is hard to see what any of the code means in terms of our everyday programming concepts: Is it code? Is it data? Is it a function that iterates? Is the iteration bounded or not? Does it call itself recursively? Is it a conditional? In other words, we are stuck in a Turing tarpit, where “everything is possible but nothing of interest is easy.”
The term “surveyable” is borrowed from Ludwig Wittgenstein, who wrote about the difference between Peano arithmetic and the arithmetic we learn in school: From a mathematical standpoint, the unary math of Peano arithmetic, which relies only on 0 and a successor function, is much more elegant than the arithmetic we learn in school, because we can define natural numbers, addition, multiplication and so forth in terms of Peano arithmetic. As languages, Peano arithmetic and school arithmetic have the same computational and expressive power!
But as Wittgenstein pointed out, while it is true that Peano arithmetic can serve as a mathematical foundation for our school arithmetic (in the same way that lambda calculus can serve as a foundation for practical programming languages), our school arithmetic is in some (non-mathematical, philosophical) sense the foundation for Peano arithmetic, because our school math is at the origin of our concepts of natural numbers, addition, multiplication and so forth. These concepts might be informal, but they nevertheless give Peano arithmetic a use. The only reason we are interested in Peano arithmetic at all is because it corresponds to our way of doing arithmetic that we learned in school.
Bertrand Russell and A. N. Whitehead considered informal arithmetic to be secondary and subsidiary to foundational math like Peano arithmetic. But Wittgenstein emphasized that, in contrast to our school arithmetic, Peano arithmetic only remains surveyable in very small examples: If we try to add larger numbers, such as 1000 + 1000, using Peano arithmetic, we quickly reach the limits of what we can manipulate symbolically without making mistakes. Whether or not we make mistakes or arrive at the correct result is defined in reference to our school arithmetic, in other words Peano arithmetic is in some sense secondary and subsidiary to informal arithmetic!
(It might seem as if this argument broke down once we consider computers, because a computer can obviously manipulate huge numbers in Peano arithmetic without making mistakes. Wittgenstein's remarks have interesting applications here as well, but would lead a bit too far. Let's instead only use “surveyable” as a sort of guiding principle here.)
In the context of programming languages, a larger number of (orthogonal) primitives means that a language is more clearly separated into distinct concepts. Lambda calculus with its three constructs is elegant and very simple to implement, but hard for a compiler to optimize, because it is unclear which combination of functions is supposed to mean a number, a recursive function, a bounded iteration and so forth.
By separating a language into a larger number of distinct concepts, we make it easier for both humans and compiler to reason about the code and optimize it: We can store numbers as machine integers, we can implement iteration using jumps and recursion using tail recursion, where possible.
Interestingly, separating a Turing-complete language into more orthogonal primitives is often harder than using fewer orthogonal primitives, because Turing-completeness creeps in quickly and often unnoticed.
Type systems are one way to cleanly separate concepts, by restricting how some primitives can be used, so that we have room to add another orthogonal primitive to explicitly do the task. One example is recursion: We can restrict a language to non-recursive types, which guarantees that functions can never call themselves recursively, and only then add an explicit recursion operator, opening up new avenues for optimization.
As mentioned before, more power isn't necessarily better. The above notions are a first attempt to make the landscape of the design space itself more “surveyable”, however, by clarifying our concept of “power” in the context of programming languages.
The notions discussed here are neither complete nor rigorous, but why should they be? Language design involves balancing a lot of different tensions and tradeoffs, all we can do is try to make our concepts more explicit, then start experimenting.