Parsing math Expressions is obviously a difficult Task. On the other Hand it is a very wellknown Task, and is solved, since Editors are in use to write Code.
So it may be a good idea - instead of re-inventing the wheel - to figure out some of the concepts, which may be used by professional Expression-Parsers.
And before we can come to concret code we must discuss several more or less abstract topics, namely:
- What is an Expression?
- What is a Token?
- What is Infix-Notation, and what is PostFix-Notation?
- What is an Operator?
With this background we can create code of surprisingly power, shortness and performance.
So lets step in
Expressions and Tokens
An Expression is a Text, composed of several atomic units - the Tokens.
In the here introduced point of view a token (of an infix-noted Expression) can only have two meanings: It is either an opening/closing Bracket or an Operator (the term 'Operator' will be discussed later).
Infix-Notation
On Infix-noted Expressions mainly binary Operators are set between their both Operands.
Example: (3 + 2) * (15 / 3 - 8)
That causes a significant problem: In which Order the Operators shall operate?
To solve it, there was introduced the concept "Operator-Rank" (power over multiply, multiply over addition,...). Additionally there is a concept "Bracket" to break the Ranking-Concept and define operating-orders, which differ from the order as it would ensue from the Operator-Ranks.
A real confusing mess of concepts - isn't it?
And things get worse, when unary Operators come into play, like Negation, or Factorial (eg: -3
, 5!
)
Postfix-Notation
The same Expression-Sample as above, but postfix-noted:
3 2 + 15 3 / 8 - *
Postfix has no such circumstancies like Infix: The Operators Operands are always at its left, the Order of Operating is always from left to right, and that's that!
Operating means: consume Operator and Operands, and replace them with the Operation-Result.
Theese Results (except the last one) are again Operands of further Operators.
See Picture:
You see clearly the sukzessive Reducing of Token, until only one remains: the expression-result.
This Algorithm - "PostfixProcessing" - can be implemented very smart and efficient.
But that's the last step of all - please keep patient.
Operator
In the here introduced meaning an Operator can have one Operand (unary OP), two (binary OP), or more, and even can have no Operand at all!
What is meant with: "Operator with more than 2 Operands"? Functions can do so, eg a Function SumOfThree( 3, 4, -15 )
would be considered as Operator with 3 Operands.
And how can I imagine an "Operator without Operand"? Very simple: Imagine a Number. A Number consumes no Operand, but creates a Result - namely itself.
Another such "zeroary Operator" is a named Parameter. A Parameter is also a no-operand-consuming Operator, and the Result which it creates is taken from another, external mechanism.
You can consider Numbers and Constants as Operators with "intrinsic result", while Parameters use a kind of "result-injection" - as said: none of them consumes an Operand.
Code
Tokenize
First we need to convert the expression to a List of Tokens:
1 Private Function Tokenize(expression As String) As List(Of Token)
2 Tokenize = New List(Of Token)
3 Dim rgxToken As New Regex("(?<operator>[+\-*/^()])|(?<number>[\d\.]+)|(?<param>[a-z])")
4 Dim tk = New Token("("c)
5 For Each mt As Match In rgxToken.Matches(expression)
6 If mt.Groups("number").Success Then
7 tk = New Token("0"c) With {.Number = Double.Parse(mt.Value, CultureInfo.InvariantCulture)}
8 Else
9 Dim c = mt.Value(0)
10 If c = "-"c AndAlso tk.CausesUnaryNegation Then
11
12 c = "~"c
13 End If
14 tk = New Token(c)
15 End If
16 Tokenize.Add(tk)
17 Next
18 End Function
Basically that is simple - Regex help to distinct operators, numbers, params (line #3).
The '-
'-Token causes a little problem, because its meaning is context-sensitiv: if it has a Number at its left, it is the binary Substraction-Operator, otherwise it is the unary Negation-Operator, and applies to its Operand at its right.
Lines #10-14 solve this by replacing the Negation-Token with unambigous '~
'.
But see the Token-Class itself:
1 Public Class Token
2
3 Public Number As Double, OpCode As Char, Rank, OperandCount As Integer
4
5 Private Shared _Ranks As String() = {"(", ")", "+-", "*/", "^", "~"}
6 Private Shared _OperandCounts As String() = {"~", "+-*/^"}
7
8 Public Sub New(opcode As Char)
9 Me.OpCode = opcode
10 Rank = _Ranks.FindIndex(Function(s) s.Contains(opcode))
11 If Rank < 0 Then Rank = 99
12 OperandCount = _OperandCounts.FindIndex(Function(s) s.Contains(opcode)) + 1
13 End Sub
14
15 Public Function CausesUnaryNegation() As Boolean
17 Return Rank <> 99 AndAlso OpCode <> ")"c
18 End Function
19
20 End Class
The Main-Thing is line #3, which says, what a Token needs to know: Namely an OperatorCode (OpCode
), its Rank
and its OperandCount
.
Moreover my Token knows a 'Number
', but that is only in use for Number-Tokens.
Infix2Postfix
This is the most tricky code of this article - it is known as "Shunting-Yard"-Algorithm, and was invented by Dijkstra (who else?).
Please refer to the link - its explainations are excellent, especially the detailed samples of processing an expression step by step.
Private Function Infix2Postfix(infixTokens As List(Of Token)) As List(Of Token)
Dim tokenStack As New Stack(Of Token)
Dim output = New List(Of Token)
For Each tk In infixTokens
If tk.OpCode <> "("c Then
While tokenStack.Count > 0 AndAlso tokenStack.Peek.Rank >= tk.Rank
output.Add(tokenStack.Pop())
End While
End If
If tk.OpCode = ")"c Then tokenStack.Pop() Else tokenStack.Push(tk)
Next
output.AddRange(tokenStack)
Return output
End Function
Since the code is less than 20 lines, and the comment describes exactly what is done I dispense further explainations. Because when you know, what a Stack
is, then you should understand it by closely looking at and thinking about it.
ProcessPostfix
Postfix-Processing can target several goals: eg build an expression-tree, or build a TreeNode
-tree to visualize the expression-tree.
But mostly expected is to solve the Expression.
The Algorithm was already anticipated in the "What is Postfix" - Section, but in Repetition see the Picture, how Operations sukzessive reduce Tokens until only the Expression-Result remains:
To implement that again the Stack
-Class is very helpful - we use an operandStack As Stack(Of Double)
, from which Operators can pop their Operands. And the ProcessPostfix-Algo (here named: Solve()
) pushes the Operation-Results of the Operators back to operandStack.
See Code:
1 Module ExpressionSolver
2 Private _Operators As Dictionary(Of Char, Func(Of Stack(Of Double), Double))
3
4 Sub New()
5
6
7
8 _Operators = New Dictionary(Of Char, Func(Of Stack(Of Double), Double)) From
9 {{"~"c, Function(stk) -stk.Pop},
10 {"^"c, Function(stk)
11 Dim right = stk.Pop
12 Return stk.Pop ^ right
13 End Function},
14 {"*"c, Function(stk) stk.Pop * stk.Pop},
15 {"/"c, Function(stk) 1 / stk.Pop * stk.Pop},
16 {"+"c, Function(stk) stk.Pop + stk.Pop},
17 {"-"c, Function(stk) -stk.Pop + stk.Pop}}
18
19 _Operators.Add("0"c, Nothing)
20 For i = 0 To 25
21 _Operators.Add(Convert.ToChar(97 + i), Nothing)
22 Next
23 End Sub
24
25 Public Function Solve(expression As String, ParamArray paramValues() As Double) As Double
26 Dim operandStack As New Stack(Of Double)
27 For Each tk In Infix2Postfix(Tokenize(expression))
28 Dim result As Double
29 Select Case tk.OpCode
30 Case "0"c : result = tk.Number
31
32 Case "a"c To "z"c
33 Dim i = Convert.ToInt32(tk.OpCode) - 97
34 result = paramValues(i)
35
36 Case Else
37 result = _Operators(tk.OpCode).Invoke(operandStack)
38 End Select
39 operandStack.Push(result)
40 Next
41 Return operandStack.Pop
42 End Function
Upto line#23 you see the initializing of the available Operators, and how they get stored in a Dictionary.
In this case an Operator is an anonymous Function which pops one or more Operands from a given Stack, and return a result.
Note the Number- and Parameter- Operators (#19 - #22): Since their evaluating-result-mechanism is different, they do not interact with the stack - (especially don't pop Operands from it).
Now look at the Solve()
-Function - after such effortful Initialisation it is banal: Just invoke every Tokens Operator with the operandStack
(line #37) and push the result back to operandStack
(line #39).
Exceptions are made with Numbers and Parameters: From a Number-Token (OpCode '0'
) the result is its Number (#30), and for a ParameterToken (OpCodes 'a'
To 'z'
) its result is taken from the paramValues
-Array (#34).
Another PostfixProcessing: Build Treeview
The Principle is the same: Execute eaches Tokens Operation on the operandStack
. Only two differences to the Solve()
-Code above: 1) the operands aren't numbers, but are TreeNode
s 2) no algo-exceptions are required.
1 Private Function BuildTreeNodes(postfixTokens As List(Of Token)) As TreeNode
2 Dim operandStack As New Stack(Of TreeNode)
3 For Each tk In postfixTokens
4 Dim result = New TreeNode(tk.ToString)
5 For i = 1 To tk.OperandCount
6 result.Nodes.Insert(0, operandStack.Pop)
7 Next
8 operandStack.Push(result)
9 Next
10 Return operandStack.Pop
11 End Function
there are no differentiated Operators, since every Token operates the same way:
Create a (Result-)TreeNode
(line #4), pop Operands, and for each Operand insert a further TreeNode
into the Result-TreeNode
s NodeCollection
(lines #5 - #7).
Then push the Result-TreeNode
back to Stack (#8).
See picture oft the attached Sample-App:
On Top you see the infix-noted Expression, below its Postfix-Version, below-left a Tree-Representation.
On the right you can set Values for the Expression-Params, and right-bottom you get the Expression-Result.
Further PostfixProcessings: Linq-ExpressionTree
Maybe in an update or a continuing article I will show the "Kings-Way" of Expression-parsing: There will be a PostfixProcessing-Function, which composes a Linq.Expression with classes of the Linq.Expression-Namespace, and compile it to an usual anonymous Function.
To this Function you can pass parameter-values, and its evaluation-Performance is simply not toppable, since it doesn't need to re-parse the Expression at each call, and of course since it is compiled.
Moreover i will support Constants and Functions, namely all stuff of the Math-Namespace: Sin()
, Cos()
, Abs()
, Pi
,...
And even userdefined Functions, even with more than 2 Operands, can be made recognizable as Expression-terms.
(In former times I used that technique to create a Tablefunction-Feature for Datagridviews - similar to Excel. But that would be a too big topic for an article)
Conclusion
We saw, that a little Background-Knowledge (especially about Infix-/Postfix-Notation, and of course about the faboulous Shunting-Yard-Algo, created by Mr. Dijkstra) places us in a position, where we have a clear concept of how to solve the task "infix-math-expression-parsing" in an efficient, reliable and powerful way (eg. including params and function-calls).
The general Approach is: 1) Tokenize 2) Convert2Postfix 3) ProcessPostfix
And please note, that the attached sources meet the goal "demonstrate a general Approach for parsing expressions", but nevertheless it has several drawbacks (and lazyness prevents me to fix them):
- The Datatype
Char
for Token.OpCode
was a wrong choice. When it comes to extend functionality with math. Function-Names (Sin, Cos, Abs
,...), DataType Char
can't handle that - String
obviously would have been better. - My "named Parameters" are fix predefined: They are
a, b, c, ..., z
, and in that order. It would have been better, if one could choose freely the param-Names within an expression, as useful. - some other, minor drawbacks...
It is consolatory, that it can't be goal of an article, to introduce the "perfect code". It's enough, when an article just introduces some more or less basic concepts, which enable programmers to develop their own "perfect code" - namely perfect fitting to their needs.
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.