Parsing Math Expressions: Approach towards State of Art

Mr.PoorEnglish

Rate me:

4.82/5 (6 votes)

7 Dec 2017CPOL7 min read

12K

232

Parse Expressions in three steps: 1) Tokenize, 2) Convert2Postfix 3) ProcessPostfix

Download ExpressionSolverTester.zip - 39.8 KB

Parsing math Expressions is obviously a difficult Task. On the other Hand it is a very wellknown Task, and is solved, since Editors are in use to write Code.
So it may be a good idea - instead of re-inventing the wheel - to figure out some of the concepts, which may be used by professional Expression-Parsers.

And before we can come to concret code we must discuss several more or less abstract topics, namely:

What is an Expression?
What is a Token?
What is Infix-Notation, and what is PostFix-Notation?
What is an Operator?

With this background we can create code of surprisingly power, shortness and performance.
So lets step in

Expressions and Tokens

An Expression is a Text, composed of several atomic units - the Tokens.
In the here introduced point of view a token (of an infix-noted Expression) can only have two meanings: It is either an opening/closing Bracket or an Operator (the term 'Operator' will be discussed later).

Infix-Notation

On Infix-noted Expressions mainly binary Operators are set between their both Operands.
Example: (3 + 2) * (15 / 3 - 8)
That causes a significant problem: In which Order the Operators shall operate?
To solve it, there was introduced the concept "Operator-Rank" (power over multiply, multiply over addition,...). Additionally there is a concept "Bracket" to break the Ranking-Concept and define operating-orders, which differ from the order as it would ensue from the Operator-Ranks.
A real confusing mess of concepts - isn't it?
And things get worse, when unary Operators come into play, like Negation, or Factorial (eg: -3, 5!)

Postfix-Notation

The same Expression-Sample as above, but postfix-noted:
3 2 + 15 3 / 8 - *
Postfix has no such circumstancies like Infix: The Operators Operands are always at its left, the Order of Operating is always from left to right, and that's that!
Operating means: consume Operator and Operands, and replace them with the Operation-Result.
Theese Results (except the last one) are again Operands of further Operators.
See Picture:

You see clearly the sukzessive Reducing of Token, until only one remains: the expression-result.
This Algorithm - "PostfixProcessing" - can be implemented very smart and efficient.
But that's the last step of all - please keep patient.

Operator

In the here introduced meaning an Operator can have one Operand (unary OP), two (binary OP), or more, and even can have no Operand at all!
What is meant with: "Operator with more than 2 Operands"? Functions can do so, eg a Function SumOfThree( 3, 4, -15 ) would be considered as Operator with 3 Operands.
And how can I imagine an "Operator without Operand"? Very simple: Imagine a Number. A Number consumes no Operand, but creates a Result - namely itself.
Another such "zeroary Operator" is a named Parameter. A Parameter is also a no-operand-consuming Operator, and the Result which it creates is taken from another, external mechanism.
You can consider Numbers and Constants as Operators with "intrinsic result", while Parameters use a kind of "result-injection" - as said: none of them consumes an Operand.

Code

Tokenize

First we need to convert the expression to a List of Tokens:

VB.NET

  1  Private Function Tokenize(expression As String) As List(Of Token)
  2     Tokenize = New List(Of Token)
  3     Dim rgxToken As New Regex("(?<operator>[+\-*/^()])|(?<number>[\d\.]+)|(?<param>[a-z])")
  4     Dim tk = New Token("("c) ' since the following Loop accesses the precuser-token, an initial dummy-Token is needed
  5     For Each mt As Match In rgxToken.Matches(expression)
  6        If mt.Groups("number").Success Then
  7           tk = New Token("0"c) With {.Number = Double.Parse(mt.Value, CultureInfo.InvariantCulture)}
  8        Else
  9           Dim c = mt.Value(0)
 10           If c = "-"c AndAlso tk.CausesUnaryNegation Then ' access the precuser-token to decide the meaning of "-"
 11              ' "-" can mean substraction as well as unary negation. If so replace it by "~" for easyer distinction
 12              c = "~"c
 13           End If
 14           tk = New Token(c)
 15        End If
 16        Tokenize.Add(tk)
 17     Next
 18  End Function

Basically that is simple - Regex help to distinct operators, numbers, params (line #3).
The '-'-Token causes a little problem, because its meaning is context-sensitiv: if it has a Number at its left, it is the binary Substraction-Operator, otherwise it is the unary Negation-Operator, and applies to its Operand at its right.
Lines #10-14 solve this by replacing the Negation-Token with unambigous '~'.

But see the Token-Class itself:

VB.NET

  1  Public Class Token
  2  
  3     Public Number As Double, OpCode As Char, Rank, OperandCount As Integer
  4     
  5     Private Shared _Ranks As String() = {"(", ")", "+-", "*/", "^", "~"} ' the element-index is the Token-Rank
  6     Private Shared _OperandCounts As String() = {"~", "+-*/^"} ' the element-index is the operators OperandCount-1
  7  
  8     Public Sub New(opcode As Char)
  9        Me.OpCode = opcode
 10        Rank = _Ranks.FindIndex(Function(s) s.Contains(opcode))
 11        If Rank < 0 Then Rank = 99 ' OpCodes, not listed in _Ranks, are Numbers or Parameters, and get Rank=99
 12        OperandCount = _OperandCounts.FindIndex(Function(s) s.Contains(opcode)) + 1 ' OpCodes, not listed in _OperandCounts, automatically get OperandCount=0
 13     End Sub
 14  
 15     ''' <summary> if true, a following '-'-Token means unary Negation. Otherwise means (binary) Subtraction </summary>
 16     Public Function CausesUnaryNegation() As Boolean
 17        Return Rank <> 99 AndAlso OpCode <> ")"c
 18     End Function
 19  	
 20  End Class

The Main-Thing is line #3, which says, what a Token needs to know: Namely an OperatorCode (OpCode), its Rank and its OperandCount.
Moreover my Token knows a 'Number', but that is only in use for Number-Tokens.

Infix2Postfix

This is the most tricky code of this article - it is known as "Shunting-Yard"-Algorithm, and was invented by Dijkstra (who else?).
Please refer to the link - its explainations are excellent, especially the detailed samples of processing an expression step by step.

VB.NET

Private Function Infix2Postfix(infixTokens As List(Of Token)) As List(Of Token)
   'shunting-yard-algo by Dijkstra: holds a 'tokenStack' as interim Token-storage.
   'The rules of populating and flushing it to output are as follows:
   '1) Every token, except '(', flushes all higher or equal ranked tokens from tokenStack to output
   '2) Then every token, except ')', enter the tokenStack.
   ' Exceptional ')'-Behavior: since '(' is of lowest Rank, and ')' is next, the latter has flushed all other tokens, and a '(' now is stackTop. Remove it too.
   '3) Neither '(' nor ')' move to output
   Dim tokenStack As New Stack(Of Token)
   Dim output = New List(Of Token)
   For Each tk In infixTokens
      If tk.OpCode <> "("c Then ' all Tokens - except '(' - flush higher ranked other Tokens from tokenStack to output
         While tokenStack.Count > 0 AndAlso tokenStack.Peek.Rank >= tk.Rank
            output.Add(tokenStack.Pop())
         End While
      End If
      ' ')' doesn't enter the tokenStack. Instead remove the '(', which now is on Stack-Top
      If tk.OpCode = ")"c Then tokenStack.Pop() Else tokenStack.Push(tk)
   Next
   output.AddRange(tokenStack) ' final rest-flush
   Return output
End Function

Since the code is less than 20 lines, and the comment describes exactly what is done I dispense further explainations. Because when you know, what a Stack is, then you should understand it by closely looking at and thinking about it.

ProcessPostfix

Postfix-Processing can target several goals: eg build an expression-tree, or build a TreeNode-tree to visualize the expression-tree.
But mostly expected is to solve the Expression.
The Algorithm was already anticipated in the "What is Postfix" - Section, but in Repetition see the Picture, how Operations sukzessive reduce Tokens until only the Expression-Result remains:

To implement that again the Stack-Class is very helpful - we use an operandStack As Stack(Of Double), from which Operators can pop their Operands. And the ProcessPostfix-Algo (here named: Solve()) pushes the Operation-Results of the Operators back to operandStack.
See Code:

VB.NET

  1  Module ExpressionSolver
  2     Private _Operators As Dictionary(Of Char, Func(Of Stack(Of Double), Double))
  3  
  4     Sub New()
  5        'associate Tokens with Operators: Functions, which pop their (Double-)Operands from operandStack, process them and retrieve a Double as Result
  6        '"~" is Unary-Negation-Postfix, all other Operators are as in Vb.
  7        '"^" takes some effort, to respect order of left and right operand
  8        _Operators = New Dictionary(Of Char, Func(Of Stack(Of Double), Double)) From
  9                                            {{"~"c, Function(stk) -stk.Pop},
 10                                            {"^"c, Function(stk)
 11                                                      Dim right = stk.Pop
 12                                                      Return stk.Pop ^ right
 13                                                   End Function},
 14                                            {"*"c, Function(stk) stk.Pop * stk.Pop},
 15                                            {"/"c, Function(stk) 1 / stk.Pop * stk.Pop},
 16                                            {"+"c, Function(stk) stk.Pop + stk.Pop},
 17                                            {"-"c, Function(stk) -stk.Pop + stk.Pop}}
 18        ' Number- and Parameter-Operators create Values without processing Stack-Elements as Operands. So their Operators are Nothing
 19        _Operators.Add("0"c, Nothing) ' Number-Operator
 20        For i = 0 To 25 ' Parameter-Operator: [a-z] 
 21           _Operators.Add(Convert.ToChar(97 + i), Nothing)
 22        Next
 23     End Sub
 24  
 25     Public Function Solve(expression As String, ParamArray paramValues() As Double) As Double
 26        Dim operandStack As New Stack(Of Double)
 27        For Each tk In Infix2Postfix(Tokenize(expression)) ' process postfix-ordered Tokens
 28           Dim result As Double
 29           Select Case tk.OpCode
 30              Case "0"c : result = tk.Number ' numbers: 'intrinsic result'
 31  
 32              Case "a"c To "z"c            ' parameters: result 'injected' by paramValues
 33                 Dim i = Convert.ToInt32(tk.OpCode) - 97 ' compute param-index from OpCode
 34                 result = paramValues(i)
 35  
 36              Case Else             ' other operators: result by consuming operands from stack
 37                 result = _Operators(tk.OpCode).Invoke(operandStack) 
 38           End Select
 39           operandStack.Push(result)
 40        Next
 41        Return operandStack.Pop
 42     End Function

Upto line#23 you see the initializing of the available Operators, and how they get stored in a Dictionary.
In this case an Operator is an anonymous Function which pops one or more Operands from a given Stack, and return a result.
Note the Number- and Parameter- Operators (#19 - #22): Since their evaluating-result-mechanism is different, they do not interact with the stack - (especially don't pop Operands from it).

Now look at the Solve()-Function - after such effortful Initialisation it is banal: Just invoke every Tokens Operator with the operandStack (line #37) and push the result back to operandStack (line #39).
Exceptions are made with Numbers and Parameters: From a Number-Token (OpCode '0') the result is its Number (#30), and for a ParameterToken (OpCodes 'a' To 'z') its result is taken from the paramValues-Array (#34).

Another PostfixProcessing: Build Treeview

The Principle is the same: Execute eaches Tokens Operation on the operandStack. Only two differences to the Solve()-Code above: 1) the operands aren't numbers, but are TreeNodes 2) no algo-exceptions are required.

VB.NET

  1  Private Function BuildTreeNodes(postfixTokens As List(Of Token)) As TreeNode
  2     Dim operandStack As New Stack(Of TreeNode)
  3     For Each tk In postfixTokens
  4        Dim result = New TreeNode(tk.ToString)
  5        For i = 1 To tk.OperandCount
  6           result.Nodes.Insert(0, operandStack.Pop) ' respect reversed order of stack-elements
  7        Next
  8        operandStack.Push(result)
  9     Next
 10     Return operandStack.Pop
 11  End Function

there are no differentiated Operators, since every Token operates the same way:
Create a (Result-)TreeNode (line #4), pop Operands, and for each Operand insert a further TreeNode into the Result-TreeNodes NodeCollection (lines #5 - #7).
Then push the Result-TreeNode back to Stack (#8).
See picture oft the attached Sample-App:

On Top you see the infix-noted Expression, below its Postfix-Version, below-left a Tree-Representation.
On the right you can set Values for the Expression-Params, and right-bottom you get the Expression-Result.

Further PostfixProcessings: Linq-ExpressionTree

Maybe in an update or a continuing article I will show the "Kings-Way" of Expression-parsing: There will be a PostfixProcessing-Function, which composes a Linq.Expression with classes of the Linq.Expression-Namespace, and compile it to an usual anonymous Function.
To this Function you can pass parameter-values, and its evaluation-Performance is simply not toppable, since it doesn't need to re-parse the Expression at each call, and of course since it is compiled.
Moreover i will support Constants and Functions, namely all stuff of the Math-Namespace: Sin(), Cos(), Abs(), Pi,...
And even userdefined Functions, even with more than 2 Operands, can be made recognizable as Expression-terms.
(In former times I used that technique to create a Tablefunction-Feature for Datagridviews - similar to Excel. But that would be a too big topic for an article)

Conclusion

We saw, that a little Background-Knowledge (especially about Infix-/Postfix-Notation, and of course about the faboulous Shunting-Yard-Algo, created by Mr. Dijkstra) places us in a position, where we have a clear concept of how to solve the task "infix-math-expression-parsing" in an efficient, reliable and powerful way (eg. including params and function-calls).
The general Approach is: 1) Tokenize 2) Convert2Postfix 3) ProcessPostfix

And please note, that the attached sources meet the goal "demonstrate a general Approach for parsing expressions", but nevertheless it has several drawbacks (and lazyness prevents me to fix them):

The Datatype Char for Token.OpCode was a wrong choice. When it comes to extend functionality with math. Function-Names (Sin, Cos, Abs,...), DataType Char can't handle that - String obviously would have been better.
My "named Parameters" are fix predefined: They are a, b, c, ..., z, and in that order. It would have been better, if one could choose freely the param-Names within an expression, as useful.
some other, minor drawbacks...

It is consolatory, that it can't be goal of an article, to introduce the "perfect code". It's enough, when an article just introduces some more or less basic concepts, which enable programmers to develop their own "perfect code" - namely perfect fitting to their needs.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Mr.PoorEnglish

Germany

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Parsing Math Expressions: Approach towards State of Art

Expressions and Tokens

Infix-Notation

Postfix-Notation

Operator

Code

Tokenize

Infix2Postfix

ProcessPostfix

Another PostfixProcessing: Build Treeview

Further PostfixProcessings: Linq-ExpressionTree

Conclusion

License

Comments and Discussions