2004.12.28 09:57 PM

Excel Formula Parsing

Update 8/7/2017

Whoa. Ten years later and I'm in here making an update. Here's why. In short, I've updated the JavaScript in this post (view source) to include an MIT license. It previously had only this: "free for the taking", which seemed sufficient back in 2004, but folks need something more legit these days. Note that the C# implementation of this code, which I posted in 2007, included an MIT license.

Update 4/26/2007

Fixed a problem with literal numbers in formulas formatted with scientific notation (e.g., =3.1E-24-2.1E-24), per Tjaard's comment. More info here.

As part of a self-directed (i.e., unpaid) project I'm working on with my old pal Darnley Bynoe, it was necessary to write an Excel formula parser, the output of which is an RPN expression which our product evaluates. While investigating the nuances of Excel formulas and exploring our options, we put together a number of prototypes. Darnley wrote his prototypes using Pick/BASIC, while I used JavaScript (as it made it easy to spit out the results in formatted HTML). Since we settled on an approach a while back and have moved on to other downstream work, I figured someone might like to have one of the old parsing prototypes we produced (one that stopped short of producing tokenized RPN, which is fairly illegible, and mostly useless without our interpreter).

So, here it is. I chose to share this particular prototype not so much because of the approach it uses (more on that below), but because it provides good visualization of the formula's parts and their nesting. Anyhow, feel free to use it here or take the code with you. To use it here, just paste in a valid Excel formula. Make sure it's a valid Excel formula (in other words, Excel accepts it), as this prototype doesn't do expression validation or provide any stack overflow protection (if you try to type in a formula and forget a paren or closing quote or something you'll likely see an error, or maybe nothing, who knows). Remember, it was only a prototype.



Here are some sample formulas, if you want to try it out:

=IF("a"={"a","b";"c",#N/A;-1,TRUE}, "yes", "no") &   "  more ""test"" text"

=+ AName- (-+-+-2^6) = {"A","B"} + @SUM(R1C1) + (@ERROR.TYPE(#VALUE!) = 2)

=IF(R13C3>DATE(2002,1,6),0,IF(ISERROR(R[41]C[2]),0,IF(R13C3>=R[41]C[2],0, IF(AND(R[23]C[11]>=55,R[24]C[11]>=20),R53C3,0))))


=IF(R[39]C[11]>65,R[25]C[42],ROUND((R[11]C[11]*IF(OR(AND(R[39]C[11]>=55, R[40]C[11]>=20),AND(R[40]C[11]>=20,R11C3="YES")),R[44]C[11],R[43]C[11]))+(R[14]C[11] *IF(OR(AND(R[39]C[11]>=55,R[40]C[11]>=20),AND(R[40]C[11]>=20,R11C3="YES")), R[45]C[11],R[43]C[11])),0))

There are as many ways to parse an expression as there are expressions. For our purposes, however, there were some unique requirements which sort of framed our final approach. The most significant of these was that we had to be able to parse an expression without advance knowledge of its possible contents. For example, we had to be able to parse an expression without knowing the universe of possible functions that may appear in it, or how many (or what kind of) parameters they might take. This lack of advanced knowledge sort of eliminated the possibility of us constructing an unambiguous BNF description to feed to YACC or BISON or ANTLR or some other "compiler compiler" (well, that and the fact that BNF and parse trees are hard and this is still an unpaid project).

Instead, we figured it was possible to understand enough of the rules inherent in Excel's formula language to write a simple state- and stack-based tokenizer, with some post-parse passes to contextually adjust the resulting tokens. In the end, this was the approach we adopted, and the prototype above illustrates some of our earlier work along those lines.

I don't have time to describe all of the issues we encountered while figuring this out, but there are some points worth mentioning, in case you decide to try this on your own.

  • In some cases, Excel considers "+" to be a valid unary operator, along with "-", and will preserve it in the formula, although in the end it's a no-op.
  • For some reason Excel uses a comma as the range union operator. This means that its only unambiguous use between ranges is outside of a function call or within a subexpression (i.e., between parentheses).
  • For some other reason Excel uses a space (or multiple spaces) as the range intersection operator. While not as ambiguous as the comma, it does require some consideration.
  • If you see an open paren immediately preceded by one or more characters, it's a function.
  • Don't forget that Excel still allows functions to be preceded with an "@".
  • String constants are always surrounded by double quotes, and interior double quotes are always doubled up.
  • The same is true for single quotes surrounding linked workbook paths and the sheet names of off-sheet references.
  • R1C1-style addresses use brackets to surround relative offsets (e.g., R[-1]C[3]). Oh, and there's that unary operator again.
  • Don't forget that array constants (surrounded by braces "{}") can contain rows, which are delimited with semicolons ";". Note that in the above prototype we treated array constants as function calls: ARRAY() and ARRAYROW().

Here are some additional resources you might find interesting:

Great explanation of parse trees: http://homepage.smc.edu/kennedy_john/PARSETREES.PDF

An overview of Excel reference operators: http://chacocanyon.com/smm/readings/referenceoperators.shtml

Some Excel formula BNF: http://www.mcs.vuw.ac.nz/~db/FishBrainWiki?Excel

Here's a little piece on parsing Excel formulas the hard way with K: http://www.nsl.com/papers/excel.htm

To tokenize a formula, I use M/ (index-over = pointer-chasing) and a small state-transition table. I think it is possible to do tokenizing using ONLY a state-transition table, but this requires too many states. Consequently, I use this method to bust up a formula into alphanumeric constants, strings of operator symbols, and sequences within quotes and brackets. I then perform several post- processing steps on the resulting list of strings to find unary minus, negative constants, and multi-token symbolic operators (e.g. '<='). Although K supports several standard iterators (do and while), I've required them in only one place; namely, to resolve the ambiguity of Excel's comma. Within the scope of an aggregator, comma means Union; for example, Sum(x,y,z) means: Sum the values in the union of the ranges x, y, and z. Sum has valence 1. But, within the scope of a function, comma operates as an argument separator; for example, If(x,y,z) has valence 3, and means: if x is true, then y, else z. So far, I have been unable to find a non-looping method for determining which commas are which.

Someone get that guy a Pabst Blue Ribbon!


You sir, are one incredibly smart man. I have no idea what I could use this for, but I already booked marked it.

Genius, pure genius.

Ryan A. Rinaldi | 2004.12.29 07:30 AM

Aw, shucks. Thanks, Ryan.

ewbi.develops | 2004.12.29 10:18 AM

Check out Rob van Gelder's formula parsing Excel add-in - very cool!


ewbi.develops | 2005.04.02 06:14 AM

Definitely bent my brain tonight — thanks for sharing!

Zaine Ridling | 2005.04.02 11:27 PM

You're welcome!

ewbi.develops | 2005.04.03 11:40 PM

Hi Eric.

Version 1.00 is available for download.


Rob van Gelder | 2005.04.15 03:42 AM

actually, i prefer boddington's pub ale, in the can with the little doo-hickey.

thanks for the link (although i'm not quite sure why you think i did it "the hard way".)


sa | 2005.05.05 04:14 AM


You're welcome. I hope a lot of folks interested in this stuff make it over to your "no stinking loops" site to check out the content:


Nix the PBR, right.

And "the hard way" is a euphemism for "requiring more brains than my way". ;)

ewbi.develops | 2005.05.05 08:33 AM

Thats a great work. After searching for days I found a good parser!!!!

BeginnerVB_Net | 2005.07.01 10:43 AM

Thanks for saving me a lot of time!

steve | 2005.07.13 04:39 PM

No problem - glad I could help.

ewbi.develops | 2005.07.13 05:59 PM

i want to sorting between to sheets is it possible plz advice

shazad | 2006.02.26 07:55 PM

Hi Shazad - Not sure I can help, but to know that I'll need a little more info. First, are you referring at all to parsing Excel formulas (the main thrust of this post)? Second, what exactly are you looking to sort within Excel and how?

ewbi.develops | 2006.02.27 06:43 AM

I am very interested in the execl formula and the usuage, I will be glad if you can send me more literatures on the formula and usage through my e-mail address.


Sani Bello | 2006.07.17 03:08 AM

I don't get it.

I have a bunch of data in one cell, separated by commas. I want a formula to parse instead of always having to use the text to column function. Can you help?

j | 2006.07.26 02:12 PM

Hi j,

I might be able to help, but this post doesn't address your need. This post is about parsing excel formulas, not about Excel formulas for parsing text.

For the latter, the text to column function is your best bet for splitting and spreading the delimited/fixed-length content of a cell (or column of cells) across multiple columns. It's also possible to write a VBA macro that does the split and spread, which might prove easier to use over and over, assuming the data structure is known and doesn't change. In terms of using formulas, you can't - at least not if you want one formula to split and spread items to other cells, because formulas can only affect the cells in which they reside. However, if the number of items to split is fixed, you can use a set of formulas to retrieve data from particular positions in the text, and then if necessary repeat the formulas down subsequent rows.

Let me know if you'd like to discuss further. Good luck.

ewbi.develops | 2006.07.26 05:06 PM

Excellent. Thank you very much.

AK | 2006.10.10 03:42 PM

You're welcome. I'll be posting a C# implementation soon, too.

ewbi.develops | 2006.10.10 08:24 PM

Nice work! Here's a Python port of the algorithm. Might be of use to someone?

# Description: Tokenise an Excel formula using an implementation of
# E. W. Bachtal's algorithm, found here:
# http://ewbi.blogs.com/develops/2004/12/excel_formula_p.html
# Tested with Python v2.5 (win32)
# Author: Robin Macharg
# Copyright: Algorithm (c) E. W. Bachtal, this implementation (c) R. Macharg
# CVS Info:
# $Header: T:\\cvsarchive/Excel\040export\040&\040import\040XML/ExcelXMLTransform/EWBI_Javascript_port/jsport.py,v 1.3 2006/11/17 13:27:32 rmacharg Exp $
# Modification History
# Date Author Comment
# =======================================================================
# 2006/11 - RMM - RMM = Robin Macharg
# Created

# Token definitions
TOK_TYPE_NOOP = "noop";
TOK_TYPE_OPERAND = "operand";
TOK_TYPE_FUNCTION = "function";
TOK_TYPE_SUBEXPR = "subexpression";
TOK_TYPE_ARGUMENT = "argument";
TOK_TYPE_OP_PRE = "operator-prefix";
TOK_TYPE_OP_IN = "operator-infix";
TOK_TYPE_OP_POST = "operator-postfix";
TOK_TYPE_WSPACE = "white-space";
TOK_TYPE_UNKNOWN = "unknown"


TOK_SUBTYPE_CONCAT = "concatenate";

# Class: f_token
# Description: Encapsulate a formula token
# Methods: __init__()
# Attributes: tvalue -
# ttype - See token definitions, above, for values
# tsubtype - See token definitions, above, for values
class f_token:
def __init__(self, value, type, subtype):
self.tvalue = value
self.ttype = type
self.tsubtype = subtype

# Class: f_tokens
# Description: An ordered list of tokens

# Attributes: items - Ordered list
# index - Current position in the list
# Methods: __init__()
# f_token - add() - Add a token to the end of the list
# None - addRef() - Add a token to the end of the list
# None - reset() - reset the index to -1
# Boolean - BOF() - End of list?
# Boolean - EOF() - Beginning of list?
# Boolean - moveNext() - Move the index along one
# f_token/None - current() - Return the current token
# f_token/None - next() - Return the next token (leave the index unchanged)
# f_token/None - previous() - Return the previous token (leave the index unchanged)
class f_tokens:
def __init__(self):
self.items = []
self.index = -1

def add(self, value, type, subtype=""):
if (not subtype):
subtype = ""
token = f_token(value, type, subtype)
return token

def addRef(self, token):

def reset(self):
self.index = -1

def BOF(self):
return self.index = (len(self.items) - 1)

def moveNext(self):
if self.EOF():
return False
self.index += 1
return True

def current(self):
if self.index == -1:
return None
return self.items[self.index]

def next(self):
if self.EOF():
return None
return self.items[self.index + 1]

def previous(self):
if self.index 0) and [self.items[len(self.items) - 1]] or [None])[0]

def value(self):
return ((self.token()) and [(self.token()).tvalue] or [""])[0]

def type(self):
t = self.token()
return ((self.token()) and [(self.token()).ttype] or [""])[0]

def subtype(self):
return ((self.token()) and [(self.token()).tsubtype] or [""])[0]

# Function: getTokens()
# Description: Parse an Excel formula into constituent tokens.
# Arguments: String - formula - The formula to parse
# Returns: f_tokens - tokens - A list of tokens, parsed from the formula
def getTokens(formula):

def currentChar():
return formula[offset]

def doubleChar():
return formula[offset:offset+2]

def nextChar():
# JavaScript returns an empty string if the index is out of bounds,
# Python throws an IndexError. We mimic this behaviour here.
except IndexError:
return ""
return formula[offset+1]

def EOF():
return offset >= len(formula)

tokens = f_tokens()
tokenStack = f_tokenStack()
offset = 0
token = ""
inString = False
inPath = False
inRange = False
inError = False

while (len(formula) > 0):
if (formula[0] == " "):
formula = formula[1:]
if (formula[0] == "="):
formula = formula[1:]

# state-dependent character evaluation (order is important)
while not EOF():

# double-quoted strings
# embeds are doubled
# end marks token
if inString:
if currentChar() == "\"":
if nextChar() == "\"":
token += "\""
offset += 1
inString = False
token = ""
token += currentChar()
offset += 1

# single-quoted strings (links)
# embeds are double
# end does not mark a token
if inPath:
if currentChar() == "'":
if nextChar() == "'":
token += "'"
offset += 1
inPath = False
token += currentChar()
offset += 1;

# bracketed strings (range offset or linked workbook name)
# no embeds (changed to "()" by Excel)
# end does not mark a token
if inRange:
if currentChar() == "]":
inRange = False
token += currentChar()
offset += 1

# error values
# end marks a token, determined from absolute list of values
if inError:
token += currentChar()
offset += 1
if ",#NULL!,#DIV/0!,#VALUE!,#REF!,#NAME?,#NUM!,#N/A,".find("," + token + ",") != -1:
inError = False
token = ""

# independent character evaulation (order not important)
# establish state-dependent character evaluations
if currentChar() == "\"":
if len(token) > 0:
# not expected
tokens.add(token, TOK_TYPE_UNKNOWN)
token = ""
inString = True
offset += 1

if currentChar() == "'":
if len(token) > 0:
# not expected
tokens.add(token, TOK_TYPE_UNKNOWN)
token = ""
inPath = True
offset += 1

if (currentChar() == "["):
inRange = True
token += currentChar()
offset += 1

if (currentChar() == "#"):
if (len(token) > 0):
# not expected
tokens.add(token, TOK_TYPE_UNKNOWN)
token = ""
inError = True
token += currentChar()
offset += 1

# mark start and end of arrays and array rows
if (currentChar() == "{"):
if (len(token) > 0):
# not expected
tokens.add(token, TOK_TYPE_UNKNOWN)
token = ""
tokenStack.push(tokens.add("ARRAY", TOK_TYPE_FUNCTION, TOK_SUBTYPE_START))
offset += 1

if (currentChar() == ";"):
if (len(token) > 0):
tokens.add(token, TOK_TYPE_OPERAND)
token = ""
tokens.add(",", TOK_TYPE_ARGUMENT)
offset += 1

if (currentChar() == "}"):
if (len(token) > 0):
tokens.add(token, TOK_TYPE_OPERAND)
token = ""
offset += 1

# trim white-space
if (currentChar() == " "):
if (len(token) > 0):
tokens.add(token, TOK_TYPE_OPERAND)
token = ""
tokens.add("", TOK_TYPE_WSPACE)
offset += 1
while ((currentChar() == " ") and (not EOF())):
offset += 1

# multi-character comparators
if (",>=,,".find("," + doubleChar() + ",") != -1):
if (len(token) > 0):
tokens.add(token, TOK_TYPE_OPERAND)
token = ""
tokens.add(doubleChar(), TOK_TYPE_OP_IN, TOK_SUBTYPE_LOGICAL)
offset += 2

# standard infix operators
if ("+-*/^&=> 0):
tokens.add(token, TOK_TYPE_OPERAND)
token = ""
tokens.add(currentChar(), TOK_TYPE_OP_IN)
offset += 1

# standard postfix operators
if ("%".find(currentChar()) != -1):
if (len(token) > 0):
tokens.add(token, TOK_TYPE_OPERAND)
token = ""
tokens.add(currentChar(), TOK_TYPE_OP_POST)
offset += 1

# start subexpression or function
if (currentChar() == "("):
if (len(token) > 0):
tokenStack.push(tokens.add(token, TOK_TYPE_FUNCTION, TOK_SUBTYPE_START))
token = ""
tokenStack.push(tokens.add("", TOK_TYPE_SUBEXPR, TOK_SUBTYPE_START))
offset += 1

# function, subexpression, array parameters
if (currentChar() == ","):
if (len(token) > 0):
tokens.add(token, TOK_TYPE_OPERAND)
token = ""
if (not (tokenStack.type() == TOK_TYPE_FUNCTION)):
tokens.add(currentChar(), TOK_TYPE_OP_IN, TOK_SUBTYPE_UNION)
tokens.add(currentChar(), TOK_TYPE_ARGUMENT)
offset += 1

# stop subexpression
if (currentChar() == ")"):
if (len(token) > 0):
tokens.add(token, TOK_TYPE_OPERAND)
token = ""
offset += 1

# token accumulation
token += currentChar()
offset += 1

# dump remaining accumulation
if (len(token) > 0):
tokens.add(token, TOK_TYPE_OPERAND)

# move all tokens to a new collection, excluding all unnecessary white-space tokens
tokens2 = f_tokens()

while (tokens.moveNext()):
token = tokens.current();

if (token.ttype == TOK_TYPE_WSPACE):
if ((tokens.BOF()) or (tokens.EOF())):
elif (not(
((tokens.previous().ttype == TOK_TYPE_FUNCTION) and (tokens.previous().tsubtype == TOK_SUBTYPE_STOP)) or
((tokens.previous().ttype == TOK_TYPE_SUBEXPR) and (tokens.previous().tsubtype == TOK_SUBTYPE_STOP)) or
(tokens.previous().ttype == TOK_TYPE_OPERAND)
elif (not(
((tokens.next().ttype == TOK_TYPE_FUNCTION) and (tokens.next().tsubtype == TOK_SUBTYPE_START)) or
((tokens.next().ttype == TOK_TYPE_SUBEXPR) and (tokens.next().tsubtype == TOK_SUBTYPE_START)) or
(tokens.next().ttype == TOK_TYPE_OPERAND)
tokens2.add(token.tvalue, TOK_TYPE_OP_IN, TOK_SUBTYPE_INTERSECT)


# switch infix "-" operator to prefix when appropriate, switch infix "+" operator to noop when appropriate, identify operand
# and infix-operator subtypes, pull "@" from in front of function names
while (tokens2.moveNext()):
token = tokens2.current()
if ((token.ttype == TOK_TYPE_OP_IN) and (token.tvalue == "-")):
if (tokens2.BOF()):
token.ttype = TOK_TYPE_OP_PRE
elif (
((tokens2.previous().ttype == TOK_TYPE_FUNCTION) and (tokens2.previous().tsubtype == TOK_SUBTYPE_STOP)) or
((tokens2.previous().ttype == TOK_TYPE_SUBEXPR) and (tokens2.previous().tsubtype == TOK_SUBTYPE_STOP)) or
(tokens2.previous().ttype == TOK_TYPE_OP_POST) or
(tokens2.previous().ttype == TOK_TYPE_OPERAND)
token.tsubtype = TOK_SUBTYPE_MATH;
token.ttype = TOK_TYPE_OP_PRE

if ((token.ttype == TOK_TYPE_OP_IN) and (token.tvalue == "+")):
if (tokens2.BOF()):
token.ttype = TOK_TYPE_NOOP
elif (
((tokens2.previous().ttype == TOK_TYPE_FUNCTION) and (tokens2.previous().tsubtype == TOK_SUBTYPE_STOP)) or
((tokens2.previous().ttype == TOK_TYPE_SUBEXPR) and (tokens2.previous().tsubtype == TOK_SUBTYPE_STOP)) or
(tokens2.previous().ttype == TOK_TYPE_OP_POST) or
(tokens2.previous().ttype == TOK_TYPE_OPERAND)
token.tsubtype = TOK_SUBTYPE_MATH
token.ttype = TOK_TYPE_NOOP

if ((token.ttype == TOK_TYPE_OP_IN) and (len(token.tsubtype) == 0)):
if (("<>=").find(token.tvalue[0:1]) != -1):
token.tsubtype = TOK_SUBTYPE_LOGICAL
elif (token.tvalue == "&"):
token.tsubtype = TOK_SUBTYPE_CONCAT
token.tsubtype = TOK_SUBTYPE_MATH

if ((token.ttype == TOK_TYPE_OPERAND) and (len(token.tsubtype) == 0)):
except ValueError, e:
if ((token.tvalue == 'TRUE') or (token.tvalue == 'FALSE')):
token.tsubtype = TOK_SUBTYPE_LOGICAL
token.tsubtype = TOK_SUBTYPE_RANGE
token.tsubtype = TOK_SUBTYPE_NUMBER

if (token.ttype == TOK_TYPE_FUNCTION):
if (token.tvalue[0:1] == "@"):
token.tvalue = token.tvalue[1:]


# move all tokens to a new collection, excluding all noops
tokens = f_tokens()
while (tokens2.moveNext()):
if (tokens2.current().ttype != TOK_TYPE_NOOP):

return tokens

# Main code:
# A simple test-rig. Iterate through a list of test input strings,
# outputing a nested display of the token stream parsed from each one.
if __name__ == "__main__":

# Test inputs
inputs = [
# Simple test formulae
'=3 * 4 + 5',
'=SUM(B5:B15 A7:D7)',
'=SUM((A:A 1:1))',
'=SUM((A:A A1:B1))',
'=SUM(123 + SUM(456) + (45DATE(2002,1,6),0,IF(ISERROR(R[41]C[2]),0,IF(R13C3>=R[41]C[2],0, IF(AND(R[23]C[11]>=55,R[24]C[11]>=20),R53C3,0))))',
'=IF(R[39]C[11]>65,R[25]C[42],ROUND((R[11]C[11]*IF(OR(AND(R[39]C[11]>=55, ' +
'R[40]C[11]>=20),AND(R[40]C[11]>=20,R11C3="YES")),R[44]C[11],R[43]C[11]))+(R[14]C[11] ' +
'*IF(OR(AND(R[39]C[11]>=55,R[40]C[11]>=20),AND(R[40]C[11]>=20,R11C3="YES")), ' +

for i in inputs:
print "========================================"
print "Formula:", i
indent = 0
for t in (getTokens(i)).items:

if (t.tsubtype == TOK_SUBTYPE_STOP):
indent -= 1

print " "*indent + t.tvalue + " "

if (t.tsubtype == TOK_SUBTYPE_START):
indent += 1;

Robin Macharg | 2006.11.17 05:34 AM

Rats. The indentation's got munged which, for Python, is a BAD THING. Can this be fixed? If not contact me for a correctly formatted copy of the code.

Robin Macharg | 2006.11.17 05:39 AM

Finally got the C# code loaded, and a very belated thanks to Robin for the Python port - very cool!


ewbi.develops | 2007.03.20 01:12 PM

I've discovered a bug in this parser. Try for example =6.0E-5 ; it will be parsed as three tokens (two operands and the - as the operator) instead of a single number. Anyone an idea of how to easily fix this?

Tjaard | 2007.04.03 05:01 AM

Oops, thanks Tjaard. I rarely use scientific notation, and I've noticed Excel (2003, at least) converts up to 16 places to the literal number in my formulas anyway. So, for example, your example of =6.0E-5 becomes =0.00006 . However, after 16 places it sticks, so the parser needs to account for it. I'll have to give this some thought, I don't think there's a simple solution. Unfortunately, I've got no time to fix it now, but I'll give it a look as soon as time allows.

ewbi.develops | 2007.04.03 07:11 PM

This is a really cool tool! And it looks you can also use it to parse SQL Server Reporting Services expressions.

Thanks for such a cool tool

Armando | 2007.04.26 11:58 AM

Hey, if any of u guy's might be able to help me with a simple formula in EXCEL!! This is for a transport company! If the driver's code is entered into a cell eg. ss01, then i want the following cells to enter certain data from another excel sheet. Let's say there's 20 drivers starting from ss01 to ss20? When i type ss05, then te detail for that driver must automaticaly write to certain cell's? is this possible, and if yes, please help!! Kind regards, DARRYL!

Darryl | 2007.05.11 05:51 AM


I believe what you're asking is possible, though it's hard to say for certain without additional details. In so much as this is a post regarding formula parsing, not Excel formula implementations, why don't you send me an email at info@ewbi.com with additional details regarding your needs and we'll see if we can figure it out.

ewbi.develops | 2007.05.11 09:52 AM

Thank you genius for saving me a lot of time. I am implementing a JS excel spreadsheet and that was very useful for me !!

Andrea Pradarelli | 2007.07.04 04:30 AM

Great job. I do have a question. I'm wondering what is best way to use the parsed tokens in terms of basically finding an end point...for example:


Let's say I wanted to process the ^ because it is not valid in Javascript that I know of so I want to swap it out for Math.pow(). I need to wrap (2*(1+1)) inside the pow() function and wondering if there are any samples of basically 'GetNextEntity' or something that will get the entire 'next thing' that is valid in a formula. Maybe I have a ton of reading to do (based on all your suggestions) but thought I'd ask in case it is something commonly done already.

Terry | 2007.09.18 08:35 PM

Hi Terry,

You are correct that the parser outlined in this post (and updated in some subsequent posts) doesn't do this. The activity you describe involves interpreting the parser's resulting tokens, pushing and popping lexical scopes (similar to the test code above pushing and popping levels for tree-style presentation), and dealing with operator precedence, usually recursively. So, for instance, having walked the tokens up to the ^, you would prepare a context for what follows, and proceed (pushing and popping scope for functions and subexpressions as necessary) until you encounter (or are about to pop out of) the same scope as the ^, another operator of greater precedence, or run out of tokens; everything that you have accumulated to that point would represent the expression on the right side of the ^. Here are some additional examples that illustrate the effort required:

=10*2^(2*(1+1)) (= 160)
=10*2^(2*(1+1))+2 (= 162; ^ has greater precedence than +)
=10*2^(2*(1+1))% (=10.28114; % has greater precedence than ^)
=2+(10*2^(2*(1+1)+SUM(A2)))*3 (who knows, but you'll push and pop here multiple times)

Here are the precedences of Excel's formula operators:


Good luck.

ewbi.develops | 2007.09.19 11:59 AM

Creating foormula for the Ms Execl.

kamal | 2007.11.07 07:23 PM

Actually, it is possible to write an unambiguous LALR-1 grammar for the Excel formula language. I know, I wrote the formula engine for Hypernumbers,* which is 100% Excel-compatible. There are things that can't be completely resolved at parse-time (e.g. whether a range constructed with INDIRECT() is valid), which you have to do at runtime, but the amount of code required to deal with such cases is surprisingly small.

* http://hypernumbers.com

hv | 2008.01.17 06:30 AM

hv, thanks for the note and link. I never meant to suggest that production of an unambiguous grammer was impossible, only that it wasn't possible for us, due to our requirements and resource limitations. Good luck with hypernumbers, looks interesting - I've signed up for a beta notification.

ewbi.develops | 2008.01.17 01:06 PM

Hi, I am working on a library in C++, which will be used to create excel based reports. Initially a report template will be created in excel and saved as xml. Then the programmer will simply fill the data into respective rows/columns in the program. This library will allow the programmer to load the xml template and manipulate the related objects (elements), which includes adding sheets, defining styles, adding/deleting columns, rows and cells etc. Then the output is again saved to a xml file, which can be again opened in excel.

I am almost done with the other things. Now important thing I need to do is normalize the formule i.e. when a row/column is deleted or insrted affected formule should be changed as they are done in excel application.

I was wondering if there is any c++ class, which could read the formula and allow the changes, and then reconstruct the formula string.

Thanks & Regards,


Joga | 2008.05.02 03:43 AM

Hi Joga,

Sounds like an interesting project. You might want to check out the GNOME Office / Gnumeric open source C++ project (uses a GNU general public license). I suspect there's some code there that might prove helpful.


Good luck!

ewbi.develops | 2008.05.02 08:36 AM

I want to find the name of the cell that has the greatest number. for example there is lots of numbers, I can find Max easily, but how can I put the name of the Max cell by logical formula in excel.

shallaw | 2008.05.07 11:54 AM

Forgive me, shallaw, but I'm not sure if your question is related to the parsing of Excel formulas (the topic of this post), or if you are simply looking for a formula to produce the result you describe. Can you be more explicit?

ewbi.develops | 2008.05.07 12:25 PM

I am in the process of writing an Excel formula parser myself in VBA and your code was and is invaulable help.
Thank you for this great starting point!
Having paid the honors, there are a few issues that I don't get in your code and explanations:
1. You call the comma a union operator. While this is true in some functions, e.g. SUM, it is not true in others, e.g. IF. THat is probably the reason Excel call the comma a 'list separator'.
While It is legal to enter a formula such as "=(A1,A2)" you get a #VALUE! error upon entering the formula in a cell. Can you point me to a sensible usage of the comma in a subexpression?
2. You write that "Excel still allows functions to be preceded with an '@'". While this is true as well, my version (2003) then converts the result of the function into a hyperlink to a mail address and deletes the '@'. In version 97 the '@' is simply deleted which also happens in 2003 if you change the Autocorrect options.
So at least in these two versions you will never see a formula in a cell where a function starts with a '@'.
This also applies to user defined functions written in VBA which names must not contain a '@'.
What am I missing here?
3. In your code you comment that brackets ("[]") have "no embeds (changed to "()" by Excel)".
I know of two ways to use brackets in Excel. To my knowledge in R1C1 notation you can't have embedded brackets. Excel also uses brackets to enclose a file name . In Windows XP brackets are not allowed in file names.
So how can there be embedded brackets in the first place?
4. In the code itself you tackle the problem of nested functions and subexpressions by using the 'tokenstack'. AFAIK there are no nested array nor are there arrays nested in functions ur subexpressions or vice versa. Still you are using the tokenstack for array as well.
Again, I have the feeling I am missing something.
5. Finally I am not sure what exactly you want to achieve with your parser. Clearly one aim is to make a long formula easier to read. This is wonderfully done by ignoring noops, blanks or the various 'matchfixes' in the output.
In no instance save one you are trying to evaluate the formula (next to impossible without info on the cells referenced ;-).
But when you have embedded double quotes in a string, i.e. '="abc""def"', you delete one of them, thus evaluating the formula. The output then (and only then) resembles what you see in the cell as opposed to what you see in the formula bar. What is the reason behind this decision?

Thank you again for sharing all the work that went into this great tool.


Daniel | 2008.05.14 05:16 PM

Hi Daniel,

Glad the code was of some help to you. I'm afraid the code and this topic have gotten little attention from me in the more than three years since they were written, so I'm not sure how much help I can be.

1. I'm not sure I understand what kind of example you are looking for here; the example you provide is pointless, of course, because it attempts to return a multi-cell range from a formula that must return a single scalar value (hence the #VALUE). Unions made with commas (and multi-cell returning intersections made with spaces) only make sense when passed as parameters to functions expecting (and supporting) multi-area ranges, like SUM, MAX, COUNT, ROWS, etc. You'll have to clarify what you mean by "Can you point me to a sensible usage of the comma in a subexpression?"

And, just to be clear, I'm not the only one who calls the comma an Excel union operator, that's what Microsoft calls it too:


2. I'm not sure you're missing anything here, except that I had no reason to limit the code's support to post-Excel 97 formulas, which may or may not persist leading @'s.

3. You left out the line just above that comment, which reads: "bracked strings (range offset or linked workbook name)". The code that follows this comment is a stateful check for a closing bracket found when it's been determined we're within a range address, which means that we're terminating an offset in R1C1 notation or a linked workbook file name. "No embeds" in this case means that we don't have to worry about there being brackets embedded between brackets, certainly not for offsets, but also not for linked workbook names, because if the workbook name/path included a bracket then Excel would have already replaced it with a parenthesis.

4. As I mentioned in the post above, the code choose to represent arrays using a function-like syntax, so when it finds the beginning of an array (or an array row) it pushes a function on the stack and utilizes the same function parameter parsing logic for the array parameters. Even though the elements of a literal array are more limited then the parameters of a function (i.e., constant scalar values only), there was no point in writing entirely separate parsing logic just for this.

5. What I want to achieve with my parser? Well, I've already achieved it, years ago actually. I used parsing logic similar to this, though a bit more advanced, to pull apart formulas within workbooks in order to build range cross-reference tracing and other debugging aids, to build complex formula visualization and normalization tools, workbook calculation differencing tools, and a number of other things. Evaluation of formulas outside of Excel was just a small part of what we were aiming for; what we mostly wanted was to share the logic with others (like you) in order to see what types of things they might come up with. Here are some links to other folks who have utilized this logic (one of which is a VBA-based version for formula visualization in Excel):


Regarding your specific question about doubled-up double quotes in string tokens, I don't consider that evaluation, that's simply normalization of the token's value.

Hope that answers some of your questions. Good luck!

ewbi.develops | 2008.05.14 08:32 PM

Thank you for your quick reply.
Just a few comments:
1. Glad we agree on no sensible use of commas outside a function. It was just that you seemed to implicate this possibility that made me ask about it.
And, yes, you are right, even in my native German MS calls the comma a union operator. Only VBA calls it a list separator which, to me, makes more sense given the ambiguity of its use.
2. I guess your eliminating of leading '@'s would probably fall under 'normalization' (which is not my aim) so I will just leave formula names as they are. I was only curious on this point.
3. This is the only issue still open for me: You state that "... if the workbook name/path included a bracket then Excel would have already replaced it with a parenthesis".
I tried all Excel versions both English and German from 97 to 03 on a Win XP system with R1C1 and A1 notation. The result is always the same: Despite from Excel doing its best to prevent you from getting there it is possible to have a reference like 'C:\Bra[cke]ts\[Filename.xls]'!R1C[1] in a cell.
Opposed to your statement above the brackets are not replaced by paranthesis on my system. They aren't embedded either, of course, but then embedding would only be possible for brackets in a filename. That, though, is illegal under Windows XP.
As I want to show the formula 'as is' (not normalize it), I would be very interested in a hint on possible occurences of embedded brackets (OS?, language?, previous Excel versions?).
4. Got it, thanks. I overlooked the array rows.
5. I guess the handling of embedded double quotes is just a question of style here. That's why I was asking about your aim - not because I wanted to question your expertise or the state of this project. Sorry if I gave a wrong impresiion here, I should have written 'wantED to achieve'.

Again thanks for your invaluable help and sorry for my bad knowledge of English and Java . I am just a simple German VBA guy.


Daniel | 2008.05.15 03:22 AM

No need to apologize, Daniel, I think your English is very good. Certainly better than my German! :)

Regarding #3, I can only suggest that there is perhaps a difference in the versions of Windows/Excel we're using. I just tested this again and found that if I reference a workbook having brackets in its file name in a formula in another workbook, the end result is that the referenced file name is surrounded by brackets (as always happens with "long" file names) and the brackets in the file name are replaced with parentheses. So, for instance, given a workbook with the name "test [file].xls", when it is referenced in the formula of another workbook, the formula reads:

='[test (file).xls]Sheet1'!$B$4

So, as the JavaScript comment says, there's no worry about brackets being embedded in brackets.

Hope that helps. Good luck with your project!

ewbi.develops | 2008.05.15 08:41 AM

Can you please tell me what combination of Windows and Excel (and language settings) produced that result?

Vista (english, US) and XL 07 (English, US), I assume.

Btw I am not the first to experience this on a pre-Vista system (See http://help.lockergnome.com/office/Legal-filename-character-inconsistency-ftopict947822.html)



Daniel | 2008.05.15 04:55 PM

Hi Daniel,

This is English Windows XP SP2 and Office 2003 SP3. Thanks so much for that link. I'm so surprised to see that, as I've never heard of (nor experienced) this problem.

ewbi.develops | 2008.05.16 10:36 AM

Bug! this formula:


the parse result of range is wrong

Bosee | 2008.07.03 07:37 AM

Hi Bosee,

Thanks much for sharing that. I can't put together a scenario in Excel in which this is valid. Can you please explain the steps necessary to set this up? Once I repro it we can get it fixed. Thanks!

ewbi.develops | 2008.07.03 07:47 AM

A product of my company base on Excel, the data of this product very complex, especially it's formula. one day i found a formula writter wrote this formula, it can make the formula parser of the product failed.
I type following formula to the Excel cell one by one, they can work correctly (suppose all formula in 'Sheet-1', the space between two address Excel never remove it).

=SUM('Sheet-1'!A1 : 'Sheet-1'!A10)
=SUM('C:\[Book1.xls]Sheet-1'!A1 : 'C:\[Book1.xls]Sheet-1'!A10)

the result as same as this:


Bosee | 2008.07.05 01:16 PM

is your code available for download ?

rash | 2008.08.28 01:01 AM

Yep. For the JavaScript version, just right-click and View Source on this page and look for the "GoCalc Prototype Excel Formula Parser - free for the taking" comment. For the C# and other language versions, follow the links here:


ewbi.develops | 2008.08.28 08:32 AM

Bosee, just noticed your comment, sorry. I'll try to take a closer look in the next couple of weeks. Thanks!

ewbi.develops | 2008.08.28 08:33 AM

i need ur help for my some problem. can i get An particular Amt ,which i fill in cell A1, "X" time in Row B which i fill in cell A2. suppose i fill 1000 in cell A1 an 10 an A2 the result should be 1000 in cell B1 to B10.
i would realy obliged from your quick reply

neeraj Pathak | 2008.09.20 01:45 AM

Thanks for the information, after searching for week, I found this article.

Myea Amelia | 2009.01.12 07:39 PM

Thank you! I had been struggling with this problem, and you save me many hours by posting this.

Steve C. | 2009.09.03 08:28 PM

Many thanks - fantastic work - just what I was lookg for - pure genius!

Ian Daniels | 2010.01.17 02:53 AM

Thanks you! you give me big help for my work. Great work!

daboai | 2010.01.30 01:56 AM

This is fantastic. Is there an implementation out there that converts an Excel formula to a Javascript function (with plenty of caveats, of course!)?

JN | 2011.01.05 05:49 AM

JN, not that I've seen; however, you might want to check out ExcelEverywhere, which does a complete conversion of a sheet to an HTML/Javascript-only page.


ewbi.develops | 2011.01.05 08:05 AM

Saving this on recursive descent parsers and Excel formulas in C# to come back to some time in the future:


ewbi.develops | 2011.02.14 10:11 PM

The c++ port of your ExcelFormulaParser

lishen | 2011.03.31 02:21 AM

Excellent, lishen! That's fantastic. Folks still find this post and the related C# code and occasionally send me notes about how it has helped them out. This is a great addition. I'll update the sidebar page with a link, too:


ewbi.develops | 2011.03.31 10:39 AM


At times I have to debug an Excel spreadsheet that has been written by others. THE most difficult task I encounter is tracking down embedded constants (e.g., A1/360).

I suspect the Java Script in your article "Excel Formula Parsing" could do this. Is there any way to use this script from within an Excel VBA macro to analyze each formula in a spreadsheet, and just identify all the constants?

Your help and time is greatly appreciated.

Robert Neary

Although I am familiar with Excel and Excel Procedures, the level of programming exhibited in your JavaScript is way beyond my capabilities. So, patience and detail is respectfully requested.


Robert Neary | 2011.05.06 05:15 PM

Hi Robert,

It can indeed be used for just what you've described. In fact, I've done so for a very similar reason - to validate the relationships between range references in formulas. However, I didn't translate the logic to VBA for use within Excel, but instead reworked it into C# for use in a stand-alone .NET utility able to work from outside of Excel.

Some folks have translated the logic to VBA for use in an Excel formula analyzer add-in:


Be sure to read through the comments for a link to the latest version. They might be willing to share it with you for use in your utility.

Links to that site and the various other versions of this logic are available here:


In terms of how to wield this logic to accomplish your task, that's a little more complicated to explain. In short I'll just say it involves passing over each of the cells in the used range of a sheet (or sheets), checking to see if they have a formula, and if they do passing the formula through the parser logic and then enumerating over the token results to identify and report any constants. Of course, there are tricky bits and gotchas along the way, but nothing too hard to overcome.

Please feel free to email at the address above in the left sidebar if you'd like to discuss further.

Good luck!

ewbi.develops | 2011.05.07 02:58 PM


i port the excelformulaparser to Delphi XE.
Here is the download link:


Hartmut David | 2011.06.13 02:25 PM

Hey, that's fantastic!

I'm not a Delphi guy so I won't bother looking at the code, but I've added a link to the aggregation page here:


Thanks so much!

ewbi.develops | 2011.06.14 04:54 PM

I have excel formulas I would like to convert to PHP language. If I could just enter the formula into a function and get a PHP script out that would be great. Does anyone know if there is a tool to covert excel formulas to PHP. This is an example:

dan lippetti | 2011.10.08 10:18 PM

We have developed a Tool for Parsing Formulas in Excel in VBA.
There are 2 versions we have created for each addins: a version for Excel 2003 and an other for Excel 2007.
The heart of both is the function that uses a regular expression pattern to parse the formula.

The function searches all the results and performs a check on each of the submatches (groups). By taking the array of the matched results the function reconstructs the formula by adding carriage return to separate the elements.
An other outer function formats the text by adding tabs.

this is the function of persing:

Function Parser_Formula( _
ByVal sFormula As String, _
Optional sListSeparator As String = ",", _
Optional sRowSeparator As String = "\") As String

Dim M As Object, RE As Object, SM As Object, SB As Object
Dim v As Variant, t As String
Set RE = CreateObject("vbscript.regexp")
RE.ignorecase = True
RE.Global = True

v = Array( _
"(""[^""]*""|'[^']*')", _
"(\{[^}]+})", _
"(\" & sListSeparator & ")", _
"(" & cw & "+(?:\." & cw & "+)*\()", _
"(\))", _
"(^=|\()", _

RE.Pattern = Join(v, "|")
If RE.test(sFormula) Then
Set M = RE.Execute(sFormula)
sFormula = ""
For Each SM In M
Set SB = SM.SubMatches
If Len(SB(0) & SB(6)) Then
t = SB(0) & SB(6)
ElseIf Len(SB(1)) Then
t = Array_Const_Wrap(SB(1), sRowSeparator) & vbCr
ElseIf Len(SB(2) & SB(5)) Then
t = SB(2) & SB(5) & vbCr
ElseIf Len(SB(3)) Then
t = vbCr & SB(3) & vbCr
ElseIf Len(SB(4)) Then
t = vbCr & SB(4)
End If
sFormula = sFormula & t
End If

RE.Pattern = "\r{2,}"
sFormula = RE.Replace(sFormula, vbCr)

Parser_Formula = sFormula
End Function

Function Array_Const_Wrap( _
ByRef sArraY As String, _
Optional sRowSeparator As String = "\") As String
Dim v
If Len(sArraY) > 1 Then
v = Split(Mid(sArraY, 2, Len(sArraY) - 2), sRowSeparator)
Array_Const_Wrap = "{" & vbCr & Join(v, sRowSeparator & vbCr) & vbCr & "}"
End If
End Function

best regards

r | 2011.10.14 01:38 AM

Wow, r, thanks for sharing that. I'm sure some folks will find that helpful.

Dan, I hope someone's able to help you out. I've spent very little time putting parsed formulas back together again to target other languages. Seems fairly straightforward, but time just never permits. Good luck!

ewbi.develops | 2011.10.14 08:14 AM


Roberto Mensa and his colleague Krisztina have created another VBA parser using RegExp https://sites.google.com/site/e90e50/random-topics/tool-for-parsing-formulas-in-excel. I haven't played with it mch, but it looks good so far.

As Excel 64bit does not suppport the treeview control, hey supply treeview and textbox versions.

Bob Phillips | 2012.06.27 05:07 AM

Thanks for the link, Bob. That's good stuff. Can't wait to check it out.

ewbi.develops | 2012.06.27 08:24 AM

Not sure if you can help, this is all a little foreign to me but I'm giving it a shot (I understand very little of programing but I can recognize the language and have seen it when doing some basic html work before, etc.).

I'm trying to find an easy was to convert excel to javascipt to use as javascript in a PDF form. I see your note about right clicking to get the javascript for my converted formula (which I can do successfully). I even was able to enter it correctly into my form. My confusion is, the formula references other cells in the worksheet, what do I do to the other cells to get them to communicate with this formula? Some of the other cells are/will be for the user to type into and the info they enter needs to feed the formulas (some of it actually feeds multiple different formulas, but if I can just get the first figured out, I imagine the rest will be easy).

Help! Thanks!

Beth | 2013.03.07 05:40 AM

Hi Beth,

Pretty cool that you got the JavaScript parser working behind a PDF. I've never done PDF-based scripting, but I'm sure it was challenging. However, based on your explanation, it sounds like parsing Excel formulas into tokens is only a small part of what you want to do.

If I understand correctly, you would also like to use JavaScript to put the tokens back together (statically or interpretively) for execution in order to represent some Excel-based spreadsheet logic involving multiple cells containing constants and formulas, all within a PDF. In other words, you want to convert a spreadsheet, or some portion of one, to JavaScript and then run this within a PDF.

I suppose this is possible. As described in a number of comments above and in some other posts on this blog, there are a number of third party tools and services that do this. But I'm not sure any of them will do so in a way that would support hosting and executing the results in the JavaScript sandbox found in a PDF.

Of course, you could code it up yourself, but having done this myself I can assure you it's not a trivial exercise, depending on the complexity of the spreadsheet logic you need to convert, how dynamic/static the result needs to be, etc.

Perhaps if you explain a bit more about why you want to host/execute some spreadsheet logic inside a PDF, and also describe the complexity of the spreadsheet logic you wish to replicate, I might be able to point you in the right direction.

Good luck!

ewbi.develops | 2013.03.07 12:41 PM

Post a Comment

  (no html)