2007.03.20 01:11 PM

Excel Formula Parsing, in C#

Update 4/26/2007

Fixed a problem with literal numbers in formulas formatted with scientific notation (e.g., =3.1E-24-2.1E-24). More info here.


Hard to believe it's been over two years since I posted a JavaScript version of my Excel formula parser here. (As far as that goes, it's hard to believe it's been over three months since I posted anything here.) Since that original post, the parsing logic has been re-used in a variety of ways by me and others, including a VBA port by Rob van Gelder for an Excel add-in and a cool Python port by Robin Macharg in the comments of the original post (which, unfortunately, I never got around to responding to - thanks Robin!). Beyond that, though, it hasn't gotten much attention.

However, on Friday, a reader reminded me that last year in the comments I'd promised to post a C# version of the parser. Oops. So I dug around and found the code. Once found, I had to pull it out of its project, clean it up a little, modify the execution context some, and add a sample form to illustrate its use. The sample form uses the parser to basically do what the JavaScript parser does on the original post:

Hopefully the fact that it's written in C# for .NET 2.0 won't be a big issue for the reader who requested the code. If so, I might be able to dig up an older .NET 1.1 version.

Note that this parser still relies on the older (slower) multi-pass infix, union, and whitespace logic reflected in the JavaScript, so it's not optimal. But hey, it's free.

Anyhow, here's a VS2K5 solution with the parser and sample form projects.

Feel free to use the code any way you want (there's a single copyright condition in the parser's ExcelFormula class). Also, please take note of the disclaimer: as always, there are no warranties - use the code at your own risk. That said, if you find something wrong or make some interesting mods, please come back and share them.


Comments

Oops, reader Tjaard pointed out a problem with scientific notations on the original formula parser post. Fix forthcoming, as time allows.

ewbi.develops | 2007.04.03 07:12 PM

I've recently written my own Excel formula parser in VBA, and so it's of great interest to me to see how someone else tackled this problem.

Our solutions couldn't be more different.

You basically slay the dragon with one single procedure. No recursion, no regexes, no hashtables - just your bare hands (and a stack). To me it looks almost miraculously concise.

Whereas mine does all of the above. And the code is much longer, and I bet it takes several times longer to execute...

On the other hand, I'm pretty sure mine is more extensible - I'm particularly proud of the first of my (ahem) six passes, which is essentially a general purpose lexer rather than one specialized for Excel.

My parser returns an abstract syntax tree rather than a string of tokens in Reverse Polish style.

I also have functions to regenerate formulas from syntax trees.

At the moment I'm using it to (and there's clever-sounding technical term for this which I've very cleverly forgotten...) partially evaluate a workbook - you declare part of your spreadsheet to be 'the data', and any formula or subformula that is determined to be independent of the data is evaluated.

Neil Fitzgerald | 2007.04.22 11:19 AM

Oh, your parser is slightly broken btw.

It doesn't know how to treat the colon as the 'Range operator' e.g. =SUM(A():B()).

This is actually quite tricky.

=SUM(AB:AC) always means the sum of everything in columns AB and AC (even if there exist names AB and AC (which is quite possible)).

On the other hand...

=SUM(AB:ABCD) always means the sum of the afore-mentioned 'Range operator' applied to the two named ranges AB and ABCD.

As for SUM(AB:ABC)... as far as I know (and I could be wrong) in versions of Excel prior to 2007, this is just like ABCD, but in Excel 2007, it's like the AB:AC version (because now there are lots more columns).

Neil Fitzgerald | 2007.04.24 03:03 PM

Just one more thing:

Thanks for posting it!

I know it's only an early prototype / wasn't meant to be flawless and wash the dishes and take the trash out etc. What it does, it does very well, and proves to be a useful source of ideas.

Neil Fitzgerald | 2007.04.24 03:35 PM

Hi Neil,

Wanted to write you back sooner, but I've been away for a few days and needed some time to catch up.

You are correct about the approach used in this code. As I mentioned in the original post (linked in the post above), we considered some different approaches, including generation of an AST. Ultimately, for lack of time (and remuneration), we settled on a simpler Excel-specific evaluation. This doesn't actually represent our final cut, which eliminated one of the post-parse fix-up enumerations and improved the identification of ranges and names; it's just one of the earlier prototypes we used for testing and decided to share. I'd agree it probably is easier to read and understand, but I'd preferred to have developed a more general-purpose approach, similar to what you've done. Time wasn't on our side, though, so brute-force simplicity won out.

Regarding the parser being broken, you are correct. In fact, it's busted in a couple of ways (the prototype was only meant to handle the most common cases). I've actually got a patch for one issue, which is its failure to accommodate in-line scientific notation, which I'll hopefully get posted soon. The issue you describe (AB:AC vs AB:ABCD) is caused by the fact that this version of the parser doesn't bother to distinguish ranges from names; basically, if it's not a function or operator, its an operand, and if it's an operand and not a number or text, it's a range. That's one way in which the final version of the code is better than the prototype, and it's also not something I'll be retrofitting to this earlier code (well, not anytime soon).

Any chance you'll be sharing your work? Based on the hits these posts get each day, it seems there's a lot of interest.

Thanks for the comments.

ewbi.develops | 2007.04.24 03:54 PM

this is great, thanks for sharing this.
but wondering how would I use it to generate a C# function so that its reusable in my .NET Library. Please let me know if there is any way out!!!

Thanks

Mustafa | 2007.06.15 02:52 AM

Hi Mustafa, you'll find a link to the C# version here:

http://ewbi.blogs.com/develops/popular/excelformulaparsing.html

ewbi.develops | 2007.06.15 06:28 AM

yeah I have had a look at it, but I am looking at a conversion in this form

Excel :

=if(percentage>75,distinction,if(percentage>60,first,if(percentage>50,second,if(percentage>35,pass,fail))))

would give you an output in this form

double Function1(double percentage, double distinction, double first, double second, double pass, double fail)
{
double returnvalue;
if(percentage>75)
{
returnvalue = distinction;
}
else
{
if(percentage>60)
{
returnvalue = first;
}
else
{
if(percentage>50)
{
returnvalue = second;
}
else
{
if(percentage>35)
{
returnvalue = pass;
}
else
{
returnvalue = fail;
}

}

}

}

}

I have reached upto this level, but wondering how to incorporate all the functions like and, or, not, min, max, avg etc. Please let me know if any solution or a easier way of going about it exists using your code.

Thanks

Mustafa | 2007.06.20 09:46 PM

Oh, well now, that's the hard part. ;)

As you've probably figured out, the lexical analysis and parsing logic I've provided only gets you half way there. It provides the Excel formula as an array of normalized tokens, which certainly helps, but interpretation and conversion of those tokens into alternative representations (i.e., as in your example, into some C#/Java-like language) for compilation and/or execution is a project unto itself. I've done it (partly), as have many others. For instance, see Matt Brown's work in Rico 2.0, which I linked to here:
http://ewbi.blogs.com/develops/2007/04/excel_formula_p.html

However, I'm not likely to share much of my work in that area here for free (for now, anyway), partly because I don't have time, but more importantly because it required a significant investment, which I'm still recovering.

Good luck!

ewbi.develops | 2007.06.20 10:08 PM

I would be very greatful if you could rummage around and find a .net 1.1 version of the parser. thanks in advance.

Perry Nally | 2008.01.08 09:16 AM

Hi Perry,

I'm afraid that's probably not going to be possible now. Since this was written, the artifacts for this effort have been archived and some machines swapped out. I'll take a look at the code, though, when I get a chance and see how tied it was to .NET 2.0 - it might be just a matter of swapping out some generic lists/dictionaries for regular old hash tables.

ewbi.develops | 2008.01.09 11:24 AM

Well, Perry, it's not too bad. Mostly just eliminate the generics.

For instance, ExcelFormula needs to inherit IList instead of IList[ExcelFormulaToken], which means some interface members must change (some additions, some removals), and some must have their parameters de-genericized, such as the CopyTo() and GetEnumerator() methods. Similarly, the ExcelFormulaStack inner class needs to use a basic Stack instead of Stack[ExcelFormulaToken]. And this, of course, means you'll need to do some casting of the Object references.

I haven't any time to help, but if you run into any problems with this or have a question, please come back and let me know. Good luck!

ewbi.develops | 2008.01.09 12:23 PM

Thanks for looking. I undertook to create my own spreadsheet due to time constraints and the rendering various objects on screen. This is working very well for my needs and is very fast (the crucial part). Thanks for exploring. I will use this as reference if needed. Thank you again.

Perry Nally | 2008.02.04 05:41 AM

Its to good

Arhath | 2008.06.13 01:43 AM

Please Help for Excel Formula
Que. How to find the result PASS/FAIL according to each subject, if any one subject is less then 35 the student is FAIL otherwise PASS.

Pravin | 2008.10.08 09:08 PM

Hi,
Its nice work and very helpfull too.
but i found that this parser takes boolean values (true/false) as range too. I am going to ammand code to address this limitation. please let me know if you have any concern.
Thanks

Imran Javed | 2009.06.02 04:32 AM

Hi Imran,

You might check that the formula you're parsing uses uppercase TRUE and FALSE values, as expected (required, actually) by Excel. My testing shows that these are properly identified as "logical" values, not "range" values. However, bugs happen, and if you've got a particular formula that doesn't parse correctly I'd love to know what it is. Thanks, and good luck!

ewbi.develops | 2009.06.02 07:36 AM

A beautiful piece of engineering, saved me a whole load of drudgery. Many thanks for publishing this!

Maurice Calvert | 2011.07.19 05:56 AM

Hi,

this is very helpful, but how to get the index of the token in formula string and is this possible to rebuild the string formula from the token list?

imad | 2011.08.30 09:18 AM

imad, it seems the tokens don't carry their own index, though they are indexed in the set/list of tokens. If you need this it should be pretty easy to add. Regarding rebuilding the formula, without spending too much time analyzing it, I'd say yes it is possible.

ewbi.develops | 2011.08.31 02:15 PM

it seems to much work to do for now, but it will be very helpful if the token can carry his own string index and the index of the closing token.

imad | 2011.09.02 07:05 AM

c# Excel automation

http://csharp.net-informations.com/excel/csharp-excel-tutorial.htm

lee.

lee suko | 2012.01.23 10:51 AM

Your Excel formula parser written in C# saved me a ton of work. I'm using it to figure out which external workbooks a formula links to. Thanks!

James Nylen | 2013.01.28 07:22 AM

James, glad to hear it - thanks for sharing!

ewbi.develops | 2013.01.28 08:59 AM


TrackBack

TrackBack URL:  http://www.typepad.com/services/trackback/6a00d8341c7bd453ef00d8341f48a053ef

Listed below are links to weblogs that reference Excel Formula Parsing, in C#: