Thursday, January 19, 2012

Sebastien Lorien's Fast CSV Reader: Standing the Test of Time

A couple of years ago, when I was working on some data integration projects and didn't have the luxury of SSIS or Informatica, I had to write some custom .Net components to handle CSV sources flowing into OLEDB destinations.  Thinking about what libraries are available in standard .Net (2.0 at the time) -- String parsers, RegEx handlers, StreamReaders, etc. -- one would think it would relatively be a cinch.  However, I wanted to have some of the niceties afforded to ETL engines like SSIS and Informatica.  Specifically, fully qualified text fields, handling of escaped characters, custom delimeter characters, and missing field actions.


Instead of trying to reinvent the wheel, I scoured CodeProject for some inspiration as a starting point.  Never did I realize that an entire CSV parsing library was written so well, that I ended up using the entire project out of the box for my CSV handling needs.  This is the case with Sebastien Lorien's Fast CSV Reader.


Sebastien did a tremendous job in parsing CSV's the way a good integration utility (such as SSIS or Informatica) would.  You name it, this library's got it: handling missing required fields, handling malformed CSV rows, field headers, the works.  I believe the only thing that it didn't do (at the time I used it) was identifying exactly which field was not in an expected data format.  That may have changed over the years, however, so I'll have to bring it into a test project and see what she can do now, about 3 years later.


Oh, and one other thing.  The Fast CSV Reader is incredibly memory efficient.  I can corroborate the numbers reported on the Code Project site for this code, she runs lean and mean.  I seem to recall running a relatively large CSV file (several hundred megabytes) using the reader, and it ran quickly and I didn't have any memory issues.  For that, if I ever meet Sebastien in real life, I would definitely give him a Geek High Five.


Anyway, it appears that this project is still actively supported by Sebastien, so if you are writing custom .Net utilities to handle CSV's (or any character delimited file), I'd highly recommend either using Fast CSV Reader or using its code as a starting point for your implementation.

No comments: