Parsing comma separated values with regular expressions

We’ve all had to write code to parse comma separated values before; it sounds simple, but it can actually be quite tricky! Sure, if our lists were always nicely defined like this:

  • “one”,”two”,”three”,”four”
  • five,six,seven,eight

Then we could simply use String.Split. But life is never that kind! When your input strings may be a bit more loosely defined, like this:

  • one,”two,three”,four,,six,”seven”
  • ,two,”three,four,five”,,

It gets a little tougher.

So can you do it using a single regular expression? Yes, you most certainly can! It’s simply a matter of breaking down the possibilities, then catering for the best case scenario (quoted values), down to the worse case scenario (zero-length values), and finally, catering for the delimiters (either a comma, or the end of the string). Lets look at them one step at a time.

Firstly, quoted values. This is by far the easiest of all the conditions – find any length of text between two quotes. We’ll use a non-greedy expression (the question mark after the star) to ensure we don’t over-extend the length of text that we match:

private const string
    Template_QuotedValues = @"""(?<content>.*?)""";

The next easiest type of match to capture are non-quoted, non-zero length values. To do this, we’ll simply look for one or more characters which are not a comma. Again, we’re using a non-greedy match:

private const string
    Template_UnquotedValues = @"(?<content>[^,]+?)";

Notice also that that for both templates, we’re creating a named group called “content” – this allows us to easily extract the contents the match, no matter what conditions were matched under.

The last type of match we need to cater for is non-quoted, zero-length matches. This is the trickiest of the three situations, since there’s “nothing” to actually match on! So instead, we look zero repetitions of any character, immediately after a delimiter. Since the first value in the list may be empty, the possible values for our delimiter are either the start of string (specified by the hat – ^), or a comma:

private const string
    Template_EmptyValues = @"(?<=(?:,|^))(?<content>.{0})",

The final piece of the puzzle is the delimiters. Since we’re matching from left-to-right, we can assume that every match will be followed either by a comma, or the end of the string. We’ll use a non-capturing group since we don’t want the delimiter to be explicitly captured in a group.

private const string
    Template_Delimiter = @"(?=(?:,|$))";

Now, to put it all together. We have our three types of matches that we’re expecting, and our delimiter, so all we need to do is create a single RegEx for it all. Here goes:

private const string
    // Any length value within quotes...
    Template_QuotedValues = @"""(?<content>.*?)""",

    // ... or values with at least 1 character, not in quotes...
    Template_UnquotedValues = @"(?<content>[^,]+?)",

    // ...or zero-length matches, not in quotes...
    Template_EmptyValues = @"(?<=(?:,|^))(?<content>.{0})",

    // ... followed either a comma, or end of string
    Template_Delimiter = @"(?=(?:,|$))";

// Now join as one Template - notice the OR condition (pipe)
// between the three match types
readonly static private string
    Template = String.Format("({0}|{1}|{2}){3}",
        Template_QuotedValues,
        Template_UnquotedValues,
        Template_EmptyValues,
        Template_Delimiter);

// Finally, our RegEx!
readonly static private Regex CsvSplitterRegex
    = new Regex(Template, RegexOptions.Compiled);

Was that so bad? ;-) Iterating through the list of values in our comma separated list is now a piece of cake.

// Assume CSV is in "record" field
foreach (Match match in CsvSplitterRegex.Matches(record))
{
    Console.WriteLine("Match value: {0}",
        match.Groups["content"].Value);
}

Simple, eh?

Posted in C#, RegEx at May 15th, 2010. 2 Comments.