Parsing comma separated values with regular expressions
We’ve all had to write code to parse comma separated values before; it sounds simple, but it can actually be quite tricky! Sure, if our lists were always nicely defined like this:
- “one”,”two”,”three”,”four”
- five,six,seven,eight
Then we could simply use String.Split. But life is never that kind! When your input strings may be a bit more loosely defined, like this:
- one,”two,three”,four,,six,”seven”
- ,two,”three,four,five”,,
It gets a little tougher.
So can you do it using a single regular expression? Yes, you most certainly can! It’s simply a matter of breaking down the possibilities, then catering for the best case scenario (quoted values), down to the worse case scenario (zero-length values), and finally, catering for the delimiters (either a comma, or the end of the string). Lets look at them one step at a time.
Firstly, quoted values. This is by far the easiest of all the conditions – find any length of text between two quotes. We’ll use a non-greedy expression (the question mark after the star) to ensure we don’t over-extend the length of text that we match:
private const string
Template_QuotedValues = @"""(?<content>.*?)""";
The next easiest type of match to capture are non-quoted, non-zero length values. To do this, we’ll simply look for one or more characters which are not a comma. Again, we’re using a non-greedy match:
private const string
Template_UnquotedValues = @"(?<content>[^,]+?)";
Notice also that that for both templates, we’re creating a named group called “content” – this allows us to easily extract the contents the match, no matter what conditions were matched under.
The last type of match we need to cater for is non-quoted, zero-length matches. This is the trickiest of the three situations, since there’s “nothing” to actually match on! So instead, we look zero repetitions of any character, immediately after a delimiter. Since the first value in the list may be empty, the possible values for our delimiter are either the start of string (specified by the hat – ^), or a comma:
private const string
Template_EmptyValues = @"(?<=(?:,|^))(?<content>.{0})",
The final piece of the puzzle is the delimiters. Since we’re matching from left-to-right, we can assume that every match will be followed either by a comma, or the end of the string. We’ll use a non-capturing group since we don’t want the delimiter to be explicitly captured in a group.
private const string
Template_Delimiter = @"(?=(?:,|$))";
Now, to put it all together. We have our three types of matches that we’re expecting, and our delimiter, so all we need to do is create a single RegEx for it all. Here goes:
private const string
// Any length value within quotes...
Template_QuotedValues = @"""(?<content>.*?)""",
// ... or values with at least 1 character, not in quotes...
Template_UnquotedValues = @"(?<content>[^,]+?)",
// ...or zero-length matches, not in quotes...
Template_EmptyValues = @"(?<=(?:,|^))(?<content>.{0})",
// ... followed either a comma, or end of string
Template_Delimiter = @"(?=(?:,|$))";
// Now join as one Template - notice the OR condition (pipe)
// between the three match types
readonly static private string
Template = String.Format("({0}|{1}|{2}){3}",
Template_QuotedValues,
Template_UnquotedValues,
Template_EmptyValues,
Template_Delimiter);
// Finally, our RegEx!
readonly static private Regex CsvSplitterRegex
= new Regex(Template, RegexOptions.Compiled);
Was that so bad? ;-) Iterating through the list of values in our comma separated list is now a piece of cake.
// Assume CSV is in "record" field
foreach (Match match in CsvSplitterRegex.Matches(record))
{
Console.WriteLine("Match value: {0}",
match.Groups["content"].Value);
}
Simple, eh?