Usually, you will want a CSV parser whenever possible, but I believe people will face this problem from time to time, so I decided to write this up. It's a sort of best-effort, half-assed attempt with flaws related to how it handles empty fields (,, or ""). If there are no empty fields, it should work reliably.
I haven't googled the problem before starting on the article - I often avoid that if I can, so I don't know how the official nerd doctrine way of doing it might go, but the method I "developed" works, and I can't think of any situations in which it will not work, off the top of my head. Empty fields are potentially the main problem here. Presumably the nerd doctrine way uses a properly built parser with tokens.I'll walk through a few situations you might come across, but will start with a quick screenshot demonstration that fast-forwards to the end. This might serve to draw your attention.
Here I pretend to have a need to be doing PowerShell's Import-Csv cmdlet's job, and parse the CSV format PowerShell uses when you Export-Csv, and also built in a check for backslash-escaped commas in the data which you're likely to run across using C#, VB.NET, Python and other languages, in various situations.
However, I start by looking at comma-delimited stuff, to support other formats than the double-quoted PowerShell format.
The regex and results will look something like this:PS C:\temp> [regex]::Matches($String, '((?:\\,|[^,])*)(?:,(\\,|[^,]*))*?') | ` foreach { $_.Groups[1].Value } "CSV field 1"But this produces three blank lines/elements. I filter out the empty elements by checking that the line contains non-whitespace. Using a Where-Object and $_.Trim() is a clever way of doing that, since an empty string is false and will then be filtered out by Where-Object."field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote"
"field3"
PS C:\temp> [regex]::Matches($String, '((?:\\,|[^,])*)(?:,(\\,|[^,]*))*?') | ` foreach { $_.Groups[1].Value } | Where { $_.Trim() } "CSV field 1" "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote" "field3"
In retrospect, it occurred to me that using the quantifier "+" instead of "*" should help clean up results directly, and the old method doesn't handle empty elements anyway.
PS C:\temp> $String = '"CSV field 1", "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote", "", "field3"' PS C:\temp> [regex]::Matches($String, '((?:\\,|[^,])+)(?:,(\\,|[^,])+)*?') | ` foreach { $_.Groups[1].Value } "CSV field 1" "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote" "" "field4"
Using the new regex, and checking how this method copes with empty fields, I get results that seem satisfactory for this use case; it copes with the empty double-quoted element. You can see the count of fields is 4.
PS C:\temp> $String = '"CSV field 1", "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote", "", "field4"' PS C:\temp> [regex]::Matches($String, '((?:\\,|[^,])+)(?:,(\\,|[^,])+)*?') | ` foreach { $_.Groups[1].Value } | ` foreach { if ($_ -match '"((?:""|[^"])*)"') { $Matches[1] } } CSV field 1 field2\, with\, backslash-escaped delimiters and a little ""surprise"" quotefield4
PS C:\temp> ([regex]::Matches($String, '((?:\\,|[^,])+)(?:,(\\,|[^,])+)*?') | ` foreach { $_.Groups[1].Value } | ` foreach { if ($_ -match '"((?:""|[^"])*)"') { $Matches[1] } }).Count 4
Finally, I tack on an example of replacing double double quotes with single double quotes.
PS C:\temp> ([regex]::Matches($String, '((?:\\,|[^,])+)(?:,(\\,|[^,])+)*?') | ` foreach { $_.Groups[1].Value } | ` foreach { if ($_ -match '"((?:""|[^"])*)"') { $Matches[1] } }) -replace '""', '"' CSV field 1 field2\, with\, backslash-escaped delimiters and a little "surprise" quotefield4
PS C:\temp>
>>> string = '"CSV field 1", "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote", "field3"' >>> import re >>> m = re.findall(r'((?:\\,|[^,])*)(?:,(\\,|[^,]*))*?', string); m [('"CSV field 1"', ''), ('', ''), (' "field2\\, with\\, backslash-escaped delimiters and a little ""surprise"" quote"', ''), ('', ''), (' "field3"', ''), ('', '')] >>> fields = [field for field in fields if field.__len__() > 0]; fields ['"CSV field 1"', ' "field2\\, with\\, backslash-escaped delimiters and a little ""surprise"" quote"', ' "field3"']
In the end you have a list of the fields, with a leading whitepace on all but the first. Might want to handle that.
To do as I do in PowerShell, and extract the stuff between double quotes with support for doubled-up double quotes in the data, you could do something like this:>>> for field in fields: ... m = re.match(r'\s*"((?:""|[^"])*)"\s*', field) ... if m: ... print m.groups(1) ... ('CSV field 1',) ('field2\\, with\\, backslash-escaped delimiters and a little ""surprise"" quote',) ('field3',)
var string = '"CSV field 1", "field2\\, with\\, backslash-escaped delimiters and a little ""surprise"" quote", "", "field4"' console.log("Before parsing:\n" + string + "\n\n"); var regex = /((?:\\,|[^,])+)(?:,(\\,|[^,])+)*?/g; //var regex = /([^,]+),?/g; console.log("Listing fields:\n\n"); while ((r = regex.exec(string))) { console.log(r[1]); }
Output:
PS C:\temp> node .\test.js Before parsing: "CSV field 1", "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote", "", "field4"Powershell Windows Python .NET Regex CSV JavaScript All CategoriesListing fields:
"CSV field 1" "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote" "" "field4"
Minimum cookies is the standard setting. This website uses Google Analytics and Google Ads, and these products may set cookies. By continuing to use this website, you accept this.