Get-CSVDelimiter

This is a function I got from the internet ( http://www.powertheshell.com/autodetecting-csv-delimiter/ ) and modified a little can be used to determine the delimiter used in a CSV file.

It does this in such a simple way you’ll be wondering why you didn’t think of it yourself.  It parses each line and counts the obvious ‘candidates’ for a delimiter and then after each line checks if the ‘usual suspects’ ( great film by the way and  if you haven’t seen it yet the what have you been watching, neighbors, come on this should be on everyone’s watch list)  then you can use the appropriate –delmiter switch on the import-command.

A CSV file will use different delimiters depending on the country you live in apparently.  so the C in Comma Separated Values clearly does not stand for comma 🙂

In the UK it’s not uncommon to see PIPE delimited file i.e. using | instead of , probably because when you open it with excel it doesn’t  mangle the ‘integer looking’ values in the row, e.g. a telephone number like this +44 20 7123 1234 canget converted into a numeric value including an exponential and don’t get me started on how Excel mangles date formats depending on the workstation local, but we should all blame ourselves for allowing each country to ‘choose’ a date format.  Why can’t we all agree one way or the other which parts of a date are which, i.e is this the 12th of the 1st and of which month, 12/1/2018 – you are either slightly early or exceptionally late to the party!

12/1/2018 – you are either slightly early or exceptionally late to the party! Neither is a good place to be.

Anyway,. let talk about CSV files.  This handy function should be able to work out what character is being used as a delimiter and then you can use this information in an import-csv file to ensure that you import the data correctly.

e.g. Import-Csv –Delimiter  (Get-CSVDelimiter c:\temp\mycsvfile.csv )

Although I’d put a little more error checking around it if you are planning on using this in your scripts.

Like all code I get from the internet I try to analyse how it does what it does and in doing this I saw some nice ideas around formatting the output.  This blog post on PSCustom objects will tell you all ( hopefully everything ) you need to know about PSCustom objects.  I particularly liked the elegant solutions on controlling the output attributes, something we take for granted when we output an objects contents to screen.

Haven’t you noticed that poweshell does not always show every attribute of an object – try using ‘| select * | fl’ some time to see the difference.  I wish the Quest commandlets would do this as those commandlets return a lot of attributes which is why they tend to be a lot slower than the AD commandlet equivalents.

https://kevinmarquette.github.io/2016-10-28-powershell-everything-you-wanted-to-know-about-pscustomobject/

function Get-CSVDelimiter { Version 1.00
[CmdletBinding()]
param (
  # Path name to CSV file
  # can be submitted as string or as File object
  [Parameter(ValueFromPipelineByPropertyName=$true,ValueFromPipeline=$true)]
  [Alias('FullName')]
  [String]$Path
)
begin {
  # list of ascii codes that typically cannot be a delimiter: 0-9, A-Z, a-z, and space:
  $excluded = ([Int][Char]'0'..[Int][Char]'9') + ([Int][Char]'A'..[Int][Char]'Z') + ([Int][Char]'a'..[Int][Char]'z')  + 32
  
  function Get-DelimitersFromLine {
   param($TextLine)
   $quoted = $false
   $result = @{}
   # examine line character by character
   foreach($char in $line.ToCharArray()) {
    # if a double-quote is detected, toggle quoting flag - ignores everything inside a double quoted string
    if ($char -eq '"') { $quoted = -not $quoted } elseif ($quoted -eq $false) { if ($excluded -notcontains [Int]$char) {  $result.$([Int]$char) ++  } }
   } # foreach($char in $line.ToCharArray()) {
   $result
  } # Get-DelimitersFromLine
} # Begin
 process {
  # initialize variables
  $oldcandidates = $null
  # examine each line in CSV file:
  $file = [System.IO.File]::OpenText($Path)
  $lines = 0
  $foundDelimiter = $false
  While (-not $file.EndOfStream) {
   $line = $file.ReadLine()
   $lines++
   if ( $foundDelimiter ) { continue }
   # examine current line and get possible delimiters:
   $candidates = Get-DelimitersFromLine $line
   # if this is the first line, skip analysis
   if ($oldcandidates -eq $null) {
    # if first line starts with "#", ignore
    if (-not $line.StartsWith('#')) { $oldcandidates = $candidates }
   } # else, identify ascii codes that have the same frequency in both this line and the previous line, and store in hash table $new:
   else {
    $new = @{}
    $keys = $oldcandidates.Keys
    foreach($key in $keys) {
     if ($candidates.$key -eq $oldcandidates.$key) {
      $new.$key = $candidates.$key
     }
    }
    $oldcandidates = $new
    # if only 1 possible delimiter is left, we are done
    # exit loop, no necessity to examine the remaining lines:
    if ($oldcandidates.keys.count -lt 2) {
     $foundDelimiter = $true
     Continue
    }
   }
  } # While (-not $file.EndOfStream) {
  $file.Close()
  # create return object
#  $rv = New-Object PSObject | Select-Object -Property Name, Path, Delimiter, FriendlyName, ASCII, Columns, Rows, Status
#  $rv.Rows = $lines
#  $rv.Path = $Path
#  $rv.Name = Split-Path -Path $Path -Leaf
  # no keys found:
  if ($oldcandidates.keys.count -eq 0) {
   $rvStatus = 'No delimiter found'
  } # exactly one key found, good:
  elseif ($oldcandidates.keys.count -eq 1) {
   $ascii = $oldcandidates.keys | ForEach-Object { $_ }
   $rvASCII = $ascii
   # convert ascii to real character:
   $rvDelimiter = [String][Char]$ascii
   # number of Columns is frequency of delimiter plus 1:
   $rvColumns = $oldcandidates.$ascii + 1
   # add friendly names for the most common delimiter types:
   switch ($ascii) {
      9 { $rvFriendlyName = 'TAB'       }
     43 { $rvFriendlyName = "Plus"      }
     44 { $rvFriendlyName = 'Comma'     }
     58 { $rvFriendlyName = 'colon'     }
     59 { $rvFriendlyName = 'Semicolon' }
    124 { $rvFriendlyName = 'Pipe'      } 
   }
   $rvStatus = 'Found'
  } # elseif ($oldcandidates.keys.count -eq 1) {
  # ambiguous delimiters detected, list ambiguous delimiters
  else {
   # convert delimiter ascii keys in a comma-separated list of quoted characters:
   $delimiters = (($oldcandidates.keys | ForEach-Object { ('"{0}"' -f [String][Char]$_) }) -join ',')
   $rvStatus =  "Ambiguous separator keys: $delimiters"
  }
  $rv = [PSCustomObject]@{
    'PSTypeName'   = 'My.DelimiterTypeObject'
    'FileName'      = $(Split-Path -Path $Path -Leaf)
    'Path'          = $Path
    'Columns'       = $rvColumns
    'Rows'          = $lines
    'DelimiterName' = $rvFriendlyName
    'Delimiter'     =  $rvDelimiter
    'ASCII'         = $rvASCII
    'Status'        = $rvStatus
  }
  # ETS: set default visible properties on return object (applies to PS 3.0 only, no effect in PS 2.0):
  [String[]]$properties = 'FileName','DelimiterName','Columns','Rows'
  [System.Management.Automation.PSMemberInfo[]]$PSStandardMembers = New-Object System.Management.Automation.PSPropertySet DefaultDisplayPropertySet,$properties
  $rv | Add-Member -MemberType MemberSet -Name PSStandardMembers -Value $PSStandardMembers
  # return result:
  $rv
} # process {
 
} #                     Get-CSVDelimiter Version 1.00

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.