Recently, a client had the following situation:
- A folder contains 1500 files on a file share.
- Almost all files are pictures and only the jpg files need to be processed. The pictures need to be placed in many different document sets in one document library in SharePoint.
- The filenames often contain two codes which combined is an identifier for the document set the file must be placed into.
- The codes are almost in all filenames in the wrong order.
- One code [Code 1] contains 1-3 letters. The other code [Code 2] contains 4 digits.
- The unique identifier is: [Code 1]-[Code 2]
Filename examples:
- 1111-NL.jpg
- 2222-NL-A.jpg
- 3333-A-NL.jpg
- 4444A-NL.jpg
- Picture of something.jpg
The client was thinking about manually renaming and uploading all files. It took me little effort to convince the client that a better approach was possible: PowerShell to the rescue.
I proposed the following approach:
(1) Create a csv-file (CSV-1) using an intelligent PowerShell script (Script-1).
Every row is related to one file. The script (Script-1) processes every file trying to substract the codes from the filename. The csv file (CSV-1) contains the following columns:
- Filename
- Code 1
- Code 2
- Suffix
- The extension
- A category identifier
(2) Import the csv file (CSV-1) into an Excel-file.
Because [Code 2] could be something like ‘0123’, the import process should NOT convert these to numbers.
(3) Let someone inspect the rows in the Excel-file (the files) and if needed correct them.
(4) Export the Excel-file back to a csv file (CSV-2).
(5) Rename the files with a PowerShell script (Script-2) based on csv file (CSV-2).
(6) Create a csv file (CSV-3) which maps the unique identifier to the name of a document set.
(7) Upload the files with a PowerShell script (Script-3) using csv files (CSV-2) (CSV-3).
This blog post shows at the bottom the PowerShell script (Script-1). That script took just a few seconds to process all files.
As one can imagine, this approach took much less effort to execute then manually correcting and uploading the files. I am convinced that this approach improved the data quality too.
The last column in the csv file (CSV-1) contains a category identifier (A, B or C):
- A : No manual correction is needed. There is a 100% match. There is no suffix.
- B : The processing could be correct but some manual correction could also be needed.
- C : This filename is to difficult to process. This file really needs some correction.
Because we wanted to make sure that as little as possible manual correction was needed, some intelligence was build into the PowerShell script (Script-1) for categories ‘B’ and ‘C’:
- B : The suffix contains the part of the filename without [Code 1], [Code 2] and the dashes.
- C : When there is a match for -YYY, YYY is used for [Code 1]. YYY being one, two or three letters.
- C : When there is a match for -YY, YY is used for [Code 1]. YYY being one, two or three letters.
- C : If the filename starts with four digits (which was almost always the case), these were used for [Code 2].
- C : If the filename starts with four digits followed by a dash, the suffix does not contain these.
$scriptPath = Split-Path -Parent $MyInvocation.MyCommand.Definition
$filesPath = $scriptPath + '\Files'
$filePath = $scriptPath + '\Files.csv'
$files = Get-ChildItem -LiteralPath $filesPath -File:$true
foreach ($file in $files) {
if ($file.Extension -eq '.jpg') {
$fileNameWithoutExtension = [IO.Path]::GetFileNameWithoutExtension($file.Name)
if ($fileNameWithoutExtension -match "^\d\d\d\d-[a-z]{1,3}$") {
$array = $fileNameWithoutExtension -split '-'
$two = $array[0]
$one = $array[1]
$suffix = ''
$row = $file.Name + '|' + $one + '|' + $two + '|' + $suffix + '|' + $file.Extension + '|A'
} elseif ($fileNameWithoutExtension -match "^\d\d\d\d-[a-z]{1,3}-") {
$array = $fileNameWithoutExtension -split '-'
$two = $array[0]
$one = $array[1]
$length = $one.Length + 1 + $two.Length + 1
$suffix = $fileNameWithoutExtension.Substring($length, $fileNameWithoutExtension.Length - $length)
$row = $file.Name + '|' + $one + '|' + $two + '|' + $suffix + '|' + $file.Extension + '|B'
} else {
# One
if ($fileNameWithoutExtension -match '-[a-z][a-z][a-z]') {
$one = $Matches[0].Substring(1, 3)
} elseif ($fileNameWithoutExtension -match '-[a-z][a-z]') {
$one = $Matches[0].Substring(1, 2)
} else {
$one = ''
}
# Two
if ($fileNameWithoutExtension -match "^\d\d\d\d") {
$two = $fileNameWithoutExtension.Substring(0, 4)
} else {
$two = ''
}
# Suffix
if ($fileNameWithoutExtension -match "^\d\d\d\d-") {
$suffix = $fileNameWithoutExtension.Substring(5, ($fileNameWithoutExtension.Length - 5))
} else {
$suffix = $fileNameWithoutExtension
}
$row = $file.Name + '|' + $one + '|' + $two + '|' + $suffix + '|' + $file.Extension + '|C'
}
$row | Out-File -FilePath $filePath -Append:$true -Encoding 'UTF8'
}
}