Powershell: Basic text processing

Problem:  I’ve received two files of eMail addresses. These represent one group that were awarded a grant, and much larger group that were rejected. I’m going to call these two files winners.txt and losers.txt.

I want to end up with the following:
1. Eliminate invalid eMail addresses
2. Eliminate duplicate records in both files
3. Check to see that anyone in the winners file does not appear in the losers file.

Although I’ve received these as an Excel Spreadsheet, I’ll use Excel to write them out as comma-delimited text files. Maybe later I’ll figure out how to work with the Excel spreadsheet directly from Powershell.

After exporting the eMail addresses from Excel they are contained in two .csv files….comma delimited text with each field enclosed in quotes. There were a huge number of blank lines at the bottom of each file, which I’m assuming are a result of the Excel export.

Working with the winners.txt file:

1. Count the number of lines in the file:
Get-Content winners.txt | measure-object -Line

2. Eliminate the blank lines. These were lines that had an empty string enclosed in double-quotes followed by a carriage-return, line feed..
 (Get-Content winners.txt -Raw)`
.Replace( "`"`"`r`n" ,"" )| Set-Content winners.txt

3. Eliminate the rest of the quotes
 (Get-Content winners.txt -Raw).Replace( "`"" , "" ) `
| Set-Content winners.txt

4. Validate for the presence of an @ character in each line. (Ok, I know there are some more robust email format evaluation routines using regular expressions or even .NET objects, but I’m reserving those for another time. Right now, I’m assuming that if a line includes an at sign ‘@’ then I can work with it).

(Get-Content winners.txt) | select-string "@" `
| Set-Content winners.txt

5. Eliminate duplicate lines
$fn=winners.txt
( Get-Content $fn ) | sort | get-unique >$fn

Note that you have to sort the lines to be able to get-unique.

Having done the above four steps on the winners’ file, I repeated them with the losers’ file. Then I compared the two files to find lines that appeared in both files, (which of course meant that my winners’ eMail addresses were present in the loosers’ file.)

5. Find lines that appear in both files.
Compare-Object -DifferenceObject $winners `
-ReferenceObject $loosers `
-ExcludeDifferent 

It turns out that all of my winners eMail addresses appeared in the loosers file as well. I deleted these with a manual search and replace… (but that’s another PowerShell story to look up).

Notes:
1. The parentheses around Get-Content in a command act like they do when doing arithmetic. They instruct PS to execute those commands first. This is necessary when modifying the conents  of a file without creating an intermediary file to hold the contents.  The downside is that the file is read entirely into memory, so there may be implications with especially large files.

2. The line continuation character is a back-tick character, “`”.  It also serves as the escape character when you need to include a particular character as part of a search string. So, in the search string in step 1,Replace( “`”`”`r`n” ,”” ) means, replace any line that has “” as its contents.

3. Find the default printer
Get-WmiObject `
-Query " SELECT * FROM Win32_Printer WHERE Default= $true"

From <https://wordpress.com/post/78494701/4&gt;

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s