Excluding words from a word frequency count

This topic contains 0 replies, has 1 voice, and was last updated by Profile photo of Forums Archives Forums Archives 5 years, 5 months ago.

  • Author
    Posts
  • #5444

    by dwwilson66 at 2013-01-16 06:38:33

    I am developing a PowerShell script to analyze potential document keywords. My plan is to count occurences of unique words and take the top 10 or 15 words. Here's where I am so far, and I need a hand figuring out how to proceed with my next step–excluding certain words and characters from the result set and word counts.

    $srcpath = "c:\users\x46332\desktop\testpath"
    $docname = "testfile.txt"
    $document = get-content $srcpath\$docname
    $document = [string]::join(" ", $document)
    $words = $document.split(" `t",[stringsplitoptions]::RemoveEmptyEntries)
    $uniq = $words | sort -uniq
    $words | % {$wordhash=@{}} {$wordhash[$_] += 1}
    Write-Host $docname "contains" $wordhash.psbase.keys.count "unique words distributed as follows."
    $frequency = $wordhash.psbase.keys | sort {$wordhash[$_]}
    -1..-15 | %{$frequency[$_]+" "+$wordhash[$frequency[$_]]}
    $grouped = $words | group | sort count

    My output from this script is as follows, and I've marked examples of the words I want to exclude.

    testfile.txt contains 222 unique words distributed as follows.
    Agency 53
    Assistance 30
    of 22 < <<<< exclude
    to 22 < <<<< exclude
    in 20 < <<<< exclude
    Promise 17
    Enter 11
    grant 11
    Click 10 < <<<< exclude
    the 10 < <<<< exclude
    button. 9 < <<<< exclude punctuation
    amount 9
    box. 9
    and 8 < <<<< exclude
    you 8 < <<<< exclude

    There are two issues going on here:
    First, I want to eliminate punctiation, so "button" and "button." are counted as one unique word. I'm thinking that a regex will achieve that, but I'm not quite sure where to insert that or the proper syntax. I've tried -match [a-zA-Z], -replace (![a-zA-z]," "), [regex]] at various places in the script without any success...I keep getting an error that I can't index into a null array:

    Cannot index into a null array.
    At C:\users\x46332\desktop\testcount.ps1:10 char:25
    + -1..-25 | %{ $frequency[ < <<< $_]+" "+$wordhash[$frequency[$_]]}
    + CategoryInfo : InvalidOperation: (-25:Int32) [], RuntimeException
    + FullyQualifiedErrorId : NullArray

    How can I make this work? I think I'm lost on the correct syntax, and while it makes sense to me to do the character filter at the get-content or join step, I may be wrong.

    Second, I have an ASCII file containing about 150 "insignificant" words and phrases to be filtered from counts ("excllist.txt"). There is one phrase per line, so they're delimited by a carriage return. My thought is to match that list to the $wordhash table. I can reset the hash value of words appearing my file to 0 and the sort and count functions automatically sort those at the bottom. Alternately, I could just delete those records from the hashtable. I don't know which strategy is better, nor do I know how to make that happen. Anyone have experience with this? Thanks for your help!

    by nohandle at 2013-01-16 08:45:52

    Edit]I wrongly use the input variable here which is special variable used to deliver pipeline input to function. Use another variable name instead. Thanks Aleksandar for pointing it out.
    1] Here is solution to the first issue (showing it on small scale so tweek it)
    $input = 'three','three','one','three','two','two'
    $statistic = $input | foreach -Begin {$hash=@{}} -Process {$hash.$_++} -End {$hash}
    $statistic.GetEnumerator() | sort -Property value -Descending | select -First 2

    Name Value
    ---- -----
    three 3
    two 2

    by nohandle at 2013-01-16 08:52:11

    2] here is quick and dirty solution to the . and other characters issue. it just removes any non a-z characters. if you index english pages it should be pretty accurate. To get something more elaborate search regex forums.
    "Hello, this is dog." -split ' ' -replace '[^a-z]'
    Hello
    this
    is
    dog

    by nohandle at 2013-01-16 09:04:57

    Edit]I wrongly use the input variable here which is special variable used to deliver pipeline input to function. Use another variable name instead. Thanks Aleksandar for pointing it out.

    and all of it combined with the list of not permitted words:
    $input = "three is just three when not multiplied by zero and thirteen when you append it to one`nand the second line marked by number two"
    $oneLine = $input -replace "`n"
    $JustAtoZ = $oneLine -replace '[^a-z| ]'
    $words = $JustAtoZ -split ' '
    $statistic = $words | foreach -Begin {$hash=@{}} -Process {$hash.$_++} -End {$hash}

    $notPermittedWords = 'one','zero','by'
    $notPermittedWords | foreach {
    if ($statistic.ContainsKey($_))
    {
    $statistic.Remove($_)
    }
    }
    $statistic.GetEnumerator() |
    sort -Property value -Descending |
    select -First 2

    Name Value
    ---- -----
    three 2
    when 2

    by dwwilson66 at 2013-01-16 10:19:03

    Awesome. Exactly what I'm looking for. Let me play to make sure I understand how to reproduce it as needed. If I have additional questions, I'll post. 🙂

    by dwwilson66 at 2013-01-16 12:35:22

    I've got a few questions to help me understand...this may get lengthy. 🙂

    [quote="nohandle"]
    $input = "three is just three when not multiplied by zero and thirteen when you append it to one`nand the second line marked by number two"[/quote]
    Instead of hardcoded input, I have a series of files in a directory. My strategy is...
    function Count-Words ($inputdoc) {
    $srcpath = "c:\users\x46332\desktop\testpath"
    $docname = $inputdoc
    $document = get-content $srcpath\$docname
    $document = [string]::join(" ", $document)
    $words = $document.split(" `t",[stringsplitoptions]::RemoveEmptyEntries)
    $uniq = $words | sort -uniq
    $words | % {$wordhash=@{}} {$wordhash[$_] += 1}
    Write-Host $docname "contains" $wordhash.psbase.keys.count "unique words distributed as follows."
    $frequency = $wordhash.psbase.keys | sort {$wordhash[$_]}
    -1..-25 | %{ $frequency[$_]+" "+$wordhash[$frequency[$_]]}
    $grouped = $words | group | sort count
    }
    for-each($inputdoc in $testpath) {
    Count-Words ($inputdoc)
    }

    ...which, I assume will be fine for the purposes of this example. I've not tested the for-each logic, but I do know that the script in the function runs just fine as a standalone script. Also, I still need to replace the non-alpha characters with a space, as you demonstrate here.
    [quote="nohandle"]$oneLine = $input -replace "`n"
    $JustAtoZ = $oneLine -replace '[^a-z| ]'
    $words = $JustAtoZ -split ' '
    $statistic = $words | foreach -Begin {$hash=@{}} -Process {$hash.$_++} -End {$hash}
    [/quote]
    However, I can't seem to get this code to work when I integrate it with my function above. I would THINK that the -replace would be most appropriate in the [strin]::join line...but it errors out no matter where I put it. That's where I'm lost.

    Method invocation failed because [System.Object[]] doesn't contain a method named 'split'.
    At C:\users\x46332\desktop\testcount3.ps1:21 char:36
    + $exclwords = $exclusionAlpha.split < <<< (",",[stringsplitoptions]::RemoveEmptyEntries)
    + CategoryInfo : InvalidOperation: (split:String) [], RuntimeException
    + FullyQualifiedErrorId : MethodNotFound

    Index operation failed; the array index evaluated to null.
    At C:\users\x46332\desktop\testcount3.ps1:23 char:44
    + $exclwords | % {$exclhash=@{}} {$exclhash[ < <<< $_] +=1}
    + CategoryInfo : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : NullArrayIndex

    My SECOND part is to take an ASCII text file of 174 words to be excluded. Originally I'd thought to create and compare two hashtables, but I like the idea of comparing the words on the fly with a nested for-each loop. I'm thinking I can create a function that will open my exclusion list (...can I just declare $notPermittedWords = gc $exclpath instead of the hardcoded word list you have noted below...)? I would call that function from within the WordCount function, passing the $words variable to compare to the open exclusion list. Does that make sense?
    [quote="nohandle"]
    $notPermittedWords = 'one','zero','by'
    $notPermittedWords | foreach {
    if ($statistic.ContainsKey($_))
    {
    $statistic.Remove($_)
    }
    }
    $statistic.GetEnumerator() |
    sort -Property value -Descending |
    select -First 2

    "[/powershell][/quote]

    Thanks for your help!

    by nohandle at 2013-01-16 13:03:43

    Edit]I wrongly use the input variable here which is special variable used to deliver pipeline input to function. Use another variable name instead. Thanks Aleksandar for pointing it out.

    All of this I kept in mind when I wrote the script. There is pretty much nothing you need to change to analyze files and load the words to exclude from file.
    just load the files to variables and change the first -replace to replace new lines with spaces (you could use single line option of the regex but this is easier)
    $input = Get-Content c:\temp\wordCount.txt
    $notPermittedWords = Get-Content c:\temp\exclude.txt

    #make the file a long single line by replacing the newlines
    $oneLine = $input -replace "`n",' '

    And of course remove the line where you list the notPermittedWords
    $notPermittedWords = 'one','zero','by'
    Just analyzed part of the wikipedia Powershell page and it works like a charm.
    Name Value
    ---- -----
    PowerShell 15
    Windows 10
    to 6
    cmdlets 5
    and 5
    by 5

    Looks like I have to add "to", "and" and "by" to my exclude list.

    Converting it to a function should be piece of cake 🙂

    by nohandle at 2013-01-16 13:31:29

    function Get-WordCount {
    [CmdletBinding()]
    param
    (
    [Parameter(Mandatory=$True,
    ValueFromPipeline=$True)]
    [string[]]$Text,

    [string[]]$Exclude
    )
    process
    {
    $collected += $text
    }
    end {
    #make the file a long single line by replacing the newlines
    $oneLine = $collected -replace "`n",' '
    $JustAtoZ = $oneLine -replace '[^a-z| ]'
    $words = $JustAtoZ -split ' '
    $statistic = $words | foreach -Begin {$hash=@{}} -Process {$hash.$_++} -End {$hash}
    if ($exclude)
    {
    $exclude | foreach {
    if ($statistic.ContainsKey($_))
    {
    $statistic.Remove($_)
    }
    }
    }
    $statistic.GetEnumerator() |
    sort -Property value -Descending
    }
    }

    Get-Content c:\temp\wordCount.txt | Get-WordCount -Exclude (Get-Content c:\temp\exclude.txt) | select -First 15

You must be logged in to reply to this topic.