parsing through a document

Welcome Forums General PowerShell Q&A parsing through a document

This topic contains 3 replies, has 3 voices, and was last updated by

 
Participant
3 months ago.

  • Author
    Posts
  • #162086

    Participant
    Topics: 2
    Replies: 2
    Points: 33
    Rank: Member

    Hello, I have a DLP type program up and running but would like to be able to parse through it for tags.  An example would be water-marking documents as "secret" or "Private", or using white-text to hide such tags in the general document text.

    Can powershell do this on any PC, or just on MS Server?  Or is it too much compute overall?  Yes, I am looking to avoid paying for actual DLP software.

    Thanks!

  • #162087

    Participant
    Topics: 1
    Replies: 1530
    Points: 2,591
    Helping Hand
    Rank: Community Hero

    ... to parse through it for tags. An example would be water-marking documents as "secret" or "Private", or using white-text to hide such tags in the general document text.

    If you're talking about controlling the program with some scripts it depends pretty much on the particular program. If it has an API for Powershell you could do it. But if not you will be probably pretty much out of luck. There are some option to control the GUI of a program with something like AutoIt or AutoHotkey but that's another discussion. 😉

    • #162251

      Participant
      Topics: 2
      Replies: 2
      Points: 33
      Rank: Member

      Hi Olaf,

      Thanks,  I don't think I spelled out what I am hoping to find a command or package for.  So I have a working file status monitoring program and am looking for a package, library or lines of code that I could add that would be able to detect in a document:

      SSNs

      Credit card numbers

      embedded tags

      etc.

       

      Thank you

  • #162260

    Participant
    Topics: 2
    Replies: 54
    Points: 278
    Helping Hand
    Rank: Contributor

    You'll have to be more specific with your operating situation. PowerShell can only do Get-Content on plaintext files, which your documents probably aren't. Handling other document formats requires more complicated methods, and it's different for each document format that you want to handle. Also, writing new information into them will be more complicated than getting information out of them.

    For instance, this blog post from the Scripting Guy describes a method for importing a .docx file as an object and then finding specific words within the file.

    This forum discussion is about finding text in a .pdf document, but it relies on the now-deprecated itextsharp. You can probably apply the same method using the new version, iText7, but depending on your usage it may not be legal to use it for free. Unclear whether this can handle .ps documents in addition to .pdf

    If you need to handle other document types, like .odt, that's another specific solution that you'll have to find.

The topic ‘parsing through a document’ is closed to new replies.