parsing through a document

Welcome Forums General PowerShell Q&A parsing through a document

  • This topic has 3 replies, 3 voices, and was last updated 1 year ago by
    Senior Moderator
    .
Viewing 2 reply threads
  • Author
    Posts
    • #162086
      Participant
      Topics: 2
      Replies: 2
      Points: 33
      Rank: Member

      Hello, I have a DLP type program up and running but would like to be able to parse through it for tags.  An example would be water-marking documents as “secret” or “Private”, or using white-text to hide such tags in the general document text.

      Can powershell do this on any PC, or just on MS Server?  Or is it too much compute overall?  Yes, I am looking to avoid paying for actual DLP software.

      Thanks!

    • #162087
      Participant
      Topics: 5
      Replies: 2368
      Points: 5,987
      Helping Hand
      Rank: Community MVP

      … to parse through it for tags. An example would be water-marking documents as “secret” or “Private”, or using white-text to hide such tags in the general document text.

      If you’re talking about controlling the program with some scripts it depends pretty much on the particular program. If it has an API for Powershell you could do it. But if not you will be probably pretty much out of luck. There are some option to control the GUI of a program with something like AutoIt or AutoHotkey but that’s another discussion. 😉

      • #162251
        Participant
        Topics: 2
        Replies: 2
        Points: 33
        Rank: Member

        Hi Olaf,

        Thanks,  I don’t think I spelled out what I am hoping to find a command or package for.  So I have a working file status monitoring program and am looking for a package, library or lines of code that I could add that would be able to detect in a document:

        SSNs

        Credit card numbers

        embedded tags

        etc.

         

        Thank you

    • #162260
      Senior Moderator
      Topics: 3
      Replies: 123
      Points: 653
      Helping Hand
      Rank: Major Contributor

      You’ll have to be more specific with your operating situation. PowerShell can only do Get-Content on plaintext files, which your documents probably aren’t. Handling other document formats requires more complicated methods, and it’s different for each document format that you want to handle. Also, writing new information into them will be more complicated than getting information out of them.

      For instance, this blog post from the Scripting Guy describes a method for importing a .docx file as an object and then finding specific words within the file.

      This forum discussion is about finding text in a .pdf document, but it relies on the now-deprecated itextsharp. You can probably apply the same method using the new version, iText7, but depending on your usage it may not be legal to use it for free. Unclear whether this can handle .ps documents in addition to .pdf

      If you need to handle other document types, like .odt, that’s another specific solution that you’ll have to find.

Viewing 2 reply threads
  • The topic ‘parsing through a document’ is closed to new replies.