Extract expressions between () + preceeding expression in Arabic from a PDF

This topic contains 3 replies, has 3 voices, and was last updated by  Adam Dhahabi 3 years, 10 months ago.

  • Author
    Posts
  • #11293

    Adam Dhahabi
    Participant

    Hi guys,
    In my quest to learn Powershell I'm trying to build a little script which does the following:

    Search a 500-page PDF for expressions between brackets, if found, such an expressions should be stored in an array but not only that, some words preceeding it also. On condition, that is, if the preceeding expression is in arabic script, it should be stored in a second array.
    Finally an output from both arrays is needed resulting in a 2 column table containing a vocabulary in both Arabic and English, extracted from the 500-page PDF.

    Any help appreciated.

  • #11296

    Don Jones
    Keymaster

    So, I've no idea how to open a PDF and scan the text, which would be the starting point for you. PDFs are legally text files, but they're encoded in the PDF language which makes it difficult to extract the human-readable text you're after. If you have a way to do that, then I think we can help. PowerShell's -match operator can scan text for specific patterns, and actually capture matches into an array exactly as you're wanting.

  • #11297

    Dave Wyatt
    Moderator

    As luck would have it, I just recently finished a consulting gig where someone wanted to work with PDF files from PowerShell, so it's still pretty fresh in my head. I used the free iTextSharp library. Documentation on this library is kind of a nightmare, but here's the basic code to extract and work with text:

    
    Add-Type -Path .\iTextSharp.dll
    
    $pdfPath = "$pwd\SomeFile.pdf"
    
    $pdfReader = New-Object iTextSharp.text.pdf.PdfReader($pdfPath)
    
    for ($i = 1; $i -le $pdfReader.NumberOfPages; $i++)
    {
        $textOnPage = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdfReader, $i)
        # Search $textOnPage for whatever you're trying to find, and either let the loop keep going to the next page, or break out here.
    }
    
    $pdfReader.Close()
    
    
  • #11325

    Adam Dhahabi
    Participant

    Awesome replies, thanks for that! I'm at the beginning of 'the' (Powershell) learning path though with some Linux and PHP background knowledge in my backpack, PowerShell seems marvelously simple compared with the old day solutions 🙂 I'm B.T.W. working at one of the many offshore service desks, in this case a corporation with 5,000 users. I moved away from the EU and found this job with ease (because of the language skills). Let's see if PowerShell can lift me up becoming a sys admin 🙂

You must be logged in to reply to this topic.