65.9K
CodeProject is changing. Read more.
Home

An Autoit Script to Extract Data from a MSWord Document and Send Them to an Internet Application

starIconstarIconstarIconstarIconstarIcon

5.00/5 (2 votes)

Mar 7, 2016

CPOL

3 min read

viewsIcon

19634

downloadIcon

142

A program for grabbing information from MSWord documents and send by Ajax technology to the Internet

Introduction

This is an example of the use of the Autoit[1] for accessing an MSWord document and to extract some data, but it is also intended to show the use of COM (Component Object Model) objects, in particular, the use of COM objects for MSWord and the MSXML2.XMLHTTP (WinHttpRequest object) which implements the Ajax protocol.

Autoit has a library for MS Word COM object with some functions, therefore here I used the native methods and properties because the library doesn't contain the function that I need and so it permits to better understand how Autoit works with COM objects.

The article carries the following topics:

  • how to have access to the MSWord documents as object
  • how to extract the data
  • how to prepare and send the data

Accessing the MSWord Document

We need to instantiate an MSWord COM objects: this can be done in some ways:

  • if we want a new document, it is by ObjCreate("Word.Application")
  • for an already opened document ObjGet("", "Word.Application")
  • for a particular document ObjGet("fileName", "Word.Application") or simply ObjGet("MSWordfileName")

The second mode fails if MSWord is not active returning in the variable @error a value different from 0. In the fragment of the script below, the methods are all used; note the choice for create a new document that presupposes that the clipboard contains the data.

Global $oWord = ObjGet("", "Word.Application")
If @error <> 0 Or $oWord.Documents.Count = 0 Then
   ; This is the case in which MSWord or is not present (@error <> 0)
   ; or has not documents ($oWord.Documents.Count = 0)
   $form = "File,Document,,30,,Documents (*.doc\59*.docx)|All (*.*);" _
			& "C,Enter for a New Document"
   $parms = formGen("Word File",$form,-1,"",100,100)	; create a form for call a filename
   If $parms.Item("fg_button") = "Cancel" Then Exit
   If $parms.Item("Document") <> "" Then
	  $oDoc = ObjGet($parms.Item("Document"))	; open the MSWord document
   Else
	  $oAppl = ObjCreate("Word.Application")
	  $oAppl.Visible = True
	  $oDoc = $oAppl.Documents.add	; creates  an empty document
	  $range = $oDoc.Range 			; Set range start/end at the end to the document
	  $range.Collapse($WdCollapseEnd)
	  $range.paste					; paste the clipboard
   EndIf
   $oDoc.Application.Visible = 1
Else
   ; This is the case in which MSWord is present and has one or more documents open
   Local $nDocs = $oWord.Documents.Count
   $nDoc = 1
   If $nDocs <> 01 Then
	  Local $docs = ""
	  For $i = 1 to $nDocs
		 consoleWrite($oWord.Documents($i).Name  & @CRLF)
		 $docs &= "|" & $i & "=" & $oWord.Documents($i).Name
	  Next
	  $form = "CMB,Document,,20,," & StringMid($docs,2)
	  $parms = formGen("Word Files",$form,-1,"",100,100)	; _
               create a form to choose which document to process
	  If $parms.Item("fg_button") = "Cancel" Then Exit
	  $nDoc = $parms.Item("Document")
   EndIf
   $oDoc = $oWord.Documents($nDoc)
EndIf

In the script, the simple forms for asking the name of the document are created by my utility formGen that you can find in my site.

The Capture of Data

The Object model of MSWord offers a complex set of objects where every object can contain collections of objects, methods and properties; these collections can refer to the entire document or to a range that is a portion of document or to a selection[2].

In this script, we are concerned with the Paragraphs and Hyperlinks collections; in particular, the document contains a list of jobs (see below) in structured format with an internet link:

FRANCE: 1 PostDoc position in HISTORY
Ref. 46_16 - City: Orleans - Deadline: 12/02/2016 »

The property text of the object Paragraph is the container of the data that are extracted by the use of regular expression[3] (RE):

For $paragraphCount = 1 To $nParag
   $parag = StringReplace($oParag($paragraphCount).Range.Text,chr(11),"")	; clear VT
   $aExtract = StringRegExp($parag, '^(.+): (.+)(Ref\..*) - City: (.+) - _
               Deadline: (\d\d/\d\d/\d\d\d\d)', $STR_REGEXPARRAYGLOBALMATCH)
   If isArray($aExtract) Then
	  $itemCount += 1
	  $aExtract[4] = StringRegExpReplace($aExtract[4], '(\d{2})/(\d{2})/(\d{4})', '$3/$2/$1')	; _
                     date from dd/mm/yyyy to yyy/mm/dd
	  $data = ""
	  For $i = 0 To UBound($aExtract) - 1
		 $data &= "&" & $aFields[$i] & "=" & encode64($aExtract[$i])
	  Next
	  $aCategory = StringRegExp($aExtract[1], '.*in (.*)', $STR_REGEXPARRAYGLOBALMATCH)	; Category
	  If isArray($aCategory) Then $data &= "&Category=" & encode64($aCategory[0])	; _
             extracts the study field
	  For $hLink In $oParag($paragraphCount).Range.Hyperlinks
		  $link = $hLink.Address
	  Next
	  $data &= "&Link=" & encode64($link)
	  ConsoleWrite(StringMid($data,2) & @CRLF)
	  $res = ajax($url,StringMid($data,2))
	  If $res <> 1 Then ConsoleWrite("--> " & $res & @CRLF)
   EndIf
Next

You can find the documentation for the syntax of the regular expression in Autoit Help, here there are only some clarifications on how RE is used in this script: the StringRegExp function, if the text match the RE, extracts the data that are matched by what is delimited by a couple of parentheses. Every token extracted has a name $1, $2, ... and this allows the StringRegExpReplace function reshape the date.

For the link contained in the paragraph, the script accesses the Hyperlinks collection of the paragraph and takes the Address property.

Send the Data

The data extracted must be prepared in the form name<sub>1</sub>=value<sub>1</sub>&name<sub>2</sub>=value<sub>2</sub>... and the data can contain characters used in the HTTP protocol like = and &, so the data must be encoded, for example coding in format base64 where only =, + and / are characters used in protocol[4].

Global $base64 = StringSplit("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/","",2)
$base64[62] = "%2B"	; +
$base64[63] = "%2F"	; /
...
	$aFields = StringSplit("Nation|Title|Notes|Town|Deadline","|",2)
...
	For $i = 0 To UBound($aExtract) - 1
		$data &= "&" & $aFields[$i] & "=" & encode64($aExtract[$i])
	Next
...
Func encode64($in)
   $bin = 0
   $out = ""
   For $i = 1 to StringLen($in)
	  $bin = BitShift($bin,-8)
	  $bin += Asc(StringMid($in,$i,1))
	  If Mod($i,3) = 0 Then
		 $out &= encode6($bin)
		 $bin = 0
	  EndIf
   Next
   If Mod(StringLen($in),3) <> 0 Then $out &= encode6($bin,Mod(StringLen($in),3))
   return $out
EndFunc
Func encode6($bin,$l=3)
   If $l <> 3 Then $bin =  BitShift($bin,-8*(3-$l))
   $out = ""
   For $i=3 To 3-$l Step -1
	  $cod6 = BitShift($bin,$i*6)
	  $bin -= BitShift($cod6,-$i*6)
	 $out &= $base64[$cod6]
   Next
   return $out
EndFunc

If the receiver is a PHP script, this fragment restores the data: foreach ($_REQUEST as $key => $value) $$key = base64_decode($value);

I used the COM object MSXML2.XMLHTTP for sending data in synchronous mode: $ajax.open("POST", $url, true):

Global $ajax = ObjCreate("MSXML2.XMLHTTP")
...
Func ajax($url,$data)
   $ajax.open("POST", $url, true)	; synchronous !
   $ajax.setRequestHeader("Content-type", "application/x-www-form-urlencoded")
   $ajax.send($data)
   Local $hTimer = TimerInit() ; Begin the timer and store in a variable.
   While TimerDiff($hTimer) < 10000
      If $ajax.readyState == 4 Then
		 If $ajax.status = 200 Then return $ajax.responseText
	  EndIf
	  Sleep(10)
   WEnd
   return "Timeout!"
EndFunc

Notes

  1. ^AutoIt is a free-ware BASIC-like scripting language, an alternative to PowerShell, designed in origin for automating the interaction with Windows GUI. AutoIt can run on Windows interpreted or compiled.
    It comes with many libraries that enable, among other things, access COM objects and create graphical interfaces.
  2. ^There can be multiple range but only one selection for document.
  3. ^This site https://regex101.com/ is very useful for test regular expression.
  4. ^The code overhead is of 33%, the = character, that I have not used, is for padding if the length of the data to encode is not divisible per 3; the padding is necessary if you want to concatenate two sets of data.