An Autoit Script to Extract Data from a MSWord Document and Send Them to an Internet Application





5.00/5 (2 votes)
A program for grabbing information from MSWord documents and send by Ajax technology to the Internet
Introduction
This is an example of the use of the Autoit[1] for accessing an MSWord document and to extract some data, but it is also intended to show the use of COM (Component Object Model) objects, in particular, the use of COM objects for MSWord and the MSXML2.XMLHTTP (WinHttpRequest object) which implements the Ajax protocol.
Autoit has a library for MS Word COM object with some functions, therefore here I used the native methods and properties because the library doesn't contain the function that I need and so it permits to better understand how Autoit works with COM objects.
The article carries the following topics:
- how to have access to the MSWord documents as object
- how to extract the data
- how to prepare and send the data
Accessing the MSWord Document
We need to instantiate an MSWord COM objects: this can be done in some ways:
- if we want a new document, it is by
ObjCreate("Word.Application")
- for an already opened document
ObjGet("", "Word.Application")
- for a particular document
ObjGet("fileName", "Word.Application")
or simplyObjGet("MSWordfileName")
The second mode fails if MSWord is not active returning in the variable @error
a value different from 0
. In the fragment of the script below, the methods are all used; note the choice for create a new document that presupposes that the clipboard contains the data.
Global $oWord = ObjGet("", "Word.Application")
If @error <> 0 Or $oWord.Documents.Count = 0 Then
; This is the case in which MSWord or is not present (@error <> 0)
; or has not documents ($oWord.Documents.Count = 0)
$form = "File,Document,,30,,Documents (*.doc\59*.docx)|All (*.*);" _
& "C,Enter for a New Document"
$parms = formGen("Word File",$form,-1,"",100,100) ; create a form for call a filename
If $parms.Item("fg_button") = "Cancel" Then Exit
If $parms.Item("Document") <> "" Then
$oDoc = ObjGet($parms.Item("Document")) ; open the MSWord document
Else
$oAppl = ObjCreate("Word.Application")
$oAppl.Visible = True
$oDoc = $oAppl.Documents.add ; creates an empty document
$range = $oDoc.Range ; Set range start/end at the end to the document
$range.Collapse($WdCollapseEnd)
$range.paste ; paste the clipboard
EndIf
$oDoc.Application.Visible = 1
Else
; This is the case in which MSWord is present and has one or more documents open
Local $nDocs = $oWord.Documents.Count
$nDoc = 1
If $nDocs <> 01 Then
Local $docs = ""
For $i = 1 to $nDocs
consoleWrite($oWord.Documents($i).Name & @CRLF)
$docs &= "|" & $i & "=" & $oWord.Documents($i).Name
Next
$form = "CMB,Document,,20,," & StringMid($docs,2)
$parms = formGen("Word Files",$form,-1,"",100,100) ; _
create a form to choose which document to process
If $parms.Item("fg_button") = "Cancel" Then Exit
$nDoc = $parms.Item("Document")
EndIf
$oDoc = $oWord.Documents($nDoc)
EndIf
In the script, the simple forms for asking the name of the document are created by my utility formGen
that you can find in my site.
The Capture of Data
The Object model of MSWord offers a complex set of objects where every object can contain collections of objects, methods and properties; these collections can refer to the entire document or to a range that is a portion of document or to a selection[2].
In this script, we are concerned with the Paragraphs
and Hyperlinks
collections; in particular, the document contains a list of jobs (see below) in structured format with an internet link:
FRANCE: 1 PostDoc position in HISTORY
Ref. 46_16 - City: Orleans - Deadline: 12/02/2016 »
The property text
of the object Paragraph
is the container of the data that are extracted by the use of regular expression[3] (RE):
For $paragraphCount = 1 To $nParag
$parag = StringReplace($oParag($paragraphCount).Range.Text,chr(11),"") ; clear VT
$aExtract = StringRegExp($parag, '^(.+): (.+)(Ref\..*) - City: (.+) - _
Deadline: (\d\d/\d\d/\d\d\d\d)', $STR_REGEXPARRAYGLOBALMATCH)
If isArray($aExtract) Then
$itemCount += 1
$aExtract[4] = StringRegExpReplace($aExtract[4], '(\d{2})/(\d{2})/(\d{4})', '$3/$2/$1') ; _
date from dd/mm/yyyy to yyy/mm/dd
$data = ""
For $i = 0 To UBound($aExtract) - 1
$data &= "&" & $aFields[$i] & "=" & encode64($aExtract[$i])
Next
$aCategory = StringRegExp($aExtract[1], '.*in (.*)', $STR_REGEXPARRAYGLOBALMATCH) ; Category
If isArray($aCategory) Then $data &= "&Category=" & encode64($aCategory[0]) ; _
extracts the study field
For $hLink In $oParag($paragraphCount).Range.Hyperlinks
$link = $hLink.Address
Next
$data &= "&Link=" & encode64($link)
ConsoleWrite(StringMid($data,2) & @CRLF)
$res = ajax($url,StringMid($data,2))
If $res <> 1 Then ConsoleWrite("--> " & $res & @CRLF)
EndIf
Next
You can find the documentation for the syntax of the regular expression in Autoit Help, here there are only some clarifications on how RE is used in this script: the StringRegExp
function, if the text match the RE, extracts the data that are matched by what is delimited by a couple of parentheses. Every token extracted has a name $1, $2, ... and this allows the StringRegExpReplace
function reshape the date.
For the link contained in the paragraph, the script accesses the Hyperlinks
collection of the paragraph and takes the Address
property.
Send the Data
The data extracted must be prepared in the form name<sub>1</sub>=value<sub>1</sub>&name<sub>2</sub>=value<sub>2</sub>...
and the data can contain characters used in the HTTP protocol like =
and &
, so the data must be encoded, for example coding in format base64 where only =
, +
and /
are characters used in protocol[4].
Global $base64 = StringSplit("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/","",2)
$base64[62] = "%2B" ; +
$base64[63] = "%2F" ; /
...
$aFields = StringSplit("Nation|Title|Notes|Town|Deadline","|",2)
...
For $i = 0 To UBound($aExtract) - 1
$data &= "&" & $aFields[$i] & "=" & encode64($aExtract[$i])
Next
...
Func encode64($in)
$bin = 0
$out = ""
For $i = 1 to StringLen($in)
$bin = BitShift($bin,-8)
$bin += Asc(StringMid($in,$i,1))
If Mod($i,3) = 0 Then
$out &= encode6($bin)
$bin = 0
EndIf
Next
If Mod(StringLen($in),3) <> 0 Then $out &= encode6($bin,Mod(StringLen($in),3))
return $out
EndFunc
Func encode6($bin,$l=3)
If $l <> 3 Then $bin = BitShift($bin,-8*(3-$l))
$out = ""
For $i=3 To 3-$l Step -1
$cod6 = BitShift($bin,$i*6)
$bin -= BitShift($cod6,-$i*6)
$out &= $base64[$cod6]
Next
return $out
EndFunc
If the receiver is a PHP script, this fragment restores the data: foreach ($_REQUEST as $key => $value) $$key = base64_decode($value);
I used the COM object MSXML2.XMLHTTP
for sending data in synchronous mode: $ajax.open("POST", $url, true)
:
Global $ajax = ObjCreate("MSXML2.XMLHTTP")
...
Func ajax($url,$data)
$ajax.open("POST", $url, true) ; synchronous !
$ajax.setRequestHeader("Content-type", "application/x-www-form-urlencoded")
$ajax.send($data)
Local $hTimer = TimerInit() ; Begin the timer and store in a variable.
While TimerDiff($hTimer) < 10000
If $ajax.readyState == 4 Then
If $ajax.status = 200 Then return $ajax.responseText
EndIf
Sleep(10)
WEnd
return "Timeout!"
EndFunc
Notes
- ^AutoIt is a free-ware BASIC-like scripting language, an alternative to PowerShell, designed in origin for automating the interaction with Windows GUI. AutoIt can run on Windows interpreted or compiled.
It comes with many libraries that enable, among other things, access COM objects and create graphical interfaces. - ^There can be multiple range but only one selection for document.
- ^This site https://regex101.com/ is very useful for test regular expression.
- ^The code overhead is of 33%, the
=
character, that I have not used, is for padding if the length of the data to encode is not divisible per 3; the padding is necessary if you want to concatenate two sets of data.