Click here to Skip to main content
13,201,050 members (53,848 online)
Click here to Skip to main content
Add your own
alternative version

Stats

7.2K views
75 downloads
1 bookmarked
Posted 7 Mar 2016

An Autoit script to extract data from a MSWord document and send them to an Internet application

, 7 Mar 2016
Rate this:
Please Sign up or sign in to vote.
A program for grab information from MSWord documents and send by ajax technology to Internet

Introduction

This is an example of the use of the Autoit[1] for access an MSWord document and to extract some data, but he is also intended to show the use of COM (Component Object Model) objects in particular the use of COM objects for MSWord and the MSXML2.XMLHTTP (WinHttpRequest object) which implements the Ajax protocol.

Autoit has a library for MS Word COM object with some functions, therefore here I used the native methods and properties because the library doesn't contain the function that I need and so it permits to better understand how Autoit works with COM objects.

The article carries the following topics:

  • how to have access the MSWord documents as object,
  • how extract the data,
  • how prepare and send the data.

Accessing the MSWord document

We need to instantiate a MSWord COM objects: this can be done in some ways:

  • if we want a new document, it is by ObjCreate("Word.Application"),
  • for an already opened document ObjGet("", "Word.Application"),
  • for a particular document ObjGet("fileName", "Word.Application") or simply ObjGet("MSWordfileName").

The second mode fails if MSWord is not active returning in the variable @error a value different from 0. In the fragment of the script below the methods are all used; note the choice for create a new document that presupposes that the clipboard contains the data.

Global $oWord = ObjGet("", "Word.Application")
If @error <> 0 Or $oWord.Documents.Count = 0 Then
   ; This is the case in which MSWord or is not present (@error <> 0)
   ; or has not documents ($oWord.Documents.Count = 0)
   $form = "File,Document,,30,,Documents (*.doc\59*.docx)|All (*.*);" _
			& "C,Enter for a New Document"
   $parms = formGen("Word File",$form,-1,"",100,100)	; create a form for call a filename
   If $parms.Item("fg_button") = "Cancel" Then Exit
   If $parms.Item("Document") <> "" Then
	  $oDoc = ObjGet($parms.Item("Document"))	; open the MSWord document
   Else
	  $oAppl = ObjCreate("Word.Application")
	  $oAppl.Visible = True
	  $oDoc = $oAppl.Documents.add	; creates  an empty document
	  $range = $oDoc.Range 			; Set range start/end at the end to the document
	  $range.Collapse($WdCollapseEnd)
	  $range.paste					; paste the clipboard
   EndIf
   $oDoc.Application.Visible = 1
Else
   ; This is the case in which MSWord is present and has one or more documents open
   Local $nDocs = $oWord.Documents.Count
   $nDoc = 1
   If $nDocs <> 01 Then
	  Local $docs = ""
	  For $i = 1 to $nDocs
		 consoleWrite($oWord.Documents($i).Name  & @CRLF)
		 $docs &= "|" & $i & "=" & $oWord.Documents($i).Name
	  Next
	  $form = "CMB,Document,,20,," & StringMid($docs,2)
	  $parms = formGen("Word Files",$form,-1,"",100,100)	; create a form to choose which document to process
	  If $parms.Item("fg_button") = "Cancel" Then Exit
	  $nDoc = $parms.Item("Document")
   EndIf
   $oDoc = $oWord.Documents($nDoc)
EndIf

In the script the simple forms for ask the name of the document are created by my utility formGen that you can find in my site.

The capture of data

The Object model of MSWord offers a complex set of objects where every object can contain collections of objects, methods and properties; these collections can refers to the entire document or to a range that is a portion of document or to a selection[2].

In this script we are concerned whit the Paragraphs and Hyperlinks collections; in particular the document contains a list of jobs (see below) in structured format with an internet link:

FRANCE: 1 PostDoc position in HISTORY
Ref. 46_16 - City: Orleans - Deadline: 12/02/2016 »

The property text of the object Paragraph is the container of the data that are extracted by the use of regular expression[3] (RE):

For $paragraphCount = 1 To $nParag
   $parag = StringReplace($oParag($paragraphCount).Range.Text,chr(11),"")	; clear VT
   $aExtract = StringRegExp($parag, '^(.+): (.+)(Ref\..*) - City: (.+) - Deadline: (\d\d/\d\d/\d\d\d\d)', $STR_REGEXPARRAYGLOBALMATCH)
   If isArray($aExtract) Then
	  $itemCount += 1
	  $aExtract[4] = StringRegExpReplace($aExtract[4], '(\d{2})/(\d{2})/(\d{4})', '$3/$2/$1')	; date from dd/mm/yyyy to yyy/mm/dd
	  $data = ""
	  For $i = 0 To UBound($aExtract) - 1
		 $data &= "&" & $aFields[$i] & "=" & encode64($aExtract[$i])
	  Next
	  $aCategory = StringRegExp($aExtract[1], '.*in (.*)', $STR_REGEXPARRAYGLOBALMATCH)	; Category
	  If isArray($aCategory) Then $data &= "&Category=" & encode64($aCategory[0])	; extracts the study field
	  For $hLink In $oParag($paragraphCount).Range.Hyperlinks
		  $link = $hLink.Address
	  Next
	  $data &= "&Link=" & encode64($link)
	  ConsoleWrite(StringMid($data,2) & @CRLF)
	  $res = ajax($url,StringMid($data,2))
	  If $res <> 1 Then ConsoleWrite("--> " & $res & @CRLF)
   EndIf
Next

You can find the documentation for the syntax of the regular expression in Autoit Help, here there are only some clarifications on how RE is used in this script: the StringRegExp function, if the text match the RE, extracts the data that are matched by what is delimited by a couple of parentheses. Every token extracted has a name $1, $2, ... and this allows the StringRegExpReplace function reshape the date.

For the link contained in the paragraph, the script access the Hyperlinks collection of the paragraph and take the Address property.

Send the data

The data extracted must be prepared in the form name<sub>1</sub>=value<sub>1</sub>&name<sub>2</sub>=value<sub>2</sub>... and the data can contains characters used in the HTTP protocol like = and &, so the data must be encoded, for example coding in format base64 where only =, + and / are characters used in protocol[4].

Global $base64 = StringSplit("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/","",2)
$base64[62] = "%2B"	; +
$base64[63] = "%2F"	; /
...
	$aFields = StringSplit("Nation|Title|Notes|Town|Deadline","|",2)
...
	For $i = 0 To UBound($aExtract) - 1
		$data &= "&" & $aFields[$i] & "=" & encode64($aExtract[$i])
	Next
...
Func encode64($in)
   $bin = 0
   $out = ""
   For $i = 1 to StringLen($in)
	  $bin = BitShift($bin,-8)
	  $bin += Asc(StringMid($in,$i,1))
	  If Mod($i,3) = 0 Then
		 $out &= encode6($bin)
		 $bin = 0
	  EndIf
   Next
   If Mod(StringLen($in),3) <> 0 Then $out &= encode6($bin,Mod(StringLen($in),3))
   return $out
EndFunc
Func encode6($bin,$l=3)
   If $l <> 3 Then $bin =  BitShift($bin,-8*(3-$l))
   $out = ""
   For $i=3 To 3-$l Step -1
	  $cod6 = BitShift($bin,$i*6)
	  $bin -= BitShift($cod6,-$i*6)
	 $out &= $base64[$cod6]
   Next
   return $out
EndFunc

If the receiver is a PHP script, this fragment restore the data: foreach ($_REQUEST as $key => $value) $$key = base64_decode($value);

I used the COM object MSXML2.XMLHTTP for send data in synchronous mode: $ajax.open("POST", $url, true):

Global $ajax = ObjCreate("MSXML2.XMLHTTP")
...
Func ajax($url,$data)
   $ajax.open("POST", $url, true)	; synchronous !
   $ajax.setRequestHeader("Content-type", "application/x-www-form-urlencoded")
   $ajax.send($data)
   Local $hTimer = TimerInit() ; Begin the timer and store in a variable.
   While TimerDiff($hTimer) < 10000
      If $ajax.readyState == 4 Then
		 If $ajax.status = 200 Then return $ajax.responseText
	  EndIf
	  Sleep(10)
   WEnd
   return "Timeout!"
EndFunc

Notes

  1. ^AutoIt is a free-ware BASIC-like scripting language, an alternative to PowerShell, designed in origin for automating the interaction with Windows GUI. AutoIt can run on Windows interpreted or compiled.
    It comes with many libraries that enable, among other things, access COM objects and create graphical interfaces.
  2. ^There can be multiple range but only one selection for document.
  3. ^This site https://regex101.com/ is very useful for test regular expression.
  4. ^The code overhead is of 33%, the = character, that I have not used, is for padding if the length of the data to encode is not divisible per 3; the padding is necessary if you want to concatenate two sets of data.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Member 4206974
Software Developer Condor Informatique
Italy Italy
Computer literacy (software) : Languages: PHP, Javascript, SQL Autoit,Basic4Android; Frameworks: JOOMLA!
Teaching/Training skills on Office, WEB site development and programming languages.
Others : WEB site development.
UNDP Missions
feb – may 2003 Congo DR Bukavu: ground IT computer course
nov 2003 Burundi Bujumbura: Oracle Data Base course
feb 2005 Burundi Bujumbura: JAVA course
mar 2005 Mali Kati: MS Office course
oct 2006 Mali Kati: MS Office course
jun 2006 Burkina Faso Bobo Dioulasso: MS Office course
jun 2007 Burkina Faso Bobo Dioulasso: MS Office course
may 2007 Argentina Olavarria hospital: Internet application for access to medical records
apr 2008 Burkina Faso Ouagadougou: MS ACCESS and dynamic Internet applications
jun 2008 Niger Niamey: analysis of the computing needs of the Niamey hospital
may 2009 Burkina Faso Ouagadougou: MS ACCESS and dynamic Internet applications
oct 2010 Niger Niamey: analysis of the computing needs of the Niamey hospital (following)
Region Piedmont project Evaluation
mar 2006 Burkina Faso, Niger
mar 2007 Benin, Burkina Faso, Niger
sep 2008 Benin, Burkina Faso, Niger
Others
feb 2010 Burundi Kiremba hospital: MS Office course
feb 2011 Congo DR Kampene hospital: MS Office course

You may also be interested in...

Pro
Pro

Comments and Discussions

 
-- There are no messages in this forum --
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.171020.1 | Last Updated 7 Mar 2016
Article Copyright 2016 by Member 4206974
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid