Extract text within a particular tag ,Python Lxml

Question

0.00/5 (No votes)

See more:

I have a document which looks like:

INTRODUCTION
This is a test document for xml.
Lets see how this works.
It should extract this sentence.

Conclusion
It should hopefully..

I need to extract the text in italics , i.e the line with " It should extract this sentence." The xml of the file looks like:

'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relati><w:body><w:p w:rsidR="00454E78" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:rPr><w:b/></w:rPr><w:t>INTRODUCTION</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRPr="00454E78" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:t>This is a test document for xml</w:t></w:r><w:r><w:t>.</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:proofErr w:type="spellStart"/><w:proofErr w:type="gramStart"/><w:r><w:t>Lets</w:t></w:r><w:proofErr w:type="spellEnd"/><w:proofErr w:type="gramEnd"/><w:r><w:t xml:space="preserve"> see how this works.</w:t></w:r></w:p><w:p w:rsidR="00095108" w:rsidRPr="00095108" w:rsidRDefault="00095108"><w:pPr><w:rPr><w:i/></w:rPr></w:pPr><w:r><w:rPr><w:i/></w:rPr><w:t>It should extract this sentence.</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:rPr><w:b/></w:rPr><w:t>Conclusion</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRPr="00456755" w:rsidRDefault="00456755"><w:r w:rsidRPr="00456755"><w:t>It should hopefully</w:t></w:r><w:r><w:t>..</w:t></w:r></w:p><w:sectPr w:rsidR="00456755" w:rsidRPr="00456755"><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'

So basically what I did was using:

SQL

w = 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'
for p in source.findall('.//{' + w + '}p[.//{' + w + '}i]'):
    print ''.join(t.text for t in p.findall('.//{' + w + '}t'))

to extract text contained within italics <w:i xmlns:w="#unknown">
but this gives an error:

File "<string>", line unknown SyntaxError: invalid predicate

so i tried using xpath:

SQL

find = etree.XPath("//w:p//t",namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})
print(find(lxml_tree))

It displays upto w:p tag(along with //text()) and adding an extra //t outputs nothing. How should i go about this ? Stuck since a couple of days.

Posted 24-Sep-14 19:57pm

Sword19

Add a Solution

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Extract text within a particular tag ,Python Lxml

Add your solution here

Preview 0

Extract text within a particular tag ,Python Lxml

Add your solution here

Preview 0

Existing Members

...or Join us