Click here to Skip to main content
15,878,945 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I have a document which looks like:

INTRODUCTION
This is a test document for xml.
Lets see how this works.
It should extract this sentence.

Conclusion
It should hopefully..

I need to extract the text in italics , i.e the line with " It should extract this sentence." The xml of the file looks like:

'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relati><w:body><w:p w:rsidR="00454E78" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:rPr><w:b/></w:rPr><w:t>INTRODUCTION</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRPr="00454E78" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:t>This is a test document for xml</w:t></w:r><w:r><w:t>.</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:proofErr w:type="spellStart"/><w:proofErr w:type="gramStart"/><w:r><w:t>Lets</w:t></w:r><w:proofErr w:type="spellEnd"/><w:proofErr w:type="gramEnd"/><w:r><w:t xml:space="preserve"> see how this works.</w:t></w:r></w:p><w:p w:rsidR="00095108" w:rsidRPr="00095108" w:rsidRDefault="00095108"><w:pPr><w:rPr><w:i/></w:rPr></w:pPr><w:r><w:rPr><w:i/></w:rPr><w:t>It should extract this sentence.</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:rPr><w:b/></w:rPr><w:t>Conclusion</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRPr="00456755" w:rsidRDefault="00456755"><w:r w:rsidRPr="00456755"><w:t>It should hopefully</w:t></w:r><w:r><w:t>..</w:t></w:r></w:p><w:sectPr w:rsidR="00456755" w:rsidRPr="00456755"><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'


So basically what I did was using:

SQL
w = 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'
for p in source.findall('.//{' + w + '}p[.//{' + w + '}i]'):
    print ''.join(t.text for t in p.findall('.//{' + w + '}t'))


to extract text contained within italics <w:i xmlns:w="#unknown">
but this gives an error:

File "<string>", line unknown SyntaxError: invalid predicate


so i tried using xpath:

SQL
find = etree.XPath("//w:p//t",namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})
print(find(lxml_tree))


It displays upto w:p tag(along with //text()) and adding an extra //t outputs nothing. How should i go about this ? Stuck since a couple of days.
Posted

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900