Click here to Skip to main content
14,270,872 members
Rate this:
Please Sign up or sign in to vote.
See more:
I wanted to copy the content of one docx/document to another docx/document at particular position.

Note: Content might be having text data with background color, tables, smartart and images.

What I have tried:

def add_para_data(output_doc_name, paragraph):
    """
    Write the run to the new file and then set its font, bold, alignment, color etc. data.
    """
    output_para = output_doc_name.add_paragraph()
    for run in paragraph.runs:
        # output_run = output_para.add_run(run.text)
        output_run = output_para.add_run()

        # Run's bold data
        output_run.bold = run.bold
        # Run's italic data
        output_run.italic = run.italic
        # Run's underline data
        output_run.underline = run.underline
        # Run's color data
        output_run.font.color.rgb = run.font.color.rgb
        # Run's font data
        output_run.font.name = run.font.name
        output_run.font.highlight_color = run.font.highlight_color
        output_run.font.size = run.font.size
        # output_run.style.name = run.style.name
        output_run.style = run.style



    # Paragraph's alignment data
    output_para.style = paragraph.style
    output_para.alignment = paragraph.alignment
    output_para.paragraph_format.alignment = paragraph.paragraph_format.alignment
    output_para.paragraph_format.widow_control = paragraph.paragraph_format.widow_control
Posted
Comments
Richard MacCutchan 19-Jul-19 10:52am
   
And, what is the question?
Raman_patil 13-Aug-19 2:43am
   
How to copy paragraph of docx to another docx and retain all styles
Richard MacCutchan 13-Aug-19 4:05am
   
You need to study the documentation for the package that you are using, to see how it identifies different data types.
Raman_patil 13-Aug-19 8:38am
   
I gone through the documentation of python-docx.
It reads only content and it won't identify the images and tables inside the paragraph in between text data. The python-docx will identifies all the images and tables present in whole document not particularly with paragraph.

I used some more modules, please find the observations below:
1. Python-docx:
• Able to read the text data by iterating over paragraph, however paragraph won’t detect the table and images in it
• We have to use the document object which will detect all the tables and images present in the whole document

2. Tika:
• Tika reads both text data and meta data(like images) in separate variable
• Tika reads whole content in one single string so it won’t be possible to iterate over it

3. Win32:
• Unable to read the data with formats

4. Converting docx to xml:
• Bullets were not getting copied along with the content
• Image will be present in the different directory which has to copied separately in the output file, even if we are able to get the image path we are unable to add it to specific position in the output docx file

5. Converting docx to HTML:
• Unable to copy the bullet point
• Image is stored in html page with Base64 encoding, which will not be store in the directory
• Base64 is basically used for representing data in an ASCII string format

6. BytesIO:
• Able to read and write complete data, however this data will not be iterated line by line, please find the Documentation

7. Mammoth:
• Similar to HTML parsing

8. Pydocx:
• Similar to HTML parsing
Richard MacCutchan 13-Aug-19 12:50pm
   
Looks like you will have to find some other way of reading them. If you know C# you can use the Microsoft provided interop libraries.
Raman_patil 13-Aug-19 2:44am
   
I am reading the input docx file sections/paragraphs and then copy-pasting the content in it to another docx file at a particular section. The content is having images, tables and bullet points in between the data. However, I'm getting only text not the images, tables and bullet points present in between the text.

Tika module is able to read whole content but the whole docx is coming in a single string so I'm unable to iterate over the section and also I'm unable to edit(copy-pasting the content) the output docx file.

Tried using python-docx, whereas it reads only content and it won't identify the images and tables inside the paragraph in between text data. The python-docx will identifies all the images and tables present in whole document not particularly with paragraph

Tried unzipping word to XML, but the XML is having images in a separate folder. Also, the code will not identify the bullets

I expected the output file should be having all the sections replaced with the copied input section content(images, tables, smart art and formats)

The output file just has the text data.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100