Click here to Skip to main content
15,886,199 members
Please Sign up or sign in to vote.
4.00/5 (2 votes)
See more:
hi, everyone. Now I am writing a blog(just get start). I wish to convert url to word or text. Let's say http://www.yahoo.com/[^]. I want to extract its content to word or text. So I can add mine into it, or edit its formating. How could I do ?
Posted
Comments
Member 12425065 18-Apr-16 7:44am    
You can try this if you facing this type of problem:
// Convert HTML to Word (DOCX) document.
DocumentModel.Load("Document.html").Save("Document.docx");
Truld 25-Apr-17 7:13am    
To add a source of that snippet code, for future reference, this example demonstrates how to convert HTML content to Word document in C#. The example uses a Word processing library for C#.

1 solution

Basically, you can convert it to text by removing all tags from the HTML code, which is a simple string-manipulation routine. Another problem is: you would need to convert all HTML character entities to Unicode characters. In this way, some HTML parser could be very handy. It will give you the values for all text nodes, with all the character entities already excluded.

The simplest way of doing it would be using XML parser, but if can only work if your HTML is a well-formed XML. The XML parsers are readily available in .NET> This is really a shame, but many HTML codes existing in the real world do not conform to it. In this case, you would need some parser which can work with non-well-formed code. In this case, look at one of available HTML parser in C#. The following projects are found in CodeProject:
An Elementary HTML Parser[^]
Another C# Legacy HTML Parser Using Tag Processing[^]
AfterWork HTML Parser in C#[^].

If this is not enough, do your own search:
http://en.lmgtfy.com/?q=HTML+parser+%22C%23%22[^].

Sharpen your Google skills, by the way; it will greatly help you.

Now, what kind of Word document do you want: with formatting close to the HTML source of not?

If first case, you can just open HTML document with Word (and save as .doc or .docx file); in second case, you should not do anything else; you can consider plain Unicode text as a partial case of Word document.

If you need to do the conversion automatically (but I don't know why, of HTML document can be opened with Word anyway), you would need to use Office/Word interop.

To create a Word document, use Office interop assembly. Basically, in your project's "References" tab of the Code Explorer, click "Add Reference", use the tab "COM" of the "Add Reference" window, add the reference to Microsoft Word Object Library of required version. Please see:
http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word%28v=office.11%29.aspx[^],
http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word%28v=office.14%29.aspx[^].

(Or similar piece of documentation for required version.)

See also:
http://msdn.microsoft.com/en-us/library/aa192495%28v=office.11%29.aspx[^],
http://msdn.microsoft.com/en-us/office/hh128772.aspx[^].


—SA
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900