Click here to Skip to main content
15,891,136 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.

Globals.ThisAddIn.Application.ActiveDocument.Select();
Microsoft.Office.Interop.Word.Document doc = Globals.ThisAddIn.Application.ActiveDocument;

string result = Path.GetTempPath();

string tmpFileName = Globals.ThisAddIn.Application.ActiveDocument.FullName;
doc.SaveEncoding = Microsoft.Office.Core.MsoEncoding.msoEncodingUSASCII;
if (File.Exists(result + "temp.html"))
{
    File.Delete(result + "temp.html");
}
doc.SaveAs(result + "temp.html", WdSaveFormat.wdFormatFilteredHTML); 

doc.Close(Microsoft.Office.Interop.Word.WdSaveOptions.wdDoNotSaveChanges);

HtmlAgilityPack.HtmlDocument mangledHTML = new HtmlAgilityPack.HtmlDocument();
mangledHTML.Load(result + "temp.html");


if (File.Exists(result + "newtemp.html"))
{
    File.Delete(result + "newtemp.html");
}

mangledHTML.Save(result + "newtemp.html");
// Remove standalone CRLF

string badHTML = File.ReadAllText(result + "newtemp.html");
badHTML = badHTML.Replace("\r\n\r\n", "ackThbbtt ");
badHTML = badHTML.Replace("\r\n", " ");
badHTML = badHTML.Replace("ackThbbtt ", "\r\n");
badHTML = badHTML.Replace('�', ' ');
if (File.Exists(result + "finaltemp.html"))
{
    File.Delete(result + "finaltemp.html");
}
File.WriteAllText(result + "finaltemp.html", badHTML);

// Clean up temp files, show the finished result in Notepad
File.Delete(result + "temp.html");
File.Delete(result + "newtemp.html");

Microsoft.Office.Interop.Word.Document orignalDoc = new Document();
orignalDoc = Globals.ThisAddIn.Application.Documents.Open(tmpFileName);


What I have tried:

Basically, what I want to do is I want to store all word document paragraph data separately in database and I also want it’s all property like font size, font width, font name and font style. So that I can show it in my application as it is as I written in word document file.

To represent it as it is I need to convert it html format and the by sepreting all paragraphs I can store it in database. But when in my word document has paragraph which have equations then

C#
Globals.ThisAddIn.Application.ActiveDocument.Select();
Microsoft.Office.Interop.Word.Document doc = Globals.ThisAddIn.Application.ActiveDocument;

string result = Path.GetTempPath();

string tmpFileName = Globals.ThisAddIn.Application.ActiveDocument.FullName;
doc.SaveEncoding = Microsoft.Office.Core.MsoEncoding.msoEncodingUSASCII;


This code converts my word documents all equations in Images and as it convert in image I can’t show the equation properly in my application.

So I tried to convert this equations in MATHML form but I couldn’t solve this.
Posted
Updated 22-Apr-24 4:53am
v2

1 solution

I've not tested this but it would appear that the MS interop library version that you are using doesn't know how to convert images into MathML or LaTeX format. So either you need to convert them to MathML or LaTeX in the document before saving them to your database or you need a different library that does know how to do that conversion.
 
Share this answer
 
Comments
Conduct dotnet 24-Apr-24 0:20am    
Yes exactly. But I couldn't find a solution. how I can achieve that.
M-Badger 24-Apr-24 3:37am    
https://www.google.co.uk/search?q=convert+word+equations+to+mathml+or+latex
-> https://tex.stackexchange.com/questions/233963/convert-mathtype-and-ms-word-equations-equations-to-latex
-> https://www.youtube.com/watch?v=qVduH8PuR2E

https://www.google.co.uk/search?q=word+to+html+converter+library+latex
-> https://dpcarlisle.blogspot.com/2007/04/xhtml-and-mathml-from-office-20007.html?_sm_au_=iMVM5T6JJPSZKH7HQ0WpHK6H8sjL6
-> https://github.com/michaelfranzl/docx_converter

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900