Click here to Skip to main content
15,896,154 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi,

I am trying to develop a web based document management system for internal uses and have come across an issue to which I have failed to find an answer.

I was hoping to provide the user a search box where they can enter a keyword to search through the contents of documents. If a match is found then the application will display those files. The documents will generally be PDF documents, but the directory may contain Word documents on some rare occasions.

I would appreciate any help in identifying the different options that are available to achieve this and any resources that you can point me towards to help with the development.

Thank you for taking the time to read my request.
Mo
Posted

1 solution

As this is a text search, you may want to convert word documents to plain (Unicode) text for search purposes.
It would be the best if your document were .docx, not old .doc. This newer format is based on Open XML: http://en.wikipedia.org/wiki/Office_Open_XML[^].

So, the simplest fallback solution, just for search, could be as simple as this: unpack .docx document using ZIP algorithm (this is how such documents are packed) and use extracted XML (you can also removed all XML tags) for text search.

For more fine grain use, you can use Open XML SDK, which you can obtain from Microsoft free of charge. Please see my past answer and referenced answers and other materials: How to add microsoft excel 15.0 object library from Add Reference in MS Visual Studio 2010[^].

By, the way, see also Microsoft warnings against using Office interop in server settings:
http://support.microsoft.com/default.aspx?scid=kb;EN-US;q257757#kb2[^],
http://support.microsoft.com/kb/257757/en-us[^].

What to do if you have to use old .doc file? It's better to avoid them by all means (why not converting them before storing on the site?) but this is still possible to work with them, but much harder. The only source I know is the API which comes with open-source Libre Office. Please see my answer referenced above, first link in it.

You can also try to find something else:
http://bit.ly/15DSm5l[^],
http://bit.ly/15MYwki[^].

However, I would avoid Office documents at all. Even though Open XML is presently a public standard, the Office documents and applications are still proprietary and are not the part of W3 standards. Isn't it possible to re-word it to some HTML or XML-based documentation?

—SA
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900