|
|
Comments and Discussions
|
 |
 |
I'm using Windows Server 2012 and Adobe iFilter, and in job.cs it says "Use this class to sandbox Adobe IFilter 11 or higher when you want to use this code on Windows 2012 or higher". How would I go about using that?
|
|
|
|
 |
When reading .xls files, your code displays the text in the file as it appears on the screen, not as it is contained in the document. A xls and xlsx file contain the same 20-digit number. The string of the output from IFilter of the xlsx contains the 20-digit number. The string of the output from IFilter of the xls contains the first few digits and E+19.
|
|
|
|
 |
You can't do anything about this behavior. The numbers is stored in Excel as 20 digits. It just shows you it as E+19. This is how Excel formats it when displaying
|
|
|
|
 |
I was also unable to remedy this behavior. I ended up using CSharpJExcel library instead of this one. It gives proper output of the value no matter what is displayed.
|
|
|
|
 |
I can start the program. After I select pdf file, I am getting the following error:
Error HRESULT E_FAIL has been returned from a call to a COM component.
|
|
|
|
 |
Any answer to this yet? I'm getting the same error.
|
|
|
|
|
 |
Try this code --> https://github.com/Sicos1977/IFilterTextReader/tree/master
|
|
|
|
 |
Try this code --> https://github.com/Sicos1977/IFilterTextReader/tree/master
|
|
|
|
 |
I downloaded the code and opened in Visual Studio. How do I run the project? If I press Start, it just says project with output type class library cannot be run directly.
Please help!!! Regards
Partha Mandayam
|
|
|
|
|
 |
Modified to aminooe's post below and the Winforms version in the download now reads any document. When the same dll is from from an MVC web app, only old doc files work, no docx or pdf. Filer comes back null when stepping through. Anyone seen this? Both apps compiles as Any CPU and used the dll produced by the download for filtering.
-MickeyB
|
|
|
|
 |
Hi,
I made this project available on GitHub and refactored the code to .NET 4.0 (Visual Studio 2013).
I modernized the code more to now a days standards, added comments and incorporated most of the fixes that were posted by other users over here.
You can find the code overhere --> https://github.com/Sicos1977/IFilterTextReader/tree/master[^]
Greetings,
Kees van Spelde
|
|
|
|
 |
Hi Kees.
Thank you very much, great job.
Best regards,
Eugene
|
|
|
|
 |
Hi,thank you your code. I had downloaded it and run. but I found the sample can not get text from pdf.
my pc environment: windows 2008 x64 server ,vs 2010, PDF iFilter 64 11.0.01 had been installed.
Greetings.
Jack.
|
|
|
|
 |
Hi,
Can you be a little bit more specific. Do you get an error or just no text at all? And is this PDF text based or is it a scanned document?
Greetings,
Kees van Spelde
|
|
|
|
 |
Hi,
The pdf document(with OCR) and MS-office file can get the text on windows 2003 x86 which 32 bit operation system,without install any ifilter plugin. but when I upgrade to windows 2008 x64, pdf document and MS-office file can not get the text, program no any error popup, the return result is empty string,so install the "FilterPack64bit.exe"(from Microsoft) then MS-office can get text. but install the "PDF iFilter 64 11.0.01"(from Adobe) ,pdf document still can not get the text. later, I install "Foxit PDF iFilter 3.0" then can get the PDF text. now I had used the "iTextSharp" to get PDF text however speed so lower.
about how to get the pdf text, I had tried "PdfBox",but it get the chinese char not good.
Greetings,
Jack.
|
|
|
|
 |
I'am developing a c# application and I use the IFilter library to extract the content of file. The problem is the the function LoadLibrary doesn't work when compiled to x86
[DllImport("kernel32.dll")]
public static extern IntPtr LoadLibrary(string lpFileName);
IntPtr dllHandle=Win32NativeMethods.LoadLibrary(dllName);
The above code works fine when compiling the project to any cpu, but if I compile it to x86 it fails, Win32NativeMethods.LoadLibrary() returns zero when i tried to load offfiltx.dll:
|
|
|
|
 |
Hello,
I would using the library Ifilter in c# application witch run as x86. It's possible ?
Thanks a lot in advance.
|
|
|
|
 |
I am trying to use it with in my website that is built on framework 4.0 but it always returns an exception no filter defined. To check this i created a basic forms application and used and same error was received unless the project was first downgraded to 2.0 and then again upgraded. After it, it works correctly. The issue is that i cant down grade my website as it is using other APIs that wont work with a lesser version. Kindly help!
|
|
|
|
 |
No other methods seem to work except ReadToEnd() ?
I'm trying to read a file a line at a time so have:
while ((line = reader.ReadLine()) != null)
{
...
}
The file is ok and yet no lines are read. I've checked the other methods and the same thing happens ?
Am I doing something wrong here ?
|
|
|
|
 |
Hi everyone,
I keep getting this error (Access Violation unable to read or write...) when I try to read a PDF document.
I'm working in a ASP.NET Application installed in a x64 machine, the DLL is compiled as x86 and I have installed PDF IFilter 6.0.
The framework I'm working on is .Net 3.5.
On Office documents the filter can't be found because the machine doesn't let install the x86 version of them.
Any help would be appreciated. Thanks.
|
|
|
|
 |
I've fixed the problem with the Adobe iFilter 64bit v11 (http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542[^])
First create the IPersist and IPersistStream inerfaces:
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown),
Guid("0000010c-0000-0000-C000-000000000046")]
public interface IPersist
{
void GetClassID( out Guid pClassID);
};
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown),
Guid("00000109-0000-0000-C000-000000000046")]
public interface IPersistStream : IPersist
{
new void GetClassID(out Guid pClassID);
[PreserveSig]
int IsDirty();
void Load([In] IStream pStm);
void Save([In] IStream pStm, [In, MarshalAs(UnmanagedType.Bool)] bool fClearDirty);
void GetSizeMax(out long pcbSize);
};
Then create a StreamWrapper so you can use a .NET stream:
public class StreamWrapper : IStream
{
public StreamWrapper(Stream stream)
{
if (stream == null)
throw new ArgumentNullException("stream", "Can't wrap null stream.");
this.stream = stream;
}
private Stream stream;
public void Read(byte[] pv, int cb, System.IntPtr pcbRead)
{
Marshal.WriteInt32(pcbRead, (Int32)stream.Read(pv, 0, cb));
}
public void Write(byte[] pv, int cb, IntPtr pcbWritten)
{
int written = Marshal.ReadInt32(pcbWritten);
stream.Write(pv, 0, written);
}
public void Seek(long dlibMove, int dwOrigin, System.IntPtr plibNewPosition)
{
stream.Seek(dlibMove, (SeekOrigin)(dwOrigin));
}
public void Clone(out IStream ppstm)
{
throw new NotImplementedException();
}
public void Commit(int grfCommitFlags)
{
throw new NotImplementedException();
}
public void CopyTo(IStream pstm, long cb, IntPtr pcbRead, IntPtr pcbWritten)
{
throw new NotImplementedException();
}
public void LockRegion(long libOffset, long cb, int dwLockType)
{
throw new NotImplementedException();
}
public void Revert()
{
throw new NotImplementedException();
}
public void SetSize(long libNewSize)
{
throw new NotImplementedException();
}
public void Stat(out System.Runtime.InteropServices.ComTypes.STATSTG pstatstg, int grfStatFlag)
{
var tempSTATSTG = new System.Runtime.InteropServices.ComTypes.STATSTG();
tempSTATSTG.cbSize = stream.Length;
pstatstg = tempSTATSTG;
}
public void UnlockRegion(long libOffset, long cb, int dwLockType)
{
throw new NotImplementedException();
}
}
public class IStreamWrapper : Stream
{
IStream stream;
public IStreamWrapper(IStream stream)
{
if (stream == null)
throw new ArgumentNullException("stream");
this.stream = stream;
}
~IStreamWrapper()
{
Close();
}
public override int Read(byte[] buffer, int offset, int count)
{
if (offset != 0)
throw new NotSupportedException("only 0 offset is supported");
if (buffer.Length < count)
throw new NotSupportedException("buffer is not large enough");
IntPtr bytesRead = Marshal.AllocCoTaskMem(Marshal.SizeOf(typeof(int)));
try
{
stream.Read(buffer, count, bytesRead);
return Marshal.ReadInt32(bytesRead);
}
finally
{
Marshal.FreeCoTaskMem(bytesRead);
}
}
public override void Write(byte[] buffer, int offset, int count)
{
if (offset != 0)
throw new NotSupportedException("only 0 offset is supported");
stream.Write(buffer, count, IntPtr.Zero);
}
public override long Seek(long offset, SeekOrigin origin)
{
IntPtr address = Marshal.AllocCoTaskMem(Marshal.SizeOf(typeof(int)));
try
{
stream.Seek(offset, (int)origin, address);
return Marshal.ReadInt32(address);
}
finally
{
Marshal.FreeCoTaskMem(address);
}
}
public override long Length
{
get
{
System.Runtime.InteropServices.ComTypes.STATSTG statstg;
stream.Stat(out statstg, 1 );
return statstg.cbSize;
}
}
public override long Position
{
get { return Seek(0, SeekOrigin.Current); }
set { Seek(value, SeekOrigin.Begin); }
}
public override void SetLength(long value)
{
stream.SetSize(value);
}
public override void Close()
{
stream.Commit(0);
stream = null;
GC.SuppressFinalize(this);
}
public override void Flush()
{
stream.Commit(0);
}
public override bool CanRead
{
get { return true; }
}
public override bool CanWrite
{
get { return true; }
}
public override bool CanSeek
{
get { return true; }
}
}
Now we can use it, add this to LoadAndInitIFilter(string fileName, string extension) (FilterLoader.cs, line 76)
var iPersistStream = filter as IPersistStream;
if (iPersistStream != null)
{
Stream fileStream = new System.IO.FileStream(fileName, FileMode.Open);
StreamWrapper wrapper = new StreamWrapper(fileStream);
iPersistStream.Load(wrapper);
if (filter.Init(iflags, 0, IntPtr.Zero, out flags) == IFilterReturnCode.S_OK)
return filter;
}
I tested it with multiple files (pdf, docx, xlsx, txt) and it just works.
I used some sources:
http://hl7connect.blogspot.nl/2010/04/c-implementation-of-istream.html[^]
A generic and typed way to transfer .NET objects to COM+ queued components[^]
Thanks.
|
|
|
|
 |
Hi Aminooe,
I am new to use COM+ object
Thanks you very much, your code works to me
I would like to ask if it is necessary to close the filestream after the
"Stream fileStream = new System.IO.FileStream(fileName, FileMode.Open);"?
And where should I close the steam if it is needed?
Thanks.
|
|
|
|
 |
Yep, It's necessary to refactor code and Dispose Stream at the end. When file parsing is finished.
|
|
|
|
 |
where did you make this edit?
-MickeyB
|
|
|
|
 |
Hi author,
My English is not good, I hope you can understand me and give me some help.
When I use this code, I catch a System.ArgumentException. I build a new word and do nothing to it. I use this code to read it and then I catch the System.ArgumentException.The program will stop at the last sentence. If I add any words into the word and there will be no erro. I use the office2010.
Looking forward for your help.
IFilter filter=LoadIFilter(extension);
if (filter==null)
return null;
IPersistFile persistFile=(filter as IPersistFile);
if (persistFile!=null)
{
persistFile.Load(fileName, 0);
|
|
|
|
 |
I'm having the same issue, any help for this error?
adi
|
|
|
|
 |
Hi there,
I am using IFilter in a Windows Server 2008 64 bits with adobe ifilter 9 64 bits.
I can extract the text with no problem in a batch script but if i try to use the component in an asp i get the following error "Interface not registered (Exception from HRESULT: 0x80040155)". This only happens in the second time i execute the ifilter.
Anyone have know how to fix this?
Regards,
Carlos
|
|
|
|
 |
Hi Carlos,
I am facing the same problem. You could find a solution?
Regards,
Bruno
|
|
|
|
|
 |
Hi Eyal, thanks for sharing your code and experiences - I've found your wrapper excellent and the best around by far.
You mentioned in an earlier post that you wouldn't mind if the code is used in open source applications. That's the case with us - we'd like to use it via GPL but it says Ms-PL in the bottom of the description. So would it be alright to use it via GPL, we would also attribute your effort in the about section of our app and on the wiki docu page as well.
Thanks,
Bogdan
|
|
|
|
|
 |
Recently I experienced a bug where if user use the code to process a file which is > 25MB, then the code will hang.
The reason is mentioned in Microsoft forum:
http://support.microsoft.com/kb/318747[^]
Pretty much if user added the registry mentioned by the above link, the error will be fixed.
Anyhow, suggested to add the following code inside FilterReader.cs:
below
IFilterReturnCode res=_filter.GetChunk(out _currentChunk);
add
if (res != IFilterReturnCode.S_OK && res != IFilterReturnCode.FILTER_E_END_OF_CHUNKS) throw new Exception(string.Format("Error: 0x{0:x}", res));
The code will throw exception if it's not the expected return code instead of running into an infinite loop.
|
|
|
|
 |
I found a problem with the code I put in.
Right now the existing code will wait for IFilter.GetChunk() returns FILTER_E_END_OF_CHUNKS twice before exiting (i.e. exit condition = endOfChunksCount>1 where endOfChunksCount initialized as 0). And for some reason this part of the code is working fine for most of the format except for MS Office 2010 format (i.e. docx, xlsx...etc) Those file will throw 0x8000FFFF right after the first FILTER_E_END_OF_CHUNKS and then GetChunk() will go back to normal. Therefore, if the above code is implemented, then before normal exit, 0x8000FFFF will be thrown.
Now if I change the code to exit when first FILTER_E_END_OF_CHUNKS is received, everything seems to work properly. So I was wondering, is there any reason why we need 2x FILTER_E_END_OF_CHUNKS before setting _done = true?
|
|
|
|
 |
I installed the latest FoxIT IFilters on a Windows 7 workstation. The actual DLLs are detected via their handler and persist class GUIDs. However, when trying to load the filter from the DLL, the code fails on:
ComHelper.GetClassFactory ().
This is specifically occuring at:
if (dllGetClassObject(ref filterPersistGUID, ref IClassFactoryGUID, out unk)!=0)
return null;
The dllGetClassObject is returning 0 and resulting in null being returned from GetClassFactoryFromDll() instead of a valid IClassFactory object.
Has anyone else experienced this with FoxIT IFilters and come up with an explaination/reason?
Thanks - Marc Mueller
UPDATE (3/26/12) - I found the issue...missing license key was required to call FoxIT from code even when evaluating.
Marc
modified 26-Mar-12 16:56pm.
|
|
|
|
 |
Hi,
I've tried your solution and it works great. However, when I tested it with the latest Adobe Reader 10.1.1 in Windows 7, it doesn't work. It fails at persistFile.Load. I suspect that particular ifilter that came with that version didn't implement the load method for persistFile.
However, the strange thing is that this ifilter works fine for windows search. I also used the filtdump tool from MS and it also could extract the text (says it's using IpersistStream though). However, Ifilter explorer shows only IPersistFile being registered.
What could filtdump and windows search be doing to extract text via ifilter that is implemented differently from your code?
|
|
|
|
 |
I have run into this same issue, Adobe changed their "iFilter" so that it no longer implements IPersistFile or IPersistStream. They did this in 10.0 and haven't changed it as of 10.1.3 which is the latest version as of this writing.
Our only solution so far has been to make customers install Reader 9.5 as a maximum compatible version. I've tried to get an answer from Adobe on the subject but have gotten no response so far. If anyone knows a workaround, that would be great...
|
|
|
|
|
 |
Hi,
I have a 64-bit application running on a 64 bit machine when I try to use the I filters since the application is 64-bit and the IFilter dlls are 32-bit it gives me an exception.
Is there a way around this problem.
Thanks for your help in advance.
Ruchi
|
|
|
|
 |
I have PDF files stored in a database table that I would like to index. Has anyone managed to do this?
|
|
|
|
 |
I am trying to extract text from a number of PDF files for indexing. The test project included in the download works just fine, but when I try to instantiate a FilterReader from an (empty) ASP.NET application I get this: "Error HRESULT E_FAIL has been returned from a call to a COM component." Is this a known issue? Is there a workaround to make the library do its magic in a web application? Thanks in advance!
|
|
|
|
 |
This is really a great article.I was finding solution to read doc file without using Interop and I found this article useful.
I want to read bullets from word document and replace it with some character.
Is it possible using IFilter?
Regards,
Sushil
never say die
|
|
|
|
 |
With the code provided, i am able to read document content by using the IFilters. How can i read document properties such as Title, Author, etc., for Office 2007 documents and spreadsheets by using the IFilters.
|
|
|
|
 |
this doesnt work for .ppsx and .ppsm file formats..
anyone who can help me???
|
|
|
|
 |
I have noticed that the IFilter gets stuck in a loop when trying to read .docx files containing images. Has anyone else experienced this?
|
|
|
|
 |
download the latest Microsoft Office 2010 IFilters. I used to get that when a 2010 docx was processed using an older Office IFilter.
|
|
|
|
 |
I converted PDF document to text using your IFilter implementation and output was:
----------------------------------
Architecture..........................................8Installation........................................................12Load
----------------------------------
New lines are missing and strings are glued together "...8Installation".
Regards,
TomazZ
|
|
|
|
 |
Like many of you, I was most impressed by the authors code and was up and running locally in a matter of minutes (with my Windows XP lap-top). However, when publishing to a Windows 2003 x64 Server I encountered a few issues with PDF documents. After pouring through reams of posts and Googling until my fingers were sour, I managed to get the application working so decided to post my findings on how I eventually did it:
1. Install IFilter50.exe, not IFilter60.exe
You will need to install the Adobe IFilter.dll's on your server so it filter PDF documents. Installing ADOBE Acrobat Reader 8.0 or 9.0 on a x64 server does NOT install the IFilter dll's (as it does on x86 machines), therefore you will need to obtain these separately from ADOBE.com.
After installing IFilter60.exe I had limited success, but after a few runs I got the dreaded "Memory is Corrupt..." message, which nobody (not even Microsoft) was able to explain. After reading the final thread of this code project post (many thanks by the way), I uninstalled IFilter60 and installed IFilter50. No more dreaded messages!
2. Set the target platform to x86
If you still get the "no filter defined for..." message, explicitly set the environment to x86. From the Properties window of the IFilter source code, select the Build tab, and change the Platform target field from Any CPU to x86. Build the application to create a new EPocalipse.IFilter.dll and copy this to the Bin folder of your application.
As mentioned at the start, my environment is Windows 2003 Server x64 and after performing these two steps, the application was - and still is - running fine. Hope this is of some help to someone.
|
|
|
|
|
|
 |
|
General News Suggestion Question Bug Answer Joke Rant Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.
|
| First Posted | 12 Mar 2006 |
| Views | 641,009 |
| Bookmarked | 292 times |
|
|