Click here to Skip to main content
15,887,683 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
See more:
Hi,

My application is dealing with thousands of TEXT files in a folder, where i am working on a filter functionality to remove blank files.

As of now, i am checking the byte length and if it is less than a threshold [just to accommodate files with blank spaces]. This was working good as expected, but now i came across few files which contains only line breaks and it doesn't have any printable character

My users want to filter-out those files as well along with blank files.

Please suggest me a best possible way to check whether a file contains a printable character or not on the other hand is there any way to find the bytesize of a TEXT file excluding blank space.

Please note that performance would my primary concern, as i am dealing with thousands of TEXT files
Posted
Updated 11-Jun-13 3:06am
v2
Comments
Sergey Alexandrovich Kryukov 11-Jun-13 9:15am    
This problem is not well-defined, because there is no clear criteria on "printable". Except for #13, #10, ' ' and tab, there are many other characters which won't actually print, others will be printed and showed depending on the fonts you may use or encoding we would use when reading text. And, after all, files maybe not text, so there may be no predefined meaning of "printability", because the bytes in the file content are not meant to be considered as characters.
—SA

1 solution

I see no better way than scanning every file character by character (byte by byte), in search of either the first printing character or the end-of-file marker.

To achieve good performance, I'd recommend to use a low-level API, such as the Read method of a FileStream object, using a large buffer for block transfers. (Benchmark and tune the buffer size.)

To check for the printing characters, you can rely on .NET utility functions like Char::IsControl, or implement your own lookup table that covers all 256 byte values.
 
Share this answer
 
v2
Comments
[no name] 14-Jun-13 2:10am    
Thanks much

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900