Click here to Skip to main content
14,330,689 members

Press Releases

Welcome to the new Press Releases forum! Our old press release system has been retired but we've moved everything and everyone to a new, shinier home. Enjoy!

A press release must be written for the purpose of announcing something newsworthy. Advertisements, promotions, or anything smelling even vaguely of spam will be deleted. All press releases must be relevant to the development community.

 
NewsCreate/Update Group Shapes on Slides & Identify Password Protect Presentation Pin
sherazam21-Jul-14 23:11
membersherazam21-Jul-14 23:11 
NewsUniversity of Limerick, Ireland Partners with MadCap Software to Enrich Students’ Skills with Advanced Technical Writing Tools and Support Successful Careers in Technical Communication and E-Learning Pin
andrew.kinetic21-Jul-14 10:31
memberandrew.kinetic21-Jul-14 10:31 
NewsGridWeb Component for Java Based Web Apps & Improved Spreadsheets to HTML Export Pin
sherazam20-Jul-14 19:40
membersherazam20-Jul-14 19:40 
GeneraldbForge Studio for Oracle v3.6 is Released Pin
Devart17-Jul-14 23:48
memberDevart17-Jul-14 23:48 
NewsWSO2 Presents Summer School Class on Applying API Management to the Internet of Things Pin
Member 979567317-Jul-14 8:43
memberMember 979567317-Jul-14 8:43 
NewsNew WSO2 White Paper for IT Architects Examines the Seven Steps to Achieving a Connected Business Pin
Member 979567315-Jul-14 12:03
memberMember 979567315-Jul-14 12:03 
NewsCheck & Get Master by its Name from Visio Drawing & Gluing Group Shapes Pin
sherazam15-Jul-14 6:06
membersherazam15-Jul-14 6:06 
GeneralDocument Filters, Search Engines & The Anatomy Of A Binary Format Pin
dtSearch15-Jul-14 2:11
groupdtSearch15-Jul-14 2:11 
Document Filters, Search Engines & The Anatomy Of A Binary Format

WHEN YOU VIEW a document in Microsoft Word, you expect the text to be crystal clear. The same applies when you display a database in Access, a presentation file in PowerPoint, a spreadsheet in Excel, a PDF in Adobe Reader, an email in Outlook/ Exchange or Thunderbird, etc. Further, these applications make it easy not only to view the text but also to locate specific words for basic navigation within the file. But what if you need to search across millions or billions of files? Pulling up each file individually in its associated application would take far too much time. Opening an untrusted document in its native application also creates a risk of virus infection. Instead, you would want a separate search engine to automatically search through all the data at once.

Binary Formats

Just as it is inefficient for you to sequentially retrieve a large number of files in their associated applications, so that process is inefficient for a search engine. Instead, a search engine needs to review data in binary format, bypassing the need to pull up each file in a separate program. The problem is that file text that looks crystal clear inside its associated application typically appears as gibberish in binary format. Take a look at the image at the top right of this page for a look at a product description as it appears in Word. The bottom image shows a sample from this document as it appears in binary format. Returning this binary format to the readable text that appeared in Word requires a lot of parsing. The industry name for the process that parses binary formats is document filters. Document filters and search engines all parse binary formats to different levels of depth. The parsing process this article describes reflects the dtSearch® product line. While this article’s anatomy of binary formats is a general one, the stages this article describes to unravel these formats may not precisely reflect other product lines than dtSearch.

Binary Format Identification
Before parsing a binary format, the document filters need to identify what type of document or other object the binary format represents. In fact, identifying the right data specification is all-important, as the file specification for Word is nothing like the specification for Outlook or PDF. Further, the document filters need to figure out the data type of a binary format preferably without reference to any document name or extension. For example, suppose a user saves a Word file with an extension of .PDF instead of the Word extension .DOCX. Only by using the binary format itself to identify the data type instead of the extension can document filters effectively recognize and parse this file.

Evolving Specifications & Unicode
After figuring out the data type, the document filters can begin to apply the correct specification to decode the data. File specification data can be enormous. For example, Microsoft’s documentation of the .DOC Word file format alone is more than 600 pages. The document filters must also take into account the fact that all major data formats continue to evolve. If Microsoft makes a change to the .DOCX Word specification, the document filters have to apply this update for all new Word documents. And the document filters have to do so without interfering with the parsing of existing Word documents.

The next item for the document filters is to identify relevant text encoding. Some documents such as newer versions of Word store data in Unicode. Other document formats can store text in language-specific encodings, which the document filters must identify and translate into Unicode.

Metadata & Recursively Embedded Objects
In addition to parsing the main body of the text, the document filters have to identify and correctly handle other elements of a document, including headers and footers, fields such as subject and author, and even potentially hidden metadata. Then there is the issue of nested objects. A Word document can embed an Access database, which can itself embed an Excel spreadsheet, which can further embed a PowerPoint. The document filters need to recognize and drill through all of the different levels of nested document objects to fully parse the text.

Database & Online Data
It is not only documents that can embed other documents as nested objects. An SQL database can store documents inside BLOB data within the database. An email can attach documents directly or as part of a ZIP or RAR archive. Documents — including standard Office files such as Word documents or emails — can appear online in the context of Web-based static (HTML, XSL/ XML, PDF, etc.) data. Or they can appear within Web-based dynamic data (MS SharePoint, ASP.NET, CMS, PHP, etc.). The document filters need to handle all of these different data types just to ensure proper handling of documents. And that’s not even to mention the surrounding SQL, email, compression, static, and online data itself, which the search engine needs to handle for comprehensive full-text searching.

Document Filters In Context
Parsing data is just the initial step for a search engine like dtSearch. After parsing the data, the search engine needs to create a search index. The search index itself is simply a programmatic device to enable very fast searching of a wide range of data. A single search index can hold a large variety of data, including documents, emails and attachments, databases, and other Web-based static and dynamic data. In doing so, the index can enable concurrent or multithreaded federated searching across all of these different data types at once. After processing a search request from its index, the search engine will return a list of matching files or other data. The search engine then returns to the document filters to display the complete text of retrieved data. dtSearch products display the complete text by converting data types that are not already Web-ready to HTML for browser- based display. The final step is to retrieve “hit offsets” from the index. The hit offsets tell the search engine and its document filters where to highlight hits in the browser-based data display.

For more information please visit http://www.dtsearch.com[^]
NewsBigLever, MadCap Partner to Deliver Integrated Documentation Management and Product Line Engineering Solution Pin
andrew.kinetic14-Jul-14 12:21
memberandrew.kinetic14-Jul-14 12:21 
NewsAutomatic Change Tracking in Word Document & Hyperlink Creation for Footnote Pin
sherazam14-Jul-14 6:33
membersherazam14-Jul-14 6:33 
GeneralAdd Custom Recognition Blocks & Set Automatic Spelling Correction in .NET Apps Pin
sherazam10-Jul-14 1:09
membersherazam10-Jul-14 1:09 
NewsWSO2 to Present Workshops Analyzing Big Data Streams From Internet of Things Devices in London and Houston Pin
Member 97956738-Jul-14 11:14
memberMember 97956738-Jul-14 11:14 
NewsWSO2 Presents Summer School Class on Addressing Internet of Things Security Challenges Pin
Member 97956738-Jul-14 8:24
memberMember 97956738-Jul-14 8:24 
NewsExport MS Project Data to Separate PNG/JPEG & Reading Timescale Data from XML/MPP Pin
sherazam8-Jul-14 8:01
membersherazam8-Jul-14 8:01 
GeneralE-XD++ Power systems, wiring diagrams, distribution maps, geographic wiring diagram, the power system configuration and simulation, power dispatch, automatic control, C / C ++ and DELPHI and .NET, and web application examples and , 100% VC++ Source C Pin
kellyonlyone7-Jul-14 16:59
memberkellyonlyone7-Jul-14 16:59 
NewsSkyvia Cloud Data Integration Service is Released Pin
Devart7-Jul-14 1:16
memberDevart7-Jul-14 1:16 
GeneralAspose.Newsletter July 2014: New REST API for Managing MS Project Files & More Pin
sherazam4-Jul-14 10:44
membersherazam4-Jul-14 10:44 
GeneralSave OneNote Documents to Stream & Specify Save Format Explicitly using .NET Pin
sherazam1-Jul-14 19:10
membersherazam1-Jul-14 19:10 
News$100,000 USD up for grabs at PayPal and Braintree Sydney Hackathon Pin
Member 109177811-Jul-14 18:16
memberMember 109177811-Jul-14 18:16 
NewsNew WSO2 White Paper Discusses How to Extend Benefits of Java EE by Using WSO2 Application Server Pin
Member 979567330-Jun-14 9:03
memberMember 979567330-Jun-14 9:03 
NewsAdd Delete or Read Voting Options from New or Existing Messages in Java Apps Pin
sherazam30-Jun-14 1:02
membersherazam30-Jun-14 1:02 
GeneralCIOs Outsource Mainframe Application Development and Testing to Overcome Compliance Challenges Pin
Priyank Prakash24-Jun-14 21:53
memberPriyank Prakash24-Jun-14 21:53 
NewsThinkGeo Releases Map Suite 8.0 with Centralized Product Center for Easy Product Access, Native Support of Popular Data Formats, and Many Other New Features Pin
ThinkGeo - Code Project24-Jun-14 6:14
memberThinkGeo - Code Project24-Jun-14 6:14 
GeneraldbForge Query Builder for SQL Server v3.8 Now Supports SQL Server 2014 Pin
Devart24-Jun-14 3:50
memberDevart24-Jun-14 3:50 
GeneralE-XD++ Electronic Form design, Form printing, Form filling, data dissemination, C / C++ visualization, VB/.NET, source code component library, 100% VC++ Source Code is Shipped 2014 Pin
kellyonlyone22-Jun-14 21:00
memberkellyonlyone22-Jun-14 21:00 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.