Click here to Skip to main content
15,883,705 members
Please Sign up or sign in to vote.
2.00/5 (1 vote)
See more:
recently~ i met a question about the english string segmentation in my project.
at first,there is a substring database ,and how to segmentation a input string in at all most through match the substring database.

ie:
C#
inputstring="womenshic++programmerandorcodeproject";
substringdatebase[]={"wo","men","wome",
                     "c","c++","shi",
                    "program","programmer","and",
                    "or","code","project"
                     ..........
                    };

as follow affter segmentation:
C#
outstring={"wo","mem","shi".....};
Posted
Updated 6-Jan-12 19:48pm
v2
Comments
Sergey Alexandrovich Kryukov 6-Jan-12 23:04pm    
Not clear what do you mean by a segment. An example is not a definition. Is it a syllable? Anyway, this is a complex linguistic problem. You probably need a whole knowledge base: known syllables and all flexes, their usage, etc.
--SA
hengfeng 7-Jan-12 21:56pm    
thank you~

1 solution

For efficiency of the search, you can represent your substrings using a trie data structure (http://en.wikipedia.org/wiki/Trie[^]).

Then repeatedly perform a search for the longest string that matches some substring. Anyway, this is not a bulletproof solution, as long substring matches could supersede shorter substring matches and lead to a dead end.

Example: looking for { "ab", "abc", "ce" } in "abce" would detect "abc" and then fail on "e", whereas there is a solution with "ab" followed by "ce".
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900