Click here to Skip to main content
15,886,770 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
I extract Vocabulary from document with below code.
C#
//document is a string variable 
List<string> Vocabulary= Vocabulary.Union((Regex.Replace(document, "\\p{P}", " ")).
Split(' ')).ToList();

how replace every word of Vocabulay with index it in Vocabulary in both Vocabulay and document Simultaneously with extraction operation in top code
for example
document=>"the book book is are is" change value to "0 1 1 2 3 2"
or this values store in a List<int> variable
Vocabulary[0]=>"the" change value to 0
Vocabulary[1]=>"book" change value to 1
Vocabulary[2]=>"is" change value to 2
Vocabulary[3]=>"are" change value to 3
Posted
Updated 2-Jul-15 8:09am
v2

1 solution

Interesting problem...

This looks like a word-based document "compression" process.

But I'm a little confused as to why you would replace all of the info in Vocabulary from the word strings to the corresponding number. That is the only information of the mapping between the words and the numbers. Without it, you will have no way to reverse the mapping and recreate the original string. So, essentially, the output can be almost arbitrary since there's no way to reconstruct anything useful!

If it is really required to replace the vocabulary values then:
C#
Vocabulary = Enumerable.Range(0, Vocabulary.Count).Select(n => n.ToString()).ToList();

For doing the replacements in document, I'd probably use a Dictionary<string, int> to hold the word-to-number mapping instead of needing to scan the Vocabulary List at every word.

Another option would be to iterate through the Vocabulary list, and apply a Regex.Replace across the whole document for each word with the corresponding number. This will almost certainly misbehave if document can contain numbers that are the same as any of the word replacement values. Also, this is O(N²) on the length of the document.
 
Share this answer
 
Comments
kave2011 2-Jul-15 13:51pm    
I implemented "naive bayes text classification algorithm" parallel on cuda.
I want to optimize it. but my problem is memory shortage, especially shared memory in gpu. so I want to decrease size of data.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900