Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: C++ C Java Python database , +
Hello,
 
I am supposed to take data from wikipeadia dump or freebase dump or dbpedia.
I am then supposed write code that gives as output what every datum in that database is. eg: name of a person or a bussines, address,... It does not matter in what language i write the code but, I’m only familiar with C, C++, Java and Python. Java is my preferred language.
 
Those databases have all types of data: title, person name, address, social security, phone...
 
I have three questions:
 
1) Since I have used machine learning a lot, I have decided to use a machine learning approach.
I have started looking into WEKA, a Java machine learning toolbox. It however has only a GPL license. Is there another tool box that i can use in commercial product.
 
2)The problem I am facing with a machine learning approach is that I don't know what features to use. All I can think of right now is: the length of the datum, the number of string characters it has, the number of integer character it has.
This is very little with all the type of data those databases have. Regular expression seems to not be a solution for this type of project.
 
3)Is there another approach I can use? I mean, is machine learning the only approach?
 
Thank you for your help.
 
Regards,
 
Herve
Posted 25-May-11 11:02am
hervebags1.1K

1 solution

Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

This stuff's way beyond me, but to get a discussion started, here's how I'd approach it. . .
 
* Build up a dictionary of basic words (you can pull a list from project gutenberg to get you started). Classify these as verbs, nouns, adjectives, etc.
 
* Read up on the syntax of sentences (e.g. diagram[^]).
 
* Use this knowledge along with your dictionary to create a classification routine which can take a sentence and guess at a classification (verb, noun, etc) based on a word's position within the sentence.
 
* The nouns are the bits you're interested in (i.e. names, addresses, etc). Have another routine which you pass the sentence to if it contains an unknown noun and a keyword (named, called, he, she, lives at, etc). This can then add it to your list of likely candidates if the location of the keywords compared to the new noun is deemed as suggesting that the noun is a name/address.
 
* Break the data from your source down into sentences, pass them to the routine, and pull back the results.
 
This will still be a very rough approach, but with a bit of tweaking I reckon it'll be OK for starters.
 
Alternatively, check the web for videos and docs about the Wanderlust Natural Language project - I think they attempted something similar, but more advanced.
  Permalink  

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 Mathew Soji 330
1 BillWoodruff 260
2 Sergey Alexandrovich Kryukov 240
3 OriginalGriff 216
4 Afzaal Ahmad Zeeshan 208
0 OriginalGriff 6,168
1 Sergey Alexandrovich Kryukov 5,853
2 DamithSL 5,028
3 Manas Bhardwaj 4,539
4 Maciej Los 3,845


Advertise | Privacy | Mobile
Web01 | 2.8.1411019.1 | Last Updated 8 Jun 2011
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100