Click here to Skip to main content
15,896,111 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I want to apply Apriori algorithm to XML documents. In this regard, to prepare input, I should convert XML data to transaction/matrix form to be acceptable by the algorithm (written both in C# and Java). So far, I’ve tried to convert XML to relational format and even into excel, but the problem remained unsolved. What's the best way to do that? Any suggestion?

Update: Sample record from data set

HTML
<article key="tr/gte/TR-0263-08-94-165">
<author>Frank Manola</author>
<title>An Evaluation of Object-Oriented DBMS Developments: 1994 Edition.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0263-08-94-165</volume>
<month>August</month>
<year>1994</year>
</article>
Posted
Updated 17-Jan-15 6:42am
v2
Comments
Zoltán Zörgő 17-Jan-15 6:18am    
Do you really think that this information will be enough for anybody to help you? Tell us the logic of the desired conversion, and add at least one short example for the input XML and the desired output for that specific input.
Can be an interesting task, but for now, it is far too incomplete... and you might end up with no help at all...
Eilia98 17-Jan-15 12:43pm    
Sorry for insufficient details. In fact, Apriori algorithm gets input (transactions) in form of matrices:
TID A B C D E
T1 1 1 1 0 0
T2 1 1 1 1 1
T3 1 0 1 1 0
T4 1 0 1 1 1
T5 1 1 1 1 0

While my XML data is as shown in the question.
Now, I don’t know how to map such data to be acceptable by the algorithm.
[no name] 17-Jan-15 7:52am    
"...., but the problem remained unsolved":

What problem?
Eilia98 17-Jan-15 12:45pm    
The problem is mapping XML in a suitable format to be acceptable by the algorithm.
Peter Leow 17-Jan-15 9:07am    
Look like you are doing some text mining to group frequently associated words in XML documents. You have to transform the documents into some structure and format that is acceptable by the program. That is data preparation and is manual work.

1 solution

You did not state your objective. But it involves text mining. Technically, Task Mining is the task of transforming unstructured text data into structure numerical data so that machine learning algorithms can be applied to large document databases. Converting text to numbers requires the use of techniques for handling text at the individual work/character depending on the objective of the mining task.
Briefly, the process to prepare textual data for analysis involves:
1. Tokenization[^]
2. Stemming[^]
3. Stop words[^]
4. Indexing - represent the documents in the form term-document matrix (TDM) using "bag-of-words"[^] approach.
It is not possible to explain in details here, so visit Term Frequency and Inverse
Document Frequency
[^]
Finally, the whole document corpus will be turned into a TDM where the usual data mining techniques can then be applied to meet the mining objective.
 
Share this answer
 
v3
Comments
Eilia98 24-Jan-15 2:51am    
@PeterLeow, thanks for your solution, over last week I searched for other solutions, but finally find yours practical. In this regard, I have a more critical question. To perform the text mining I want to use RapidMiner, however I'm wondering how I can incorporate XML tags and attributes in the text mining process.

In fact, when convert XML to, for example, CSV, there is no difference between attributes, tags or textual content of documents. I want mine the data regarding the structure of the document not as bunch of text. Is there any way to do so? I welcome and appreciate any suggestion.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900