Preparing XML data for Apriori algorithm

Question

0.00/5 (No votes)

See more:

I want to apply Apriori algorithm to XML documents. In this regard, to prepare input, I should convert XML data to transaction/matrix form to be acceptable by the algorithm (written both in C# and Java). So far, I’ve tried to convert XML to relational format and even into excel, but the problem remained unsolved. What's the best way to do that? Any suggestion?

Update: Sample record from data set

HTML

<article key="tr/gte/TR-0263-08-94-165">
<author>Frank Manola</author>
<title>An Evaluation of Object-Oriented DBMS Developments: 1994 Edition.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0263-08-94-165</volume>
<month>August</month>
<year>1994</year>
</article>

Posted 17-Jan-15 0:12am

Eilia98

Updated 17-Jan-15 6:42am

v2

Add a Solution

Comments

Zoltán Zörgő 17-Jan-15 6:18am

Do you really think that this information will be enough for anybody to help you? Tell us the logic of the desired conversion, and add at least one short example for the input XML and the desired output for that specific input.
Can be an interesting task, but for now, it is far too incomplete... and you might end up with no help at all...

Eilia98 17-Jan-15 12:43pm

Sorry for insufficient details. In fact, Apriori algorithm gets input (transactions) in form of matrices:
TID A B C D E
T1 1 1 1 0 0
T2 1 1 1 1 1
T3 1 0 1 1 0
T4 1 0 1 1 1
T5 1 1 1 1 0

While my XML data is as shown in the question.
Now, I don’t know how to map such data to be acceptable by the algorithm.

[no name] 17-Jan-15 7:52am

"...., but the problem remained unsolved":

What problem?

Eilia98 17-Jan-15 12:45pm

The problem is mapping XML in a suitable format to be acceptable by the algorithm.

Peter Leow 17-Jan-15 9:07am

Look like you are doing some text mining to group frequently associated words in XML documents. You have to transform the documents into some structure and format that is acceptable by the program. That is data preparation and is manual work.

Eilia98 17-Jan-15 12:48pm

Thanks for answer, in fact this is another working solution; however,I want to mine XML document with respect to its structure as well as content.

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Peter Leow · Accepted Answer · 2015-01-17T16:10:00

You did not state your objective. But it involves text mining. Technically, Task Mining is the task of transforming unstructured text data into structure numerical data so that machine learning algorithms can be applied to large document databases. Converting text to numbers requires the use of techniques for handling text at the individual work/character depending on the objective of the mining task.
Briefly, the process to prepare textual data for analysis involves:
1. Tokenization[^]
2. Stemming[^]
3. Stop words[^]
4. Indexing - represent the documents in the form term-document matrix (TDM) using "bag-of-words"[^] approach.
It is not possible to explain in details here, so visit Term Frequency and Inverse
Document Frequency[^]
Finally, the whole document corpus will be turned into a TDM where the usual data mining techniques can then be applied to meet the mining objective.