Click here to Skip to main content
Click here to Skip to main content
Go to top

ETL using RapidMiner Community Edition

, 9 Oct 2012
Rate this:
Please Sign up or sign in to vote.
ETL using RapidMiner Community edition.

Sample Image

Introduction

RapidMiner is one of the leading data mining software suites. With more than 400 data mining modules or operators, it is one of the most comprehensive and most flexible data mining tools available. With over 10,000 downloads from SourceForge.net each month and more than 300,000 downloads in total, it is also one of the most widespread-used data mining tools. According to polls by the popular data mining web portal KDnuggets.com among several hundred data mining experts, RapidMiner was the most widely used open source data mining tool and among the top three data mining tools overall in 2007 and 2008.

RapidMiner supports all steps of the data mining process from data loading, pre-processing, visualization, interactive data mining process design and inspection, automated modeling, automated parameter and process optimization, automated feature construction and feature selection, evaluation, and deployment. RapidMiner can be used as stand-alone program on the desktop with its graphical user interface (GUI), on a server via its command line version, or as data mining engine for your own products and Java library for developers.

Background

For my recent BI project I need a tool to transform a SQL Server 2008 database to PostgreSQL database with all data. After working with several tools I chose RapidMiner for the task and it shows its full color for this purpose.

ETL

ETL stands for Extract, Transform, Load. For example, you receive files or other data from vendors or other third parties which you need to manipulate in some way and then insert into your own database.

What you Need

Microsoft® SQL Server® 2008 Express

SQL Server 2008 Express is a free edition of SQL Server that is an ideal data platform for learning and building desktop and small server applications, and for redistribution by ISVs. You can download from http://www.microsoft.com/en-us/download/details.aspx?id=1695[^]

I assume you have installed Microsoft® SQL Server® 2008 Express. For further information you can visit http://msdn.microsoft.com/en-us/library/dd981045(v=sql.100).aspx.

AdventureWorks database

You can download from http://msftdbprodsamples.codeplex.com/downloads/get/478218[^] and attach:

Sample Image

RapidMiner

You can download from http://sourceforge.net/projects/rapidminer/files/1.%20RapidMiner/5.2/rapidminer-5.2.008x32-install.exe/download[^].

I assume you have installed RapidMiner. For further information you can visit http://rapid-i.com/wiki/index.php?title=RapidMiner_Installation_Guide.

ETL Process

Step 1:

Create a new process from File->New.

Step 2:

Create a new process from File->New.

Step 3:

Click Operators->Import->Data->Read Database.

Sample Image

Step 4:

Now we have to create a SQL Server connection for reading.

Sample Image

After creating the connection, assign the connection to read Database link.

Sample Image

Step 5:

Click Build Query and assign your Table.

Sample Image

Step 6:

Click Operators->Export->Data->Write CSV. Select Write CSV icon and assign value.

CSV file: Location of the file.

Column separator: Column separator character.

Connect out from "Read Database" to inp from "Write CSV".

Sample Image

Behind the scenes

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="370" width="614">
      <operator activated="true" class="read_database" compatibility="5.2.008" 
              expanded="true" height="60" name="Read Database" width="90" x="179" y="75">
        <parameter key="connection" value="MSSQLServer-AdventureWorks2008"/>
        <parameter key="query" value="SELECT *
FROM "HumanResources"."Employee""/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="write_csv" compatibility="5.2.008" 
             expanded="true" height="76" name="Write CSV" width="90" x="380" y="75">
        <parameter key="csv_file" value="C:\Users\Masud\Desktop\exportcsv.csv"/>
        <parameter key="column_separator" value=","/>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Write CSV" to_port="input"/>
      <connect from_op="Write CSV" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Step 7:

Now press Run or F11 you will see that all data will be exported as a CSV file in your destination. And output will be like:

Sample Image

Conclusion

RapidMiner Community Edition is an excellent tool for ETL.

History

None so far.

License

This article, along with any associated source code and files, is licensed under The BSD License

Share

About the Author

mparvez
Software Developer (Senior) icddr,b
Bangladesh Bangladesh
More than 8 years experience on Programming and Project implementation, I was primarily involved with projects for private organization,Govt.(Bangladesh Army,DG Health,RJSC), NGO (SEDF,WFP). Presently I am working at ICDDR,B and enhancing Hospital Management System developed by Microsoft Dynamic NAV and Windows Mobile Application 5.0
 
An active supporter of Open Source technology, my interested areas are ERP, IT Audit, Data warehouse, BI etc.
 
Playing Guitar for 15 years, my interested music style is Blues Rock,Neo Classical.
 
Certification
 
70-540:Microsoft® Windows Mobile® 5.0 - Application Development
MB7-514:Microsoft Dynamics™ NAV 5.0 C/SIDE Introduction
MB7-516:Microsoft Dynamics™ NAV 5.0 Solution Development
MB7-517:Microsoft Dynamics™ NAV 5.0 Installation and Configuration
MB7-515:Microsoft Dynamics™ NAV 5.0 Financials
70-432:Microsoft SQL Server 2008 - Implementation and Maintenance
70-450:PRO: Designing, Optimizing and Maintaining a Database Administrative Solution Using Microsoft SQL Server 2008
70-448:Microsoft SQL Server 2008, Business Intelligence Development and Maintenance
312-50:Certified Ethical Hacker
 
Web :http://masudparvezshabuz.appspot.com
Blog :http://masudparvezshabuz.wordpress.com
linkedin :http://www.linkedin.com/in/masudparvez
Follow on   Google+

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web04 | 2.8.140926.1 | Last Updated 9 Oct 2012
Article Copyright 2012 by mparvez
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid