StackShareStackShare
Follow on
StackShare

Discover and share technology stacks from companies around the world.

Follow on

© 2025 StackShare. All rights reserved.

Product

  • Stacks
  • Tools
  • Feed

Company

  • About
  • Contact

Legal

  • Privacy Policy
  • Terms of Service
  1. Stackups
  2. Application & Data
  3. Languages
  4. Languages
  5. OpenRefine vs RapidMiner

OpenRefine vs RapidMiner

OverviewDecisionsComparisonAlternatives

Overview

RapidMiner
RapidMiner
Stacks36
Followers65
Votes0
GitHub Stars0
Forks0
OpenRefine
OpenRefine
Stacks33
Followers68
Votes0
GitHub Stars11.6K
Forks2.1K

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs
CLI (Node.js)
or
Manual

Advice on RapidMiner, OpenRefine

Sarah
Sarah

Jun 25, 2020

Needs adviceonOpenRefineOpenRefine

I'm looking for an open-source/free/cheap tool to clean messy data coming from various travel APIs. We use many different APIs and save the info in our DB. However, many duplicates cannot be easily recognized as such.

We would either write an algorithm or use smart technology/tools with ML to help with product management.

While there are many things to be considered, this is one feature that it should have:

"To avoid confusion, we need to merge the suppliers & products accordingly. Products and suppliers must be able to be merged and assigned separately.

Reason: It may happen that one supplier offers different products. E.g., 1 tour operator offers 3 products via 1 API, but only 1 product with 3 (or a different amount of) variations via a different API. Also, the commission may differ for products, which we need to consider. Very often, products that are live (are bookable in real-time) on via 1 API, but are not live on the other. E.g., Supplier product 1 & 2 of API1 are live, product 3 not. For the same supplier, API2 provides live availability for products 1, 2, and 3.

Summing up, when merging the suppliers (tour operators) we need to consider:

  • Are the products the same for all APIs?
  • Which booking system API gives a better commission? Note: Some APIs charge us 1-5% depending on the monthly sale, which needs to be considered
  • Which booking system provides live availability
  • Is it the same supplier, or is the name only similar?

Most of the time, the supplier names differ even if they are the same (e.g., API1 often names them XX Pty Ltd, while API2 leaves "Pty Ltd" out). Additionally, the product title, description, etc. differ.

We need to write logic and create an algorithm to find the duplicates & to merge, assign, or (de)activate the respective supplier or product. My previous developer started a module to merge the suppliers, which does not seem to work correctly. Also, it is way too time taking considering the high amount of products that we have.

I would recommend merging, assigning etc. products and suppliers only if our algorithm says it's 90- 100% the matching supplier/product. Otherwise, admins need to be able to check & modify this. E.g. everything with a lower possibility of matching will be matched automatically, but can be undone or modified.

The next time the cron job runs, this needs to be considered to avoid recreating duplicates & creating a mess."

I am not sure in what way OpenRefine can help to achieve this and what ML tool can be connected to learn from the decisions the product management team makes. Maybe you have an idea of how other travel portals deal with messy data, duplicates, etc.?

I'm looking for the cheapest solution for a start-up, but it should do the work properly.

19.2k views19.2k
Comments

Detailed Comparison

RapidMiner
RapidMiner
OpenRefine
OpenRefine

It is a software platform for data science teams that unites data prep, machine learning, and predictive model deployment.

It is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

Graphical user interface; Analysis processes design; Multiple data management methods; Data from file, database, web, and cloud services; In-memory, in-database and in-Hadoop analytics; Application templates; -D graphs, scatter matrices, self-organizing map; GUI or batch processing
Faceting; Clustering; Editing cells; Reconciling; Extending web services
Statistics
GitHub Stars
0
GitHub Stars
11.6K
GitHub Forks
0
GitHub Forks
2.1K
Stacks
36
Stacks
33
Followers
65
Followers
68
Votes
0
Votes
0
Integrations
Java
Java
MATLAB
MATLAB
Python
Python
MongoDB
MongoDB
Groovy
Groovy
Zapier
Zapier
R Language
R Language
HTML5
HTML5
Python
Python
Dask
Dask
Ludwig
Ludwig
Vertica
Vertica

What are some alternatives to RapidMiner, OpenRefine?

JavaScript

JavaScript

JavaScript is most known as the scripting language for Web pages, but used in many non-browser environments as well such as node.js or Apache CouchDB. It is a prototype-based, multi-paradigm scripting language that is dynamic,and supports object-oriented, imperative, and functional programming styles.

Python

Python

Python is a general purpose programming language created by Guido Van Rossum. Python is most praised for its elegant syntax and readable code, if you are just beginning your programming career python suits you best.

PHP

PHP

Fast, flexible and pragmatic, PHP powers everything from your blog to the most popular websites in the world.

Ruby

Ruby

Ruby is a language of careful balance. Its creator, Yukihiro “Matz” Matsumoto, blended parts of his favorite languages (Perl, Smalltalk, Eiffel, Ada, and Lisp) to form a new language that balanced functional programming with imperative programming.

Java

Java

Java is a programming language and computing platform first released by Sun Microsystems in 1995. There are lots of applications and websites that will not work unless you have Java installed, and more are created every day. Java is fast, secure, and reliable. From laptops to datacenters, game consoles to scientific supercomputers, cell phones to the Internet, Java is everywhere!

Golang

Golang

Go is expressive, concise, clean, and efficient. Its concurrency mechanisms make it easy to write programs that get the most out of multicore and networked machines, while its novel type system enables flexible and modular program construction. Go compiles quickly to machine code yet has the convenience of garbage collection and the power of run-time reflection. It's a fast, statically typed, compiled language that feels like a dynamically typed, interpreted language.

HTML5

HTML5

HTML5 is a core technology markup language of the Internet used for structuring and presenting content for the World Wide Web. As of October 2014 this is the final and complete fifth revision of the HTML standard of the World Wide Web Consortium (W3C). The previous version, HTML 4, was standardised in 1997.

C#

C#

C# (pronounced "See Sharp") is a simple, modern, object-oriented, and type-safe programming language. C# has its roots in the C family of languages and will be immediately familiar to C, C++, Java, and JavaScript programmers.

Scala

Scala

Scala is an acronym for “Scalable Language”. This means that Scala grows with you. You can play with it by typing one-line expressions and observing the results. But you can also rely on it for large mission critical systems, as many companies, including Twitter, LinkedIn, or Intel do. To some, Scala feels like a scripting language. Its syntax is concise and low ceremony; its types get out of the way because the compiler can infer them.

Elixir

Elixir

Elixir leverages the Erlang VM, known for running low-latency, distributed and fault-tolerant systems, while also being successfully used in web development and the embedded software domain.

Related Comparisons

Bootstrap
Materialize

Bootstrap vs Materialize

Laravel
Django

Django vs Laravel vs Node.js

Bootstrap
Foundation

Bootstrap vs Foundation vs Material UI

Node.js
Spring Boot

Node.js vs Spring-Boot

Liquibase
Flyway

Flyway vs Liquibase