Hadoop

Hadoop

Application and Data / Data Stores / Databases
Associate Data Engineer at Virtuosoft·
Needs advice
on
Apache HiveApache Hive
and
OpenRefineOpenRefine

I've been going over the documentation and couldn't find answers to different questions like:

Apache Hive is built on top of Hadoop meaning if I wanted to scale it up I could do either horizontal scaling or vertical scaling. but if I want to scale up openrefine to cater more data then how can this be achieved? the only thing I could find was to allocate more memory like 2 of 4GB but using this approach would mean that we would run out of memory to allot. so thoughts on this?

Secondly, Hadoop has MapReduce meaning a task is reduced to many mapper running in parallel to perform the task which in turn increase the processing speed, is there a similar mechanism in OpenRefine or does it only have a single processing unit (as it is running locally). thoughts?

READ MORE
5 upvotes·19K views
Replies (2)
Developer Advocate at Superface·

From my point of view, both OpenRefine and Apache Hive serve completely different purposes. OpenRefine is intended for interactive cleaning of messy data locally. You could work with their libraries to use some of OpenRefine features as part of your data pipeline (there are pointers in FAQ), but OpenRefine in general is intended for a single-user local operation.

I can't recommend a particular alternative without better understanding of your use case. But if you are looking for an interactive tool to work with big data at scale, take a look at notebook environments like Jupyter, Databricks, or Deepnote. If you are building a data processing pipeline, consider also Apache Spark.

Edit: Fixed references from Hadoop to Hive, which is actually closer to Spark.

READ MORE
5 upvotes·331K views
Director - NGO "Informational Culture" / Ambassador - OKFN Russia at Infoculture·

I don't think that OpenRefine and Apache Hive are compatible for such tasks. If you need to cleanup and process huge amount of data (big data) I would recommend to use Clickhouse instead and to do data processing tasks using SQL queries, not manually.

OpenRefine is a great tool with the great limitations. It doesn't handle big datasets, it doesn't scale, it doesn't handle JSON documents with sub-documents.

READ MORE
2 upvotes·4.6K views
Needs advice
on
HadoopHadoopMarkLogicMarkLogic
and
SnowflakeSnowflake

For a property and casualty insurance company, we currently use MarkLogic and Hadoop for our raw data lake. Trying to figure out how snowflake fits in the picture. Does anybody have some good suggestions/best practices for when to use and what data to store in Mark logic versus Snowflake versus a hadoop or all three of these platforms redundant with one another?

READ MORE
4 upvotes·121.3K views
Needs advice
on
AirflowAirflow
and
Apache NiFiApache NiFi

I am looking for the best tool to orchestrate #ETL workflows in non-Hadoop environments, mainly for regression testing use cases. Would Airflow or Apache NiFi be a good fit for this purpose?

For example, I want to run an Informatica ETL job and then run an SQL task as a dependency, followed by another task from Jira. What tool is best suited to set up such a pipeline?

READ MORE
4 upvotes·643K views
Replies (2)
Recommends
on
Airflow

I have been using Airflow for more than 2 years now and haven't thought about moving to any other platform. Coming back to your requirements, Airflow fits pretty well. 1. It has an excellent way to manage dependent tasks using DAG (Direct Acyclic Graph), You can create a DAG with tasks and manage which task is dependent on which and Airflow takes care of running it or not running a task in case the parent task fails. 2. Integrations - The airflow community has implemented various integration to different cloud services, to Hadoop, spark a and as well as Jira. Though it doesn't have in-built integration for Informatica you can also run your own service in Airflow as a task (which can handle all Informatica related operations).

  1. It's very easy to find/monitor and manage Jobs/Pipelines as Airflow provides a great consolidated UI.
READ MORE
5 upvotes·20.5K views
Sales Executive at Astronomer·
Recommends
on
Airflow

Hey Sathya! With Airflow, you are able to create custom hooks and operators to trigger various types of jobs. There may be ones that exist already for informatica, but I am unsure. Would be happy to connect to discuss further if you are interested. josh@astronomer.io

READ MORE
20.4K views