Pandas

My process is like this: I would get data once a month, either from Google BigQuery or as parquet files from Azure Blob Storage. I have a script that does some cleaning and then stores the result as partitioned parquet files because the following process cannot handle loading all data to memory.

The next process is making a heavy computation in a parallel fashion (per partition), and storing 3 intermediate versions as parquet files: two used for statistics, and the third will be filtered and create the final files.

I make a report based on the two files in Jupyter notebook and convert it to HTML.

Everything is done with vanilla python and @{Pandas}|tool:2180|.
sometimes I may get a different format of data
cloud service is @{Microsoft Azure}|tool:213|.

What I'm considering is the following:

Get the data with Kafka or with native python, do the first processing, and store data in Druid, the second processing will be done with Apache Spark getting data from apache druid.

the intermediate states can be stored in druid too. and visualization would be with apache superset.

Pandas Discussions

Discover why developers choose Pandas. Read real-world technical decisions and stack choices from the StackShare community.

Simone Sadak

Jan 21, 2023

Needs adviceon

Google BigQuery

Azure Blob Storage

Jupyter

I make a report based on the two files in Jupyter notebook and convert it to HTML.

Everything is done with vanilla python and @{Pandas}|tool:2180|.
sometimes I may get a different format of data
cloud service is @{Microsoft Azure}|tool:213|.

What I'm considering is the following:

Get the data with Kafka or with native python, do the first processing, and store data in Druid, the second processing will be done with Apache Spark getting data from apache druid.

the intermediate states can be stored in druid too. and visualization would be with apache superset.

0 views0

Comments

Guillaume Simler

Sep 9, 2019

Needs adviceon

Jupyter

Anaconda

Pandas

Jupyter Anaconda Pandas IPython

A great way to prototype your data analytic modules. The use of the package is simple and user-friendly and the migration from ipython to python is fairly simple: a lot of cleaning, but no more.

The negative aspect comes when you want to streamline your productive system or does CI with your anaconda environment: