This repository showcases three projects that demonstrate the use of advanced data processing, interactive visualizations, and automated exploratory data analysis (EDA). The projects utilize Apache Beam, D3.js, and Sweetviz for real-time processing, visual insights, and structured data analysis. Each project includes detailed documentation, sample code, and references to Colab notebooks and YouTube tutorials.
- Project Descriptions
- Technologies Used
- How to Use
- Repository Structure
- Colab Notebooks and YouTube Tutorials
Objective: Demonstrate the use of Apache Beam for processing movie data with features like filtering, transforming, and windowing operations in a simulated real-time streaming environment.
- Composite Transforms: Encapsulates multiple processing steps into reusable components.
- ParDo: Applies custom logic to individual elements.
- Windowing and Triggers: Groups data into 60-second intervals for processing.
- Simulated Streaming: Generates dynamic streaming data for real-time applications.
- Dynamic Outputs: Writes processed results to Google Drive for easy access.
- Simulate streaming movie data.
- Parse raw CSV data into structured dictionaries.
- Apply windowing to group data by fixed intervals.
- Use composite transforms to filter and format movie data.
- Save results as a text file in Google Drive.
A processed output file (movies_output.txt
) containing movie titles and their respective countries:
Title_1: Country_1
Title_2: Country_2
...
Objective: Analyze the 2019 World Happiness Report dataset to uncover insights through interactive visualizations and predictive modeling.
- Interactive Visualizations with D3.js:
- Scatter Plot: Happiness Score vs GDP per Capita.
- Bar Chart: Top 10 happiest countries.
- Heatmap: Correlation between socio-economic factors.
- Predictive Modeling with PyCaret:
- Automates regression model comparison, tuning, and evaluation.
- Identifies the best predictors for happiness scores.
The dataset includes socio-economic indicators like:
- GDP per Capita
- Social Support
- Freedom
- Life Expectancy
- Generosity
- Corruption
- Perform EDA with D3.js for interactive visual insights.
- Use PyCaret to build and evaluate regression models predicting happiness scores.
Objective: Automate the EDA of the Netflix dataset to uncover trends and patterns using Sweetviz, a Python library for visually appealing reports.
- Generate interactive reports summarizing data distributions and correlations.
- Identify trends in content type, ratings, and release years.
- Content Type: Movies dominate Netflix’s catalog.
- Popular Ratings: TV-MA and TV-14 are most common.
- Release Year Trends: A significant rise in content production post-2010.
An HTML report (Netflix_Analysis.html
) with an in-depth summary of the dataset.
- Apache Beam: Distributed data processing framework.
- D3.js: JavaScript library for interactive, browser-based visualizations.
- Sweetviz: Python library for automated EDA.
- PyCaret: Low-code machine learning library for regression modeling.
- Google Colab: Cloud-based environment for running Python and JavaScript code.
- Pandas, Matplotlib, Seaborn: Data manipulation and static visualizations.
-
Clone the Repository:
git clone /~https://github.com/subhashpolisetti/apache-beam-and-eda-projects.git
-
Set Up Dependencies:
- Install Python libraries:
pip install apache-beam pandas matplotlib seaborn sweetviz pycaret
- Install Python libraries:
-
Run Projects:
- Open the Colab notebooks for each project and follow the instructions provided.
apache-beam-and-eda-projects/
├── apache-beam-processing/
│ ├── Composite_Transforms_and_Real_Time_Data_with_Apache_Beam.ipynb
│ ├── README.md
│ └── movies_output.txt-00000-of-00001
│
├── eda-d3js-visualizations/
│ ├── World_Happiness_Report_2019_Advanced_EDA_with_D3_js_Visualizations.ipynb
│ ├── bar_chart_html.html
│ ├── heatmap_html.html
│ ├── scatter_plot.html
│ └── README.md
│
├── auto-eda-analysis/
│ ├── Netflix_EDA_with_Sweetviz.ipynb
│ ├── Netflix_Analysis.html
│ └── README.md
│
└── README.md
Project | Colab Notebook | YouTube Tutorial |
---|---|---|
Apache Beam: Real-Time Movie Data Processing | Open in Colab | Watch Tutorial |
World Happiness Report 2019: Advanced EDA with D3.js | Open in Colab | Watch Tutorial |
Netflix Dataset Analysis Using Sweetviz | Open in Colab | Watch Tutorial |