Reddit Data Pipeline with Kafka, Spark, and MongoDB

This project implements a data pipeline that streams Reddit comments from the 'Politics' subreddit, processes the data using Kafka and Apache Spark, and stores the results in MongoDB. The architecture is designed for real-time data collection and processing.

Architecture Overview

Reddit Stream: Streams comments from the 'Politics' subreddit using the PRAW (Python Reddit API Wrapper).
Kafka: Acts as the message broker for decoupling the Reddit data producer and consumer.
Spark: Reads messages from Kafka, processes them, and writes them to MongoDB.
MongoDB: Stores the processed data in the `reddit_db` database and `comments` collection.

Prerequisites

System Requirements

Python 3.x
Apache Kafka (Local or remote setup)
Apache Spark 3.x with MongoDB Spark Connector
MongoDB (MongoDB Atlas or local instance)

Libraries/Dependencies

The following Python libraries are required:

praw: Python Reddit API Wrapper
confluent_kafka: Kafka producer and consumer client
pyspark: Apache Spark for streaming and data processing
pymongo: MongoDB driver for Python
json: For working with JSON data

To install the dependencies, run:

pip install praw confluent_kafka pyspark pymongo

Kafka Setup

Set up a Kafka cluster locally or use a managed Kafka service.
Make sure Kafka is running and accessible. The default broker address in the code is kafka:9092.

MongoDB Setup

Use MongoDB Atlas or a local MongoDB instance.
Set up a database (reddit_db) and a collection (comments).
Update the MongoDB URI in the Spark configuration to connect to your MongoDB instance.

Configuration

Constants File (`constant.py`)

Create a file named constant.py in the project directory and define the following constants:


# constant.py
client_id = 'your_reddit_client_id'
client_secret = 'your_reddit_client_secret'
user_agent = 'your_user_agent'

Kafka Configuration

The Kafka producer configuration is set in the code to connect to kafka:9092 by default. Update the configuration if using a different Kafka broker:


# Producer configuration
producer_config = {
    'bootstrap.servers': 'kafka:9092',
}

MongoDB Configuration in Spark

Update the MongoDB URI in the Spark session configuration:


spark = SparkSession.builder \
    .appName("FileToMongoDB") \
    .config("spark.mongodb.write.connection.uri", "mongodb+srv://:@cluster0.mongodb.net/reddit_db") \
    .getOrCreate()

How the Pipeline Works

Step 1: Reddit Stream (Producer)

The producer script uses PRAW to stream comments from the Politics subreddit:

Continuously fetches new comments from the subreddit.
Each comment is serialized to JSON format and sent to the Kafka REDDIT_TOPIC topic.

Step 2: Kafka Consumer

The consumer reads the comments from the Kafka topic:

Polls messages from the REDDIT_TOPIC topic.
Each message is deserialized, and the comment data is stored in a file (output.jsonl).

Step 3: Spark Streaming

Spark reads the output.jsonl file in real-time:

Processes the data according to the schema defined in the code.
Writes the processed data to MongoDB using the MongoDB Spark Connector.

Step 4: MongoDB Storage

The comments are stored in the comments collection in the reddit_db database.

Running the Project

1. Run the Reddit Stream Producer

Run the script that streams comments from Reddit and sends them to Kafka:

python reddit_producer.py

2. Run the Kafka Consumer

Run the consumer to read from Kafka and save the comments to a local file:

python kafka_consumer.py

3. Run Spark Streaming

Run the Spark streaming job to process the data and store it in MongoDB:

python spark_streaming.py

Directory Structure


.
├── constant.py               # Configuration for Reddit API
├── kafka_consumer.py         # Kafka consumer to read messages
├── reddit_producer.py        # Reddit producer to fetch comments
├── spark_streaming.py        # Spark streaming job to process and save data
└── dataset/                  # Folder where consumer saves data (output.jsonl)

Expected Output

Once the pipeline is running, comments from the 'Politics' subreddit will be continuously streamed, processed, and stored in MongoDB. You can verify the data by querying the reddit_db database, comments collection in MongoDB.

db.comments.find().pretty()

Troubleshooting

Kafka Issues: Make sure Kafka is running and accessible. Ensure that the topic REDDIT_TOPIC exists or is created automatically.
MongoDB Issues: Verify your MongoDB URI and ensure MongoDB is up and running.
Spark Issues: Check the Spark configuration and ensure the MongoDB Spark Connector is correctly installed.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
config		config
data architecture		data architecture
dataset		dataset
jobs		jobs
myenv		myenv
Dockerfile		Dockerfile
Dockerfile-consumer		Dockerfile-consumer
Dockerfile-spark-hadoop		Dockerfile-spark-hadoop
README.md		README.md
constant.py		constant.py
consumer.py		consumer.py
docker-compose.yml		docker-compose.yml
notes.txt		notes.txt
producer.py		producer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Data Pipeline with Kafka, Spark, and MongoDB

Architecture Overview

Prerequisites

System Requirements

Libraries/Dependencies

Kafka Setup

MongoDB Setup

Configuration

Constants File (`constant.py`)

Kafka Configuration

MongoDB Configuration in Spark

How the Pipeline Works

Step 1: Reddit Stream (Producer)

Step 2: Kafka Consumer

Step 3: Spark Streaming

Step 4: MongoDB Storage

Running the Project

1. Run the Reddit Stream Producer

2. Run the Kafka Consumer

3. Run Spark Streaming

Directory Structure

Expected Output

Troubleshooting

About

Releases

Packages

Languages

Undisputed-jay/Streaming-Data-from-Reddit-Using-Kafka-Spark-and-MongoDB

Folders and files

Latest commit

History

Repository files navigation

Reddit Data Pipeline with Kafka, Spark, and MongoDB

Architecture Overview

Prerequisites

System Requirements

Libraries/Dependencies

Kafka Setup

MongoDB Setup

Configuration

Constants File (constant.py)

Kafka Configuration

MongoDB Configuration in Spark

How the Pipeline Works

Step 1: Reddit Stream (Producer)

Step 2: Kafka Consumer

Step 3: Spark Streaming

Step 4: MongoDB Storage

Running the Project

1. Run the Reddit Stream Producer

2. Run the Kafka Consumer

3. Run Spark Streaming

Directory Structure

Expected Output

Troubleshooting

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Constants File (`constant.py`)

Packages