Running Kafka jobs using Docker
HDFS, or Hadoop Distributed File System, is a distributed file system designed to store and process large datasets using commodity hardware. It is part of the Apache Hadoop ecosystem and is widely used in big data processing. HDFS uses a master-slave architecture with one NameNode and multiple DataNodes. The NameNode manages the file system metadata, while the DataNodes store the actual data. This allows for scalable and fault-tolerant data storage and processing. HDFS is optimized for batch processing and sequential reads, making it well-suited for applications like log analysis, data warehousing, and machine learning. However, it is not well suited for random writes and low-latency data access. HDFS is a critical component of the Hadoop ecosystem and is used by many big data applications. Its scalable and fault-tolerant design makes it a reliable choice for storing and processing large datasets. Overall, HDFS plays a crucial role in the world of big data and is an essential tool for data engineers and analysts.
Kafka is an open-source distributed event streaming platform designed for handling real-time data feeds. Originally developed by LinkedIn and now part of the Apache Software Foundation, Kafka is designed for high throughput, low latency, and fault tolerance. At a high level, Kafka allows producers to write streams of records to a set of topics, which are partitioned and distributed across a cluster of nodes. Consumers can then read from one or more topics and process the records in real time. Kafka is horizontally scalable, meaning that it can handle large volumes of data by adding more nodes to the cluster. Kafka's key features include distributed architecture, high throughput, low-latency, durability, and scalability. It is used in a wide variety of use cases, including stream processing, website activity tracking, log aggregation, real-time analytics, and more. Kafka's popularity has grown rapidly over the years and has become a standard tool for building real-time data pipelines in many industries. Overall, Kafka is an essential component of many big data architectures and plays a crucial role in the world of real-time data processing.
File | Purpose |
---|---|
docker-compose.yml | Docker compose with the infrastructure required to run the Hadoop cluster. |
requirements.txt | Python requirements file. |
app/test_hdfs.py | Python script that tests writing data into HDFS. |
app/test_producer.py | Python script that tests writing data into Kafka. |
app/test_consumer.py | Python script that tests consuming data from Kafka. |
- Docker Hadoop
- HDFS Simple Docker Installation Guide for Data Science Workflow
- Set Up Containerize and Test a Single Hadoop Cluster using Docker and Docker compose=
- Spark Docker
- Hadoop Namenode
- Apache ZooKeeper
- Kakfa Stack Docker Compose
- How to Use Kafka Connect
- Configuration example for writing data to HDFS
- Docker Kafka Connect image for the Confluent Open Source Platform using Oracle JDK
- Getting Started with Apache Kafka and Apache Flume (Import data to HDFS
- Hadoop-spark-kafka-zookeeper docker compose
docker rm -f $(docker ps -a -q)
docker volume rm $(docker volume ls -q)
docker-compose up
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0f87a832960b bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /r..." 12 hours ago Up 54 seconds 0.0.0.0:8088->8088/tcp yarn
51da2508f5b8 bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /r..." 12 hours ago Up 55 seconds (healthy) 0.0.0.0:8188->8188/tcp historyserver
ec544695c49a bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /r..." 12 hours ago Up 56 seconds (healthy) 0.0.0.0:8042->8042/tcp nodemanager
810f87434b2f bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /r..." 12 hours ago Up 56 seconds (healthy) 0.0.0.0:9864->9864/tcp datenode1
ca5186635150 bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /r..." 12 hours ago Up 56 seconds (healthy) 0.0.0.0:9000->9000/tcp, 0.0.0.0:9870->9870/tcp namenode
beed8502828c bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8 "/entrypoint.sh /r..." 12 hours ago Up 55 seconds (healthy) 0.0.0.0:9865->9864/tcp datenode2
[...]
The -L
flag allows redirections. By default, the namenode redirects the request to any of the datanodes.
docker exec -it namenode /bin/bash
curl -L -i -X PUT "http://127.0.0.1:9870/webhdfs/v1/data/martin/lorem-ipsum.txt?op=CREATE" -d 'testing'
HTTP/1.1 307 Temporary Redirect
Date: Thu, 30 Mar 2023 00:40:44 GMT
Cache-Control: no-cache
Expires: Thu, 30 Mar 2023 00:40:44 GMT
Date: Thu, 30 Mar 2023 00:40:44 GMT
Pragma: no-cache
X-Content-Type-Options: nosniff
X-FRAME-OPTIONS: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Location: http://datanode2.martincastroalvarez.com:9864/webhdfs/v1/data/martin/lorem-ipsum.txt?op=CREATE&namenoderpcaddress=namenode:9000&createflag=&createparent=true&overwrite=false
Content-Type: application/octet-stream
Content-Length: 0
HTTP/1.1 100 Continue
HTTP/1.1 201 Created
Location: hdfs://namenode:9000/data/martin/lorem-ipsum.txt
Content-Length: 0
Access-Control-Allow-Origin: *
Connection: close
docker exec -it namenode /bin/bash
hdfs dfs -ls /
Found 1 items
drwxr-xr-x - root supergroup 0 2023-03-03 14:15 /rmstate
docker exec -it namenode /bin/bash
hdfs dfs -mkdir -p /user/root
hdfs dfs -ls /
Found 2 items
drwxr-xr-x - root supergroup 0 2023-03-03 14:15 /rmstate
drwxr-xr-x - root supergroup 0 2023-03-03 14:17 /user
docker exec -it namenode /bin/bash
echo "lorem" > /tmp/hadoop.txt
hdfs dfs -put ./input/* input
hdfs dfs -ls /user/
Found 2 items
-rw-r--r-- 3 root supergroup 6 2023-03-03 14:20 /user/hadoop.txt
drwxr-xr-x - root supergroup 0 2023-03-03 14:17 /user/root
docker exec -it namenode /bin/bash
hdfs dfs -cat /user/hadoop.txt
lorem
Checking the status of the NameNode at http://127.0.0.1:9870/dfshealth.html
virtualenv -p python3 .env
source .env/bin/activate
pip install -r requirements.txt
python3 app/test_hdfs.py
WARNING: Replacing datanode2 with localhost.
201 Created
WARNING: Replacing datanode2 with localhost.
WARNING: Replacing datanode2 with localhost.
a.txt
data
hello.txt
test2.json
virtualenv -p python3 .env
source .env/bin/activate
pip install -r requirements.txt
python3 app/test_producer.py
Producer: <cimpl.Producer object at 0x10fd311a0>
Sent!
virtualenv -p python3 .env
source .env/bin/activate
pip install -r requirements.txt
python3 app/test_consumer.py
Consumer: <cimpl.Consumer object at 0x10f5d91a0>
Polling
Polling
Polling
Polling
Received: <cimpl.Message object at 0x10f3cff40>
Received message: {"name": "John", "age": 30, "city": "New York"}