Skip to content

Latest commit

 

History

History
280 lines (256 loc) · 7.67 KB

sparkmagic.md

File metadata and controls

280 lines (256 loc) · 7.67 KB

Running Sparkmagic, Livy and Spark

This appliance consists of a Jupyter notebook container with Sparkmagic, a Livy endpoint and a Spark cluster. The Spark cluster is running in the standalone mode, which consists of a master and a number of workers. Hence, there are four containers in the appliance - a Sparkmagic Jupyter notebook, a Livy endpoint, a Spark master and an arbitrary number of Spark workers.

The Livy endpoint uses the heliumdatacomons/livy image and exposes port 8998 on the container for HTTP requests from users. Besides, it also needs to reach the Spark master to run jobs on the Spark cluster. The heliumdatacommons/livy image takes one argument of the URL of the Spark master. The Livy endpoint container is specified as below:

{
    "id": "livy",
    "type": "service",
    "image": "heliumdatacommons/livy",
    "resources": {
        "cpus": 1,
        "mem": 4096
    },
    "network_mode": "container",
    "ports": [
        { "container_port": 8998 }
    ],
    "args": [
        "spark://@spark-master:7077"
    ],
    "dependencies": [ "spark-master" ]
}

Note: Livy requires a minimum of 2GB RAM to function normally and allow multiple active sessions simultaneously. Otherwise the sessions will run out of memory very soon and exit unexpectedly without traceable log. It has been tested that 6 active sessions can be created with 4GB RAM allocated.

The Sparkmagic Jupyter notebook runs on top of the Livy endpoint to interact with the Spark cluster underneath as specified below:

{
    "id": "sparkmagic",
    "type": "service",
    "network_mode": "bridge",
    "image": "heliumdatacommons/sparkmagic",
    "resources": {
        "cpus": 1,
        "mem": 2048
    },
    "is_privileged": true,
    "ports": [
      {"container_port": 8888 }
    ],
    "args": [
      "--NotebookApp.token=\"\""
    ],
    "env": {
        "LIVY_HOST": "@livy"
    },
    "dependencies": [ "livy" ]
}

It uses the heliumdatacommons/sparkmagic image and can read the value of the optional environment variable LIVY_HOST, so that it can set up the Livy endpoint in the Sparkmagic configuration.

The Spark master manages all the resources on the Spark workers for running jobs. It exposes two ports - port 7077 for Spark messages and port 8080 for HTTP requests. The status of the cluster can be checked in the browser by visiting the 8080 port. The container uses the heliumdatacommons/spark-master image. An example of a Spark master is shown below:

{
    "id": "spark-master",
    "type": "service",
    "image": "heliumdatacommons/spark-master",
    "resources": {
        "cpus": 1,
        "mem": 1024
    },
    "network_mode": "container",
    "ports": [
        { "container_port": 8080 },
        { "container_port": 7077 }
    ]
}

The Spark workers provide the resources. It uses the heliumdatacommons/spark-worker image and needs to connect to the Spark master. The Spark workers can scale up and down horizontally by increasing and decreasing its number of instances, respectively. An example of the Spark workers is as below:

{
    "id": "spark-worker",
    "type": "service",
    "image": "heliumdatacommons/helium-spark-r-worker",
    "instances": 1,
    "resources": {
        "cpus": 4,
        "mem": 4096
    },
    "network_mode": "container",
    "args": [
        "@spark-master:7077"
    ],
    "env": {
        "PYSPARK_PYTHON": "/usr/bin/python"
    }
}

The containers do have dependencies on each other - the Livy endpoint and Spark workers should run after the Spark master is running. To ensure the containers are provisioned in order, the dependencies between them should be specified explicitly in the dependencies field on each container. By putting them together, the appliance looks like below:

{
    "id": "spark-livy",
    "containers": [
        {
            "id": "livy",
            "type": "service",
            "image": "heliumdatacommons/livy",
            "resources": {
                "cpus": 1,
                "mem": 4096
            },
            "network_mode": "container",
            "ports": [
                { "container_port": 8998 }
            ],
            "args": [
                "spark://@spark-master:7077"
            ],
            "dependencies": [ "spark-master" ]
        },
        {
            "id": "spark-master",
            "type": "service",
            "image": "heliumdatacommons/spark-master",
            "resources": {
                "cpus": 1,
                "mem": 1024
            },
            "network_mode": "container",
            "ports": [
                { "container_port": 8080 },
                { "container_port": 7077 }
            ]
        },
        {
            "id": "spark-worker",
            "type": "service",
            "image": "heliumdatacommons/helium-spark-r-worker",
            "instances": 1,
            "resources": {
                "cpus": 4,
                "mem": 4096
            },
            "network_mode": "container",
            "args": [
                "@spark-master:7077"
            ],
            "env": {
                "PYSPARK_PYTHON": "/usr/bin/python"
            }
        }
    ]
}

To run the appliance, submit the request to PIVOT using a POST method as below:

curl -X POST -d @appliance.json http://<pivot-url>:<pivot-port>/appliance

To get the state of the appliance, query against PIVOT using a GET method as below:

curl -X GET http://<pivot-url>:<pivot-port>/appliance/spark-livy

The response contains the information about the endpoints where users can interact with the containers. For instance, users can visit the Livy endpoint at http://13.58.175.56:23917 according to the endpoint information of the livy container.

{
    "id": "spark-livy",
    "containers": [
        {
            "id": "sparkmagic",
            "endpoints": [
                {
                    "host": "18.220.249.88",
                    "host_port": 12094,
                    "container_port": 8888,
                    "protocol": "tcp"
                }
            ],
            "state": "running",
            ...
        },
        {
            "id": "livy",
            "endpoints": [
                {
                    "host": "54.227.40.165",
                    "host_port": 22463,
                    "container_port": 8998,
                    "protocol": "tcp"
                }
            ],
            "state": "running",
            ...
        },
        {
            "id": "spark-master",
            "endpoints": [
                {
                    "host": "18.220.249.88",
                    "host_port": 12179,
                    "container_port": 8080,
                    "protocol": "tcp"
                },
                {
                    "host": "18.220.249.88",
                    "host_port": 12180,
                    "container_port": 7077,
                    "protocol": "tcp"
                }
            ],
            "state": "running",
            ...
        },
        {
            "id": "spark-worker",
            "state": "running",
            ...
        }
    ]
}

To delete the state of the appliance:

curl -X DELETE http://<pivot-url>:<pivot-port>/appliance/spark-livy

The visual of the appliance can be found at http://<pivot-url>:<pivot-port>/appliance/spark-livy/ui, which looks as below:

spark-livy-app

Visit the Sparkmagic endpoint in the browser and then you can create Spark sessions.

sparkmagic