Data Science Experience is now Watson Studio. Although some images in this code pattern may show the service as Data Science Experience, the steps and processes will still work.
In this Code Pattern we will use Apache SystemML running on IBM Watson Studio to perform a Machine Learning exercise. Watson Studio is an interactive, collaborative, cloud-based environment where data scientists, developers, and others interested in data science can use tools (e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and gather insight from their data. Apache SystemML is a flexible machine learning platform that is optimized to scale with large data sets.
When you have completed this Code Pattern, you will understand how to:
- Use Jupyter Notebooks to load, visualize, and analyze data
- Run Notebooks in IBM Watson Studio
- Leverage Apache SystemML as a machine learning library
The intended audience for this Code Pattern is both application developers and other stakeholders who wish to utilize the power of Data Science quickly and effectively to solve machine learning problems using Apache SystemML. Although Apache SystemML provides various out-of-the box algorithms to experiment with, this specific Code Pattern will provide a Linear Regression example to demonstrate the ease and power of Apache SystemML. Additionally, users can develop their own algorithms using Apache SystemML's Declarative Machine Language (DML) which has R or Python like syntax, or customize any algorithm provided in the package. For more information about additional functionality support, documentation, and the roadmap, please visit Apache SystemML.
- Load the provided notebook onto the IBM Watson Studio platform.
- The notebook interacts with an Apache Spark instance.
- A sample big data dataset is loaded into the Jupyter Notebook.
- To perform machine learning, Apache SystemML is used atop Apache Spark.
- IBM Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
- IBM Analytics for Apache Spark: An open source cluster computing framework optimized for extremely fast and large scale data processing.
- Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
- Apache SystemML: An open source machine learning library. It allows data scientists to express machine learning algorithms through declarative language (DML) using R or Python like syntax.
Typically data scientist writes an algorithm on subset of dataset which can be fit on the workstation (laptop) disk/memory. Once he/she is satisfied with the results on a workstation, he/she approach system engineer to implement same algorithm in the distributed environment with much bigger dataset. It may takes weeks if not months to go back and forth between data scientist and system engineer to have equivalent algorithm gets implemented in distributed environment on bigger dataset. As human intervention gets involved there is a potential for introduction of bugs in an implementation of equivalent algorithm. When final algorithm is ready it cannot be determined if final algorithm is equivalent to that of an algorithm which was implemented to run it on a workstation. Its hard to determine if any issues found are due to implementation of algorithm in distributed environment or due to an original algorithm itself.
There comes the “State of the Art” from SystemML. With SystemML data scientist has to write an algorithm only once. With in-built optimizer from SystemML, any algorithm written will have dynamic runtime plan based on data characteristics and runtime environment such as single machine or cluster with multiple nodes. Data Scientist can save lot of time and possible error injection while transforming algorithm implemented to run on single machine to algorithm to be run in a distributed environment.
Follow these steps to setup and run this Code Pattern. These steps are described in detail below.
Sign up for IBM's Watson Studio. By creating a project in Watson Studio a free tier Object Storage
service will be created in your IBM Cloud account. Take note of your service names as you will need to select them in the following steps.
Note: When creating your Object Storage service, select the
Free
storage type in order to avoid having to pay an upgrade fee.
To create these services:
- Login or create your IBM Cloud account.
- Create your Spark service by selecting the service type Apache Spark. If the name has not already been used, name your service
Apache Spark
so that you can keep track of it. - Create your Object Storage service by selecting the service type Cloud Object Storage. If the name has not already been used, name your service
Watson Studio-ObjectStorage
so that you can keep track of it.
Note: When creating your Object Storage service, select the
Swift
storage type in order to avoid having to pay an upgrade fee.
Take note of your service names as you will need to select them in the following steps.
Create the Notebook:
- In Watson Studio, click on
Create notebook
to create a notebook. - Create a project if necessary, provisioning an object storage service if required.
- In the
Assets
tab, select theCreate notebook
option. - Select the
From URL
tab. - Enter a name for the notebook.
- Optionally, enter a description for the notebook.
- For
Notebook URL
enter: /~https://github.com/IBM/SystemML_Usage/blob/master/notebooks/Machine-Learning-Using-Apache-SystemML.ipynb - Select the free Anaconda runtime.
- Click
Create Notebook
.
When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.
Each code cell is selectable and is preceded by a tag in the left margin. The tag
format is In [x]:
. Depending on the state of the notebook, the x
can be:
- A blank, which indicates that the cell has never been executed.
- A number, which represents the relative order this code step was executed.
- A
*
, which indicates that the cell is currently executing.
There are several ways to execute the code cells in your notebook:
- One cell at a time.
- Select the cell, and then press the
Play
button in the toolbar.
- Select the cell, and then press the
- Batch mode, in sequential order.
- From the
Cell
menu bar, there are several options available. For example, you canRun All
cells in your notebook, or you canRun All Below
, that will start executing from the first cell under the currently selected cell, and then continue executing all cells that follow.
- From the
- At a scheduled time.
- Press the
Schedule
button located in the top right section of your notebook panel. Here you can schedule your notebook to be executed once at some future time, or repeatedly at your specified interval.
- Press the
Under the File
menu, there are several ways to save your notebook:
Save
will simply save the current state of your notebook, without any version information.Save Version
will save your current state of your notebook with a version tag that contains a date and time stamp. Up to 10 versions of your notebook can be saved, each one retrievable by selecting theRevert To Version
menu item.
You can share your notebook by selecting the Share
button located in the top
right section of your notebook panel. The end result of this action will be a URL
link that will display a “read-only” version of your notebook. You have several
options to specify exactly what you want shared from your notebook:
Only text and output
: will remove all code cells from the notebook view.All content excluding sensitive code cells
: will remove any code cells that contain a sensitive tag. For example,# @hidden_cell
is used to protect your dashDB credentials from being shared.All content, including code
: displays the notebook as is.- A variety of
download as
options are also available in the menu.
- Demo on Youtube: Watch the video.
- What is SystemML: A video of a Chicago Spark meetup that outlines Apache SystemML basics.
- SystemML introduction and demo: A detailed video introduction to Apache SystemML.
- Machine Learning Framework survey results: An examination of current major deep learning frameworks, including a comparison of native language of framework, multi-GPU support, and aspects of usability.
- Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
- AI and Data Code Pattern Playlist: Bookmark our playlist with all of our Code Pattern videos
- Watson Studio: Master the art of data science with IBM's Watson Studio
- Spark on IBM Cloud: Need a Spark cluster? Create up to 30 Spark executors on IBM Cloud with our Spark service
This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.