Project 6 is an opportunity to create your own custom exploratory data analysis (EDA) project using GitHub, Git, Jupyter, pandas, Seaborn and other popular data analytics tools.
- GitHub Repository: datafun-06-eda
- Documentation: README.md
- Notebook: yourname_eda.ipynb
Follow this common workflow to start a new project.
- In your browser, create a GitHub project repository with a default README.md. Name the repo as specified above.
- Clone your new repository down to your machine into your Documents folder.
- Open your new project repository folder in the Documents folder of your machine in VS Code (if you haven't already).
- In VS Code, add a useful .gitignore file with a line for .vsode/ and .venv/ and whatever else doesn't need to be tracked in the repository.
- In VS Code, edit your README.md to record your commands, process, and notes so far.
- In VS Code, open a terminal - PowerShell if Windows, zsh or bash if Mac/Linux.
- Use the terminal to git add your files and folders to source control, and git commit your changes with a useful message (e.g. "initial commit"), and git push the changes up to GitHub.
- Verify your GitHub repository.
This project requires at least the following external modules, so a local project virtual environment is recommended.
- jupyterlab: Enables Jupyter notebooks.
- pandas: Handles data manipulation and analysis, focusing on structured (tabular/panel) data.
- pyarrow: Required by pandas; facilitates interaction between pandas and the Arrow format.
- matplotlib: Basic tools for plotting and visualizing data.
- seaborn: Simplifies complex visualizations and statistical plots, built on matplotlib.
Perform and publish a custom EDA project to demonstrate skills with Jupyter, pandas, Seaborn and popular tools for data analytics. The notebook should tell a data story and visually present findings in a clear and engaging manner.
Choose a dataset for analysis.
You will want a known, clean dataset.
Cleaning data can run 60-80% of the project (or more) - you don't need to
demonstrate cleaning skills for this project.
The recommended approach is to select one of the other pre-installed datasets in Seaborn.
You can view a list of the Seaborn datasets in the first link below.
The additional links offer a range of options.
- List of Seaborn Datasets Installed
- UCI Machine Learning Repository
- Kaggle Datasets
- Data.gov
- Google Dataset Search
You may use your own data if you have permission and there is no confidential information included. Be careful with your data selection and ensure you have rights to use the content.
This project uses external packages, which are not included in the Python Standard Library - we must install them. To keep our project separate from all other Python projects, we will create and manage a local project virtual environment. We'll install our packages into the local project virtual environment. For the recommended process with detailed steps and commands, see PROJECT_VIRTUAL_ENV.md.
Make sure Jupyter is installed and working in your project virtual environment. Document the process and commands you used in your README.md.
Then create, open, and start a new notebook in your root project repository folder:
- Create the Notebook: In the VS Code Explorer, create a new file i.e., yourname_eda.ipynb. Ensure it has a .ipynb extension.
- Verify your new notebook is open for editing. If needed, view the project files in VS Code Explorer and double-click the notebook file to open it for editing.
- Add a Markdown cell at the top of your notebook with the introduction (include the title, author, date and the purpose of the project).
Add a Python cell next with the import statements for the libraries you will use in the project. Follow conventional package import organization and alias. Import each package just once near the top of the file. Be sure you have INSTALLED any external packages (outside the Python Standard Library) into your active project virtual environment first.
Jupyter Notebook / Python cell example:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
Execute the cell to ensure everything works. If you get errors on one of the statements above, the most common issue is that package has not been installed into the active project virtual environment. When you find you need a new package, first install it into the active project virtual environment and then import it near the top of your Python or Notebook file.
Perform a unique exploratory data analysis project using the tools and skills covered previously.
Load the data into a pandas DataFrame. Use the pd read functions such as pd.read_csv() or pd.read_excel() as appropriate. To read from the Seaborn dataset, we'll use sns.load_dataset() function and pass in the 'iris' (the name without .csv) to populate our DataFrame.
Jupyter Notebook / Python cell example:
# Load the dataset into a pandas DataFrame - adjust this process for your custom data
df = sns.load_dataset('iris')
# Inspect first rows of the DataFrame
print(df.head())
Display the first 10 rows of the DataFrame, check the shape, and display the data types of each column using df.head(10), df.shape, and df.dtypes.
Jupyter Notebook / Python cell example:
print(df.head(10))
print(df.shape)
print(df.dtypes)
Use the DataFrame describe() method to display summary statistics for each column.
Jupyter Notebook / Python cell example:
print(df.describe())
Choose a numerical column and use df['column_name'].hist() to plot a histogram for that specific column. To show all the histograms for all numerical columns, use df.hist().
Jupyter Notebook / Python cell example:
# Inspect histogram by numerical column
df['sepal_length'].hist()
# Inspect histograms for all numerical columns
df.hist()
# Show all plots
plt.show()
Afterwards, use a Markdown cell to document your observations.
Choose a categorical column and use df['column_name'].value_counts() to display the count of each category. Use a loop to show the value counts for all categorical columns.
Jupyter Notebook / Python cell example:
# Inspect value counts by categorical column
df['species'].value_counts()
# Inspect value counts for all categorical columns
for col in df.select_dtypes(include=['object', 'category']).columns:
# Display count plot
sns.countplot(x=col, data=df)
plt.title(f'Distribution of {col}')
plt.show()
# Show all plots
plt.show()
Afterwards, use a Markdown cell to document your observations.
Use pandas and other tools to perform transformations. Transformation may include renaming columns, adding new columns, or transforming existing data for more in-depth analysis.
For this project, you must:
- Rename at least one column.
- Add at least one column.
Create a variety of chart types using seaborn and matplotlib to showcase different aspects of your data. For each chart, include the goal - what you want to learn/explore, the type of chart you choose, display the chart, and tell your data story. Use Markdown cells and Python cells. Create at least 3 subsections - each subsection should have the following parts:
- Goal: The question you are exploring.
- Chart Type: Tell us what kind of chart you choose to illustrate this goal.
- Chart: Display the chart.
- Story: Use Markdown cell(s) to document your observations and insights.
Present your notebook with an opening that introduces yourself and your topic. Use Markdown section headings to introduce each step. Interpret the visualizations and statistics to narrate a clear and compelling data story. Present your findings in a logical and engaging manner.
- Begin your notebook with a project summary including the title, author, date, and project's purpose. This provides an immediate understanding of the notebook's objective.
- Ensure your code and presentation are neat, well-organized, and follow good coding practices. This includes proper variable naming, consistent code style, and logical organization of code cells.
- Use Markdown features effectively for formatting, such as section headings, bullet points, and emphasis (bold/italic), to enhance readability.
Once the notebook runs without errors, focus on how the notebook content is structured and documented. Organize your notebook into well-defined sections, each with a clear purpose and header. Use Markdown cells to provide context, explain your analysis, and share findings. This makes your notebook informative and engaging. Comment your code cells to explain the purpose and functionality of the code. This is especially important for complex or non-obvious code segments.
Run your notebook entirely to ensure it executes without errors. This includes checking all code cells and ensuring all data visualizations render as expected. Confirm that your notebook renders correctly on GitHub after pushing, as this ensures your work is viewable by others.
- Functionality: The project should be functional and meet all requirements.
- Documentation: The project should be well-written and well-documented.
- Presentation: The project should be presented in a clear and organized manner.
- Professionalism: The project should be submitted on-time and reflect an original, creative effort.
See rubric for additional information.
- See datafun-04-spec for a guided EDA.
- See JUPYTER.md for Jupyter Notebook keyboard shortcuts and recommendations.
- See MARKDOWN.md for Markdown syntax and recommendations.
- See Plotting graph For IRIS Dataset Using Seaborn And Matplotlib
- See Seaborn Tutorial