Skip to content

Latest commit

 

History

History
109 lines (102 loc) · 3.75 KB

skillchecklist.md

File metadata and controls

109 lines (102 loc) · 3.75 KB

#Ultimate Skill Checklist For Data Analyst

Contents

Programming

  • Python programming language
    • numpy
    • pandas
    • matplotlib
    • scipy
    • scikit-learn
  • R programming language
    • ggplot2
    • dplyr
    • ggally
    • reshape2
  • Optional
    • ipython
    • ipython notebook
    • anaconda
    • ggplot
    • seaborn
    • Spreadsheet tools (like Excel)
  • Additional Skills
    • Javascript and HTML for D3.js
      • D3.js
      • AJAX implementation
      • jQuery
    • C/C++ or Java

Statistic

  • Descriptive and Inferential statistics
    • Mean, median, mode
    • Data distributions
      • Standard normal
      • Exponential/Poisson
      • Binomial
      • Chi-square
    • Standard deviation and variance
    • Hypothesis testing
      • P-values
    • Test for significance
      • Z-test, t-test, Mann-Whitney U
      • Chi-squared and ANOVA testing
  • Experimental design
    • A/B Testing
    • Controlling variables and choosing good control and testing groups
    • Sample Size and Power law
    • Hypothesis Testing, test hypothesis
    • Confidence level
    • SMART experiments: Specific, Measurable, Actionable, Realistic, Timely]

Mathematics

  • Translate numbers and concepts into a mathematical expression: 4 times the square-root of one-third of a gallon of water (expressed as g): 4 √(1/3 g)
  • Solve for missing values in Algebra equations: 14 = 2x + 29
  • How does the 1/2 value change the shape of this graph?
  • Linear algebra and Calculus
  • Matrix manipulations. Dot product is crucial to understand.
  • Eigenvalues and eigenvectors -- Understand the significance of these two concepts
  • Multivariable derivatives and integration in Calculus

Machine Learning

  • Supervised Learning
    • Decision trees
    • Naive Bayes classification
    • Ordinary Least Squares regression
    • Logistic regression
    • Neural networks
    • Support vector machines
    • Ensemble methods
  • Unsupervised Learning
    • Clustering Algorithms
    • Principal Component Analysis (PCA)
    • Singular Value Decomposition (SVD)
    • Independent Component Analysis (ICA)
  • Reinforcement Learning
    • Qlearning
    • TD-Learning
    • Reinforcement Learning

Data Wrangling

  • Python
    • Learn about Python String library for string manipulations
    • Parsing common file formats such as csv and xml files
    • Regular Expressions
    • Mathematical transformations
      • Convert non-normal distribution to normal with log-10 transformation
  • Database systems (SQL-based and NO SQL based) - Databases act as a central hub to store information
  • Relational databases such as PostgreSQL, mySQL, Netezza, Oracle, etc.
  • Optional: Hadoop, Spark, MongoDB
  • SQL

Communication and Data Visualization

  • Understand visual encoding and communicating what you want the audience to take away from your visualizations
  • Programming
    • matplotlib
    • ggplot
    • d3.js
  • Presenting data and convincing people with your data
    • Know the context of the business situation at hand with regards to your data
    • Make sure to think 5 steps ahead and predict what their questions will be and where your audience will challenge your assumptions and conclusions
    • Give out pre-reads to your presentations and have pre-alignment meetings with interested parties before the actual meeting