Duration:
5 Days
Audience:
Employees of federal, state and local governments; and businesses working with the government. Specifically for Data Scientists, Software Engineers, and Data Engineers.
Focus:
Using Python in a Data Science context, Machine Learning
Skill-level:
Intermediate
Hands-On Format:
This hands-on class is approximately 50/50 lab to lecture ratio, combining engaging lecture, demos, group activities and discussions with comprehensive machine-based practical programming labs and project work.
Course Description:
Intermediate Python in Data Science covers the essentials of using Python as a tool for data scientists to perform exploratory data analysis, complex visualizations, and large-scale distributed processing on “Big Data”. In this course we cover essential mathematical and statistics libraries such as NumPy, Pandas, SciPy, SciKit-Learn, frameworks like TensorFlow and Spark, as well as visualization tools like matplotlib, PIL, and Seaborn.
Prerequisites:
Attending students are required to have a background in basic Python development skills.
Outline:
Session: Python for Data Science
Lesson: Python Review
· Python Language
· Essential Syntax
· Lists, Sets, Dictionaries, and Comprehensions
· Functions
· Classes, Modules, and imports
· Exceptions
Lesson: iPython
· iPython basics
· Terminal and GUI shells
· Creating and using notebooks
· Saving and loading notebooks
· Ad hoc data visualization
· Web Notebooks (Jupyter)
Lesson: numpy
· numpy basics
· Creating arrays
· Indexing and slicing
· Large number sets
· Transforming data
· Advanced tricks
Lesson: scipy
· What can scipy do?
· Most useful functions
· Curve fitting
· Modeling
· Data visualization
· Statistics
Lesson: A tour of scipy subpackages
· Clustering
· Physical and mathematical Constants
· FFTs
· Integral and differential solvers
· Interpolation and smoothing
· Input and Output
· Linear Algebra
· Image Processing
· Distance Regression
· Root-finding
· Signal Processing
· Sparse Matrices
· Spatial data and algorithms
· Statistical distributions and functions
· C/C++ Integration
Lesson: pandas
· pandas overview
· Dataframes
· Reading and writing data
· Data alignment and reshaping
· Fancy indexing and slicing
· Merging and joining data sets
Lesson: matplotlib
· Creating a basic plot
· Commonly used plots
· Ad hoc data visualization
· Advanced usage
· Exporting images
Lesson: The Python Imaging Library (PIL)
· PIL overview
· Core image library
· Image processing
· Displaying images
Lesson: seaborn
· Seaborn overview
· Bivariate and univariate plots
· Visualizing Linear Regressions
· Visualizing Data Matrices
· Working with Time Series data
Lesson: SciKit-Learn Machine Learning Essentials
· SciKit overview
· SciKit-Learn overview
· Algorithms Overview
· Classification, Regression, Clustering, and Dimensionality Reduction
· SciKit Demo
Lesson: TensorFlow Overview
· TensorFlow overview
· Keras
· Getting Started with TensorFlow
Session: Python on Spark
Lesson: PySpark Overview
· Python and Spark
· SciKit-Learn vs. Spark MLlib
· Python at Scale
· PySpark Demo
Lesson: RDDs and DataFrames
· DataFrames and Resilient Distributed Datasets (RDDs)
· Partitions
· Adding variables to a DataFrame
· DataFrame Types
· DataFrame Operations
· Dependent vs. Independent variables
· Map/Reduce with DataFrames
Lesson: Spark SQL
· Spark SQL Overview
· Data stores: HDFS, Cassandra, HBase, Hive, and S3
· Table Definitions
· Queries
Lesson: Spark MLib
· MLib overview
· MLib Algorithms Overview
· Classification Algorithms
· Regression Algorithms
· Decision Trees and forests
· Recommendation with ALS
· Clustering Algorithms
· Machine Learning Pipelines
· Linear Algebra (SVD, PCA)
· Statistics in MLib
Lesson: Spark Streaming
· Streaming overview
· Integrating Spark SQL, MLlib, and Streaming