facebook-pixel
$149.00
Certification

Industry recognized certification enables you to add this credential to your resume upon completion of all courses

Need Custom Training for Your Team?
Get Quote
Call Us

Toll Free (844) 397-3739

Inquire About This Course
Instructor
Dr. Connie Brett, Instructor - Data Wrangling in R

Dr. Connie Brett

Dr. Connie Brett is a successful Data Scientist, Entrepreneur, and Educator who has spent the past 15+ years implementing and coaching analytics teams across the entire SDLC. She brings a unique perspective to the problems faced by all phases of planning, developing and use of online products and solutions - this helps her teach you how to use analytics tools in the most effective way. With an M.S. and Ph.D. from The Ohio State University in Computational Chemistry, she worked in the quagmire of data problems, preparation, and analysis long before the coinage of the term "Data Science" or "Big Data". She has been published in peer-reviewed journals and recently filed for a US Patent on a Data Visualization Framework.

Instructor: Dr. Connie Brett

Real-world data preparation for further analysis using R

  • Learn from start to finish how to get your data into R efficiently and polish it up so that it is as good as it can be.
  • Instructor is the founder of Analytics Incubation Center at Cisco and has 15 years of analytics development experience.
  • Capstone project reviewed by the instructor.

Duration: 2h 15m

Course Description

R is an extraordinarily powerful language with a vast community of great resources, but where should you start when all you want to do is get your data into a usable format? How do you know your data might be ready? What are the pitfalls you should watch for so that you don’t perform an analysis on bad data? This course will teach you from start to finish how to get your data into R efficiently and polish it up so that it is as good as it can be. This will let you or your team focus after this step on the statistical modeling, visualization, reporting, sharing, or any other post-processing task you wish to perform. Confidence, reliability, and reproducibility in your data acquisition and preparation are the kingpins to being able to maximize your data’s value. This course uses a variety of real-world data sets that contain real-world data quality, formatting, and other issues. It will ensure that you understand not just the R syntax to perform a task, but also sources of quality issues, how to recognize hidden data problems, and the benefits and adverse effects of the most common data manipulations. This course will give you real experience in the art and science of data preparation that you can take to your next real project forward with confidence. The capstone project utilizes open agricultural industry data in preparation for a future statistical analysis of the products and brands of the companies. Like a real project, the project goals and background are provided but the step-by-step data preparation is not given - the course will have provided the methods and insights needed to prepare this data for future statistical analysis! The capstone project is reviewed by the instructor and feedback is individually provided to each student in the course along with a full project solution.

What am I going to get from this course?

  • Understand the R syntax to perform a task
  • Identify sources of quality issues
  • Recognize hidden data problems
  • Understand benefits/detriments of the most common data manipulations
  • Prepare a real-world dataset for future statistical analysis and utilize the capstone project as a portfolio piece.
     

Prerequisites and Target Audience

What will students need to know or do before starting this course?

  • R-Studio installed (optional, but strongly suggested)
  • R installed
  • Basic R programming knowledge

Who should take this course? Who should not?

  • Students do not need to be an R expert to take this course, but should have a basic knowledge of how to use R.
  • Students should be persons who use data and R and want to better understand how to prepare data for analysis correctly and efficiently.

Curriculum

Module 1: Introduction

05:30
Lecture 1 Introduction to the Course
05:30

Course Objectives, Audience and Instructor Information.

Lecture 2 Course Slides

Download the entire set of Course Slides as a PDF to take notes/etc as you take this course.

Module 2: Data Sources

18:08
Lecture 3 Importance of Metadata
02:11

An overview and understanding of why metadata is important.

Lecture 4 Collection Bias
02:05

Understanding Collection Bias and why it is critical to keep in mind during data collection and analysis.

Lecture 5 Public Data Sources
03:24

Using public data sources including best practices.

Lecture 6 Private Data
10:28

Defining and understanding private data.

Module 3: Obtaining Data

16:33
Lecture 7 Database Connections
02:40

Connecting to and querying data directly from databases in R

Lecture 8 Files
07:00

Obtaining data from various file types and formats

Lecture 9 Hadoop
03:51

Interacting with Hadoop data stores in R

Lecture 10 Mini-Project 1

In this project we are going to obtain the data used in the mini-projects. Complete this project before the quiz!

Quiz 1 Mini-Project 1

Questions related to Mini-Project 1.

Module 4: Cleaning Data

31:05
Lecture 11 HTML
04:31

Dealing with HTML encoding in fields

Lecture 12 JSON
03:43

Dealing with JSON formatted data

Lecture 13 Excel
06:19

Excel-specific data cleaning issues and tips.

Lecture 14 Whitespace/Languages
02:05

Handling whitespace and multi-language issues in R

Lecture 15 Units and Conversions
01:48

Handling unit conversions

Lecture 16 Data Type Issues
01:55

Common data type issues

Lecture 17 Categorical Creep
03:47

Recognizing and solving categorical "creep" or spread

Lecture 18 Minor Corrections
01:36

Best Practices for minor corrections

Lecture 19 Completeness
03:50

Overview of detecting and handling of completeness issues during data cleaning

Lecture 20 Accuracy
01:31

Notes on accuracy considerations while cleaning data

Module 5: Shaping Data

20:14
Lecture 21 Long vs. Wide Formats
03:09

Understanding and converting between these commonly referenced data shapes

Lecture 22 Combined Data
02:22

Separating combined data in a single field

Lecture 23 Column & Row Names
03:31

Capturing data contained in column and row names

Lecture 24 Internally Structured Data
03:53

Flattening data with embedded structured data

Lecture 25 Internal Lists
04:22

Handling lists inside fields

Lecture 26 Naming Columns
00:58

Quick best-practices and considerations when naming columns

Lecture 27 OLAP Cubes
01:59

Using OLAP cube data in R

Lecture 28 Mini-Project 2

In this project we are going to prepare the data from Mini-Project 1 for analysis. Complete this project before the quiz!

Quiz 2 Mini-Project 2

Questions related to Mini-Project 2

Module 6: Features/Variables

27:45
Lecture 29 Introduction
01:14

Introducing Feature/Variable Selection

Lecture 30 Elimination - Variance
05:06

Eliminating features with zero or near-zero variance

Lecture 31 Elimination - Correlation
05:39

Eliminating features using correlation

Lecture 32 Feature Creation
03:35

Finding and creating features

Lecture 33 Examining Distributions
02:26

Examining variable distributions - continuous data

Lecture 34 Finding Rare Events
02:32

Finding rare events in data that may signal an issue

Lecture 35 Normalization
03:52

Normalizing and rescaling data

Lecture 36 Advanced Preprocessing
02:28

Handling less-common dat preprocessing scenarios such as baseline removal.

Lecture 37 Wrap-Up
00:53

Comments on selecting features/variables

Lecture 38 Mini-Project 3

In this project we are going to refine the dataset by feature manipulation Complete this project before the quiz!

Quiz 3 Mini-Project 3

Questions related to Mini-Project 3

Module 7: Exporting & Saving

05:07
Lecture 39 Exporting & Saving Prepared Data
05:07

Tips, tricks and notes about exporting and saving your prepared data

Module 8: Data Pipeline

06:25
Lecture 40 Working with R in a Data Pipeline
06:25

Considerations when Data Wrangling as part of a data pipeline.

Module 9: Conclusion & Capstone

03:52
Lecture 41 Next Steps and Additional Resources
03:52

Course Wrap-up

Lecture 42 Capstone Project

Instructions for the Capstone Project The capstone project utilizes open agricultural industry data in preparation for a future statistical analysis of the products and brands of the companies. Like a real project, the project goals and background are provided but the step-by-step data preparation is not given - you will be able to use the methods you learned in the class to prepare this data for the project's future statistical analysis.

Reviews

6 Reviews

Xiao X

December, 2016

Weldon C

July, 2017

Very comprehensive course. Learned a tremendous amount about R.

Chris B

May, 2017

It is a great experience to learn this course from the founder instructor of analytics incubation center at Cisco. There cannot be a greater place than learning from such an instructor. He lectured very well on analysis with R and how to get data into R efficiently. The perplex involving how and where to start data in a usable format using R is well explained in the course. It is also equally important to know about the readiness of your data, and pitfalls to watch and not to perform analysis on bad data. The instructor was open-minded for ideas and encourage to contribute and collaborative in the participation. The course material is very well informative.

Jason C

May, 2017

This course expertly teachers to get data into R efficiently. Learning this way, I could concentrate on the statistical modeling, reporting, visualization and sharing. The course build confidence in me in my data acquisition and preparation that are main tasks to maximize data value.

Jason S

May, 2017

As I am new to the field of R programming language, this course is of immense help to me to learn the basics of how to use R. and better understand how to prepare data for analysis correctly and efficiently

Victor G

July, 2017

Great experience and overall interesting course. The instructor was remarkably distinct in presentation and you can clearly see the effort that went into producing this course. This is one of the strongest courses on this subject I have taken.