Generic filters
Exact matches only
Search in title
Search in content
Search in excerpt

Data Mining and Predictive Analysis

Data mining is the process of discovering useful patterns and trends in large data sets and predictive analytics is the process of extracting information from large data sets in order to make predictions and estimates about future outcomes. Data mining is becoming more widespread every day, because it empowers companies to uncover profitable patterns and trends from their existing databases. With uCertify’s course Data mining and predictive analysis, you get a hands-on experience in data mining and you will learn what types of analysis will uncover the most profitable nuggets of knowledge from the data, while avoiding the potential pitfalls that may cost your company millions of dollars.

Submit form to obtain discount

Test Prep
Features
63+ LiveLab | 63+ Video tutorials | 02:02+ Hours

Why choose TOPTALENT?

Outline

Lessons 1:
Preface

  • What is Data Mining? What is Predictive Analytics?
  • Why is this Course Needed?
  • Who Will Benefit from this Course?
  • Danger! Data Mining is Easy to do Badly
  • “White-Box” Approach
  • Algorithm Walk-Throughs
  • Exciting New Topics
  • The R Zone
  • Appendix: Data Summarization and Visualization
  • The Case Study: Bringing it all Together
  • How the Course is Structured

Lessons 2:
An Introduction to Data Mining and Predictive Analytics

  • What is Data Mining? What Is Predictive Analytics?
  • Wanted: Data Miners
  • The Need For Human Direction of Data Mining
  • The Cross-Industry Standard Process for Data Mining: CRISP-DM
  • Fallacies of Data Mining
  • What Tasks can Data Mining Accomplish
  • The R Zone
  • R References
  • Exercises

Lessons 3:
Data Preprocessing

  • Why do We Need to Preprocess the Data?
  • Data Cleaning
  • Handling Missing Data
  • Identifying Misclassifications
  • Graphical Methods for Identifying Outliers
  • Measures of Center and Spread
  • Data Transformation
  • Min–Max Normalization
  • Z-Score Standardization
  • Decimal Scaling
  • Transformations to Achieve Normality
  • Numerical Methods for Identifying Outliers
  • Flag Variables
  • Transforming Categorical Variables into Numerical Variables
  • Binning Numerical Variables
  • Reclassifying Categorical Variables
  • Adding an Index Field
  • Removing Variables that are not Useful
  • Variables that Should Probably not be Removed
  • Removal of Duplicate Records
  • A Word About ID Fields
  • The R Zone
  • R Reference
  • Exercises

Lessons 4:
Exploratory Data Analysis

  • Hypothesis Testing Versus Exploratory Data Analysis
  • Getting to Know The Data Set
  • Exploring Categorical Variables
  • Exploring Numeric Variables
  • Exploring Multivariate Relationships
  • Selecting Interesting Subsets of the Data for Further Investigation
  • Using EDA to Uncover Anomalous Fields
  • Binning Based on Predictive Value
  • Deriving New Variables: Flag Variables
  • Deriving New Variables: Numerical Variables
  • Using EDA to Investigate Correlated Predictor Variables
  • Summary of Our EDA
  • The R Zone
  • R References
  • Exercises

Lessons 5:
Dimension-Reduction Methods

  • Need for Dimension-Reduction in Data Mining
  • Principal Components Analysis
  • Applying PCA to the Houses Data Set
  • How Many Components Should We Extract?
  • Profiling the Principal Components
  • Communalities
  • Validation of the Principal Components
  • Factor Analysis
  • Applying Factor Analysis to the Adult Data Set
  • Factor Rotation
  • User-Defined Composites
  • An Example of a User-Defined Composite
  • The R Zone
  • R References
  • Exercises

Lessons 6:
Univariate Statistical Analysis

  • Data Mining Tasks in Discovering Knowledge in Data
  • Statistical Approaches to Estimation and Prediction
  • Statistical Inference
  • How Confident are We in Our Estimates?
  • Confidence Interval Estimation of the Mean
  • How to Reduce the Margin of Error
  • Confidence Interval Estimation of the Proportion
  • Hypothesis Testing for the Mean
  • Assessing The Strength of Evidence Against The Null Hypothesis
  • Using Confidence Intervals to Perform Hypothesis Tests
  • Hypothesis Testing for The Proportion
  • Reference
  • The R Zone
  • R Reference
  • Exercises

Lessons 7:
Multivariate Statistics

  • Two-Sample t-Test for Difference in Means
  • Two-Sample Z-Test for Difference in Proportions
  • Test for the Homogeneity of Proportions
  • Chi-Square Test for Goodness of Fit of Multinomial Data
  • Analysis of Variance
  • Reference
  • The R Zone
  • R Reference
  • Exercises

Lessons 8:
Preparing to Model the Data

  • Supervised Versus Unsupervised Methods
  • Statistical Methodology and Data Mining Methodology
  • Cross-Validation
  • Overfitting
  • Bias–Variance Trade-Off
  • Balancing The Training Data Set
  • Establishing Baseline Performance
  • The R Zone
  • R Reference
  • Exercises

Lessons 9:
Simple Linear Regression

  • An Example of Simple Linear Regression
  • Dangers of Extrapolation
  • How Useful is the Regression? The Coefficient of Determination, r2
  • Standard Error of the Estimate, s
  • Correlation Coefficient r
  • Anova Table for Simple Linear Regression
  • Outliers, High Leverage Points, and Influential Observations
  • Population Regression Equation
  • Verifying The Regression Assumptions
  • Inference in Regression
  • t-Test for the Relationship Between x and y
  • Confidence Interval for the Slope of the Regression Line
  • Confidence Interval for the Correlation Coefficient ρ
  • Confidence Interval for the Mean Value of y Given x
  • Prediction Interval for a Randomly Chosen Value of y Given x
  • Transformations to Achieve Linearity
  • Box–Cox Transformations
  • The R Zone
  • R References
  • Exercises

Lessons 10:
Multiple Regression and Model Building

  • An Example of Multiple Regression
  • The Population Multiple Regression Equation
  • Inference in Multiple Regression
  • Regression With Categorical Predictors, Using Indicator Variables
  • Adjusting R2: Penalizing Models For Including Predictors That Are Not Useful
  • Sequential Sums of Squares
  • Multicollinearity
  • Variable Selection Methods
  • Gas Mileage Data Set
  • An Application of Variable Selection Methods
  • Using the Principal Components as Predictors in Multiple Regression
  • The R Zone
  • R References
  • Exercises

Lessons 11:
k-Nearest Neighbor Algorithm

  • Classification Task
  • k-Nearest Neighbor Algorithm
  • Distance Function
  • Combination Function
  • Quantifying Attribute Relevance: Stretching the Axes
  • Database Considerations
  • k-Nearest Neighbor Algorithm for Estimation and Prediction
  • Choosing k
  • Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler
  • The R Zone
  • R References
  • Exercises

Lessons 12:
Decision Trees

  • What is a Decision Tree?
  • Requirements for Using Decision Trees
  • Classification and Regression Trees
  • C4.5 Algorithm
  • Decision Rules
  • Comparison of the C5.0 and CART Algorithms Applied to Real Data
  • The R Zone
  • R References
  • Exercises

Lessons 13:
Neural Networks

  • Input and Output Encoding
  • Neural Networks for Estimation and Prediction
  • Simple Example of a Neural Network
  • Sigmoid Activation Function
  • Back-Propagation
  • Gradient-Descent Method
  • Back-Propagation Rules
  • Example of Back-Propagation
  • Termination Criteria
  • Learning Rate
  • Momentum Term
  • Sensitivity Analysis
  • Application of Neural Network Modeling
  • The R Zone
  • R References
  • Exercises

Lessons 14:
Logistic Regression

  • Simple Example of Logistic Regression
  • Maximum Likelihood Estimation
  • Interpreting Logistic Regression Output
  • Inference: Are the Predictors Significant?
  • Odds Ratio and Relative Risk
  • Interpreting Logistic Regression for a Dichotomous Predictor
  • Interpreting Logistic Regression for a Polychotomous Predictor
  • Interpreting Logistic Regression for a Continuous Predictor
  • Assumption of Linearity
  • Zero-Cell Problem
  • Multiple Logistic Regression
  • Introducing Higher Order Terms to Handle Nonlinearity
  • Validating the Logistic Regression Model
  • WEKA: Hands-On Analysis Using Logistic Regression
  • The R Zone
  • R References
  • Exercises

Lessons 15:
NaïVe Bayes and Bayesian Networks

  • Bayesian Approach
  • Maximum A Posteriori (MAP) Classification
  • Posterior Odds Ratio
  • Balancing The Data
  • Naïve Bayes Classification
  • Interpreting The Log Posterior Odds Ratio
  • Zero-Cell Problem
  • Numeric Predictors for Naïve Bayes Classification
  • WEKA: Hands-on Analysis Using Naïve Bayes
  • Bayesian Belief Networks
  • Clothing Purchase Example
  • Using The Bayesian Network to Find Probabilities
  • The R Zone
  • R References
  • Exercises

Lessons 16:
Model Evaluation Techniques

  • Model Evaluation Techniques for the Description Task
  • Model Evaluation Techniques for the Estimation and Prediction Tasks
  • Model Evaluation Measures for the Classification Task
  • Accuracy and Overall Error Rate
  • Sensitivity and Specificity
  • False-Positive Rate and False-Negative Rate
  • Proportions of True Positives, True Negatives, False Positives, and False Negatives
  • Misclassification Cost Adjustment to Reflect Real-World Concerns
  • Decision Cost/Benefit Analysis
  • Lift Charts and Gains Charts
  • Interweaving Model Evaluation with Model Building
  • Confluence of Results: Applying a Suite of Models
© 2024 TOPTALENT LEARNING.