Tools for Discovering Patterns in Data: Extracting Value from Tables, Text, and Links
John Elder, Ph.D.
Can't attend the on-site course? Register here for the online video course.
Find the useful information hidden in your data! This course surveys computer-intensive methods for inductive classification and estimation, drawn from Statistics, Machine Learning, and Data Mining. Dr. Elder will describe the key inner workings of leading algorithms, compare their merits, and (briefly) demonstrate their relative effectiveness on practical applications. We'll first review classical statistical techniques, both linear and nonparametric, then outline the ways in which these basic tools are modified and combined into powerful modern methods. The course emphasizes practical advice and focuses on the essential techniques of Resampling, Visualization, and Ensembles. Actual scientific and business examples will illustrate proven techniques employed by expert analysts. Along the way, relative strengths and distinctive properties of the leading commercial software products for Data Mining will be discussed.
John F. Elder IV, Ph.D. heads the US’s top data mining consulting team, based in Charlottesville, Virginia, and in Washington DC, Baltimore MD, and Raleigh NC. Founded in 1995, Elder Research, Inc. focuses on commercial, investment, and security applications of advanced analytics including stock selection, text mining, social networks, image recognition, biometrics, process optimization, drug efficacy, credit scoring, and fraud detection. John holds Engineering degrees from Rice University, and the University of Virginia, where he’s an Adjunct Professor teaching Optimization or Data Mining. Prior to founding ERI, he spent a decade in aerospace consulting, investment management, and academia.
Dr. Elder has authored innovative data mining tools, is a frequent keynote speaker, and chairs international analytics conferences. He was honored to serve five years on a panel appointed by President Bush to guide technology for national security. He has co-authored award-winning books on practical data mining, ensembles, and text mining. John is grateful to be a follower of Christ and the father of 5.
Those from industry and academia who work with data and wish to understand recent developments in data science and machine learning. At the conclusion of this course, one should be able to discern the basic strengths of competing methods and select the appropriate tools for one's applications. Participants should have prior working experience with computers and interest in applied statistical techniques. (It helps, as well, to have a motivating application you wish to solve.)
I. Pattern Discovery: An Overview
- Inducing Models from Data: Benefits and Dangers
- Example Projects from Science and Business
- Characteristics of successful projects
- Leading Software Tools and Vendors
II. Classical Statistical Techniques (brief review)
- Principle Components
- Nearest Neighbors
III. Modern Methods
- Neural Networks
- Decision Trees
IV. Key General Tools
- Scientific Visualization: Grand Tour, Projection Pursuit, limitations
- Bootstrapping/Resampling: Essential!
- Optimization: local and global
- Target Shuffling: learning true significance
V. Data Trouble-Shooting
- Case Diagnostics (Outlying, Influential, Leverage, & Missing points)
- Feature Creation and Selection
VI. Text Mining
- Stemming, Collocation, Feature Engineering
- Statistical vs. Language-dependent methods
- “Bag of Words” & Vector Space
- Active Learning
VII. Social Network Analysis
- The power of the "network effect"
- Visualization, modeling tools, and examples
VIII. Comparing and Combining Algorithms
- Adaptive model structure
- Matching an algorithm to your application
- Experimental test results
- Combining models to improve accuracy
- Bagging & Boosting
- Why Ensembles work
IX. Top 10 Data Mining Mistakes
- Lack data
- Focus on Training
- Rely on 1 technique
- Ask the wrong question
- Listen (only) to the data
- Leaks from the Future
- Discount pesky cases
- Answer every inquiry
- Sample without care
- Believe the best model
A note about the course scope
Each of the major topics discussed could comprise a semester-long course if presented in full detail! What this (intensive) short course provides is a broad overview of the highlights, drawing connections between major developments in the diverse fields that contribute to Predictive Analytics, including cutting-edge ways to mine text and graphical networks. Previous participants have found this "big picture" to be very useful for identifying techniques to use immediately, as well as approaches worthy of further exploration, for research or practical problem-solving.
Comments from previous attendees
- "[Dr. Elder] provided examples shedding light on complex concepts. He gave the big picture all along the way."
- "Gave real practical insights from a practitioner's point of view."
- "Finally someone told me how things are done, not just how great Data Mining is."
- "Most valuable, were the insights into the essence of various methods, their relative strengths and weaknesses, and the important open research areas."
- "Very interesting, knowledgeable, and entertaining approach."