Python Data
value
- Collection: getting the data
- Engineering: storage and computational resources
- Governance: overall management of data
- Wrangling: data preprocessing,cleaning
- Analysis: discovery (learning, visualisation, etc.)
- Presentation: arguing that results are significant and useful
- Operationalisation: putting the results towork
Data
- Mean, Median and Mode
- Mode: most frequently
- Data pre-processing
- Data preparation
- Data cleansing
- Data transformation
-
correlation amongst variables: df.corr()
- The V’s
- The first characterisations by someone with a penchant for alliteration
- Volume, Velocity and Variety (9)
- Metadata
- Data about data is critical to understanding
- Dimensions of data
- Infographics on data dimensions (how big is “big”)
- Growth laws
- Understanding the exponential growth
- in-memory: in RAM, i.e., not going to disk
- parallel processing: performing tasks in parallel distributed computing: across multiple machines
- scalability: to handle a growing amount of work; to be enlarged to accommodate growth (not just “big”)
- data parallel: processing can be done independently on separate chunks of data
Learn
- All data is labelled and the algorithms learn to predict the output from the input data.
- The goal is to approximate the mapping function so well that when you have new input data (x), you can predict the output variable (Y) for that data.
- Polynomial regression uses the same linear regression infrastructure to fit a higher order polynomial.
- Small polynomial; cannot fit the data well; said to have high bias
- Large polynomial; can fit the data too well; said to have small bias
- Naive Bayesian classification performs well for text classification with smaller data sets
- linear Support Vector Machines perform well for text classification
Clustering
- K Means
BigML
- Decision Trees
Written on November 30, 2020