Python Data

value

Mean, Median and Mode
- Mode: most frequently
Data pre-processing
Data preparation
Data cleansing
Data transformation
correlation amongst variables: df.corr()
The V’s
- The first characterisations by someone with a penchant for alliteration
- Volume, Velocity and Variety (9)
Metadata
- Data about data is critical to understanding
Dimensions of data
- Infographics on data dimensions (how big is “big”)
Growth laws
- Understanding the exponential growth
in-memory: in RAM, i.e., not going to disk
parallel processing: performing tasks in parallel distributed computing: across multiple machines
scalability: to handle a growing amount of work; to be enlarged to accommodate growth (not just “big”)
data parallel: processing can be done independently on separate chunks of data

All data is labelled and the algorithms learn to predict the output from the input data.
The goal is to approximate the mapping function so well that when you have new input data (x), you can predict the output variable (Y) for that data.
Polynomial regression uses the same linear regression infrastructure to fit a higher order polynomial.
Small polynomial; cannot fit the data well; said to have high bias
Large polynomial; can fit the data too well; said to have small bias
Naive Bayesian classification performs well for text classification with smaller data sets
linear Support Vector Machines perform well for text classification

Written on November 30, 2020