Python Data

value

  • Collection: getting the data
  • Engineering: storage and computational resources
  • Governance: overall management of data
  • Wrangling: data preprocessing,cleaning
  • Analysis: discovery (learning, visualisation, etc.)
  • Presentation: arguing that results are significant and useful
  • Operationalisation: putting the results towork

Data

  • Mean, Median and Mode
    • Mode: most frequently
  • Data pre-processing
  • Data preparation
  • Data cleansing
  • Data transformation
  • correlation amongst variables: df.corr()

  • The V’s
    • The first characterisations by someone with a penchant for alliteration
    • Volume, Velocity and Variety (9)
  • Metadata
    • Data about data is critical to understanding
  • Dimensions of data
    • Infographics on data dimensions (how big is “big”)
  • Growth laws
    • Understanding the exponential growth
  • in-memory: in RAM, i.e., not going to disk
  • parallel processing: performing tasks in parallel distributed computing: across multiple machines
  • scalability: to handle a growing amount of work; to be enlarged to accommodate growth (not just “big”)
  • data parallel: processing can be done independently on separate chunks of data

Learn

  • All data is labelled and the algorithms learn to predict the output from the input data.
  • The goal is to approximate the mapping function so well that when you have new input data (x), you can predict the output variable (Y) for that data.
  • Polynomial regression uses the same linear regression infrastructure to fit a higher order polynomial.
  • Small polynomial; cannot fit the data well; said to have high bias
  • Large polynomial; can fit the data too well; said to have small bias
  • Naive Bayesian classification performs well for text classification with smaller data sets
  • linear Support Vector Machines perform well for text classification

Clustering

  • K Means

BigML

  • Decision Trees
Written on November 30, 2020