QCon 2017 — Data, Visualisation and Machine Learning

5 min readDec 11, 2017

How do we know if our customers are happy?

90% of the world’s data has been created in the past 2 years, but most of that isn’t being used. Being a data-centric company is about using data to drive your business and make your customers happy.

Data & Visualisation

Cathy Polinsky (CTO of StitchFix and formerly of Salesforce and Amazon) gave a keynote presentation called “Data as DNA: Building a company on Data”. These are my notes from that presentation (if you’d like to see this video, get in touch with me via LinkedIn)…

Salesforce needed to re-write a system as updates to it were single-threaded. They needed to define new architecture. So what architecture to use? They started to look at the existing data they had.

If the data is not visible, it’s meaningless. Data must be seen and visualised.
Make data open
Make it interpretable
Define your questions and metrics
What problems are you solving
How will you know you’ve succeeded
Make it visual
Opening up the data does not mean you should violate privacy laws — mask sensitive data

Data Science — Using data strategically

Effective Data Science is creating solutions and algorithms that are testable and iterable.
Testing via experiments (e.g. A/B testing)
When changing navigation though, A/B tests showed that the original Amazon tab design was best! Why? Because it takes time for people to understand new navigation systems. So Amazon had to make a call and change the nav with a view to it being better in the long term.
Is there a way to learn without an A/B test?
Don’t skip the hypothesis
Pick a metric that is related to your test
Don’t peek at the results early
Test big things (time is limited!)
True North vs Magnetic North (the closer you are to your goal, the more important it is to re-evaluate your goal metrics)
Goldilocks of Data (what is meaningful to the organisation, not focussing on how much data is enough)
Not all big data is interesting data
Small data (Stanford technique)

Data Enables Personalisation

This an old idea! Shops & store-keepers from the beginning of time.

StitchFix is a personalisation company.

You complete a survey which asks about yourself: your size, your style, …. A stylist/curator assesses you (and machine learning to make recommendations) and sends you clothes. This is all driven by personalization.
This is about “data that matters”: price, cut, colour, length, how good an item is for most people, age, where they work, size, past purchases, …

Lesson 1: Feedback loops unlock personalisation

Style profile
When returning/accepting clothes, they provide feedback on the clothes (too big, too small, wrong colour)
Inventory feedback — which items are selling, and where

Lesson 2: Data incentives matter

Personalisation depends on getting good data. But most companies ask for data without giving consumers a reason to provide that feedback. Physical clothing stores never get feedback from customers when trying on clothes.
So create “compelling self interest”.
First order benefit: your experience will get better if you give feedback
Second order benefit: your feedback helps our company be better (not as compelling)
Make data collection fun!
No customer will share data with you if they don’t trust you

Lesson 3: Humans + Machines = Better

Machines are good at some things. Humans are good at some things
“The Second Machine Age”. Driving Cars, Alexa, Deep Blue

Human judgement:

Helps leverage unstructured data
Provides empathy & creativity (“no more skinny jeans!”)
Frees the algorithm developer from dealing with edge cases

Summary

Being a data-centric company is about using data to drive your business and make your customers happy.

Give access to the data to all employees and make it easy for them to interpret
Invest in data science to find the problems and guide decision making
Create highly personalised experiences by blending humans and machines

Machine Learning (ML)

The goal of ML is not to make perfect guess because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.

I attended a couple of talks and a workshop which looked at Machine Learning. For developers who are used to writing code after soliciting requirements, prepare to be disappointed.

ML is not about writing code. It is about teaching a computer to learn an algorithm that is too complex to program. It involves the following steps:

Collecting relevant data
Analysing the data
Creating features from data
Selecting the best model for the problem
Selecting the best training algorithm
Training the model (← you might need some code here)
Evaluating the accuracy of the model
Deploying the model

2. Analyzing data

Cleaning
Normalizing (maybe reducing values to between 0 and 1)
Statistical tests (how distribute is the data itself)
Visualization (to help with understanding what the data is)

Goal: Determine possible ways to mathematically represent the data.

3. Creating features

a compact representation of original data
cleaned and normalized
redundant data removed
correlated data removed (when two pieces of data say essentially the same thing. e.g. Nationality and Residency)

Feature generation is both an art and a science.

4. Selecting a model

Mathematical represntation fo data (hypothesis)
Independent of tool sets… portable between tools
Not all data can be represented by all models

5. Selecting an algorithm

Goal: How to learn the model parameters from the data

Linear (Regression) Model — a simple model, but doesn’t fit all data

Model represented as f(x) = ax + c (a line, plane)
Could appear when analysing housing prices, college scores
There are lots of algorithms to find the linear regression model

6. Training Set

the training set is a subset of the collected data
it must be statistically representative of your data (it’s not about the amount of data, but how representative it is)
it is used by the algorithm to learn the model
the test set is independent of the training set. E.g. hold back 10% of the whole data which NEVER gets used for training. Maybe 90% is used for training, 10% is for testing.

The main types of ML:

Supervised Learning

Classification
— “Pick One of a Set”
— Spam detection
— Manufacturing defect detection
— Handwriting analysis
— How:
— Decision Trees/Forests (example: Titanic survivor)
— Naïve Bayes
Regression
— “Score or Rank”
— Recommendations
— Likelihood of Purchase
— How: Fitting to some kind of curve
— Linear Regression
— Logistic Regression

Unsupervised Learning

Clustering
— Group Similar
— Find similar items
— Customer segmentation
— cohort detection
— How:
— K-Means
— Hierarchical Clustering

Practically speaking…

There are a bunch of skills needed to become a data scientist:
- Data Analysis Skills
- Data Vizualisation skills
- Programming skills (Python, R, Scala Java)
- Statistics Knowledge (applied stats)
- Distributed system skills

Toolkits
- Tensoreflow
- Apache Spark

Further Resources
- [Coursera](http://www.coursera.org/learn/machine-learning)
- Andrew Ng’s http://cs229.stanford.edu/ course
- Book: Pattern Recognition and Machine Learning (Bishop)
- Pattern Classification (Duda, Hart, Stork)