Thetazero Pubs


Tony Ting
14th Aug 2015DATA_ADSENSE

Data Science Specialization - Capstone Project

Build and evaluate a predictive text model

Model description

  • Capstone project goal is to produce a predictive test algorithm in R, that could predict the next word based on certain text phrase that user input;
  • A simple n-gram model was contructed for predicting the next word based on the previous 1, 2, or 3 words;
  • Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram;
  • The model was built and tuned to optimized for the size and runtime, a relatively smaller memory footprint was required to run the data and a Shiny app was created to demonstrate the execution of the model;

Data Processing Pipeline

  • Sample dataset provided by SwiftKey was downloaded and loaded through ReadLines();
  • Random sampling performed on dataset to reduce set of training data and provide test data set;
  • Sampled data is then cleaned(remove spaces/punctuactions/numbers/profanity words);
  • Cleaned data is then feed into n-gram tokenizer to create the relevant n-gram TermDocumentMatrix;
  • N-gram data is then feed through a Naive Bayes model;
  • A Shiny app was created to make use of the model data above to perform prediction on user input text;

Shiny Applications

  • Application could be found here
  • type your desired text phrase into the text input box;
  • the Shiny app will return the predicted next word given t he phrase entered;

Project Resources

Copyright © 2016 All Rights Reserved. Privacy Policy