Thetazero Pubs
stephanesantos

Capstone Presentation

stephane santos
jan 20, 2016DATA_ADSENSE

Slide 1 - Description of the app

This app was developed for the Capstone project leading to the data science certification provided by Johns Hopkins University through Coursera. Its purpose is to:

  • Allow the user to enter a sentence.
  • Process the sentence into words or ngrams.
  • Compare the user ngrams to a corpus of ngrams extracted from tweets, blogs and news.
  • Provide the user with the most likely next word and its associated probability.

Please allow a few seconds for the first prediction: the datasets are being loaded and cached.

Slide 2 - Corpus of Data

To build the training dataset we:

  • Extracted 3000 lines randomly from each of the following 3 sources: tweets, blogs or news.
  • Cleaned each line for punctuation, white spaces, numbers and stopwords.
  • Build 4 datasets for each type of ngram: . unigrams: single words . bigrams: sets of 2 words . trigrams: sets of 3 words . quadrigrams: sets of 4 words
  • For each of these 4 datasets we computed the frequency of occurence for each ngram.

Slide 3 - Description of the algorithm (MLE)

The first algorithm the app goes through is the Maximum Likelihod Estimation algorithm.

The MLE algorithm:DATA_ADSENSE

  • Evaluates the sentence entered by the user to qualify the type of ngram we are dealing with.

  • Takes the ngram+1 dataset and looks for ngrams+1 starting with the ngram entered by the user.

  • Provides to the next word and its associated probability calculated as the frequency of occurence of the next word in the ngram+1 in the dataset divided by the number of ngram+1 occurences.

  • Prompts a message to the user in case of failure.

Slide 4 - Description of the algorithm (Back-Off algorithm)

The second algorithm the app goes through is a Back-off algorithm.

The Back-off algorithm:

  • Evaluates the sentence entered by the user and extracts the last bigram

  • It then works looks for the bigram into the trigram dataset

  • In case of failure to find the next word:

    • it looks first for the unigram into the bigram dataset
    • as a last resort option it provides the most frequent unigram which is 'the'

Slide 5 - Trying the app & possible enhancements

The following phrases have been tested successfully:

  • ?Type the sentence? for which $ you wish to see the next
  • chicago
  • I knew illinois is a wonderful
  • fish and birds
  • I was learning that statistical area

Possible enhancements we would foresee given more time would be:

  • Checking the performance of the Good-Turing and the Kneser-ney algorithms
  • Optimizing the dataset size Both of these actions would help improve the accuracy vs. performance dilema this app poses.
Copyright © 2016 thetazero.com All Rights Reserved. Privacy Policy