Click here to Skip to main content
13,146,014 members (42,687 online)
Click here to Skip to main content
Add your own
alternative version


6 bookmarked
Posted 12 Sep 2017

[tut 2] Lina ChatBot - Question Answering using parsing CKY & web scrapping

, 12 Sep 2017
Rate this:
Please Sign up or sign in to vote.
question answer chatbot using natural language parsing and web scrapping

Series Introduction

tut1 retrival based chatbot 

This is part 2 of a series of multiple tuts about natural language and machine learning , we would manually create a chatbot able

  • to chat with users
  • answer their questions
  • extract data about users
  • perform some actions (recommending movie, telling time ....)



this series's goal is not just to create a chatbot , but it is actually about viewing one practical project that can be achieved using NLP & ML , as my primary goal would be to present both ML & NLP in an intersting way which is to create a chatbot

In this series we would cover some techniques , both simple and advanced to reach our final goal , they are

  • Document retrieval using TF-IDF
  • Document retrieval using Word2Vect 
  • cky Parsing 
  • web scrapping

I hope that through these series you would com to love NLP & ML trough this fun project 

We would like to call our Chatbot Lina :D 


this is part 2 of the series describing how to build a chatbot , today we would discus how to build a system for

  1. identifying questions from normal sentences
  2. looking for answer online

to see last tut describing how to create a retrival based chatbot check this Link

we would use primarily 2 techniques for this 

  1. natural language parsing using an algorithm named CKY
  2. web scrapping from multiple websites using beautiful soap library

in this tut we would give a quick brief on parsing as general , and a quick brief of how the cky natural language parsing algorithm works , and then we would start the fun part of working on our ChatBot to make use out of parsing and web scrapping


attached are 2 files 

  1. tut2 only
  2. tut1 + tut2 

Technical Background Agenda (a)

this section is optional if you need to know more about the subject , if you need to get started with the code check Implementation 

  1. why parsing & not classification & not simple regex
  2. why parsing
  3. what is PCFG
  4. what is CKY parsing algorithim 

Implementation Agenda (b)

if you would like to skip the technical background and start on with the implemenation , see these links

  1. cky algorithim from stat_parser
  2. parsing getting tree output then traverse to get string 
  3. regex through already known parses
  4. calling duck duck go api
  5. web scrapping from yahoo answers
  6. call the functions in a while 1 loop
  7. connect to prev tut 



  1. CKY implemetaion by  
  2. Dan Jurafsky & Chris NLP Playlist



  1. NLTK
  2. CKY implemetaion by  
  3. Beautiful Soap



Technical Background

1a - why parsing & not classification & not simple regex ?

we would need to identify questions , so for doing so , if we used classification techniques (any classification algorithm like naive bayes for example ) , we would require to train our classifier on 

question structure (like questions of how many , what is color of , who is .....)

but it doesn't stop at this , as your classifier must understand that

  • how much in the same class as  how many and
  • how old in the same class as how big
  • what is in the same class as who is

this would end up having many an enormous number of train sentences , or if you tried regex , you would end up having enormous number of rules that actually are the same 

so both methods of classification & simple regular expressions , would won't achieve being highly efficient 


2a - Why Parsing ?

as we have seen in the previous examples 

  • how much &  how many are actually the same , as much & many are both adjectives
  • how old &  how big are actually the same , as old & big are both adjectives  , actually that are exactly the same as previous point
  • what is & who is are actually the same , as much & many are both adjectives

so we can like replace the actual work of (big , old ....) and replace them by another word (called terminal)  that describes that this word is an adjective

this is precisely what the parsing does , it tries to know what this belongs to , is it a noun , verb , adjective , adverb

this task is called Natural Language Parsing , as there are many uses for parsing , as any high level language has its own parsing system to convert it to assembly code , but for our task here we would use parsing but on Natural Language

there is a well known algorithm for this called CKY , but it works with PCFG , but what are PCFG ? 


3a - What is PCFG ?

PCFG stands for  probability context free grammar ,

  1. grammar stands for they are the grammar of words inside a language , how the words are arranged inside a language , like the rules of a certain language
  2. context free  stands for that grammar is built in a non intersecting way 
  3. probability  stands for how probable a certain grammar rule over the other


all rules here form what we call grammar

it simply describes that , there are 2 ways to create a subject

  1. when NP (noun phrase) then a VP (verb Phramse) comes after it , this would create a sunbject , and this rule counts foe a probabiluty of 0.9
  2. just by having a verb phrase , this would create a subject with prob of 0.1

same goes for all other non terminals 



4a - what is CKY parsing algorithim ?

CKY parsing algorithim stands for the method of parsing natural language sentences , it is developed using dynamic programming , it uses PCFG that we have previously discussed.

We would go brefly on the very concpet of how it is built , as building the cky algorithim itself is not our goal , our goal is to put it in a good use in our chatbot , but it is extremly important to know how it is built , so lets begin !!


There are 2 main points for CKY to work proparly

  1. CKY only uses PCFG that are Binary (that only contains 2 terms on right hand side ) so if wou have a rule with more than 2 terms on right hand side , you must break them to multiple binary rules
  2. Assume we have a dictonary which contains each word and the prob that it is a noun , verb , adjective .....

Assume we need to parse we need to parse a sentence "people fish in the sea" to make it simpler we would only parse as if the sentance was "people fish"

assume give grammar (PCFG grammar) is

CKY works with something called a parse triangle (chart) , it is just a way of visualizing how cky parse the sentance

here we assume that 

  1. People is (from the predifned dictionary of words with their probability )
    • Noun Phrase with prob of 0.35
    • verb with prob of 0.1
    • Noun with prob of 0.5 ==> this makees sense as people is after all a noun
  2. Fish is (from the predifned dictionary of words with their probability )
    • Verb Phrase with prob of 0.06
    • Noun Phrase with prob of 0.14
    • verb with prob of 0.6 ==> this makees sense as people is after all a verb
    • Noun with prob of 0.2

but when choosing our pair of terminals (VP , NP ,V , N) we don't just choose the max out of all prob , but we test on all Possible grammar rules pairs to get max

so in the previous example we chose to begin testing with    NP → NP NP  ,   since both people and fish can be NP (that is with differnet prob of course) 

so we would go to the next higher level to get the prob that such a rule could be genereated , so to do so we multiply

  1. prob of people being NP
  2. prob of fish being NP
  3. prob of the rule NP → NP NP  itself

this would result in   0.35 x  0.14 x 0.1 = 0.0049

But we won't just stop here , we must continue , till we loop through all Possible Rules that "people fish" can make

so lets choose  VP → V NP  this would output 0.007 (we simply write NP as our previous guess)

then continue on with choosing NP & VP to create S     S → NP VP  , this would create a prob of 0.0189 , (we also write the previous guesses)



BUT in

CKY we also take unary PCFG into consideration, so in each step (after the first non-terminal
step , you see if the output of this rule can be generated by a unary rule or not) ,

  S → VP  of prob 0.0007

so we have 2 methods for creating S 

  1.  Either using both NP and VP with prob 0.0189
  2. Or by only using the already generated VP with prob 0.0007

so we only use the one with the highest prob and discard the other

so now after looping through all possible combinations we have 3 terminals as our output  ,  3 possible guesses for how to parse "people fish" 

  1. noun phrase with prob  0.0049 
  2. verb phrase with prob 0.007
  3. complete sentance with prob 0.0189

here CKY would choose to parse "people fish"  as complete sentance composed of noun phrase "people" and a verb phrase "fish"


We then use this concept through the whole parse triangle until we reach its top , as each step we go next upper  level till reaching the top of the triangle ,  in this way we would have achieved a way of parsing sentences in square of time not in an exponential one



1b - cky algorithim from stat_parser

There is a trully smart implentaion develped by  found on github for cky parsing ,  it is trully unique as it doesn't contain pre defined pcfg , but it learns it from QuestionBank and Penn treebanks. 

This makes the learnt PCFG trully realistic , not simple arbitary probablites for each rule , but true learnt values , this reflects perfectly on the results that this code outputs



2b - parsing to get tree output then traverse to get string 

first we would import the code for cky & nltk library as NLTK would help in storing of the tree

# _____for parsing_____
from stat_parser.parser import Parser, display_tree
import nltk


lets define a function that takes a sentance and displays the output of parse as a tree

def parse(sentence):

    # first we create a parser obj
    parser = Parser()
    # put parsing in a try and catch 
        # actually parse the sentance
        tree = parser.parse(sentence)
        return False, ""

    # display the tree as string
    print tree

    # display the tree in a gui form


this would be the tree gui when parsing "can you tell me who is Obama"


to actually work on the output it would be much easier if we could have the output just 

- MD - PRP - VB - PRP - WP - VBZ - NNP -

So we would need to traverse through the tree

so we use a recursive function for traversing 

def parse(sentence):

   parser = Parser()

     tree = parser.parse(sentence) 
     return False, ""
   print tree

   #just call the function through passing tree[i]

    for i in range(len(tree)):
        traverse(tree[i], 0)

now lets see the function itslef

# this function uses a global variable array to store the leaves
tree_output = []

# and a dictionary to save nouns
imp_list_array = {'Noun': []}

def traverse(parent, x):
    # all our code is written in a try catch clauses
        # we loop through all children within the passed node
        for node in parent:

            # if this node is a tree , then loop till you reach the leaves
            if type(node) is nltk.Tree:

                # if the node is the root  ==> do nothing , as we will have reached our final goal
                if node.label() == 'ROOT':
                    # "======== Sentence ========="
                    # print "Sentence:", " ".join(node.leaves()) , " +  type " , node.label()
                    a = 6

                # else , call the function again on this node ==> till reaching leaves
                    element_type = node.label()
                    element_value = node.leaves()[0]
                    element_sentence = node.leaves()

                    # just a condition i have written t extract nouns
                    if str(element_type) == 'NN' or str(element_type) == 'NNS' or str(element_type) == 'NNP' or str(
                            element_type) == 'NNPS':

                    # call the function again , recursivly
                    traverse(node, x)
                # actually add the leaves to the array 

    # if the above code failed , simply do nothing


since we are using global variables itt would be better if we clear these variables at each parse

def parse(sentence):
    while len(tree_output) > 0:

    parser = Parser()
        tree = parser.parse(sentence)
        #print tree
        return False, ""

    # display_tree(tree)
    #print("parse succeeded")

    for i in range(len(tree)):
        traverse(tree[i], 0)


3b - regex through already known parses

def parse(sentence):
    while len(tree_output) > 0:

    parser = Parser()
        tree = parser.parse(sentence)
        return False, ""

    for i in range(len(tree)):
        traverse(tree[i], 0)

    tree_output_str = ""

    # first we would convert the array to a string 
    for a in tree_output:
        tree_output_str += " - " + a

    print  tree_output_str

    # here we would save the parses that we are interested in
    special_parses = [
        "WRB - JJ - NNS",  # how many Leopards
        "WRB - JJ - JJ",  # how many leopards
        "WRB - JJ - VBP - DT - NN",  # how big are the pyramids
        "WRB - JJ - VBZ - JJ",  # how old is obama
        "WRB - JJ - VBZ - NNP",  # how old is Obama
        "WRB - JJ - NN - VBP - NNP - VBP",  # how much money do Bill have
        "WRB - VBP - DT - NN",  # where are the pyramids
        "WP - VBD - DT - NN",  # who won the champions last week        #when was the tv first invented
        "WP - VBP - DT - NN",  # what are the pyramids
        "WP - VBZ - DT - NN - IN - NN",  # what is the capital of egypt

        "WDT - NNS",  # which companies are the biggest ,
        "WRB - VBZ - NN",  # where is egypt
        "WRB - VBZ - DT - NN" ,    # where is the usa

        "WP - VBZ - NNP",  # what is Egypt
        "WP - VBZ - JJ",  # what is egypt
        "WRB - VBD - NNP",  # when did Bayern
        "WP - VBZ - NN" ,  # what is indonesian      

         "WP - VBZ - DT - JJ - NN - IN - NN"  #who is the current president of usa


        # add all elements in special_parses as a regex expression
        # putting "|" between them
        regex = reduce(lambda x, y: x + "|" + y, special_parses)
        # do the actual regex
        # and get the position where the question is within the sentance
        # as a user can ask "could you tell me who is Obama"
        # we would only need to get "who is Obama"
        pos_tree_output = tree_output_str.index(, tree_output_str).group(0))
        pos_var = len(tree_output_str.replace('-', '').split()) - len(
            tree_output_str[pos_tree_output:].replace('-', '').split())
        # do the extraction itself from the index known from previous lines
        fact_question = ' '.join(sentence.split()[pos_var:])

        print("it is a fact question")
        # ----scrap solution from interent-------#
        # ----we would scrap from 2 sources------#
        # duck duck go
        answr_text = look_in_duck_go(fact_question)
        #if not found in duck duck go look inside yahoo answers
        if answr_text=="":
            # yahoo answers
            answr_text = look_in_yahoo_answers(fact_question)
        return True, answr_text

        print("not fact question")
        return False, "not fact question"


4b - calling duck duck go api

duck duck go provides a simple api (without even having api key) to get answers to questions , but we first must import some libraries

# _____for scrapping_____
import re
from bs4 import BeautifulSoup
import requests

Duck Duck Go returns answer either in xml or json , we would choose xml , and we would use BeautifulSoup to extract data out of it 

the xml response would look like

we would extract the abstract tag , and image tag

but if abstract tag is empty we would try to look for text tag

now lets write the function itself

def look_in_duck_go(fact_question):
          # we would call the api , attach the question , ask for xml format
          base_address = "" + fact_question + "&format=xml"
          # get the webpage
          super_page = requests.get(base_address)

          # pass it to Beautiful soap
          soup_super_page = BeautifulSoup(super_page.content, "xml")

          # extract all Abstract tag , and use the first one
          answer = soup_super_page.findAll('Abstract')[0].text

          # extract the image
          Image = soup_super_page.findAll('Image')[0].text

          # if the abstract tag is empty , look for the text tag
          if (answer == ""):
              answer = soup_super_page.findAll('Text')[0].text
          print("found in duck duck go")
      return  answer



5b - web scrapping from yahoo answers

Duck Duck Go only answers what & who questions , so there would be many other questions left unanswered , so we wold use yahoo answers for that , given its huge community , most of the questions can be found 

this is how you can use yahoo answers

this would return 

we would use the answer found inside the first link 

this is the link in html

 <a class="<code> lh-17</code> fz-m"


referrerpolicy="unsafe-url" target="_blank" id="yui_3_10_0_1_1505224373621_408">

<b id="yui_3_10_0_1_1505224373621_431">many</b> 
<b id="yui_3_10_0_1_1505224373621_407">Egypt</b>


what is really important is the class name which is lh-17

we extract the link (href)

so our function would be 

def look_in_yahoo_answers(fact_question):

      # to replace empty spaces with %20
      fact_question = fact_question.replace(' ' , '%20')

      # write the website path
      base_address =""+fact_question
      #access it
      super_page = requests.get(base_address) 

      # load it into BeautifulSoup
      soup_super_page = BeautifulSoup(super_page.content, "lxml")

      # get the first link ==> <code>lh-17
     </code># and extract the href attribute
      answr_link = soup_super_page.findAll('a', {'class' : 'lh-17'})[0].attrs["href"]

      #--------------- we still need to get real answer--------#

      return answr_text


then after getting the link of the answer , this is how it would look like 

we would need to get be the best answer

which look is in html

<span itemprop="text" class="<code>ya-q-full-text</code>" id="yui_3_17_2_5_1505225015420_1884">
There are more than one hundred pyramids. <br>
The reason the exact number is not fixed is that
 there are unfinished or not in good shape pyramids. 
So it is debatable for some of them if they would be counted as pyramids or not. <br>
The perfect ones are the three pyramids of Giza. 


we are interested in the class name which is ya-q-full-text

so we extract its text

def look_in_yahoo_answers(fact_question):
      fact_question = fact_question.replace(' ' , '%20')

      base_address =""+fact_question
      super_page = requests.get(base_address)#, headers=headers)

      soup_super_page = BeautifulSoup(super_page.content, "lxml")

      answr_link = soup_super_page.findAll('a', {'class' : 'lh-17'})[0].attrs["href"]


      # get the page corresponding to the link we have just extracted
      answer_page = requests.get(answr_link)

      # load page into BeautifulSoup
      soup_answer_page = BeautifulSoup(answer_page.content, "lxml")

      # extract the answer
      answr_text = soup_answer_page.findAll('span', {'class' : '<code>ya-q-full-text</code>'})[0].text

      print("found in yahoo answers")

      # return the answer
      return answr_text



6b - call the functions in a while 1 loop

lets put all the above code in a user friendly interface

while 1:

    # empty the global variables
    tree_output_str =""

    fact_question = raw_input("talk to Lina : ")

    # just call the parse function
    answr_text = parse(fact_question)

    print answr_text



here we would create a simple python script that would use tut1 & tut2 , to create a more complete chatbot , this chatbot now would be able to 

  1. chat with users using retrival modal using TF-IDF (see prev tut for more info)
  2. qnswer user's questions using parsing & web scrapping

so lets start with a python script called tut_1_2

import tf_idf # tut1
import parsing_scrapping  # tut2

we would comment all while 1 in both tf_idf & parsing_scrapping (disable while 1 in inner code)

and create a new while 1 in tut_1_2

we will start with parsing module before using TF-IDF , as if the question is a fact question this module would easily identify it as fact question while TF-IDF would try its best to respond as if it is a norml chat , we don't need this , so we would start with the parsing module


while 1:
    sentence = raw_input("talk to Lina : ")

    # start with parsing & scrapping
    # empty the global variables
    parsing_scrapping.tree_output_str =""
    # call the parsing module
    # it would return 2objs
    # answr_text[0] bool either fact question or not
    # answr_text[1] the answer 
    answr_text = parsing_scrapping.parse(sentence) 

    # if it is a fact question
    if answr_text[0] :
         print answr_text[1]

    # if normal chat
    else :
        response_primary, line_id_primary = tf_idf.talk_to_lina_primary(sentence)
        print response_primary







Please tell me what do you think about this tut , tell me your review and vote this tut , I also would suggest seeing last tut , and wait for coming tuts isA




This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

The Zakies
Software Developer (Junior)
Egypt Egypt
No Biography provided

You may also be interested in...

Comments and Discussions

-- There are no messages in this forum --
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.170915.1 | Last Updated 12 Sep 2017
Article Copyright 2017 by The Zakies
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid