The modern internet contains vast amounts of information, we can make content easier to find for search engines and ultimately for our users by adding meta information to pages.
Most blogs give posts categories and tags and clicking these tags takes us to related posts, usually this is a manual step where the author will fill out the tags most relevant to the post; but what if we could automate this process? With a basic amount of natural language processing understanding and some simple methods we can make a script that can automatically suggest tags for any article.
What Words are Key?
Imagine explaining what a fire engine is to someone who has never seen one before, how would you start? It would be fairly easy to use common descriptive words such as large, red, vehicle or sirens, using this information a person could get an idea of what a fire engine is without ever having seen one. This process is similar to how an automatic categorisation system might work, the key difference is that a computer doesn’t know what a fire engine is any more than it knows what a banana is but through some clever libraries it can extract relevant information from a block of text and use that, this is known as a statistical classification problem in machine learning.
Older classification systems relied on hand made sets of keywords (corpus) that defined a classification, for example an algorithm may find words related to fish and based on the corpus it would know to classify the document as a fish document.
The method presented in this article will use the older systems as a foundation and will be a form of extraction summarization. More advanced methods exist where a human readable summary of an article is created but that is outside the scope of this article.
Our Document
Let’s create a simple document that will be parsed and classified later on, we will take the first couple of sentences from a Wikipedia article –
The domestic cat is a small, usually furry, domesticated, and carnivorous mammal. They are often called a housecat when kept as an indoor pet or simply a cat when there is no need to distinguish them from other felids and felines. Cats are often valued by humans for companionship and their ability to hunt pests.
Data Representation
We have a document but it’s not much use to us as it is, we need to process the document into a more useful data set. The first thing is to convert the document to lower case, split the document into words, remove punctuation and remove common words such as ‘is’, ‘the’, ‘and’ (stop words) –
|
document = "The domestic cat is a small, usually furry, domesticated, and carnivorous mammal. They are often called a housecat when kept as an indoor pet or simply a cat when there is no need to distinguish them from other felids and felines. Cats are often valued by humans for companionship and their ability to hunt pests." stop = ['the', 'is', 'a', 'are', 'at', 'or', 'no', 'by', 'for', 'and', 'to', 'an', 'as'] document = document.replace(',', '') document = document.replace('.', '') print [x for x in document.lower().split(" ") if x not in stop] |
This gives us a pretty useful list of words that we can play with.
Preprocessing
As we can see in our input document we have both cat and cats, we know they are the same word but a computer won’t, in order to reduce cats to cat we need to use a technique called stemming. Stemming is the process of reducing a word to it’s base form or stem, so for example reducing plurals to singulars and so on. To accomplish this we will use a stemming library.
|
#!/usr/bin/python from stemming.porter2 import stem from collections import Counter document = "The domestic cat is a small, usually furry, domesticated, and carnivorous mammal. They are often called a housecat when kept as an indoor pet or simply a cat when there is no need to distinguish them from other felids and felines. Cats are often valued by humans for companionship and their ability to hunt pests." stop = ['the', 'is', 'a', 'are', 'at', 'or', 'no', 'by', 'for', 'and', 'to', 'an', 'as'] document = document.replace(',', '') document = document.replace('.', '') filtered = [stem(x) for x in document.lower().split(" ") if x not in stop] counted = Counter(filtered) print counted |
The output of this is –
Counter({‘cat’: 3, ‘often’: 2, ‘when’: 2, ‘domest’: 2, ‘simpli’: 1, ‘indoor’: 1, ‘human’: 1, ‘felid’: 1, ‘need’: 1, ‘abil’: 1, ‘felin’: 1, ‘housecat’: 1, ‘from’: 1, ‘companionship’: 1, ‘pet’: 1, ‘there’: 1, ‘their’: 1, ‘other’: 1, ‘call’: 1, ‘furri’: 1, ‘them’: 1, ‘they’: 1, ‘distinguish’: 1, ‘valu’: 1, ‘kept’: 1, ‘hunt’: 1, ‘carnivor’: 1, ‘pest’: 1, ‘small’: 1, ‘mammal’: 1, ‘usual’: 1})
Our most common word is cat which is a fantastic start for very little processing. We could even take the most common word and leave it at that as clearly the document in question is about cats.
Part of Speech Tagging
While we have a basic model that gives us a topic we can’t always rely on simple word frequencies to extract the theme of a document. A major part of natural language processing is part of speech (POS) tagging, this takes in a sentence and tags each word with a type such as noun, verb or adjective; for example –
- heat – verb (noun)
- water – noun (verb)
- in prep – (noun, adv)
- a – det (noun)
- large – adj (noun)
- vessel – noun
To tag our words we will use a tagger from the Natural Language Toolkit (NLTK) which is a NLP library available for Python.
Let’s add the tagging to our code so that we have two models of our document, the first being the word count and the second being the part of speech tagged model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
|
#!/usr/bin/python from collections import Counter import nltk from nltk.tag import pos_tag, map_tag lemma = nltk.wordnet.WordNetLemmatizer() document = "The domestic cat is a small, usually furry, domesticated, and carnivorous mammal. They are often called a housecat when kept as an indoor pet or simply a cat when there is no need to distinguish them from other felids and felines. Cats are often valued by humans for companionship and their ability to hunt pests." stop = ['the', 'is', 'a', 'are', 'at', 'or', 'no', 'by', 'for', 'and', 'to', 'an', 'as'] document = document.replace(',', '') document = document.replace('.', '') filtered = [lemma.lemmatize(x) for x in document.lower().split(" ") if x not in stop] counted = Counter(filtered) posTagged = nltk.tag.pos_tag(filtered) simplifiedTags = {(word, map_tag('en-ptb', 'universal', tag)) for word, tag in posTagged} print simplifiedTags |
The will run the part of speech tagging on our list of filtered words and give us a simplified description of each word, as the aim is to identify nouns we do not worry so much about parsing the document as a whole. The resulting list is as follows –
set([(‘need’, u’VERB’), (‘hunt’, u’NOUN’), (‘carnivorous’, u’ADJ’), (‘furry’, u’ADJ’), (‘cat’, u’ADJ’), (‘other’, u’ADJ’), (‘called’, u’VERB’), (u’felid’, u’ADJ’), (‘distinguish’, u’ADJ’), (‘companionship’, u’NOUN’), (‘cat’, u’NOUN’), (‘usually’, u’ADV’), (‘them’, u’PRON’), (u’feline’, u’NOUN’), (‘from’, u’ADP’), (‘when’, u’ADV’), (u’pest’, u’NOUN’), (‘housecat’, u’ADJ’), (‘domesticated’, u’VERB’), (‘ability’, u’NOUN’), (‘kept’, u’ADJ’), (‘indoor’, u’NOUN’), (‘small’, u’ADJ’), (‘they’, u’PRON’), (‘simply’, u’ADV’), (‘valued’, u’VERB’), (‘often’, u’ADV’), (u’human’, u’ADJ’), (‘domestic’, u’ADJ’), (‘there’, u’DET’), (‘their’, u’PRON’), (‘mammal’, u’ADJ’), (‘pet’, u’NOUN’)])
Keyword Extraction
Now we will combine our two data models, first we will take the most common words and then lookup their tag type, if it is a noun we can say that this is likely to be a topic of our document.
Putting all of our code together it looks like this –
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
|
#!/usr/bin/python from collections import Counter import nltk from nltk.tag import pos_tag, map_tag lemma = nltk.wordnet.WordNetLemmatizer() document = "The domestic cat is a small, usually furry, domesticated, and carnivorous mammal. They are often called a housecat when kept as an indoor pet or simply a cat when there is no need to distinguish them from other felids and felines. Cats are often valued by humans for companionship and their ability to hunt pests." stop = ['the', 'is', 'a', 'are', 'at', 'or', 'no', 'by', 'for', 'and', 'to', 'an', 'as'] document = document.replace(',', '') document = document.replace('.', '') filtered = [lemma.lemmatize(x) for x in document.lower().split(" ") if x not in stop] counted = Counter(filtered) posTagged = nltk.tag.pos_tag(filtered) simplifiedTags = {(word, map_tag('en-ptb', 'universal', tag)) for word, tag in posTagged} for word,key in counted.most_common(5): for tag in simplifiedTags: if word == tag[0] and tag[1] == "NOUN": print word |
And the resulting words from our test document that have been identified as possible key words –
cat feline indoor
Improvements
This is a very basic parser and could benefit from improvements in the following areas –
- Improve the stop word list, NLTK does come with several large stop word lists for various languages.
- Try and identify key phrases rather than key words, for example in the document we have the phrase “carnivorous mammal” this may be very relevant depending on the key words/phrases we are trying to extract. This could be better solved by using tagging systems to identify the parts of a sentence rather than single words.
- Relying on word count is useful but basic – for this keyword extraction system we are very dependent on the key topic words being repeated but it may not always be the case. Particularly for more abstract documents. Wikipedia articles are a good example of documents that will likely contain the topic several times in the document.
- Just checking for nouns is quite limiting, in our example document cats are described as mammals which could be a very useful tag to extract to help with semantic meta data but because it is tagged as an adjective it is ignored.
Recent Comments