The book that I’ve been working on for some time is nearly finished. It is aimed at teaching young adults programming in a more formal setting, each chapter is a single lesson that can be taught over 1-2 hours. There are some draft chapters already available on the learning to program page, the completed chapters will be added here eventually.

The real aim is to produce a set of materials that can be used by students and an additional set of materials aimed at the teachers, including further reading and explanation of what happens in each session as well as model answers and suggested homework.

It will initially be released as an eBook and print copies available separately.

Automatic Tagging of Articles

The modern internet contains vast amounts of information, we can make content easier to find for search engines and ultimately for our users by adding meta information to pages.

Most blogs give posts categories and tags and clicking these tags takes us to related posts, usually this is a manual step where the author will fill out the tags most relevant to the post; but what if we could automate this process? With a basic amount of natural language processing understanding and some simple methods we can make a script that can automatically suggest tags for any article.

What Words are Key?

Imagine explaining what a fire engine is to someone who has never seen one before, how would you start? It would be fairly easy to use common descriptive words such as large, red, vehicle or sirens, using this information a person could get an idea of what a fire engine is without ever having seen one. This process is similar to how an automatic categorisation system might work, the key difference is that a computer doesn’t know what a fire engine is any more than it knows what a banana is but through some clever libraries it can extract relevant information from a block of text and use that, this is known as a statistical classification problem in machine learning.

Older classification systems relied on hand made sets of keywords (corpus) that defined a classification, for example an algorithm may find words related to fish and based on the corpus it would know to classify the document as a fish document.

The method presented in this article will use the older systems as a foundation and will be a form of extraction summarization. More advanced methods exist where a human readable summary of an article is created but that is outside the scope of this article.

Our Document

Let’s create a simple document that will be parsed and classified later on, we will take the first couple of sentences from a Wikipedia article –

The domestic cat is a small, usually furry, domesticated, and carnivorous mammal. They are often called a housecat when kept as an indoor pet or simply a cat when there is no need to distinguish them from other felids and felines. Cats are often valued by humans for companionship and their ability to hunt pests.

Data Representation

We have a document but it’s not much use to us as it is, we need to process the document into a more useful data set. The first thing is to convert the document to lower case, split the document into words, remove punctuation and remove common words such as ‘is’, ‘the’, ‘and’ (stop words)  –

This gives us a pretty useful list of words that we can play with.


As we can see in our input document we have both cat and cats, we know they are the same word but a computer won’t, in order to reduce cats to cat we need to use a technique called stemming. Stemming is the process of reducing a word to it’s base form or stem, so for example reducing plurals to singulars and so on. To accomplish this we will use a stemming library.

The output of this is –

Counter({‘cat’: 3, ‘often’: 2, ‘when’: 2, ‘domest’: 2, ‘simpli’: 1, ‘indoor’: 1, ‘human’: 1, ‘felid’: 1, ‘need’: 1, ‘abil’: 1, ‘felin’: 1, ‘housecat’: 1, ‘from’: 1, ‘companionship’: 1, ‘pet’: 1, ‘there’: 1, ‘their’: 1, ‘other’: 1, ‘call’: 1, ‘furri’: 1, ‘them’: 1, ‘they’: 1, ‘distinguish’: 1, ‘valu’: 1, ‘kept’: 1, ‘hunt’: 1, ‘carnivor’: 1, ‘pest’: 1, ‘small’: 1, ‘mammal’: 1, ‘usual’: 1})

Our most common word is cat which is a fantastic start for very little processing. We could even take the most common word and leave it at that as clearly the document in question is about cats.

Part of Speech Tagging

While we have a basic model that gives us a topic we can’t always rely on simple word frequencies to extract the theme of a document. A major part of natural language processing is part of speech (POS) tagging, this takes in a sentence and tags each word with a type such as noun, verb or adjective; for example –

  • heat – verb (noun)
  • water – noun (verb)
  • in prep – (noun, adv)
  • a  – det (noun)
  • large – adj (noun)
  • vessel – noun

To tag our words we will use a tagger from the Natural Language Toolkit (NLTK) which is a NLP library available for Python.

Let’s add the tagging to our code so that we have two models of our document, the first being the word count and the second being the part of speech tagged model.

The will run the part of speech tagging on our list of filtered words and give us a simplified description of each word, as the aim is to identify nouns we do not worry so much about parsing the document as a whole. The resulting list is as follows –

set([(‘need’, u’VERB’), (‘hunt’, u’NOUN’), (‘carnivorous’, u’ADJ’), (‘furry’, u’ADJ’), (‘cat’, u’ADJ’), (‘other’, u’ADJ’), (‘called’, u’VERB’), (u’felid’, u’ADJ’), (‘distinguish’, u’ADJ’), (‘companionship’, u’NOUN’), (‘cat’, u’NOUN’), (‘usually’, u’ADV’), (‘them’, u’PRON’), (u’feline’, u’NOUN’), (‘from’, u’ADP’), (‘when’, u’ADV’), (u’pest’, u’NOUN’), (‘housecat’, u’ADJ’), (‘domesticated’, u’VERB’), (‘ability’, u’NOUN’), (‘kept’, u’ADJ’), (‘indoor’, u’NOUN’), (‘small’, u’ADJ’), (‘they’, u’PRON’), (‘simply’, u’ADV’), (‘valued’, u’VERB’), (‘often’, u’ADV’), (u’human’, u’ADJ’), (‘domestic’, u’ADJ’), (‘there’, u’DET’), (‘their’, u’PRON’), (‘mammal’, u’ADJ’), (‘pet’, u’NOUN’)])

Keyword Extraction

Now we will combine our two data models, first we will take the most common words and then lookup their tag type, if it is a noun we can say that this is likely to be a topic of our document.

Putting all of our code together it looks like this –

And the resulting words from our test document that have been identified as possible key words –

cat feline indoor


This is a very basic parser and could benefit from improvements in the following areas –

  • Improve the stop word list, NLTK does come with several large stop word lists for various languages.
  • Try and identify key phrases rather than key words, for example in the document we have the phrase “carnivorous mammal” this may be very relevant depending on the key words/phrases we are trying to extract. This could be better solved by using tagging systems to identify the parts of a sentence rather than single words.
  • Relying on word count is useful but basic – for this keyword extraction system we are very dependent on the key topic words being repeated but it may not always be the case. Particularly for more abstract documents. Wikipedia articles are a good example of documents that will likely contain the topic several times in the document.
  • Just checking for nouns is quite limiting, in our example document cats are described as mammals which could be a very useful tag to extract to help with semantic meta data but because it is tagged as an adjective it is ignored.


Language Detection Algorithm

My thesis required dealing with large quantities of Twitter data and to make it easier for myself I decided to only use English language tweets. Due to the ‘interesting’ grammar and spelling that is frequently found on line a different algorithm to a standard dictionary test was required. This one checks for common English n-grams (letter groupings) and returns a value of how sure the input is English.

Here’s some examples:

Good morning everyone! : 66%
Im So Sick. Really in a bad Position : 50%
Thats the fear of unicorns : 100%
Gnt too com uma saudade daminha namorada vcs nao tem ideia :(((( : 0%
Ils testent les hologrammes pour faire des concerts par des morts. Bientôt même les pas morts feront des hologrammes et vous irez les voir. : 12%
il7ain ilnass ykhl9un men habat twitter o yntqlun to istagram=))!!! : 40%


Too Busy To Tweet

For my thesis I am investigating methods for finding the happiness on Twitter. I’ll release a lot more of the information and data after it is all completed. But for now here are some small and interesting finds.

Number of tweets separated by day of the week.

What the above image shows is my full data set of tweets (around 400,000) and on which day those tweets were sent. Interesting to note is that weekends have around half as many posts as the weekdays. I know I definitely use Twitter more when I’m procrastinating.

As well as having the least amount of traffic, the weekends also suffer from an increased number of negative tweets. The graph below shows my take on finding the happiest day of the week.

Happiness across days of the week ( repeated for ease of viewing)

There have been several studies aiming to find the happiest days of the week; from psychological studies into the idea that Monday is the unhappiest day of the week to studies also using Twitter as their statistical source.

What this appears to show is that the weekend suffers from a large scale drop in happiness from a week day high on Thursday. There does seem to be some correlation between the number of tweets sent on a day and the over all happiness of that day. Is it possible that when we are happy we want to share that with the world?

More information and graphs will follow as the study continues.

© 2024 Acodemics

Theme by Anders NorénUp ↑