Tagpython

Superset Gitlab OAuth Integration

Finally cracked getting Superset to use Gitlab as an OAuth provider.

We had to make a custom Superset SecurityManager class – loans direct lenders uk

And then this is the superset_config.py, note that we load a custom security manager

Additionally I had to disable SSL certificate validation as we are using a self signed SSL certificate for our internal Gitlab installation. 

Superset SSL Verifiy Failed

Installing Apache Superset at work for internal use, I was attempting to link our Superset OAuth to an internal OAuth provider, this provider uses self signed SSL certificates which caused issues with the Flask OAuthlib. In my Superset logs I would see the following error –probiotic supplement

As this is an internal system it’s possible to monkey patch the SSL handler to avoid doing a verification on our certificates by adding the following to the superset_config.py –

This means that we can reuse the same superset_config.py for our Superset Docker image and bypass the exception, OAuth here we come!

2 Sum Algorithms

I’ve been doing the Algorithms: Design and Analysis Coursera course and during the final week there was a cool algorithm that needed to be coded called the 2 Sum algorithm.

I initially coded the naive implementation that would complete in O(n) time –

This worked but we can do better!

First off we sort the input array, this allows us to modify the algorithm to take advantage of the data to modify our algorithm –

This modification will still give a O(n) time in the worse case but on the given test data it decreased the running time by nearly 50%.

Automatic Tagging of Articles

The modern internet contains vast amounts of information, we can make content easier to find for search engines and ultimately for our users by adding meta information to pages.

Most blogs give posts categories and tags and clicking these tags takes us to related posts, usually this is a manual step where the author will fill out the tags most relevant to the post; but what if we could automate this process? With a basic amount of natural language processing understanding and some simple methods we can make a script that can automatically suggest tags for any article.

What Words are Key?

Imagine explaining what a fire engine is to someone who has never seen one before, how would you start? It would be fairly easy to use common descriptive words such as large, red, vehicle or sirens, using this information a person could get an idea of what a fire engine is without ever having seen one. This process is similar to how an automatic categorisation system might work, the key difference is that a computer doesn’t know what a fire engine is any more than it knows what a banana is but through some clever libraries it can extract relevant information from a block of text and use that, this is known as a statistical classification problem in machine learning.

Older classification systems relied on hand made sets of keywords (corpus) that defined a classification, for example an algorithm may find words related to fish and based on the corpus it would know to classify the document as a fish document.

The method presented in this article will use the older systems as a foundation and will be a form of extraction summarization. More advanced methods exist where a human readable summary of an article is created but that is outside the scope of this article.

Our Document

Let’s create a simple document that will be parsed and classified later on, we will take the first couple of sentences from a Wikipedia article –

The domestic cat is a small, usually furry, domesticated, and carnivorous mammal. They are often called a housecat when kept as an indoor pet or simply a cat when there is no need to distinguish them from other felids and felines. Cats are often valued by humans for companionship and their ability to hunt pests.

Data Representation

We have a document but it’s not much use to us as it is, we need to process the document into a more useful data set. The first thing is to convert the document to lower case, split the document into words, remove punctuation and remove common words such as ‘is’, ‘the’, ‘and’ (stop words)  –

This gives us a pretty useful list of words that we can play with.

Preprocessing

As we can see in our input document we have both cat and cats, we know they are the same word but a computer won’t, in order to reduce cats to cat we need to use a technique called stemming. Stemming is the process of reducing a word to it’s base form or stem, so for example reducing plurals to singulars and so on. To accomplish this we will use a stemming library.

The output of this is –

Counter({‘cat’: 3, ‘often’: 2, ‘when’: 2, ‘domest’: 2, ‘simpli’: 1, ‘indoor’: 1, ‘human’: 1, ‘felid’: 1, ‘need’: 1, ‘abil’: 1, ‘felin’: 1, ‘housecat’: 1, ‘from’: 1, ‘companionship’: 1, ‘pet’: 1, ‘there’: 1, ‘their’: 1, ‘other’: 1, ‘call’: 1, ‘furri’: 1, ‘them’: 1, ‘they’: 1, ‘distinguish’: 1, ‘valu’: 1, ‘kept’: 1, ‘hunt’: 1, ‘carnivor’: 1, ‘pest’: 1, ‘small’: 1, ‘mammal’: 1, ‘usual’: 1})

Our most common word is cat which is a fantastic start for very little processing. We could even take the most common word and leave it at that as clearly the document in question is about cats.

Part of Speech Tagging

While we have a basic model that gives us a topic we can’t always rely on simple word frequencies to extract the theme of a document. A major part of natural language processing is part of speech (POS) tagging, this takes in a sentence and tags each word with a type such as noun, verb or adjective; for example –

  • heat – verb (noun)
  • water – noun (verb)
  • in prep – (noun, adv)
  • a  – det (noun)
  • large – adj (noun)
  • vessel – noun

To tag our words we will use a tagger from the Natural Language Toolkit (NLTK) which is a NLP library available for Python.

Let’s add the tagging to our code so that we have two models of our document, the first being the word count and the second being the part of speech tagged model.

The will run the part of speech tagging on our list of filtered words and give us a simplified description of each word, as the aim is to identify nouns we do not worry so much about parsing the document as a whole. The resulting list is as follows –

set([(‘need’, u’VERB’), (‘hunt’, u’NOUN’), (‘carnivorous’, u’ADJ’), (‘furry’, u’ADJ’), (‘cat’, u’ADJ’), (‘other’, u’ADJ’), (‘called’, u’VERB’), (u’felid’, u’ADJ’), (‘distinguish’, u’ADJ’), (‘companionship’, u’NOUN’), (‘cat’, u’NOUN’), (‘usually’, u’ADV’), (‘them’, u’PRON’), (u’feline’, u’NOUN’), (‘from’, u’ADP’), (‘when’, u’ADV’), (u’pest’, u’NOUN’), (‘housecat’, u’ADJ’), (‘domesticated’, u’VERB’), (‘ability’, u’NOUN’), (‘kept’, u’ADJ’), (‘indoor’, u’NOUN’), (‘small’, u’ADJ’), (‘they’, u’PRON’), (‘simply’, u’ADV’), (‘valued’, u’VERB’), (‘often’, u’ADV’), (u’human’, u’ADJ’), (‘domestic’, u’ADJ’), (‘there’, u’DET’), (‘their’, u’PRON’), (‘mammal’, u’ADJ’), (‘pet’, u’NOUN’)])

Keyword Extraction

Now we will combine our two data models, first we will take the most common words and then lookup their tag type, if it is a noun we can say that this is likely to be a topic of our document.

Putting all of our code together it looks like this –

And the resulting words from our test document that have been identified as possible key words –

cat feline indoor

Improvements

This is a very basic parser and could benefit from improvements in the following areas –

  • Improve the stop word list, NLTK does come with several large stop word lists for various languages.
  • Try and identify key phrases rather than key words, for example in the document we have the phrase “carnivorous mammal” this may be very relevant depending on the key words/phrases we are trying to extract. This could be better solved by using tagging systems to identify the parts of a sentence rather than single words.
  • Relying on word count is useful but basic – for this keyword extraction system we are very dependent on the key topic words being repeated but it may not always be the case. Particularly for more abstract documents. Wikipedia articles are a good example of documents that will likely contain the topic several times in the document.
  • Just checking for nouns is quite limiting, in our example document cats are described as mammals which could be a very useful tag to extract to help with semantic meta data but because it is tagged as an adjective it is ignored.

 

Python Importing Global Variables – Reference or Value?

I’ve been refactoring a large Python program into separate modules for readability and ease of unit testing and part of that involves moving global variables into a separate globals module. I was interested to see if importing this globals module into other modules and then updating a variable would mean that other modules see the change, here’s what I found –

From globals import *

My first instinct was to use from globals import * in my sub modules, let’s look at an example –

globals.py

main.py

test.py

Running the main.py we get this output –

First: 1
Second: 2
printVar(): 1

Obviously updating the global variable VAR in main.py does not reflect in test.py, this is because from globals import * will import by value giving main.py a local copy of VAR.

Import globals

In order to get a single instance of the global variable VAR we need to modify our program to use import globals and then reference the VAR variable directly inside globals.

Let’s modify the scripts –

main.py

test.py

Running this will give us the following output –

First: 1
Second: 2
printVar(): 2

Much better! The solution to this problem is to update the VAR variable directly inside globals allowing updates to be seen from both modules that import globals.

Bag of Words Implementation

Bag of words is a method for reducing natural text into a representative model for use with machine learning and natural language processing.

I used it to train a network on log messages and then assign scores to known log messages, based on a bag of words representation the neural network can give a score of how well the test data matches.

 

Maze Solving with A* In Python

There was a new challenge at work to create a program that can solve 2D ascii mazes, for this challenge I implemented the A* search algorithm, this is a very fast algorithm that uses heuristics to determine whether or not a path is viable. It is very useful for something such as path finding in computer games as there may be different routes available but some routes are preferable to others.

The mazes look like this –

Here is a larger one –

Save the mazes into a file and pipe them into the program.

Python Port Tunneler

I needed to be able to remote debug a process that was hidden behind a middle server so I made a Python script that can create a ‘middle man’ port to allow for this.

Basically if you have server-B which is only accessible through server-A then the script creates a port on server-A that links to the desired remote debug port on server-B and provides a new port to forward all traffic through.

It’s probably possible to do this all with some cryptic SSH command but this works better for me, plus I got to do some more Python. It uses the gevent library, which is an extremely fast and efficient socket library and can be downloaded from any good repository.

The usage is quite simple –

The code –

 

Image Recognition Tool

As I’ve recently got quite into machine learning tools I’ve written a small GUI based tool that uses the PyBrains library to ‘learn’ a common theme from a folder of training data. To test new images, drag and drop them onto the tool to get a percentage of similarity to the test data.

The tool is written using Python, PyBrains, Glade2 and PyGTK so you need those libraries available.

The tool can be downloaded here – image_recognition_tool

Image Recognition with PyBrains

After recently completing the Machine Learning course from Stanford University on Coursera I’ve been preparing to give a small introduction to machine learning at work. Part of that is showing some demos of machine learning tools.

I made a character recognition neural network using the PyBrains Python library, it’s a great library and very fast but the documentation is very poor and examples are hard to come across. With enough digging I managed to put together something very simple and short.

In this example it reads in small PNG files of letters, extracts all of the pixel values and creates a 1D array of the values, this is used to train the neural network through back propagation. I test the network on one of the inputs. Each input is classified with a number in the addSample function, this takes the flattened array and a number (unfortunately it does not take a string as a classification). If you run the application you will see that, for example when using b.png as a test, it will return a value close to 2.

You can download the images I used here – Machine Learning Training Characters.

 

Ceaser Cipher Cracker

Another programming challenge from work to solve Ceaser ciphered sentences and return the correct shift value. Automatic solving of the cypher is easy enough but the hard part comes to automatically detecting if the resulting shifted sentence is English. I have posted before about detecting English using ngrams and used a similar process here.

Here is the code which contains the four test cases.

Reblog: Python Minidom and Whitespace

A good tutorial providing fixes for printing of XML using Python and Minidom. Specifically fixing added whitespace and the conversion of characters into HTML entities.

© 2024 Acodemics

Theme by Anders NorénUp ↑