Superset Gitlab OAuth Integration

Finally cracked getting Superset to use Gitlab as an OAuth provider.

We had to make a custom Superset SecurityManager class – loans direct lenders uk

And then this is the superset_config.py, note that we load a custom security manager

Additionally I had to disable SSL certificate validation as we are using a self signed SSL certificate for our internal Gitlab installation. 

Superset SSL Verifiy Failed

Installing Apache Superset at work for internal use, I was attempting to link our Superset OAuth to an internal OAuth provider, this provider uses self signed SSL certificates which caused issues with the Flask OAuthlib. In my Superset logs I would see the following error –probiotic supplement

As this is an internal system it’s possible to monkey patch the SSL handler to avoid doing a verification on our certificates by adding the following to the superset_config.py –

This means that we can reuse the same superset_config.py for our Superset Docker image and bypass the exception, OAuth here we come!

Android Bluetooth Barcode Scanner

As part of an Android contract role I am working on I needed to integrate a Bluetooth barcode scanner with my app, this scanner would be used to trigger events in the app. The first thing to work out was how to get data from the scanner into the app, I initially suspected I would need to deal with the Bluetooth stack directly but luckily this wasn’t the case.

All recent (from 2011 and onwards really) Android devices should support the HID standard as part of the Android Open Accessory Protocol 2.0 update. HID basically allows connected devices (USB and Bluetooth) to self describe the contents and types of their inputs, in general this means that connected devices can represent themselves as keyboard devices or specific hardware buttons.

In the case of Bluetooth scanners ensure that the device supports HID connections and also has the ability to send a line feed after each barcode is sent. The technology has improve with the years, and you can get Bluetooth in almost everything, without counting other technology improvements for a business such as the use of paycheck stubs to increase effectivity.

Coming back to our Android app, knowing what we do now, we can capture all key inputs to our activity in order to capture barcodes sent by the scanner.

Override an activities detault dispatchKeyEvent method with the above function. It takes all action down events, interprets the keycode using the KeyCharacterMap class and adds it to an internal buffer; once an enter key is detected this signals that our scanner has completed sending it’s information and can then trigger our handler.

As we capture all key press signals to our activity we need to reimplement some of the basic activity behaviour such as back key pressed, this is handled in the switch.

Base64 Image Encoding With Angular2 and Typescript

As part of the upcoming Android application I’ve been working on a web app that can manage content in my Firebase database using Angular. I was recommended to use Angular2 as it will soon be the ‘hot new thing’, unfortunately that means a lack of documentation and stack overflow examples.

One thing that I needed was to store images in base64 encoded strings in Firebase (due to no support for Firebase Storage in AngularFire, but that’s another story). I first started by looking for some sort of plugin but couldn’t find anything that worked with Angular2 so ended up writing the code myself, it’s actually fantastically easy.

It does not require any includes as FileReader is part of the core language. Once the image has been processed it will be placed in the member base64Encoded. You may wish to use some sort of event emitter so you can subscribe for when the encoding is complete.

The book that I’ve been working on for some time is nearly finished. It is aimed at teaching young adults programming in a more formal setting, each chapter is a single lesson that can be taught over 1-2 hours. There are some draft chapters already available on the learning to program page, the completed chapters will be added here eventually.

The real aim is to produce a set of materials that can be used by students and an additional set of materials aimed at the teachers, including further reading and explanation of what happens in each session as well as model answers and suggested homework.

It will initially be released as an eBook and print copies available separately.

SVN To Git – Advantages and Perks

I recently gave a talk to a group of people looking to move from SVN to Git. I made a presentation discussing the various differences between the two systems and the possible advantages of Git over SVN.

It’s always a hard thing to change something that a developer is comfortable with but I think that by presenting some possible new workflows and features it’s possible for teams to decide for themselves that they wish to change.

The PDF version of the slideshow can be found at this link git SCM Training.

The main take away from the presentation is showing that by using branching as a part of the work flow it opens up totally new ways of collaborating on code within teams; for example by using feature branches a subset of developers can code review a feature before it is merged into master, these things are not available or not recommended with more traditional centralised version control systems.

2 Sum Algorithms

I’ve been doing the Algorithms: Design and Analysis Coursera course and during the final week there was a cool algorithm that needed to be coded called the 2 Sum algorithm.

I initially coded the naive implementation that would complete in O(n) time –

This worked but we can do better!

First off we sort the input array, this allows us to modify the algorithm to take advantage of the data to modify our algorithm –

This modification will still give a O(n) time in the worse case but on the given test data it decreased the running time by nearly 50%.

Automatic Tagging of Articles

The modern internet contains vast amounts of information, we can make content easier to find for search engines and ultimately for our users by adding meta information to pages.

Most blogs give posts categories and tags and clicking these tags takes us to related posts, usually this is a manual step where the author will fill out the tags most relevant to the post; but what if we could automate this process? With a basic amount of natural language processing understanding and some simple methods we can make a script that can automatically suggest tags for any article.

What Words are Key?

Imagine explaining what a fire engine is to someone who has never seen one before, how would you start? It would be fairly easy to use common descriptive words such as large, red, vehicle or sirens, using this information a person could get an idea of what a fire engine is without ever having seen one. This process is similar to how an automatic categorisation system might work, the key difference is that a computer doesn’t know what a fire engine is any more than it knows what a banana is but through some clever libraries it can extract relevant information from a block of text and use that, this is known as a statistical classification problem in machine learning.

Older classification systems relied on hand made sets of keywords (corpus) that defined a classification, for example an algorithm may find words related to fish and based on the corpus it would know to classify the document as a fish document.

The method presented in this article will use the older systems as a foundation and will be a form of extraction summarization. More advanced methods exist where a human readable summary of an article is created but that is outside the scope of this article.

Our Document

Let’s create a simple document that will be parsed and classified later on, we will take the first couple of sentences from a Wikipedia article –

The domestic cat is a small, usually furry, domesticated, and carnivorous mammal. They are often called a housecat when kept as an indoor pet or simply a cat when there is no need to distinguish them from other felids and felines. Cats are often valued by humans for companionship and their ability to hunt pests.

Data Representation

We have a document but it’s not much use to us as it is, we need to process the document into a more useful data set. The first thing is to convert the document to lower case, split the document into words, remove punctuation and remove common words such as ‘is’, ‘the’, ‘and’ (stop words)  –

This gives us a pretty useful list of words that we can play with.

Preprocessing

As we can see in our input document we have both cat and cats, we know they are the same word but a computer won’t, in order to reduce cats to cat we need to use a technique called stemming. Stemming is the process of reducing a word to it’s base form or stem, so for example reducing plurals to singulars and so on. To accomplish this we will use a stemming library.

The output of this is –

Counter({‘cat’: 3, ‘often’: 2, ‘when’: 2, ‘domest’: 2, ‘simpli’: 1, ‘indoor’: 1, ‘human’: 1, ‘felid’: 1, ‘need’: 1, ‘abil’: 1, ‘felin’: 1, ‘housecat’: 1, ‘from’: 1, ‘companionship’: 1, ‘pet’: 1, ‘there’: 1, ‘their’: 1, ‘other’: 1, ‘call’: 1, ‘furri’: 1, ‘them’: 1, ‘they’: 1, ‘distinguish’: 1, ‘valu’: 1, ‘kept’: 1, ‘hunt’: 1, ‘carnivor’: 1, ‘pest’: 1, ‘small’: 1, ‘mammal’: 1, ‘usual’: 1})

Our most common word is cat which is a fantastic start for very little processing. We could even take the most common word and leave it at that as clearly the document in question is about cats.

Part of Speech Tagging

While we have a basic model that gives us a topic we can’t always rely on simple word frequencies to extract the theme of a document. A major part of natural language processing is part of speech (POS) tagging, this takes in a sentence and tags each word with a type such as noun, verb or adjective; for example –

  • heat – verb (noun)
  • water – noun (verb)
  • in prep – (noun, adv)
  • a  – det (noun)
  • large – adj (noun)
  • vessel – noun

To tag our words we will use a tagger from the Natural Language Toolkit (NLTK) which is a NLP library available for Python.

Let’s add the tagging to our code so that we have two models of our document, the first being the word count and the second being the part of speech tagged model.

The will run the part of speech tagging on our list of filtered words and give us a simplified description of each word, as the aim is to identify nouns we do not worry so much about parsing the document as a whole. The resulting list is as follows –

set([(‘need’, u’VERB’), (‘hunt’, u’NOUN’), (‘carnivorous’, u’ADJ’), (‘furry’, u’ADJ’), (‘cat’, u’ADJ’), (‘other’, u’ADJ’), (‘called’, u’VERB’), (u’felid’, u’ADJ’), (‘distinguish’, u’ADJ’), (‘companionship’, u’NOUN’), (‘cat’, u’NOUN’), (‘usually’, u’ADV’), (‘them’, u’PRON’), (u’feline’, u’NOUN’), (‘from’, u’ADP’), (‘when’, u’ADV’), (u’pest’, u’NOUN’), (‘housecat’, u’ADJ’), (‘domesticated’, u’VERB’), (‘ability’, u’NOUN’), (‘kept’, u’ADJ’), (‘indoor’, u’NOUN’), (‘small’, u’ADJ’), (‘they’, u’PRON’), (‘simply’, u’ADV’), (‘valued’, u’VERB’), (‘often’, u’ADV’), (u’human’, u’ADJ’), (‘domestic’, u’ADJ’), (‘there’, u’DET’), (‘their’, u’PRON’), (‘mammal’, u’ADJ’), (‘pet’, u’NOUN’)])

Keyword Extraction

Now we will combine our two data models, first we will take the most common words and then lookup their tag type, if it is a noun we can say that this is likely to be a topic of our document.

Putting all of our code together it looks like this –

And the resulting words from our test document that have been identified as possible key words –

cat feline indoor

Improvements

This is a very basic parser and could benefit from improvements in the following areas –

  • Improve the stop word list, NLTK does come with several large stop word lists for various languages.
  • Try and identify key phrases rather than key words, for example in the document we have the phrase “carnivorous mammal” this may be very relevant depending on the key words/phrases we are trying to extract. This could be better solved by using tagging systems to identify the parts of a sentence rather than single words.
  • Relying on word count is useful but basic – for this keyword extraction system we are very dependent on the key topic words being repeated but it may not always be the case. Particularly for more abstract documents. Wikipedia articles are a good example of documents that will likely contain the topic several times in the document.
  • Just checking for nouns is quite limiting, in our example document cats are described as mammals which could be a very useful tag to extract to help with semantic meta data but because it is tagged as an adjective it is ignored.

 

Python Importing Global Variables – Reference or Value?

I’ve been refactoring a large Python program into separate modules for readability and ease of unit testing and part of that involves moving global variables into a separate globals module. I was interested to see if importing this globals module into other modules and then updating a variable would mean that other modules see the change, here’s what I found –

From globals import *

My first instinct was to use from globals import * in my sub modules, let’s look at an example –

globals.py

main.py

test.py

Running the main.py we get this output –

First: 1
Second: 2
printVar(): 1

Obviously updating the global variable VAR in main.py does not reflect in test.py, this is because from globals import * will import by value giving main.py a local copy of VAR.

Import globals

In order to get a single instance of the global variable VAR we need to modify our program to use import globals and then reference the VAR variable directly inside globals.

Let’s modify the scripts –

main.py

test.py

Running this will give us the following output –

First: 1
Second: 2
printVar(): 2

Much better! The solution to this problem is to update the VAR variable directly inside globals allowing updates to be seen from both modules that import globals.

Moore’s Law in Action

image

Bag of Words Implementation

Bag of words is a method for reducing natural text into a representative model for use with machine learning and natural language processing.

I used it to train a network on log messages and then assign scores to known log messages, based on a bag of words representation the neural network can give a score of how well the test data matches.

 

Merry Christmas!

Merry Christmas from me at Acodemics!

Here’s a little bit of code that will make snowfall in your terminal, it’s awkward on purpose 🙂

 

© 2024 Acodemics

Theme by Anders NorénUp ↑