My thesis required dealing with large quantities of Twitter data and to make it easier for myself I decided to only use English language tweets. Due to the ‘interesting’ grammar and spelling that is frequently found on line a different algorithm to a standard dictionary test was required. This one checks for common English n-grams (letter groupings) and returns a value of how sure the input is English.
Here’s some examples:
Good morning everyone! : 66%
Im So Sick. Really in a bad Position : 50%
Thats the fear of unicorns : 100%
Gnt too com uma saudade daminha namorada vcs nao tem ideia :(((( : 0%
Ils testent les hologrammes pour faire des concerts par des morts. Bientôt même les pas morts feront des hologrammes et vous irez les voir. : 12%
il7ain ilnass ykhl9un men habat twitter o yntqlun to istagram=))!!! : 40%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
from collections import Counter class trigram: ngrams = ['the', 'and', 'tha', 'ent', 'ing', 'ion', 'tio', 'for', 'nde', 'has', 'nce', 'edt', 'tis', 'oft', 'sth', 'men', ' th', 'he ', 'ed ', ' of', 'in ', 'to ', 'er ', 'as ', 'her', 'ng ', 'of ', ' an', ' in'] probability = 0.0 def __init__(self): pass def parse_text(self, text): count = 0.0 s = set() length = len(text.strip(' ').split()) for i in range( len( text ) ): s.add( text[ i: i + 3 ] ) c = Counter(s) for trigram in self.ngrams: try: count += c[trigram] except IndexError as e: pass if length > 0: self.probability = (count / length) * 100.0 else: self.probability = 0 return self.probability def get_probability(self): return self.probability def is_english(self): return (self.probability > 40) |
Questions? Comments?