My thesis required dealing with large quantities of Twitter data and to make it easier for myself I decided to only use English language tweets. Due to the ‘interesting’ grammar and spelling that is frequently found on line a different algorithm to a standard dictionary test was required. This one checks for common English n-grams (letter groupings) and returns a value of how sure the input is English.

Here’s some examples:

Good morning everyone! : 66%
Im So Sick. Really in a bad Position : 50%
Thats the fear of unicorns : 100%
Gnt too com uma saudade daminha namorada vcs nao tem ideia :(((( : 0%
Ils testent les hologrammes pour faire des concerts par des morts. Bientôt même les pas morts feront des hologrammes et vous irez les voir. : 12%
il7ain ilnass ykhl9un men habat twitter o yntqlun to istagram=))!!! : 40%