At Foursquare, we attempt to personalize as much of the product as we can. In order to understand the more than 70 million tips and 1.3 billion shouts our users have left at venues, each of those pieces of text must be run through our natural language processing pipeline. The very foundation of this pipeline is our ability to identify the language of a given piece of text.
Traditional language detectors are typically implemented using a character trigram model or a dictionary based ngram model. The accuracy of these approaches is directly proportional to the length of the text being classified. For short pieces of text like tips in Foursquare or shouts in Swarm (see examples below), however, the efficacy of these solutions begins to break down. For example, if a user writes only a single word like “taco!" or an equally ambiguous statement like “mmmm strudel," a generic character- or word-based solution would not be able to make a strong language classification on those short strings. Unfortunately, given the nature of the Foursquare products, these sorts of short strings are very commonplace, and we needed a better way to accurately classify the languages in which they are written.
To this end, we decided to rethink the generic language identification algorithms and build our own identification system, making use of some of the more unique aspects of Foursquare data: the location of where the text was created and the ability to aggregate all texts by their writer. While there are many multilingual users on our platform, the average Foursquare user only ever writes tips or shouts in a single language. Given that fact, it seemed inefficient to apply a generic language classification model against all of the text that a single user creates. If we have 49 data points that strongly point to a user writing in English, and that user's 50th data point is an ambiguous text that a generic language model thinks could be German or English (with 40% and 38% accuracy respectively), chances are that the string should correctly be tagged as English and not German, even if the text contains German loanwords. Our solution to this problem was to build a custom language model for every one of our users that leave tips or shouts, and then to allow those user language models to help influence the result of the generic language detection algorithm.
The first step in this process is to run generic language detection on every tip and shout in the database. Each tip and shout is associated with a venue that has an explicit lat/long associated with it. We then reverse geocode that lat/long to the country in which that venue is located, which lets us know the country that the user was in when they wrote the text. Next, we couple the generic language detection results with this country data to create a language model for every country. While this per-country language distribution model may not correctly resemble the real life language distributions of a given country, it does model the language behavior of the users that share text via Foursquare and Swarm in those countries.
Example of top 5 languages and weights calculated in the country language models:
US - United States of America Top Tip Langs Top Shout Langs en - 0.80096 en - 0.5092 es - 0.00850 de - 0.0139 it - 0.00804 es - 0.0102 de - 0.00559 it - 0.0096 fr - 0.00459 nl - 0.0088
RU - Russian Federation Top Tip Langs Top Shout Langs ru - 0.77396 ru - 0.40054 bg - 0.02990 uk - 0.04615 uk - 0.02049 bg - 0.04446 sr - 0.01458 sr - 0.03221 en - 0.01450 be - 0.02420
TH - Thailand Top Tip Langs Top Shout Langs th - 0.67228 th - 0.60632 en - 0.17507 en - 0.10340 ru - 0.01969 zh - 0.00488 it - 0.00327 ja - 0.00478 de - 0.00298 de - 0.00467
With country models in hand, we then do a separate grouping of strings by user and are able to calculate a language distribution on a per-user basis. However, one of the problems with this approach is not every user has enough data to create a reliable user model. A new user who is multilingual will cause classification problems with this system early on due to the lack of data to produce a reliable model. To solve this particular problem we use the language model of the dominant country for that user as a baseline. When a user has little to no data for their user language model, we allow the country model to be merged into the low information user model. As more data becomes available for a given user, we slowly weight the user model higher than the dominant country model until we have enough data where the user model becomes the more dominant model between the two.
Finally, we create per country, orthographic feature models using the strings that are grouped by country. For this model, we have a set of 13 orthographic features that, when a string triggers one of them, the string's generic language identification results are added to the other strings results that triggered for that feature, in a specific country. This allows us to have a feature “containsHanScript" and have a completely different language distribution in China than the one that is calculated for Japan, where both Chinese and Japanese contain characters from the Han script. Other examples of this are Arabic vs. Farsi with the “containsArabicScript" feature, Russian vs. Ukranian vs. Bulgarian with the “containsCyrillicScript" feature, and all romance languages with the “containsLatinScript" feature.
With the user models and the orthographic feature models in place, we then rerun language identification on all of our tips and shouts, using the appropriate user's language model and applying any triggered orthographic feature model that the string matches, and we merge the 2 results together, along with the generic language detectors' results for a given string and we're left with a higher quality language classification. On preliminary analysis, we were able to correctly tag an additional ~3M tips and ~250M shouts using this method.
Examples of corrected language identification:
"Place has good tacos tortas and licuados yum" Spanish -> English US user writing a tip in Chicago
"Хороше фірмове пиво!!!" Serbian -> Ukrainian Ukrainian user writing a tip in Ivano-Frankivsk
"Zastavte se na točenou Kofolu!" Slovene -> Czech Czech user writing a tip in Prague
If these kinds of language problems interest you, why not check out our current openings!