Last August we launched Tastes to help our users customize their local search experience. Taste tags like “trendy place", “pork buns", or “romantic restaurant" not only help users find the kinds of places they like when out and about, but also allow us to answer, for the first time, the question of “What is this area known for?".
Taste data is a 2-way street. Not only are our users making use of tastes to personalize their experiences within the app, but every venue that we have external and user generated content for has it's own unique taste profile as well. Making use of many input sources we are able to reliably attach tastes to the venues within the Foursquare venue database and calculate how strongly affiliated each of the applied tastes are with a given venue with a single affinity score. Applying our NLP stack to analyze user tips at a venue, we are able to distill that data into several metrics and scores (i.e. sentiment score, quality score, spam-like measure, etc.) that feed directly into the affinity score. Additionally, explicit data from users in the form of ‘Rate Places' votes that signal which tastes our users liked at a venue is also incorporated into that final score.
Once tastes and their affinity scores are applied to our venues we can dig into our data science tool chest and use Old Faithful, TF-IDF, to find the tastes that are most unique in a given geographic region. TF-IDF is typically used to measure the importance of a term within a particular document that belongs to a larger corpus of documents. However, for our geographic taste measurement scores we have to modify the traditional understanding of what terms, documents, and corpora mean. Given the task of trying to identify the most important tastes of a sub-region in comparison to the region as a whole, we treat each venue as a single document, the tastes that are attached to the venues as the terms, and the affinity for a specified taste as the term frequency. Finally, we aggregate the venues, v, by the sub-region R we wish to measure and apply the following customized formula to find the taste uniqueness of taste t in R:
This formula is applied to every taste for a specified sub-region, producing a ranking of how unique every taste is to that sub-region. We then took the top 50 tastes ranked by uniqueness and resorted based on their affinity to the sub-region in order to find the most frequently seen tastes among the most unique. Every week, as part of our Hadoop data processing pipeline, we calculate these scores on various pairings of region and sub-region (US vs US state, US vs US city, city vs neighborhood) and used the final rankings to produce the “top" tastes in each sub-region.
The tables below represent a sampling of the results from this work.
When we first generated this data, we immediately knew it would make a great feature in the Foursquare app. With a few changes to our search pipeline, we were able to surface them as quick links for users visiting these neighborhoods:
Chinatown, NY, NY
Mission District, San Francisco, CA
We've just scratched the surface of digging into this data. If tackling these kinds of data analysis problems and working with an amazing dataset (and incredible co-workers) interests you, come join us!