Foursquare has a deep collection of more than 65 million venues. One of the signals we use to help users discover new places they'll love is similar venues. Similar venues not only powers the features shown below, but is also an underlying scoring component for our search and recommendation algorithms.
While we've had a similar venues signal for quite a while, we've recently revisited the problem from the ground up. This post will provide a high level overview of how we approached the latest update to venue similarity.
The original venue similarity job was composed of two hadoop jobs using our Scala wrapper around raw Mapreduce with separate Luigi configurations. After the rewrite, the whole similar venue calculation was contained in one Scalding job with a single Luigi configuration, increasing code legibility. We also optimized the resulting HFile to be smaller than the previous one by re-organizing the Thrift structure. Thanks to this, the online fetching of similar venues for user-facing services results in lower loads for the servers.
The premise of using covisitation to calculate venue similarity is that if a lot of people tend to frequent two venues, they may be similar to each other. For example, in San Francisco, it turns out that many users that visit Smitten Ice Cream in Hayes Valley also visit Humphry Slocombe in the Mission, both similarly hip ice cream shops with seasonal flavors. Covisitation can be a very valuable signal, especially when we have over 8 billion check-ins to work with.
Computing covisitation at scale has its challenges. At a high level, what we want to do is compute the cosine similarity between venues. We model each venue as a vector of users: if we have n users, each venue is a vector of length n. The vector's value at index i is 1 if the user at index i has been to that venue and 0 otherwise. We can naively compute cosine similarity by mapping over each user's check-in data and emitting all pairs of visited venues. This yields O(n²) blow ups for users with deep histories that then lead to unacceptable runtimes and sadness on our Hadoop cluster. Here are a few important considerations for optimizing the calculation:
- Sampling: The folks at Twitter came up with the DISCO algorithm outlined here, which describes a method for intelligently sampling data without greatly altering the final similarity measures.
- Reflexivity: Computing the covisitation of venue A -> venue B means you don't need to compute it the other way around. If we order the data properly, we can greatly reduce the number of computations needed.
Covisitation by itself is not without its flaws. One problem is there are super-hub venues like airports and train stations that people pretty much have to visit that can introduce noise despite efforts to normalize them out. There are also problems of locality, a corner deli could be next door to a diner, and share many common visitors, but not really be very similar.
When you think of a restaurant you commonly go to, you might categorize it in only one category. For example, my favorite BBQ restaurant can simply be categorized as a “BBQ restaurant". However, a venue can have multiple categories. A BBQ restaurant can be considered just a generic “restaurant", or an American restaurant. In Foursquare, each venue we has a primary category but can have other sub-categories as well. Furthermore, categories are modeled in a tree hierarchy, so each category could have a parent or children categories. Previously, we only computed similarity between categories and their parents. A Dim Sum restaurant will have an exact category match with another Dim Sum restaurant, but it will have a mismatch in the deepest category node with a Hong Kong restaurant.
While this is good in a naive examination, it doesn't prove to be the best way to determine category similarity. For example, a Hot Pot restaurant and a Chinese restaurant are fairly similar, but because of how they are modeled it was difficult to reason about the exact similarity between them. We want a Hot Pot restaurant to be more similar to a Chinese restaurant instead of a Filipino restaurant.
To alleviate this issue we used maximum likelihood estimates to answer the following question: Given a venue labeled with category x and at least one other category, what is the probability that it's also labeled with category y? Looking at co-occurrences between pairs of categories is much more fine-grained compared to looking at exact matches between category lists. Using this metric gives us the desired outcome of a Hot Pot restaurant being more similar to a Chinese restaurant (calculated similarity of 27%) compared to a Filipino restaurant (calculated similarity of 0.0%!). At this point, we've already made a big improvement.
Looking at covisitation between venues with a high category similarity is exactly the situation we've described above: if a user frequents Ice Cream shop X and Ice Cream shop Y, X and Y are very likely to be similar. This metric works well for major cities with lots of foot traffic and rich venue information. However, what if we are looking at a city or a neighborhood that is less populated?
Tastes is another valuable feature in determining venue similarity only available on Foursquare. Each venue on Foursquare has tastes associated with it, so if we have venues without a lot of check-ins we can rely on tastes for coverage. At a high level, taste similarity is calculated by looking at the tf-idf for matching tastes in two venues (to learn more about another fun way we use tf-idfs, check this out: Geographic Taste Uniqueness). With this metric we can group venues with similar primary tastes even if we don't have rich covisitation data. Taste similarity can also help find venues that seem unrelated on the surface but provide similar food/experiences. For example, if you're a fan of Poutine you will love both Citizen's Band, an American Restaurant, and Jamber Wine Pub, a Wine Bar and Gastropub.
All these parts work together to produce a list of venues similar to your favorite venue. As a result of this effort, we've greatly improved the number of similar venues served, and the new method has resulted in a 7% increase in CTR for similar venues for the web in the United States and 3% increase globally. You can check out similar venues on the web or in the app when you're looking at a venue page. If you're interested in working with big data to make meaningful user experiences like these, come join us!
— Billy Seol (@pabpbapapb)