When building quantitative systems that drive commercial value, pragmatism and innovation are not in conflict with one another.
Foursquare is a place where quantitative research needs to yield clear business wins quickly and iteratively. As a growing and lean start-up with challenging research problems and data-focused customers, Foursquare's approach to technology must embody the right mix of being rigorous, interpretable, defensible and business-aligned. Therefore, a hybrid of product-oriented pragmatism and scientific rigor became part of the company culture that was born from the growing pains of transitioning to the data-driven enterprise company we are today.
Now that Foursquare has acquired Placed, a company that went through a similar discovery process, that balance remains key to our success. Here are two examples of how balancing practicality and scientific fundamentals have kept data science at Foursquare focused and impactful.
Practical modeling methods to avoid bias and increase business relevance (without being too fancy)
Being pragmatic as a data scientist is an effective way to be connected to and impactful to the business. One way to do this is to measure what matters to that business and train models using the most relevant available data. You need to be able to draw a coherent line connecting business case to data set to model to business outcome.
On the flip side, applying unnecessary constraints — an easy mistake to make when designing new models — is a way to break that straight line and fail to generate impact. In academic settings, we often apply constraints for the sake of perceived elegance of methodology. It is important to not apply such constraints too liberally in the private sector.
These unnecessary constraints we impose on ourselves can surface all sorts of biases, some as obvious as the ones highlighted in this public case of MIT’s facial detection analysis. In these cases, facial recognition models trained on presumably-caucasian training sets failed to recognize faces as famous as Serena Williams, Michelle Obama and Oprah Winfrey.
In addition to the obvious social bias problem at play, there's an avoidable and common issue of training models in the absence of context.
From a machine learning standpoint, it's unrealistic and unnecessary to expect a generated model to intuit concepts, e.g., that skin tone and hair style are variables in human faces. A machine-learned model is like a baby, assuming that whatever it sees in the room is really the whole world. We have experiential knowledge we can leverage when teaching these models. We create data sets intending to highlight the important variables from our own awareness. We tell our models the things we already know so that they can learn the nuances we don't yet know.
In this face recognition case, a simple solution is to add a diverse and representative set of people who can train the data to recognize a range of facial features. We should highlight key variables as features up front: skin tone, hair style, eyes, nose, ears, glasses and so forth. Is it “cheating” or “inelegant” to put Michelle Obama directly in the training set? Not at all. Put Jennifer Lopez in your test set too. You want train and test sets to be independent and balanced, but each should include key product-relevant examples. You want your data to generate the story of why your model is applicable and interpretable. Using data sets that lack curation is only generating unwanted bias in your models.
This same attitude applies to any real-world measured data and comes up all the time. We know facts about the world that don't need to be inferred from scratch by our models. As an example with POI (point of interest) data at Foursquare, shopping malls tend to be filled densely with commercial places of interest; whereas stadiums tend to have a lot of commercially vacant space in their centers (e.g., a field or a court).
Understanding this concept doesn't require a significant research project. It can be added through training set diversity. Similarly, we know that places with profane names or with no user reviews are less likely to be real venues. We can expose features like profanity, stadium-ness, venue category and venue "tastes" directly for our models because we know about them from real-world experience.
Practicing agile and iterative data modeling with scientific rigor
Another common issue we’ve observed is found in data science applications that don’t have long-established best practices or strong academic attention. There is a risk in these areas of having long research cycles that don’t follow a straight path. We recommend mitigating these risks through borrowed concepts from iterative software development philosophy and “Agile” processes. A typical data science project can follow these steps:
Define a success metric that is directly related to a business objective.
Define a simple model that can be scored against the success metric.
Iterate over a set of alternate approaches (only incrementally more complex) to improve the success metric.
We aim for the ability to test (or “round trip”) one or more approaches per week. This allows us to see within a few weeks whether we can achieve the business objective and whether we are hitting a point of diminishing returns. Over time, we have seen concrete benefits when applying agile-style data science, including:
Resulting models are sufficiently complex to meet the business objectives but not more complex than is required.
The cost to implement and maintain a necessary model is kept manageable, including for new employees.
This iterative process has additional benefits on top of productivity and consistency. Seeing which types of increases in model complexity generate corresponding increases in performance allows data scientists to understand the nature of the underlying data in their sector. Insights are accumulated about which model classes are better suited to particular problems. This can result in a faster search for the most appropriate algorithm. This is a greedy approach at heart. However, in our experience, misses from greed are the exception rather than the common case when working with the noisy data sets encountered in everyday practice. The same benefits garnered from process-driven engineering have analogs in rapid-cycle innovation and modeling.
For any data-driven organization, it’s critical to innovate on a variety of quantitative problems, ranging across topics that can include user behavior, trust modeling and other developing areas. At Foursquare we're continuing to innovate on a wide range of these quantitative areas. We aim to be both pragmatic and creative in our initiatives so that we can continue to drive business value at a rapid rate.
If you’re interested in learning more about how to invent the future with data scientists at Foursquare, check out our job openings at www.foursquare.com/careers.
A version of this post originally appeared on Towards Data Science.