Feature Flighting and the QoS
The fast continuous integration and deployment cycles adopted by the majority of the software organizations mean greater code churn - new features are being released and old features are being deprecated on a regular basis. Although there are usually multiple testing environments where engineers could run and test their code, they often cannot match the scale (# of users, requests, etc.) of the production environment. As a result, the true test of the new software occurs only when it is run in the production environment.
So, we need a mechanism to test new code and features in the production environment without impacting the user experience. We could first segment the users according to some criteria such as geolocation, type, etc. Once divided, we could choose one or more of those user segments to use the new app and test all functionalities of the new feature or app in the production environment. This mechanism is called feature flighting. A more complete discussion on feature flighting can be found here.
The implementation could be following. When flighting an API, we could add logic that checks the context (location id in this example) and then routes the code to legacy or new function according to that context.
Feature flighting Example
from fastapi import HTTPException
location_whitelist = ['x','y']
@app.get("/items/{location_id}")
def read_item(location_id: int):
if location_id in location_whitelist:
return new_read_item()
else:
return legacy_read_item()
def legacy_read_item():
pass
def new_read_item():
pass
The following feature tools are popular choices for flighting features in the production environment.
Best Practices to Avoid Incidents
Your feature flighting logic should use a whitelisting approach and route the code to the new feature if the context or the user is in the whitelist.
The default/legacy code path should continue to work if the flight check fails
Always create tasks to remove the added flighting logic when all users have been migrated to the new feature and the testing is complete
Use case: Food Delivery App Migration using Geospatial Feature Flighting
In 2022, I was leading the engineering team of a tech startup in Bangladesh. One of the services offered by this startup was app-based food delivery services to ~ 10M customers in the city of Dhaka, Bangladesh. Customers would order food through its app and the delivery drivers would get the order, pickup the food from the restaurant and deliver it to the delivery address listed on the order for a fee. After rewriting the legacy food app, the management team of the startup wanted to migrate the users from the old app to the new app while maintaining a high QoS and avoiding service interruptions. The issue was, the refactored architecture, code, and storage systems of this food app could not be tested for a large scale user base and had the risk of encountering service disruptions and outages.
Problem
How can we choose a smaller set of users and delivery person to use our new version of the food delivery app so that the impact from any bugs remain isolated to that small set of users (w/o impacting the rest of the users). As the newly designed service gets stabilized, we would like to gradually onboard all users to the new app.
Solution: Strangler Pattern and Geographical Zones
The novel solution was to divide all users of the Dhaka city into several smaller subgroups based on their geographical locations (latitude, longitude): Uttara, Mirpur, etc. Then we could choose one or more of those smaller zones and onboard them to the new app one zone at a time. This gradual migration will allow us to test and stabilize the new app while avoiding any major service outages.
The solution steps were as follows:
Continue running the legacy and the new app side by side in the production environment
Required us to onboard the support and delivery personnel to ensure they can use the new and legacy backend dashboards
Design a geolocation webservice to map a user’s geolocation to one of the geographical zones
Use feature flighting to route all network traffic from the user app (order, payment, etc.) to either the legacy app or the new app based on a user’s geographical zone
Use strangler pattern to gradually transition the geographical zones from the legacy system to the new system
When all users’ zones have been migrated to the new services, sunset the legacy services
End Result
Using the geographical feature flighting, our engineering team successfully migrated the users one zone at a time and stabilized the new app without any service outages.