With in-person events happening more and more frequently, from Pokémon GO Fest, to Safari Zones, the Ultra Beast launches, and now the new City Safari events, one of the biggest issues people worry about is how the game will run smoothly. With more people in one location that the game can typically handle, Niantic have to come up with new ways to keep the game running well, without glitches. Add in the spoofer problem that inevitably coms with certain types of in-person events, and the potential for disaster is more present than ever.
So how do Niantic deal with that? They recently shared a blog post titled ‘Optimizing Pokémon GO: How a Centralized Redis Cluster Improved Performance and Reliability During Popular Raid Events’ by Da Xing and Michael Mei discussing this problem, and how it specifically relates to raiding at in-person events. They note that while lobbies for raids max out at 20 participants, each gym can host hundreds of lobbies simultaneously.
Tackling Technical Hurdles During Raids
“From a technical standpoint, the raid feature in Pokémon GO is engineered and deployed as a strictly in-memory feature. As a result, all players who participate in the same gym are hosted on the same server. During special raid events or in heavily frequented raid locations, the server infrastructure faces significant technical hurdles due to the sheer number of players present.”
“One of the challenges of raid events is the sudden large influx of traffic (aka spiky QPS). The game operates in a multi-server environment, with players usually being evenly distributed across all servers. However, during raids, players in the same gym need to be on the same server in order to access the shared game data stored only in the memory of the corresponding server, such as player profiles and raid metadata. This can lead to unbalanced server loads, as popular raid areas attract more players, resulting in increased traffic to the servers hosting those gyms.”
Spiky QPS and Delays for Players
“During particularly popular raid events, the servers can become overwhelmed by high spiky queries, as thousands of players may be raiding under one gym over a short period of time. This can cause significant delays for players in the same raid, as well as for those who are not in the raid but are on the same server, eventually rendering the game unplayable for all affected players. To address this issue, Niantic site reliability engineers slowly drain out the affected server, temporarily redirect players to other servers and restart the busy server.
In addition to QPS-related challenges, the stateful nature of the system also makes scaling and restarting difficult. The server stores in-game player attributes in memory, which restricts players to connect and remain on a particular server. Niantic has developed an effective but complicated process to ensure that players are not affected during scaling. However, during major raid events when servers are clogged by spiky QPS, this process may take longer to drain out players on hot servers, which means that game clients may not be responsive for several minutes until the hot server is restarted.”
Simplifying the Technical System
“One major change we made was to store the raid-related shared data, previously stored in the memory of the servers, in the centralized Redis cluster. This enables all Pokémon GO servers to access the raid-related shared data, eliminating the need for players to connect to the specific server where the gym is hosted to join raid groups. This simplifies the technical system significantly.”
“Players can connect to any server, regardless of where the gym is hosted, eliminating the unbalanced load caused by popular raid gyms.”
“With the raid-related shared data stored in the Redis cluster, load is now more evenly distributed. Players can connect to any server, regardless of where the gym is hosted, eliminating the unbalanced load caused by popular raid gyms. This change has removed the bottleneck and allowed the servers to sustain higher QPS during popular raid events.
The provided diagram presents a heatmap visualizing the load distribution across all servers. The x-axis corresponds to time, while the y-axis represents the number of players on each server. Each cell within the heatmap is color-coded to indicate the magnitude of server load. Specifically, a red cell indicates that a significant number of servers exhibit similar player counts, while a green cell signifies that only a few servers are accommodating a specific player count.”
“Since we’ve slowly rolled out this Redis solution starting at approximately 11:30 am on that particular day, a noticeable change in the server landscape occurred. The occurrence of high player count servers, commonly referred to as “hotspots,” reduced significantly. Instead, the majority of servers are now hosting a relatively consistent player count ranging between 1.5k to 2.5k.”
“Notably, the maximum recorded latency has decreased from over 1 second to approximately 250 milliseconds (75% latency drop).”
Launching at a Global Scale
“The introduction of the project on a global scale, at approximately 4:00 pm, resulted in a significant reduction in latency. Latency represents the duration it takes for a server to respond to a player’s request, typically occurring when a player interacts with the game client. Notably, the maximum recorded latency has decreased from over 1 second to approximately 250 milliseconds (75% latency drop). This improvement is visually represented in the chart provided below.”
“Moreover, the server is now more reliable. Long delays and server hiccups during popular raid events have been greatly reduced since the project was launched into production and fine-tuned through a few iterations. This provides a more stable raiding experience during major events and saves operational and maintenance costs that can be invested in other areas to improve the overall gaming experience.
We are constantly working to improve the Pokémon GO player experience, and have already started developing an even better solution to further enhance the performance and reliability during popular raid events. We will be sharing more details on this project soon, so stay tuned for more. Happy Raiding!”
Niantic don’t often share technical information and details like this with us, and it is a really interesting read! Tackling large amounts of players at in-person events, especially when all those players are trying to get into the same raids on specific gyms at once, can be a huge issue, and it is clear Niantic are putting a lot of effort into making sure this won’t be an issue at future events.
Spiky QPS is an issue that plagues many aspects of the game at in person events, and for raids in particular. There are often reports from the earliest timezones of problems with raids due to spoofers, for example, when the first Elite Raids occurred, and anything that can make the game smoother for those timezones that are frequently plagued with game glitches and issues is a win for the community.
What do you think Niantic can do to improve the smooth running of the game for in-person raids, events, and the early timezones with spoofer issues?