Big Tech Digest #18 💥

Featuring articles from Netflix, Expedia, Airbnb, Flipkart, and many more!

Aug 01, 2024

Happy Thursday 👋!

Welcome to another Big Tech Digest issue. This time around, we have so many great articles that I had a really hard time building the featured list.

… and all these were published since the last Big Tech Digest issue 2 weeks ago!

There’s just one thing you could do to help me grow Big Tech Digest: go ahead and mention it to your friends and/or teammates. Thank you! 🙏

Share Big Tech Digest

Without further ado, let’s get started!

// 🏆 Must reads

1. "Java 21 Virtual Threads - Dude, Where’s My Lock?"

Netflix ⸱ 10 min read ⸱ 29 Jul

Discusses the adoption of virtual threads as part of the migration to Java 21
Describes the performance issue encountered with virtual threads in Java 21
Explores the symptoms of the issue, including an increase in the number of sockets in closeWait state
Shares the analysis of thread dumps and the identification of the problem with locking

2. "How to create fake back-end using IndexedDB"

by Andrey Ozornin ⸱ Miro ⸱ 10 min read ⸱ 22 Jul

Discusses the use of indexedDB as a back-end substitute for a demo version of an app
Explores the limitations of using local storage for data storage in browsers
Introduces the use of indexedDB as a better alternative for storing data between sessions
Covers the step-by-step process of creating a fake back-end using indexedDB

3. "The noisy JIT Compiler"

by Tomasz Richert ⸱ Allegro ⸱ 6 min read ⸱ 26 Jul

Describes how the JIT compiler caused performance issues in a Java-based application
Explores the impact of JIT compiler on CPU spikes and response times at application start
Shares the diagnostic tools used to identify the issue, including thread dump and flame graph

// 📬 Optional reads

a.k.a. The Best of the Rest!

"Node.js and the tale of worker threads"

by Jeremy Colin ⸱ Zalando ⸱ 1 min read ⸱ 25 Jul

"Parsing: the merit of strictly typed JSON"

by Max Duval ⸱ The Guardian ⸱ 5 min read ⸱ 26 Jul

Describes how TypeScript is used to provide strong structural type analysis for JavaScript code
Explains the use of JSON for REST APIs and the potential issues with TypeScript static analysis
Discusses the challenges of ensuring TypeScript prevents runtime errors and handling unexpected object shapes
Introduces the use of parsing libraries to validate unknown objects against custom schemas
Covers the implementation of a custom 'Result' type to handle failures when parsing data from REST APIs

"How Airbnb Smoothly Upgrades React"

by Andre Wiggins ⸱ Airbnb ⸱ 9 min read ⸱ 23 Jul

Explores the design of the React Upgrade System to avoid a long-running upgrade branch
Covers the testing of the upgrade using visual regression testing, integration testing, and unit testing
Introduces the progressive rollout approach to control the upgrade across traffic and product surfaces
Shares the success of completely rolling out React 18 to all web surfaces at Airbnb using the React Upgrade System

"Channel-Smart Property Search: How Expedia Tailors Rankings for You"

by Anne Morvan ⸱ Expedia Group ⸱ 5 min read ⸱ 23 Jul

Explores how Expedia adapts lodging rankings through machine learning
Describes the differences between destination and property searches
Gives an overview of the lodging utility ranking algorithm and its training dataset
Covers the model architecture and its adaptation to the property search lodging ranking algorithm

"Apache Flink® on Kubernetes"

by Ran Zhang ⸱ Airbnb ⸱ 8 min read ⸱ 31 Jul

Describes the evolution of Apache Flink at Airbnb, transitioning from Hadoop Yarn to Kubernetes
Explores the challenges faced with the previous architecture and the improvements made in the current state
Gives an overview of the three phases of the architecture evolution, highlighting the benefits and challenges of each
Covers the impact of Flink on Kubernetes, including improved developer velocity, availability, and cost savings
Shares future work and improvements, such as job autoscaling and the Flink Kubernetes Operator

"Using Predictive and Gen AI to Improve Product Categorization at Walmart"

by Adnan Hassan ⸱ Walmart ⸱ 7 min read ⸱ 19 Jul

Describes the development of Ghotok, an AI technique for product categorization at Walmart.com
Covers the use of predictive and generative AI models in Ghotok to understand product hierarchies
Introduces the approach of ensemble models to combine outputs of different ML models for more precise predictions
Explains the integration of Ghotok to the backend system, including the use of a two-tier caching system

"OpenTelemetry for JavaScript Observability at Zalando"

by Mohit Karekar ⸱ Zalando ⸱ 1 min read ⸱ 29 Jul

"End-to-end test probes with Playwright"

by Jeremy Colin ⸱ Zalando ⸱ 1 min read ⸱ 19 Jul

"How to delete old data from DynamoDB without spending thousands"

by Raphael Montaud ⸱ Medium ⸱ 6 min read ⸱ 24 Jul

Describes the problem of high storage costs in DynamoDB due to old data
Explores the solution of migrating necessary data to a new table and deleting old data
Presents the tools used to estimate the costs of different cleanup scenarios
Goes through the process of handling incoming data and cleaning up existing items in the table
Shares three possible paths forward with cost estimations for each option

"Hermes: A Text-to-SQL solution at Swiggy"

by Amaresh M ⸱ Swiggy ⸱ 8 min read ⸱ 26 Jul

Hermes is a generative AI-based workflow developed by Swiggy to generate SQL queries and receive results within Slack.
Hermes V1 was a straightforward implementation using LLM, but V2 improved data flow by incorporating a middleware and Gen AI model.
The knowledge base consists of metadata that provides essential context about the data, helping the model generate accurate SQL queries.
Hundreds of users at Swiggy are leveraging Hermes for tasks such as obtaining sizing numbers, conducting deep dives, and answering specific questions during analysis.
The V2 iteration of Hermes performed significantly better for charters with well-defined metadata, confirming the necessity of handling each charter separately.

"Live Data Transfer: RDS to DynamoDB Made Easy"

by Ayushi Garg ⸱ Swiggy ⸱ 4 min read ⸱ 22 Jul

Describes the process of migrating data to a new DynamoDB database
Discusses the challenges of maintaining eventual consistency during the migration
Covers the strategy for handling potential ID conflicts between the old and new systems
Explores the cost and execution time considerations for the migration

"Solving the Mystery of the Slow Hash Table"

BlackRock ⸱ 9 min read ⸱ 29 Jul

Describes a case-study of a slow hash table in a Java application
Uncovers the issue of hash collisions causing quadratic time complexity
Shares the troubleshooting process and benchmarking results
Introduces the fix that eliminated the hash collisions and improved runtime

"The Engineering Behind Booking.com’s Ranking Platform | A System Overview"

Booking ⸱ 6 min read ⸱ 26 Jul

Explores the architecture of the Ranking platform, its role in personalized search results, and its position within the broader ecosystem
Gives an overview of the model creation and deployment process, including the use of static and dynamic features
Shares an expanded view of the Ranking ecosystem, highlighting the interaction between the Availability Search Engine and the Ranking platform
Describes the technical challenges of operating the ranking system at scale, including performance optimization and model inference
Covers the strategies used to address these challenges, such as fallback to static scores, multi-stage ranking, and model inference optimization

"Delivering Faster Analytics at Pinterest"

Pinterest ⸱ 5 min read ⸱ 31 Jul

The article discusses the decision to migrate from Druid to StarRocks for real-time insights at Pinterest.
Pinterest's need for a new system due to increased scale and requirements is covered.
The requirements for the new system, including cost efficiency, SQL support, and ingestion pipeline simplification, are explained.
StarRocks is introduced as the chosen solution, with its benefits of SQL support, ingestion capabilities, and performance improvements.

"Flaky Tests? Check your FactoryBot IDs"

by Brooke Noonan ⸱ Gusto ⸱ 4 min read ⸱ 29 Jul

Describes how FactoryBot IDs can result in non-deterministic, flaky tests
Goes through the management of database, UUID, and foreign key IDs in FactoryBot
Explains how sequences can be used to define IDs with no constraints
Covers the potential issues with sequencing in tests and how to address them

"Maestro: Netflix’s Workflow Orchestrator"

Netflix ⸱ 17 min read ⸱ 22 Jul

Introduces Maestro, a horizontally scalable workflow orchestrator designed for managing large-scale Data/ML workflows
Discusses the journey with Maestro, including its seamless transition and ability to handle ever-growing workloads
Describes Maestro's scalability and versatility, supporting various workflow use cases and managing a large number of workflows and jobs
Explores Maestro's features, including Workflow Run Strategy, Parameters and Expression Language Support, Workflow Execution Patterns, Step Runtime and Step Parameter, Step Dependencies and Signals, Retry Policies, Aggregated View, Rollup, and Maestro Event Publishing

"Scaling Horizons: Effective Strategies for Wix’s Scaling Challenges"

by Natan Silnitsky ⸱ Wix ⸱ 8 min read ⸱ 28 Jul

Explores the challenges of scaling services and databases with vertical and horizontal scaling options
Discusses the limitations of vertical scaling and the shift towards horizontal scaling
Gives an overview of fixed and dynamic routing and sharding strategies for efficient data management
Describes Wix's scaling examples and solutions for Kafka infra services, web traffic management, Reactions app, and data locality for enterprise customers

"Defending Against LLM Attacks: Securing Integration and Mitigating Risks with 5 Essential…"

by Zeev Kalyuzhner ⸱ Wix ⸱ 3 min read ⸱ 24 Jul

Discusses strategies for securing LLM integration and mitigating risks
Acknowledges the accessibility of LLM APIs and emphasizes the need for stringent access controls
Advises caution when handling sensitive data and recommends implementing robust sanitization techniques
Highlights the limitations of relying solely on prompting to block attacks and suggests supplementing with other security measures
Encourages a multi-layered security approach, including continuous monitoring and auditing, network security measures, and data encryption to fortify defenses against various attack vectors.

"Addressing Fragmented UI(s) with Micro Frontends"

by Monika Maheshwari ⸱ Flipkart ⸱ 9 min read ⸱ 24 Jul

Discusses the challenges of maintaining a cohesive UI across different applications
Describes the adoption of micro frontends to address issues of complexity, maintenance, deployment, and team autonomy
Presents the architecture and page rendering flow of Podium's micro frontend implementation
Explains the styling challenges and solutions in the micro frontend landscape
Shares the benefits of moving to micro frontends, such as reduced risk of breaking changes, decoupled codebases, independent deployments, and improved team autonomy and collaboration

"GenAI Experiments: Monitoring and Debugging Kubernetes Cluster Health"

by Lili Wan ⸱ Intuit ⸱ 5 min read ⸱ 31 Jul

Discusses challenges in observing and debugging Intuit's 325+ Kubernetes clusters
Presents the introduction of Cluster Golden Signals for improved detection using Prometheus
Describes the use of k8sgpt for deeper debugging by pulling live cluster info and analyzing with AI
Introduces the integration of GenAI for remediation using public and private models

"Meet Chrono, our scalable, consistent, metadata caching solution"

by Lihao He, Ganesh Rapolu, Yu-Wun Wang ⸱ Dropbox ⸱ 9 min read ⸱ 25 Jul

The article discusses the challenges of consistent caching and the need for a scalable, consistent caching solution.
It describes the development of Chrono, a scalable, consistent caching system built on top of the key-value storage system, Panda.
Chrono provides APIs for write path and read path, ensuring the consistency of cached data.
The article explains how Chrono works and provides a concrete example of its implementation.
It highlights the use of TLA+ and self-checking workloads for verifying the correctness of the caching protocol.

Thanks for reading Big Tech Digest. If you enjoyed this issue, 🔗 share it with your friends or teammates.

See you in two weeks 👋!