Optimizing for Speed: The Pursuit of Instantaneous AI Recommendations at CarWiz

Exploring Azure's capabilities to deliver real-time car recommendations, CarWiz is charting a course for zero-latency in the world of AI-driven advice.

Posted by Alfred Prah on January 02, 2024 · 5 mins read
In an era where seconds can dictate the success of a user's digital experience, the pursuit of minimal latency is more than just a technical challenge—it's a critical business imperative. At CarWiz, as the Co-founder & CTO, I am exploring a multitude of strategies to ensure that our car recommendation platform — currently fully hosted on Microsoft Azure — delivers real-time responses to user queries. Here's an insight into our roadmap for latency optimization.

The Importance of Instantaneity

Users come to CarWiz seeking prompt guidance in navigating the sea of automotive options. They input their preferences and expect instant Wiz-like recommendations that resonate with their needs. We are aware that the efficacy of our AI-driven advice is measured not only by its relevance but also by the swiftness of its delivery.

Latency Optimization Techniques Under Consideration


Model Pruning: Model pruning is a technique I am exploring to streamline our machine learning models on Azure ML. Approaches thus far have been inspired by the paper Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding (Han, Mao, & Dally, 2015), and it highlights the process of simplifying machine learning models by removing parts of the model that contribute little to its output (such as weights close to zero in neural networks). This can result in faster model inference times and reduced model sizes, which are beneficial for deployment in environments where resources are limited, like mobile devices or edge computing scenarios.

Quantization: Quantization involves converting a model from using floating-point numbers to using lower-precision representations, such as integers. This can greatly reduce the computational resources required for running the model, as operations on integers are faster than those on floating-point numbers. This one has also been inspired by Han, Mao & Dally's referenced above, as well as the paper Compressing Deep Convolutional Networks using Vector Quantization (Gong et. al., 2014). From prelimiary results and their analysis, I expect a drop in accuracy from this strategy but don't foresee it being noticeable, given the current size of our user base and how early we are in our ML & AI journey.

Data Representation: In the technological architecture of CarWiz, we've made a strategic decision to employ Azure SQL for structured data and tables, while utilizing Azure Blob Storage for our image data. This approach is grounded in optimizing performance, cost, and scalability by aligning the type of storage to the nature of the data.

Caching Strategies: Intelligent caching mechanisms, such as Azure Cache for Redis, are being designed to provide instant responses for common queries. Caching this way could mean storing frequently accessed data in memory for quick retrieval. This is particularly useful for our kind of recommendation system where many users might have similar queries, and I foresee it ultimately preventing the need for the model to compute the same recommendation multiple times. However, given the current stage of our AI & ML journey, more time is needed to determine which queries are most popular among our users.

Load Balancing: Even though we haven't experienced excessive demand on our servers yet, Load balancing is a consideration I have on my radar for the coming months. Per current subscription plans and server choices, most of our resources should scale as demand increases. However, depending on how costly these current choices become, we might need to look into distributing workloads across multiple computing resources to ensure that if there's ever any downsizing, no single one becomes a bottleneck during a user's end-to-end experience. In the context of Azure, this could be managed automatically by services like Azure Load Balancer, which can distribute incoming traffic across multiple targets, ensuring high availability and reliability.

Measuring the Impact

The strategies under consideration have the potential to not only enhance the user experience but also to foster trust in our platform's ability to deliver timely and relevant advice. By reducing latency, we aim to increase user engagement and ensure that CarWiz becomes the go-to platform for personalized car recommendations.

Continuous Evolution

The roadmap for latency reduction at CarWiz is an evolving document, shaped by the latest in technological advancements and user feedback. We remain vigilant and ready to integrate new methods that promise to bring speed gains to our platform's recommendations.

In conclusion, CarWiz is committed to delivering a real-time, data-driven car recommendation experience that caters to the modern user's expectation of speed. Stay tuned as we continue to innovate at the crossroads of AI and user experience.