A dual approach to AI product development: Agile and periodic enhancement cycles
A collaboration between Tensility Venture Partners and Actuate.ai.
Wayne Boulais and Armando Pauker, Managing Directors at Tensility Venture Partners
Brian Leary VP of Product & Operations at Actuate.ai.
Software development teams have embraced Agile methodologies for continuous product enhancements to deliver working software rapidly and to ensure constant improvement for over 20 years. Hardware development teams embrace periodic product enhancement with longer development cycles focused on the introduction of major new features and hardware cost reductions. AI teams need both approaches. This blog concludes with a case study at Actuate AI, a NYC computer vision AI company focused on physical security.
AI Development Requires New Processes
AI product development teams must think differently and holistically about their software development and dev ops processes. On the one hand, AI teams need to maintain the benefits of agile development noted above for improving models or introducing new capabilities. On the other hand, these teams are also tasked with the cost reduction of cloud compute and storage and may periodically switch models in production as better ones are available. This requires two different development methods concurrently: Agile methods for continuous monitoring and enhancement of models in production and a periodic product enhancement cycle for sustainable AI products. These two methods are required whether a given model is maintained and trained in-house, or whether it is accessed through APIs (like many new Large Language Models) and fine-tuned locally.
Agile Methods for AI Product Development
Agile development involves a series of scrum sprints, typically two weeks in length, to deploy code continuously. This continuous product enhancement process is designed to reduce the complexity, cost, and disruption of creating and introducing new software into production. Agile methods enable AI development teams to closely monitor and enhance models in production. AI models are constantly tuned and retrained to correct false positives and false negatives, to extend the features in the model, to correct issues of bias and drift, and to respond to new data sets or trends. The Agile approach ensures that AI products remain up-to-date, accurate, and relevant in rapidly changing environments and reduces the risk of negatively impacting production processes.
The Need for a Periodic Product Enhancement Cycle
In addition to the Agile methodology, AI products require a periodic product enhancement cycle. This cycle consists of periodic systemic changes due adjustments in product usage, new performance needs, model upgrades and reduced operating costs.
For example, these longer redesign cycles allow AI teams to evaluate potential infrastructure changes that can lower operating costs while maintaining or improving AI engine performance. Operating costs could be lowered through numerous ways such as changing cloud configurations or vendors or making internal model tradeoffs to reduce retraining compute costs. These improvements can require more analysis and testing than a normal two week sprint.
This approach is similar to how hardware engineering teams develop upgrades. The AI architecture — consisting of storage, data, training, and algorithms — is embedded in the cloud, and the enhancement process requires assessing various attributes, including: storage/IO options, model version, algorithm performance, API calls, data ingest and cloud computation costs before making a significant, systemic change.
The product enhancement cycle is an architecture upgrade that is similar to the months required to introduce a new hardware revision into production compared to software sprints. An example of a complex system that follows a similar hardware upgrade cycle is the iPhone, which typically has a production upgrade cycle of about 12 months.
Experience from Actuate’s AI product development team
Actuate offers real-time AI video analytics for surveillance systems to detect threats to safety and security across a broad range of use cases and commercially available hardware. The product does not require any hardware installation. All of the compute resources are in the cloud, from video ingestions and processing to AI model training and deployment.
The focus of Actuate’s AI product development team is to provide a stable, scalable, and cost-efficient architecture where the models can be constantly retrained and deployed with minimal intervention. Both software engineers and data scientists work closely together during the continuous upgrade cycles of AI model improvement and major architectural upgrades. Actuate operates 7–10 centralized AI models that operate independently and are retrained continuously from data received from customers.
The team quickly learned that each customers’ usage is not stable throughout a day, week, or even a month. More efficient resource scaling was needed to accommodate the variability while improving margins. Moving to a containerized system through Amazon ECS was a major overhaul that allowed for horizontally scalable resources to be spun up and scaled on demand. Horizontal scalability guaranteed resource allocation for each customer, but it did not provide cost effective resources at scale, especially given the changes in throughput of our customers daily. Following the transition to Amazon ECS, the team then identified the benefits of Kubernetes, the container orchestration tool, to allow for even better resource utilization (vertical scalability with larger resource clusters) and a higher level of automation for the retraining pipeline. The goal during each upgrade was to keep our customers running with no interruption or indication that the backend change was happening. We followed the process below, combining product enhancement cycles and agile methodologies:
Stabilize current architecture and baseline cost and performance
Design structure to address major cost and performance drivers
Build and test
Transition meaningful percentage of production stack to new architecture
Baseline cost and performance
Improve
Repeat steps 4–6 until 100 percent
Steps 1–4 require a longer, product enhancement cycle similar to a hardware build. Steps 4–6 are then repeated following an agile methodology as the entire cycle can be completed in 1–2 weeks, ending when everything is successfully transitioned.
The ECS migration addressed the high cost stream ingestion layer, but the model development process was still in need of an enhancement. Prior to any automation pipeline being built, lengthy, manual retraining was bogging down our team. SageMaker provided the building blocks required to automate most of the training process, with the added benefit of automated deployment mechanisms that can be built to link the operating and development environments. Much like the architectural upgrades above, the SageMaker implementation followed a similar, months-long process, culminating in full integration after the transition to Kubernetes. The results are short model improvement sprints (1 week or less) where multiple models trained are evaluated and tested simultaneously, with the most applicable being released instantly.
AI product development teams can benefit from a dual approach that combines Agile methods for continuous model monitoring and enhancements with a periodic product upgrade cycle for sustainable AI products. This combination ensures that AI products remain up-to-date, accurate, and cost-effective while delivering the best possible performance to users. As the AI landscape continues to evolve, embracing these approaches will be crucial for creating AI products that thrive in the market.