Technical

Test Driven Development of LLM Integrations

Test-Driven Development (TDD) is a software development practice that emphasizes writing tests before writing code. In traditional SWE (Software Engineering), TDD involves writing test cases that cover various aspects of the system being developed, such as input/output validation, edge cases, and functionality. The goal of TDD is to ensure that the code passes the tests, which in turn guarantees that the software works as intended.

When it comes to deploying Large Language Models (LLMs) in production, there are several challenges that arise. One of the main issues is dependence on third-party APIs for LLM models, which can be difficult to update and maintain. Additionally, the output structure of LLM models can be non-deterministic, making it challenging to predict the exact outcome of an API call. This non-determinism can lead to unexpected behavior in the software system, which can be difficult to debug and fix. There are also various complex workflows that use LLMs as a key component such as RAG based systems which could have multiple points of failure, and therefore testing them systematically is critical to identify and debug failing components.

To overcome these challenges, TDD can be a valuable tool for developing software with LLMs. By first writing basic tests based on LLM API calls, or output structure, or atomic components of complicated workflows, developers can ensure that their code is robust and reliable.

Testing Strategies for Third-Party APIs and LLM Outputs

When testing third-party LLM APIs, keep it simple. The tests should focus on basic sanity - can the code reach the API, does it respond fast enough, and return data in the right format? These tests check that the integration with API works, not the accuracy of the LLM call. One has to ensure the API handles errors well and responds gracefully to unexpected and malformed inputs.

When integrating language models into software, it is equally important to consider what happens after the model responds. Often, the raw response needs cleaning up or adding to before being shown to users. This "post-processing" can get complicated. To make sure it works well, it's best to test the post-processing separately from actually calling the language model. Developers can do this with "mocking".

Mocking in testing refers to the process of creating artificial or placeholder objects or services that mimic the behavior of real objects or services in a system under test. Mocking in this context would mean creating fake language model outputs that act like the real thing. This will allow testing the post-processing code and its handling of different types of responses, without relying on the real model. The model's responses might change over time as it's updated. Mocking ensures the post-processing code works no matter what responses it gets. It's a good way for developers to thoroughly check their code independently of how the language model acts. This helps ensure the whole system stays reliable.

In-Depth Testing of Complex LLM Workflows: RAG case-study

Testing complex LLM workflows, such as those involving RAG systems, requires a meticulous and systematic approach due to the multiple interacting components. Retrieval-Augmented Generation (RAG) is a technique that enhances the accuracy and reliability of generative AI models by incorporating external knowledge sources. It improves the quality of Large Language Model (LLM)-generated responses by grounding the model on external information such as relational databases, unstructured document repositories, internet data streams, media newsfeeds, audio transcripts, and transaction logs. This external knowledge is appended to the user's prompt and passed to the language model, allowing the model to synthesize more accurate and contextually relevant responses. This poses unique testing challenges. Going through a blueprint of how one can apply TDD principles in such a complicated system is a good exercise to go through next.

1. Testing Data Retrieval: Begin by focusing on the retrieval component of the RAG system. This step involves ensuring that the system accurately queries the correct dataset or knowledge base and retrieves relevant information. Tests should verify not only the success of API calls but also the relevance and accuracy of the retrieved data. This can be done by comparing the retrieved information against a set of predefined criteria or benchmarks to ensure its appropriateness for the given prompt. Classic well-established ranking problems can serve as valuable reference points in this regard.

2. Mocking for Post-Processing Tests: Once retrieval is verified, shift focus to the generation component. Here, mocking becomes crucial. Simulate, ie, mock the retrieval output and feed it into the generation component to test how it processes this input to produce coherent and contextually appropriate responses. This phase tests the ability of the model to synthesize and augment the retrieved information into a meaningful response, considering factors like coherence, relevance, and alignment with the input prompt. This could be more tricky to test since the generated artifact could be free flowing natural language text. Metrics such as similarity measures could be turned into assertions and must be ensured to be above a certain threshold for the test to pass.

3. Scenario-Based Testing: Implement scenario-based testing to cover a wide range of inputs and contexts. This includes testing with valid, invalid, and edge-case prompts, as well as prompts requiring complex retrieval queries. Observe how the system handles each scenario, focusing on both the accuracy of the retrieval and the relevance and quality of the generated response.

Cost-Effective Testing and Multi-Provider Compatibility

It is also important to consider the costs of testing LLM integrated workflows, given the expense of each API call. Every test run costs money, so it's critical to be strategic about costs without sacrificing thoroughness. One practical approach is to design tests that pass with simple prompts and short expected outputs, reducing cost per API call while still ensuring core functions work. It is easy to sometimes forget that each call adds up costs and that both input and output tokens incur costs!

Furthermore, the testing framework needs to be adaptable to each LLM provider's unique characteristics. For example, Anthropic's Claude works better when it uses XML tags in its responses, requiring different tests on model outputs and post-processing logic than a system like OpenAI that thrives on JSON based output. Recognizing these differences is important for precise and useful testing. Developing tests that can validate various providers' expected response structures increases the system's flexibility and strength. Relying on a single LLM provider can be risky due to potential outages or performance issues. It's smart to design test plans that involve multiple providers. This not only ensures reliability, but also shows the system can easily switch between providers when needed.

Wrapping up, applying Test-Driven Development to Large Language Models presents unique challenges and opportunities. TDD takes on new dimensions when applied to LLMs' unpredictable nature. Testing becomes crucial for ensuring reliability and functionality. Cost considerations are paramount given each API call's monetary impact. Smart testing prioritizes simple, cost-effective yet comprehensive prompts. This approach ensures functionality while mitigating financial impact, making development efficient and economical. Tests must also be flexible to work across different LLM providers like Claude by Anthropic and OpenAI. This adaptability is critical operationally and strategically in a competitive market. As LLMs evolve, specific TDD practices will also need to evolve, ensuring software remains reliable, functional, and ahead of the curve. The intersection of TDD principles with LLMs' capabilities opens new horizons, promising a future where AI and human ingenuity work together to create robust, dynamic, and intelligent systems.


Structured Intelligence: Enforcing Typed Discipline to Integrate Foundation Models in Modern Codebases

In the realm of software engineering, the evolution of typed languages has been a testament to the industry’s quest for reliability and efficiency. Historically, statically typed languages like Java and C# have been lauded for catching errors at compile-time, leading to robust and maintainable code. On the other hand, JavaScript, the ubiquitous language of the web, and python, the de-facto language for machine learning, deep learning and data science, have long been dynamically typed, making them accessible but prone to runtime errors. The success of TypeScript, a statically typed superset of JavaScript that allowed developers to write safer code reflects the industry’s broader recognition that the benefits of static typing—early error detection, better tooling, and more predictable code—are crucial for scaling and maintaining large and complex codebases in the modern development landscape.

Meanwhile, as software complexity escalates, the integration of Large Language Models (LLMs) like OpenAI's GPT series into codebases is transforming the landscape of intelligent decision-making within applications. The motivation for embedding LLMs into software logic stems from a desire to automate and enhance capabilities across several domains. Integrating machine learning into the software’s core logic is no longer as difficult as it used to be with these general purpose API calls to LLMs replacing in-house models. Features that necessitated entire ML teams and infrastructure can now be prototyped and even deployed with API calls to these language models. For instance, to recommend any item to a user, organizations would have needed to hire an ML team, setup infrastructure, obtain user data, train a model and then put it up for inference. Now, they can do this…

However, the challenge in integrating these intelligent systems lies in their reliability and the structure of their outputs. For a business relying on LLMs for customer interactions, erroneous or unpredictable responses could damage customer relations or lead to misinformation. Similarly, in content creation or code generation, the outputs must not only be syntactically correct but also contextually accurate and logically sound. In the aforementioned recommendation dummy example, if the output list is not formatted correctly or contains elements that are not of the expected type, the downstream code to serve these recommendations would break.

To harness the full potential of LLMs, their integration must therefore be approached with mechanisms that enforce structured and reliable output. This not only ensures consistency in automated responses but also eases the parsing and handling of these outputs by other components of the software. The push towards such reliability is driving innovations in prompt engineering, fine-tuning of models on specific domains, and post-processing of LLM outputs to fit them into the stringent requirements of a structured software system.

Most notably, this is now being achieved by the combination of two worlds - libraries which helps enforce type checking in python code, and Open AI’s function calling ability which allows the Language Model to call available functions in code as and when it may deem fit. To understand this marriage better, here is some context about such a type enforcing library in python - Pydantic.

As referenced before, Python is inherently a dynamically typed language, where the type of a variable is determined at runtime and can change as the program executes. This flexibility allows for rapid development and prototyping but can sometimes lead to type-related bugs that are hard to detect without thorough testing. To bridge the gap between the dynamic nature of Python and the benefits of static type checking, tools like Pydantic have emerged. Pydantic is a data validation and settings management library that uses Python type annotations to validate that the given input data matches the expected types, bringing a new level of safety and reliability to Python code. By defining data models with type annotations, developers can enforce type constraints at runtime, ensuring that data structures conform to a specified schema before the application acts upon them. This enforcement of type constraints not only helps in catching errors early but also improves code clarity, maintainability, and can even aid in defining communication interfaces for services, such as in the creation of APIs, where clear and predictable data structures are paramount.

This fits in seamlessly with one of the more powerful features of OpenAI's suite of ever-growing tools - its ability to perform function calls within the language model's responses. This enables the model to execute predefined functions during a conversation or text generation task, effectively allowing users to integrate more complex logic and structured data handling directly into the language model's output. Function calling in the context of LLMs is akin to giving the model a toolbox. The model can reach for the right tool — in this case, a function — to accomplish a task with precision and structure. This capability opens the door to outputs that are not just text but also actions and structured data formats.

The use of function calls allows for a standardized approach to formatting and structuring outputs. Functions can range from simple formatting tasks, like turning a blob of text into a JSON object, to more complex data manipulation, such as extracting and organizing information from a paragraph into a spreadsheet format. Or in this context, calling a function that returns objects of Pydantic classes with strictly enforced attributes and their types.

Feature Flagging for Model Deployments

What are Feature Flags? 🏴🏴

Feature flags are a powerful tool in software engineering that allow developers to manage and control the release of new features or functionality within an application without altering the underlying code. This approach enables faster development cycles, easier testing, and a better user experience by allowing developers to work on new features independently of the main codebase and test them in isolation before integration. By separating features from the core application functionality, developers can enable or disable features without affecting the stability and functionality of the entire application, making it easier to debug and deploy new features. This also enables A/B testing, allowing developers to offer different features to different users based on their preferences or account status, resulting in a more tailored user experience.

Flag your LLM to tinker, reduce cost & have peace of mind🦾

From an ML Engineering perspective, developers and product teams can use Feature flagging to deploy large language models (LLMs) more effectively. By using flags, they can test out different language models or vendors without affecting the main codebase, allowing them to compare performance and user feedback from different populations. This gradual rollout approach also enables teams to continue prompt-tuning their LLMs as they gather more data, improving their accuracy over time. Additionally, feature flags can help reduce costs by limiting the number of calls made to external language models, allowing teams to test multiple prompts without incurring unnecessary costs. To take advantage of these benefits, developers can hook up monitoring tools like LangSmith to track performance and make data-driven decisions about which language models to deploy. By leveraging feature flagging in this way, teams can ensure that their LLMs are performing optimally for their users while also having the choice to switch them off at the flip of a button without any code changes(if OpenAI suddenly starts routing calls to some other low-cost GPT 🙂 ).

Red Flags to Consider 🚩🚩

As a cautionary note, while feature flags can offer many benefits, such as easier testing, faster development cycles, and greater flexibility, it's essential to use them judiciously to avoid introducing new challenges into the development process. One potential pitfall is the temptation to use feature flags for everything, leading to a tangled web of interdependent flags that can make maintenance and debugging difficult. Akin to global variables, though they are convenient and make things simple, we do not want to abuse feature flags and end up writing unreadable spaghetti code. To avoid this, it's crucial to maintain a clear separation of concerns between features, using flags sparingly and consistently throughout the application. This includes establishing a standardized naming convention, documenting flag usage, and ensuring that flags are properly version-controlled.


Resources

Open Feature - Open source tool that allows you to enable feature flagging using any end provider of your choice.

Launch Darkly - A feature flag management platform that you can tie up with Open Feature to get full blown end to end feature flagging capabilities.

Loosely, Open Feature : git :: Launch Darkly/Feat-bit: Github


Navigating the Maze: Choosing the Right Large Language Model for Your Team/Organization

In today's rapidly evolving landscape of Generative AI, the selection of the perfect Large Language Model (LLM) for your organization is far from a trivial task. With a multitude of robust LLMs available within the AI community, the decision-making process can become an intricate maze. However, this challenge can be tackled systematically by considering a range of crucial factors. Through this exercise, we want to explore the key considerations organizations should take into account when evaluating and selecting the ideal LLM for their specific needs.


1. Performance on the Task at Hand


One of the primary factors to assess when choosing an LLM is its performance on the specific task you intend to use it for. Not all LLMs are created equal, and their strengths and weaknesses may vary depending on the task at hand. For instance, a language model that excels in question-answering tasks may not perform as effectively in simple classification or summarization tasks. Therefore, it's essential to evaluate an LLM's performance in the context of your intended application.


2. Cost Considerations


Cost is a significant consideration when selecting an LLM. Different models come with varying price points, often determined by factors such as model size, capabilities, and provider. For example, comparing the costs of GPT-3 and GPT-4 or evaluating the cost-effectiveness of models like Palm and Claude should be part of your decision-making process. It's essential to weigh the cost against the expected performance gains to determine if the investment is justified.


3. Context Length Capabilities


The context length an LLM can handle is another crucial aspect to consider. Some applications, such as information retrieval using models like RAG (Retrieval-Augmented Generation), require the ability to handle a large context window. It's important to assess whether the LLM you're considering provides the necessary context length capabilities for your specific use case. Additionally, consider whether your application truly requires an extensive context window or if a smaller one would suffice.


4. Integration with Existing Tools and Libraries


Efficient integration with existing open-source libraries and tools is vital for seamless implementation of an LLM. Ensure that the LLM you choose is compatible with commonly used libraries like langchain or other tools essential to your organization's workflow. A smooth integration process can save valuable time and resources.


5. Ease of Fine-Tuning


The ability to fine-tune an LLM to adapt it to your organization's specific needs is a valuable asset. Evaluate whether the LLM offers support for easily ingesting your data and training on it. The ease of fine-tuning can significantly impact the model's performance in your unique context, if that is something that is needed for you use-case.


6. Privacy Concerns


Privacy is a paramount consideration when dealing with LLMs. Determine whether the LLM provider uses your data for training purposes and if they adhere to strict privacy standards. If your organization intends to fine-tune the model with proprietary data, ensure that the process maintains the confidentiality of sensitive information.


In conclusion, there is no one-size-fits-all solution when it comes to selecting the ideal Large Language Model for your organization. With the diverse array of LLMs available, having concrete use cases and employing well-defined criteria for decision-making is more critical than ever. By carefully assessing performance, cost, context length capabilities, integration, ease of fine-tuning, and privacy considerations, organizations can navigate the complex landscape of LLMs and make informed choices that align with their unique requirements. In the new era of Generative Models, strategic and decisive decision-making seems to be the key to unlocking the full potential of these powerful tools.

Adapting the Six-Stage ML System Design for LLMs (Part 1) 

Lately, I've been engrossed in Chip Huyen's insightful book, Designing Machine Learning Systems. It breaks down an end-to-end ML system into six key phases: Project Scoping, Data Engineering, ML Model Development, Deployment, Monitoring, and Business Analysis. This really got me thinking about how these stages are evolving, now with LLMs having started to take center stage. As I make my way through the book, I've been drawing connections between the established stages and the unique considerations that arise when utilizing LLMs as the core model type. These mappings shed light on how LLMs reshape our perspective and introduce a fresh set of challenges. I'd like to share my insights so far, focusing on the first three stages in this newer paradigm.

First, we have Project Scoping which entails defining the goals and objectives of the machine learning project. This involves understanding the problem domain, identifying the business requirements, and specifying the metrics for success. I feel that it's important to balance the excitement surrounding LLMs with a healthy dose of realism. It's easy to become overly optimistic about their capabilities, given the hype surrounding them. Setting realistic goals and expectations seems even more critical now than it used to be in a general ML/DL setting.

Moving on to Data Engineering, which involves collecting, cleaning, and preparing the data for training and evaluation. This phase seems to have become more complex and tricky in the LLM paradigm due to the extremely black-box nature of the downstream algorithm. While we no longer use the data directly for training, ensuring its cleanliness, proper formatting, and avoiding leakage from prompts into test sets present unique challenges. Moreover, without explicit features, it's harder to assess the impact of data manipulation on the model's performance. Some findings even show that cleaning up your text leads to worse performance!

The ML Model Development stage used to be the fun, expensive and tinkering stage where the actual model architecture and training process were designed. This is where LLMs truly shine. Gone are the days of laborious feature engineering and time-consuming model training. LLMs allow us to leverage few-shot learning or retrieval augmented generation, eliminating the need for extensive training on large datasets. While more involved techniques like partial finetuning or complete re-training are still applicable, LLMs can produce zero-shot inference or instruction fine tuning with just a few examples embedded into the prompt, enabling us to get started quickly.

By exploring these first three stages, I am intrigued by how the shift towards LLMs transforms the traditional ML/DL system while still retaining the core goals and approaches. This also speaks to the fact that the book itself is so well written and focuses on generalized principles in a way that have already stood the test of time. In my next update, I'll dive into the Deployment, Monitoring, and Business Analysis phases, sharing my observations and further discussing the evolving impact of LLMs in these areas. And of course, I highly recommend the book!

The subtle art of chat GPT failure

Is it an AGI? Is it SkyNet? No, it is a large language model. ChatGPT is a language model, optimised for dialogue and (for the first time?) equipped with a minimal, beautiful and responsive interface for interaction. There are warnings as soon as you start using it - it is in early stages, it can make mistakes, and it is NOT a lookup on the web(and yes it will not replace Google, at least not right now). 

I put it to a very specific task - to help me write the K-Nearest neighbours algorithm in PyTorch. It came to me because I had been thinking of putting myself to this task for a while. One of my friends had it as an interview question! So, here I was asking a Language model probably trained in PyTorch to give me PyTorch code for a simple KNN algorithm. It did not disappoint. It gave me a very neat looking code snippet, with very clear comments and the entire end to end algorithm - right from creating dummy synthetic data to finding the distances cleanly and then getting the predictions. I was very impressed! 


I started going through it in detail and there it lay. Very subtle. Extremely minute yet a very significant error. While inference, in order to find the top k nearest neighbours from the computed distances, GPT3 used the frequently employed topK PyTorch function. However, it did not set the argument to find the topK smallest numbers, instead it went by the default parameters which ended up fetching the topK largest distances! And voila, the final predictions were wrong. 


There is no way one would see this unless they are really looking into it. And that’s the problem - Trust. You can not trust the outputs from chat GPT to be the absolute truth. If someone were to have uploaded this answer on Stack Overflow, they would have had 10 downvotes, 3 comments, and a large case of imposter syndrome within the turn of the hour - and that is why humans probably still have the edge over trillions of weight parameters. TLDR-tread lightly with chat GPT if the output is even remotely significant to you! Or else, just play around and have fun!

Music as a Stabilizer.

There is so much distraction all around in terms of what to do, how to do, for how long to do it, and so on. It becomes very difficult to just sit down and zone out even for a short duration. Yes, one may not be able to get substantial work done even after zoning out, but it at least feels good, and more often that not, if you invest focussed time something does budge. Music enables this. Spotify facilitates this. It numbs out the “other stuff” and allows you to look at one thing. Sometimes you might just end up enjoying songs and not making any breakthrough but even that is better than feeling distracted and scattered during that time. I often forget that in times of distraction, music serves me well. This note is a reminder in such an event.

My Summer Internship Experience at Appfolio Inc.

This summer, I got to work as a Software Developer Intern at Appfolio Inc. This was my first proper SDE experience, having worked as a ML Engineer/Research Engineer in my earlier roles. I was unsure of how I would find it, and how good I would be at it. But, the people at Appfolio and the culture of the engineering team took that fear away in a matter of few weeks .
They had a very well designed onboarding and learning project setup which was a great mix of mini tasks as part of a broader project that had both real world meaning, and  was neatly tied in with the day to day work that would follow. My primary work was on a Rails backend, and involved working with the team to groom, prioritize and complete stories/issues. There were a good-wide variety of tasks to tackle on a day to day basis - all the way from reducing tech debt, to fixing priority bugs that would help customers in a matter of hours.
I was also able to observe how people think about designing systems and how they also end up optimizing them. My team had been able to optimize a critical job to bring it's runtime down from hours to minutes. It was a joy to witness that and learn from it, even as a fly on the wall. The best part was that everyone in my team, and  by my limited experience - company-wide, was very helpful, caring and empathetic. I found people to be extremely curious, and they were always trying to improve existing processes or bring into existence new ones. I am grateful for this wholesome experience - both from a technical and personal viewpoint!

Learning to do & Appreciate Polished work

I worked with a PhD student and a couple of professors on submitting a full paper at a premier ML conference. I also worked with a couple of professors, and a Post-doc on a 90 minute tutorial. I spent hours finding the best way to make an image. Days creating the perfect slide. I realised how important it to not just get the work done, but to get the work done in the best possible way. As far back as I can imagine, I have always been a hacky person. I do just enough to get things working and have them be understandable. However, these two experiences have transformed me for good. I cannot go back to creating slides that are not perfect, and to putting work in papers that is not up to the mark. It has now become an instinct that kicks in and prevents me from stopping at merely good enough. There are at least 10 more iterations after getting to good enough. I appreciate quality work more now, and also hold myself up to higher standards. It feels good to acknowledge this change.

From a Joint to understanding Sentences (NLP)

Language models are meant to tell you the probability of a sentence. Given that we have seen a number of sentences from a training text corpus, a trained language model can take a new sentence and give you a probability value between 0 and 1 signifying how probable it is that it was a "Real" well formed sentence. Now, one could employ any machinery, simple or complex, to design such a system. The beautiful part though is that there is a link between how sentences are formed in the English language and the way to exactly compute a joint probability distribution for any set of random variables. 

Any joint probability distribution P(x1,x2...xn) can be factorized exactly as P(x1,x2...xn)=P(x1)*(x2|x1)*P(x3|x2,x1)... Note that this is not an approximation and there are no independence assumptions. It is the exact value of the joint! Now let us see a sentence in English language - "This is a blog post". If we were to find the probability of this sentence where each word is a random variable, we get P(this,is,a,blog,post), and the way English sentences is work is that they are sequential left to right, and a word (almost always), only depends on words seen before it. This implies, for instance, that P(blog) being third word would be intuitively thought of as P(blog|this,is,a). And that is exactly what the decomposition of the joint uses as well. It is fascinating how these two very different concepts converge to allow us ways to finish sentences automatically, and generate new text.

 Probabilities, calculus and linear algebra never cease to amaze.

Learning journey : Causality

This medium blog series--https://medium.data4sci.com/ 

Reading scattered papers. 

Going through the "Book Of Why".

Attending meetups, asking questions there to people.

Learning Journey : Ruby

Udemy

Great Guide--[Getting Started with Rails — Ruby on Rails Guides](https://guides.rubyonrails.org/getting_started.html)


Placeholder