May 28, 2024

On LLM testing strategies

AI mistakes are becoming more common as these tools enter production, from Google's Gemini Launch to a car dealership selling a truck for $1. Current software testing methods are struggling to keep up with the rapid pace of AI development, but new approaches like human review, ML metrics, and LLM UIs each have their own challenges.

On LLM testing strategies

Low-code tools are going mainstream

Purus suspendisse a ornare non erat pellentesque arcu mi arcu eget tortor eu praesent curabitur porttitor ultrices sit sit amet purus urna enim eget. Habitant massa lectus tristique dictum lacus in bibendum. Velit ut viverra feugiat dui eu nisl sit massa viverra sed vitae nec sed. Nunc ornare consequat massa sagittis pellentesque tincidunt vel lacus integer risu.

  1. Vitae et erat tincidunt sed orci eget egestas facilisis amet ornare
  2. Sollicitudin integer  velit aliquet viverra urna orci semper velit dolor sit amet
  3. Vitae quis ut  luctus lobortis urna adipiscing bibendum
  4. Vitae quis ut  luctus lobortis urna adipiscing bibendum

Multilingual NLP will grow

Mauris posuere arcu lectus congue. Sed eget semper mollis felis ante. Congue risus vulputate nunc porttitor dignissim cursus viverra quis. Condimentum nisl ut sed diam lacus sed. Cursus hac massa amet cursus diam. Consequat sodales non nulla ac id bibendum eu justo condimentum. Arcu elementum non suscipit amet vitae. Consectetur penatibus diam enim eget arcu et ut a congue arcu.

Vitae quis ut  luctus lobortis urna adipiscing bibendum

Combining supervised and unsupervised machine learning methods

Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.

  • Dolor duis lorem enim eu turpis potenti nulla  laoreet volutpat semper sed.
  • Lorem a eget blandit ac neque amet amet non dapibus pulvinar.
  • Pellentesque non integer ac id imperdiet blandit sit bibendum.
  • Sit leo lorem elementum vitae faucibus quam feugiat hendrerit lectus.
Automating customer service: Tagging tickets and new era of chatbots

Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.

“Nisi consectetur velit bibendum a convallis arcu morbi lectus aecenas ultrices massa vel ut ultricies lectus elit arcu non id mattis libero amet mattis congue ipsum nibh odio in lacinia non”
Detecting fake news and cyber-bullying

Nunc ut facilisi volutpat neque est diam id sem erat aliquam elementum dolor tortor commodo et massa dictumst egestas tempor duis eget odio eu egestas nec amet suscipit posuere fames ded tortor ac ut fermentum odio ut amet urna posuere ligula volutpat cursus enim libero libero pretium faucibus nunc arcu mauris sed scelerisque cursus felis arcu sed aenean pharetra vitae suspendisse ac.

Now that AI tools are starting to make their way into production, we're starting to see hilarious mistakes as well -- from Google's Gemini Launch to a car dealership selling a truck for $1. The old software testing paradigms are broken -- but it's not clear what will take its place.

There are a bunch of attempts at new testing paradigms, but each one comes with it's own downsides. Let's break down a few of them.

1. Human review

This is what the BigCos in the tech world are doing -- armies of human labelers are grading LLM outputs by hand. One of the blue chip companies in this space is running hundreds of manual labelers on every single new build their software team puts out.

Their process goes something like this: an engineer pushes a change, that new build gets sent to human labelers, and those labelers spend hours to days manually grading the new build's answers against a bank of several hundred test questions. Engineer gets updated performance results a few days later, and starts working again.

There's a few main challenges here.

1. Accuracy

  - Human's just aren't very good at labeling. Depending on the content, human labelers might only get anywhere from 70-90% accuracy on labels. I've personally talked with several engineers at BigCos that end up reviewing samples by hand because they're so frustrated with label quality

2. Speed

  - Labeling queues take anywhere from a few hours to a week or more to get results back, which is an eternity in developer time. In a world where feedback is instant (ie. web dev) or unit tests take minutes in CI/CD, waiting multiple days for even the most basic of changes isn't acceptable.

2. ML Ops style Metrics

You know you're in this world if you see things like cosine similarity, clustering, F-scores, or generally any quantitative measurement on AI. If you're considering leaning on any of these types of metrics, the main question you need to consider is: _what are you actually measuring?_

That is, do the numbers you're looking at actually translate to the usage of your product?

GPT-4's latest scores on the GAIA benchmark don't reflect how well it'll serve your financial analysts trading FX futures, and some bag of "correctness" metrics don't mean that your sales people are actually getting correct answers. If you're trying to use a number to reflect the performance of your product, you better be sure that the denominator of that metric is 1:1 with the usage you're actually getting.

3. LLM UIs

We've seen a bunch of usage of LLM playgrounds where people write prompts into web UIs and drag and drop arrows as flow charts.

These can be useful in their own right, but often don't focus on the core of the testing problem -- what test cases are you running against your product, and where are these test cases coming from?

We've seen many companies spin their wheels on exactly this problem, from health tech companies paying doctors to write good test cases for them to software companies paying for SWEs to grade code samples.

This is exactly where we focus at Talc.

Why Talc?

In AI, one of the most important problems is selecting a good test set to validate on.

We help you generate a good dataset, and then validate your AI against it.

Dataset Generation

We use your knowledge base and existing samples to generate a rich, accurate dataset of everything you need to test.

If you have 20 samples of customers asking about products, we'll come back to you with 200 samples of questions and answers about products, company policies, and other potential topics of conversation. If you have 30 samples of labeled medical transcripts, we'll help you generate 3000 similar types of transcripts so that you don't have to use doctors to do that work.

Instead of using humans to gather this interaction data yourself, we'll help you create a golden set that reflects your product's real usage.


Now that we have a good golden set to rely on, Talc integrates with your CI/CD (or via command line) to regularly run through the tests programmatically.

Every time you make a change to a prompt, model, or underlying code, we run on your system E2E to let you know how it's affected the golden set you need to work. As a result, you get reliable feedback in minutes, not days.

Lorem ipsum dolor sit amet consectetur ut amet lorem dolor cursus faucibus pulvinar nunc justo mauris facilisis quam.