Engineering
Oct 8, 2024

Auto-generating evaluation data

Low-code tools are going mainstream

Purus suspendisse a ornare non erat pellentesque arcu mi arcu eget tortor eu praesent curabitur porttitor ultrices sit sit amet purus urna enim eget. Habitant massa lectus tristique dictum lacus in bibendum. Velit ut viverra feugiat dui eu nisl sit massa viverra sed vitae nec sed. Nunc ornare consequat massa sagittis pellentesque tincidunt vel lacus integer risu.

  1. Vitae et erat tincidunt sed orci eget egestas facilisis amet ornare
  2. Sollicitudin integer  velit aliquet viverra urna orci semper velit dolor sit amet
  3. Vitae quis ut  luctus lobortis urna adipiscing bibendum
  4. Vitae quis ut  luctus lobortis urna adipiscing bibendum

Multilingual NLP will grow

Mauris posuere arcu lectus congue. Sed eget semper mollis felis ante. Congue risus vulputate nunc porttitor dignissim cursus viverra quis. Condimentum nisl ut sed diam lacus sed. Cursus hac massa amet cursus diam. Consequat sodales non nulla ac id bibendum eu justo condimentum. Arcu elementum non suscipit amet vitae. Consectetur penatibus diam enim eget arcu et ut a congue arcu.

Vitae quis ut  luctus lobortis urna adipiscing bibendum

Combining supervised and unsupervised machine learning methods

Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.

  • Dolor duis lorem enim eu turpis potenti nulla  laoreet volutpat semper sed.
  • Lorem a eget blandit ac neque amet amet non dapibus pulvinar.
  • Pellentesque non integer ac id imperdiet blandit sit bibendum.
  • Sit leo lorem elementum vitae faucibus quam feugiat hendrerit lectus.
Automating customer service: Tagging tickets and new era of chatbots

Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.

“Nisi consectetur velit bibendum a convallis arcu morbi lectus aecenas ultrices massa vel ut ultricies lectus elit arcu non id mattis libero amet mattis congue ipsum nibh odio in lacinia non”
Detecting fake news and cyber-bullying

Nunc ut facilisi volutpat neque est diam id sem erat aliquam elementum dolor tortor commodo et massa dictumst egestas tempor duis eget odio eu egestas nec amet suscipit posuere fames ded tortor ac ut fermentum odio ut amet urna posuere ligula volutpat cursus enim libero libero pretium faucibus nunc arcu mauris sed scelerisque cursus felis arcu sed aenean pharetra vitae suspendisse ac.

TL;DR: We automate generating evaluation data in technical domains, primarily in the format of prompt/response pairs. Our API is in closed beta, but you can play with the stripped-down demo here!

A team we work with was using financial analysts to write prompt/answer pairs to improve their LLM agents. They'd write questions like “How does Dodd-Frank legislation affect X company action”, or “What filing would you need to make if an entity did Y?”. This process, however, took hours and was a drag for employees.

But what if we didn’t actually need humans to create labeled data?

Sometimes we’re not looking for net new knowledge. For example if we were trying to encode an HR policy into an AI model, all the knowledge we need is already written up. In this case, generating a dataset is not a problem of finding new knowledge, but transforming existing knowledge. 

The challenge then becomes, “How do we transform unstructured knowledge into rows of data?”

Some of this approach has been adopted by teams like Microsoft with AdaptLLM, where they turn sentences into reading comprehension tasks, and then use the resulting tasks as training data for a general purpose LLM. We expand on these types of approaches by parsing unstructured text to form knowledge graphs.

At a high level, we:

  1. Parse existing documents for interesting facts and lessons we want to know.
  2. Reverse engineer questions or situations that require those facts to answer properly
  3. Transform the question/answer to closely match as much production data as we can in tone, style, and format.

For example, here’s a few snippets of banking regulation [1]

  1. “G-SIBs, or Global Systemically Important Banks, is a list of institutions identified annually by the Financial Stability Board (FSB)... that includes JP Morgan, Wells Fargo…..”
  2. “G-SIBs are required to hold a minimum of 4.5% CET1 holdings”
  3. “Common Equity Tier 1 (CET1) is the sum of common shares and stock surplus”

From these snippets, we could parse this fact:

       JP Morgan must hold at least 4.5% of its assets in common stock and cash

When we add to that regulation knowledge of what banks are considered GSIBs, we can generate multiple scenarios that demand advanced financial knowledge:

  1. JP Morgan currently holds 5% of its risk weighted assets in cash and common stock; is this ratio compliant with federal regulations?some text
    1. Since JPM is a GSIB it is required to hold 4.5% of assets in cash and stock so this holding is complaint with regulations.
  2. Silicon Valley Bank currently holds 3% of its risk weighted assets in cash and common stock; is this ratio compliant with federal regulations?some text
    1. Since SVB is not a GSIB, Basel III regulations don’t dictate how much cash and stock it must hold, so based on the information given it’s not violating financial regulations in this area

There’s two important features of generating data this way.

  1. It’s highly accurate because we're seeding the correct answer
  2. At scale, it approximates expertise
    1. Answering the two questions above doesn’t mean your AI is an expert, but what about answering 1000 of these types of questions? What if it answered nearly perfectly for a well distributed, complex set of 10,000 of these questions? 

And that’s it! Early customer teams have reported that data generated this way is as realistic and accurate as samples they could write themselves.

Currently this requires a bit of setup as every knowledge base is different, but we threw together a basic, stripped down MVP here that you can play with – check it out here!

Note that there are some serious limitations to this web demo. For example, we haven’t exposed the underlying config file – most of our customers see massive quality jumps after just a few config iterations, and this usually fixes the first few rounds of feedback. This felt too complicated to throw up as a toy demo, but I’m happy to revisit if there’s interest.

If you have any questions, find bugs, or have examples of mistakes we made, feel free to reach out directly at matt at talc dot ai!

[1] This is not actually how it’s phrased in Dodd Frank / Basel III but none of us want to read that today.

Request Access

We're currently in a closed beta with early partners. To request access, drop your contact info so we can reach out!

Thank you

We'll reach out within 24 hours.
Oops! Something went wrong while submitting the form.