How we built Origin Intelligence

At Origin we have spent the last 5 years streamlining bond issuance by automating documentation. When users draft documents on the Origin platform, we capture the structured data about these new securities and enable a whole suite of automations and integrations. Tasks like requesting an ISIN, or setting up the security on listing exchanges, paying agents or Bloomberg, are all now accessible with the click of a button. The operational workload for each new issuance has been cut from hours to seconds, and the market finally has the infrastructure to enable shorter settlement windows.

Over this time, we have helped issue over 1,800 bonds, ranging from vanilla private placements to syndicated benchmarks and structured notes. Despite this success, one of the biggest challenges we face is users’ reluctance to switch from MS Word to a web app for drafting documents. While some users have happily made the switch because of the clear benefits, for many others, the habits built up over many years (even decades) are hard to unlearn. This got us thinking: what if we could provide the value of structured data without changing the way people work?

In other words, could we allow users to keep creating their documents on MS Word but somehow extract the data we needed to support our integrations and automations?

Origin Extract: the first step

We launched Origin Extract 2.5 years ago, allowing users to upload a bond termsheet created offline into Origin. Legal users could now quickly draft the final legal documents using a termsheet created outside of Origin and continue to access our suite of integrations.

The approach behind the data extraction relied on two steps. We first used machine learning to convert a Word or PDF document into a machine-readable representation that preserved any tables found. The table layout chosen in a document conveys a lot of meaning but extracting it accurately from various document formats is surprisingly difficult. The second step applies a list of rules for each field being extracted. These rules could be very simple, (e.g. “find the row that says ‘Issuer Name’ and tell me the value), or extremely complex with many steps. Most of the rule types relied on techniques like fuzzy matching and regular expressions to look for certain patterns in the text.

After years of adding and tweaking the ruleset, Origin Extract now extracts 74 bond fields from termsheets with 88% accuracy. While impressively high for a rules-based approach, it has its drawbacks. The rules-based approach requires input documents to have standardised structure and language. This means that a ruleset is not transferrable to a new document type, like final terms, or a new language, like Swedish, even if the fields are exactly the same. It also creates a natural ceiling to how accurate the extraction can be, standing in the way of true automation which needs near perfect extraction (accuracy of 95% or higher). And we finally hit that ceiling with Origin Extract.

Origin Intelligence: the new approach

We released Origin Extract in July 2022, just 4 months before the first version of ChatGPT. In the 2 years since, we have seen rapid improvements in the state of the art in AI, with the natural language capabilities of the foundational models reaching a level we could have never imagined. This year, we knew it was time to bring these benefits to the bond market. As we began experimenting with the new technology, we quickly realised that we could rebuild Origin Extract from the ground up, likely making it more accurate and flexible.

After several experiments, we settled on a new architecture using an LLM for its incredible natural language processing supported by a simpler rules-based system. This allowed us to leverage our existing insights from Origin Extract. But as with any new technology, we needed a robust way to measure the performance of different approaches. We settled on a performance measure used for data classification models called the Confusion Matrix. As the name suggests, this measure highlights where the model is “confused” – albeit mathematically – giving you a clear idea of which areas need improvement. Using the confusion matrix, you can produce useful metrics for this type of decision making:

  • Precision: How likely an extracted value is the correct value in the document.

  • Recall: How likely a value is extracted when it is present in the document, regardless of if the extracted value is correct.

→ You can improve precision at the expense of recall, and vice versa, which is why it’s important to look at both.

  • Accuracy: How likely a field is correctly extracted when it is present in the document and correctly ignored when it is not present. An accurate model tends to have good precision and recall.

We used this approach with a set of correctly extracted documents to measure the performance of each iteration as we were experimenting with LLMs. It allowed us to deeply understand the impact of each change at a model level, field level and document level. 

Eventually, Origin Intelligence was born. This new bond data extractor is able to extract 79 fields with an accuracy of over 96%. It is also incredibly flexible since it is not limited to a specific document type, structure or even language. Today, the Origin platform allows users to extract from termsheets and final terms, but there is no reason we need to stop there.

What’s next

Thanks to the outstanding data extraction results for both termsheets and final terms, we have now integrated Origin Intelligence throughout the Origin platform. 

Front office users can drag and drop termsheets to benefit from our integrations allowing them to instantly generate “ XS” ISIN codes and set up new securities on Bloomberg.

Legal users can drag and drop termsheets in order to use the platform to automatically generate final terms. They can also drag and drop final terms that were prepared offline in order to benefit from our connections to listing exchanges and paying agents.

While we have already been able to get to a really high accuracy, this is only the beginning. As the LLM technology continues to evolve and improve, so too will our ability to extract bond data from ever more complex documents. And this is just one of the ways that we can use this exciting technology. Stay tuned for what more we can do with Origin Intelligence… we're just getting started!

Next
Next

The global move to shorter settlement periods