ChatGPT and the LLM revolution: A private markets perspective

Download

,

April 26, 2023

ChatGPT and the LLM revolution: A private markets perspective

 

Introduction

Open AI’s release of ChatGPT in 2022 sent shockwaves around the world. This chatbot revealed to the general public just how powerful large language models (LLMs) have become. Able to perform to a wide range of complex tasks at or beyond human level, the potential for ChatGPT or similar tools to disrupt was clear. An upgrade to ChatGPT’s underlying model in 2023 significantly improved its performance, with Open AI showcasing examples of ChatGPT writing and debugging code, understanding complex tax law and interpreting images.

Just how far and how fast any such disruption will spread remains to be seen, as well as what industries will be affected first. The goal of this article is to assess the scale of the impact on Private Markets, focusing on market transparency, speed of data exchange and quality of insight generation that allows Private Markets professionals to make informed investment decisions. We also provide a primer in the underlying technology behind ChatGPT and discuss some of its current limitations.

 

How ChatGPT works

Large language models

ChatGPT is built on foundational large language models. As the name suggests, a large language model is a machine learning model with a very large number of parameters, typically numbering in the tens or hundreds of billions. These models are trained to understand language at a fundamental level, by feeding them enormous amounts of data scraped from the internet. Raw data is converted into a learning task by removing small pieces of the sample and asking the model to predict the missing item. For example, the model can be given the start of a sentence and asked to predict the next word.

Despite the seeming simplicity of this task, the large size of the model and the sheer volume of data it sees allows it to learn a surprising range of skills. Training data includes text in different languages, so LLMs can effectively learn to understand and translate between them. Computer code scraped from public repositories such as GitHub and wikis like Stack Overflow teaches the model how to write software.

More recently, LLMs are being trained not just on text data, but also on images. Such a model is known as a visual language model. GPT-4, the most powerful underlying model behind Open AI’s ChatGPT, is a visual language model, although its visual capabilities are not yet publicly available.

Reinforcement learning

LLMs alone are not the full story behind the power of tools such as ChatGPT. Being able to predict the next word in a sentence is by itself not a particularly useful skill. After initially being trained on vast amounts of data scraped from the internet, foundational LLMs are then fine-tuned to generate not just a single word, but a natural language response to an input (known as a prompt) from the user. In this way, LLMs become chatbots that respond to inputs using natural language.

However, fine-tuning a model to produce high-quality responses to user inputs is not trivial. It is difficult to measure the quality of a natural language response mathematically. A technique known as reinforcement learning from human feedback(RLHF) is therefore used to tackle this problem. Reinforcement learning deals with agents interacting with their environment to maximize some score, known as the reward.

When fine-tuning generative language models, model responses are quality-graded by human annotators. A second large language model is introduced and trained on the quality grades to produce what is known as a reward model. This reward model is then used to further train the original large language model to produce higher quality responses.

ChatGPT

The above recipe is widely known and has been used to train many chatbots, including ChatGPT. So, what makes ChatGPT so special? Open AI have not revealed all their secrets, but it is known that they have invested heavily in the infrastructure required to train very large models on very large datasets. In addition, Open AI have developed techniques to ensure that model training is well-behaved at different scales. This allows them to iterate rapidly using smaller models, while still being able to predict how a larger model will perform. It is also believed that Open AI has built by far the largest dataset of graded responses used to train their reward model.

 

Applications in Document AI

Document AI is the process of turning unstructured data from documents into structured data. It is typically handled with a pipeline of algorithms and models executing separate tasks search as document rendering, classification, entity recognition, entity linking and business rule application. Accelex’s own data science stack follows this battle-tested approach.

How can tools such as LLMs or ChatGPT fit into such a pipeline? With the exception of document rendering (the task of converting an input document into a suitable format for downstream tasks), certain parts of the document AI pipeline could potentially be enhanced by LLMs. Furthermore, LLMs may allow for a reduction in the complexity of such pipelines.

At Accelex we are very excited about the potential for this technology to drive automation rates higher for clients and expand the capabilities of our offering to new use cases, such as automatic validation and alerting, real-time document QA and more.

 

Challenges and pitfalls

Despite there markable abilities of generative LLMs, they are not without their problems.

We identify different clusters of challenges: hallucinations and factual accuracy, speed, context length and finally privacy, which is critical to alternative investors.

Hallucinations and factual accuracy

Hallucinations are a problem in many generative models. A hallucination is a plausible falsehood, confidently included in the output of a chatbot or LLM. Hallucinations tend to occur because the underlying models are trained to predict the most likely completion for a given input and lack a world model to realize when an input is incorrect or misleading.

In our experience to date, when it comes to complex numeric reasoning such as the interpretation of investor positions, accounting and reconciliations, even state-of-the-art models like GPT-4 are not able to give consistent and accurate logical inferences or apply repeatable, logical judgments to complex and often ambiguous datasets.

We expect that the financial literacy of LLMs will improve over time, and specialized models such as BloombergGPT, trained on financial datasets, have already been announced. Notwithstanding the training inherent in such models, logical consistency may never be guaranteed with LLMs, as it is not an inherent feature of natural language in the first place. Regardless of the underlying extraction accuracy, we still believe the final review and sign-off of critical financial data should remain with industry professionals.

Speed and scale

As machine learning models grow larger, their speed of response typically decreases. Depending on the task at hand and the input and output complexity, this may take from a few seconds to over a minute. This can present a significant challenge when the technology is deployed at scale.

For example, if we attempt to use an LLM to extract information from a document, then the length of time to extract that data should be directly proportional to the number of questions asked. In Private Markets, investors routinely deal with documents that contain hundreds or even thousands of data points that need to be obtained in real-time, as efficiently as possible. Depending on the size and variance of these documents, the fetching of this information may be compressed by using a small number of prompts (perhaps a dozen, or even fewer). It remains to be seen whether automatically deploying these queries would give users an acceptable experience.

However, in recent weeks there has been a flurry of new services announced which give developers and data scientists access to powerful LLMs. Some of these, such as Amazon’s Bedrock, allow for models to be fine-tuned on different tasks with new data. This will facilitate easier integration of LLMs into end-user systems.

Context length

Large language models have a limited input size, known as the context length. In the case of GPT-4,the limit is currently 8,000 tokens, which is in the region of 5,000 words. Although large, this limit prevents LLMs from being able to fully ingest lengthy documents such as financial statements. Various solutions could be considered at the pre-processing stage, such as splitting such documents up into constituent parts or stripping out less relevant parts of the document before feeding them into LLMs. However, this re-introduces the complexity of a pipeline of models and negates some of the benefits of using LLMs. Short-form documents therefore represent a more immediate opportunity for handling with LLMs.

Privacy

Generative models pose particular challenges for data privacy. Generative models can reproduce verbatim the data on which they have been trained. Although models are typically trained on public data, the danger is that non-public data given to the models during their use is then used to further train the next iteration of the model. Non-public information could then leak to another user during their interaction with the re-trained model. This contrasts with a pipeline of single-task models, which are trained to learn more abstract concepts, such as table, figure, or date.

LLMs and Accelex

At Accelex we are excited by the potential for this next generation of LLMs for our industry and product. We believe these advances will be very relevant to our clients and that the Accelex platform, workflow and data model make us well-positioned to leverage them.

Accelex has always sought to build the best product for our clients using the best available technology. The rapid rise of LLMs as a powerful new tool is no exception, and we are working hard to deliver this technology to our clients. We see immense potential for this technology not just in data extraction, but also as a tool to search documents, improve explainability and flag exceptions as part of our data workflow.

 

Regardless of how data is extracted from documents, it is vitally important to have a workflow and data model tailored to the domain. Data and documents need to be mapped to an appropriate taxonomy, reference data should be managed efficiently and be seamlessly integrated with upstream and downstream systems. User validation should be effortless and intuitive. Uniquely, the Accelex platform covers the whole data flow for alternative markets, from acquisition to analytics and everything in between.

LLMs like ChatGPT have the potential to revolutionize the private markets industry by improving automation rates, expanding capabilities, and streamlining workflows. Accelex is committed to harnessing this technology to deliver innovative solutions to our clients while addressing the challenges of hallucinations, speed, context length, and privacy. By embracing LLMs, we aim to enhance our platform, offering new possibilities for data extraction, document search, explainability, and exception handling. As LLMs continue to evolve, we will stay at the forefront of this technological revolution to provide the best solutions for our clients and the private markets industry.

 

ChatGPT and the LLM revolution: A private markets perspective

 

Introduction

Open AI’s release of ChatGPT in 2022 sent shockwaves around the world. This chatbot revealed to the general public just how powerful large language models (LLMs) have become. Able to perform to a wide range of complex tasks at or beyond human level, the potential for ChatGPT or similar tools to disrupt was clear. An upgrade to ChatGPT’s underlying model in 2023 significantly improved its performance, with Open AI showcasing examples of ChatGPT writing and debugging code, understanding complex tax law and interpreting images.

Just how far and how fast any such disruption will spread remains to be seen, as well as what industries will be affected first. The goal of this article is to assess the scale of the impact on Private Markets, focusing on market transparency, speed of data exchange and quality of insight generation that allows Private Markets professionals to make informed investment decisions. We also provide a primer in the underlying technology behind ChatGPT and discuss some of its current limitations.

 

How ChatGPT works

Large language models

ChatGPT is built on foundational large language models. As the name suggests, a large language model is a machine learning model with a very large number of parameters, typically numbering in the tens or hundreds of billions. These models are trained to understand language at a fundamental level, by feeding them enormous amounts of data scraped from the internet. Raw data is converted into a learning task by removing small pieces of the sample and asking the model to predict the missing item. For example, the model can be given the start of a sentence and asked to predict the next word.

Despite the seeming simplicity of this task, the large size of the model and the sheer volume of data it sees allows it to learn a surprising range of skills. Training data includes text in different languages, so LLMs can effectively learn to understand and translate between them. Computer code scraped from public repositories such as GitHub and wikis like Stack Overflow teaches the model how to write software.

More recently, LLMs are being trained not just on text data, but also on images. Such a model is known as a visual language model. GPT-4, the most powerful underlying model behind Open AI’s ChatGPT, is a visual language model, although its visual capabilities are not yet publicly available.

Reinforcement learning

LLMs alone are not the full story behind the power of tools such as ChatGPT. Being able to predict the next word in a sentence is by itself not a particularly useful skill. After initially being trained on vast amounts of data scraped from the internet, foundational LLMs are then fine-tuned to generate not just a single word, but a natural language response to an input (known as a prompt) from the user. In this way, LLMs become chatbots that respond to inputs using natural language.

However, fine-tuning a model to produce high-quality responses to user inputs is not trivial. It is difficult to measure the quality of a natural language response mathematically. A technique known as reinforcement learning from human feedback(RLHF) is therefore used to tackle this problem. Reinforcement learning deals with agents interacting with their environment to maximize some score, known as the reward.

When fine-tuning generative language models, model responses are quality-graded by human annotators. A second large language model is introduced and trained on the quality grades to produce what is known as a reward model. This reward model is then used to further train the original large language model to produce higher quality responses.

ChatGPT

The above recipe is widely known and has been used to train many chatbots, including ChatGPT. So, what makes ChatGPT so special? Open AI have not revealed all their secrets, but it is known that they have invested heavily in the infrastructure required to train very large models on very large datasets. In addition, Open AI have developed techniques to ensure that model training is well-behaved at different scales. This allows them to iterate rapidly using smaller models, while still being able to predict how a larger model will perform. It is also believed that Open AI has built by far the largest dataset of graded responses used to train their reward model.

 

Applications in Document AI

Document AI is the process of turning unstructured data from documents into structured data. It is typically handled with a pipeline of algorithms and models executing separate tasks search as document rendering, classification, entity recognition, entity linking and business rule application. Accelex’s own data science stack follows this battle-tested approach.

How can tools such as LLMs or ChatGPT fit into such a pipeline? With the exception of document rendering (the task of converting an input document into a suitable format for downstream tasks), certain parts of the document AI pipeline could potentially be enhanced by LLMs. Furthermore, LLMs may allow for a reduction in the complexity of such pipelines.

At Accelex we are very excited about the potential for this technology to drive automation rates higher for clients and expand the capabilities of our offering to new use cases, such as automatic validation and alerting, real-time document QA and more.

 

Challenges and pitfalls

Despite there markable abilities of generative LLMs, they are not without their problems.

We identify different clusters of challenges: hallucinations and factual accuracy, speed, context length and finally privacy, which is critical to alternative investors.

Hallucinations and factual accuracy

Hallucinations are a problem in many generative models. A hallucination is a plausible falsehood, confidently included in the output of a chatbot or LLM. Hallucinations tend to occur because the underlying models are trained to predict the most likely completion for a given input and lack a world model to realize when an input is incorrect or misleading.

In our experience to date, when it comes to complex numeric reasoning such as the interpretation of investor positions, accounting and reconciliations, even state-of-the-art models like GPT-4 are not able to give consistent and accurate logical inferences or apply repeatable, logical judgments to complex and often ambiguous datasets.

We expect that the financial literacy of LLMs will improve over time, and specialized models such as BloombergGPT, trained on financial datasets, have already been announced. Notwithstanding the training inherent in such models, logical consistency may never be guaranteed with LLMs, as it is not an inherent feature of natural language in the first place. Regardless of the underlying extraction accuracy, we still believe the final review and sign-off of critical financial data should remain with industry professionals.

Speed and scale

As machine learning models grow larger, their speed of response typically decreases. Depending on the task at hand and the input and output complexity, this may take from a few seconds to over a minute. This can present a significant challenge when the technology is deployed at scale.

For example, if we attempt to use an LLM to extract information from a document, then the length of time to extract that data should be directly proportional to the number of questions asked. In Private Markets, investors routinely deal with documents that contain hundreds or even thousands of data points that need to be obtained in real-time, as efficiently as possible. Depending on the size and variance of these documents, the fetching of this information may be compressed by using a small number of prompts (perhaps a dozen, or even fewer). It remains to be seen whether automatically deploying these queries would give users an acceptable experience.

However, in recent weeks there has been a flurry of new services announced which give developers and data scientists access to powerful LLMs. Some of these, such as Amazon’s Bedrock, allow for models to be fine-tuned on different tasks with new data. This will facilitate easier integration of LLMs into end-user systems.

Context length

Large language models have a limited input size, known as the context length. In the case of GPT-4,the limit is currently 8,000 tokens, which is in the region of 5,000 words. Although large, this limit prevents LLMs from being able to fully ingest lengthy documents such as financial statements. Various solutions could be considered at the pre-processing stage, such as splitting such documents up into constituent parts or stripping out less relevant parts of the document before feeding them into LLMs. However, this re-introduces the complexity of a pipeline of models and negates some of the benefits of using LLMs. Short-form documents therefore represent a more immediate opportunity for handling with LLMs.

Privacy

Generative models pose particular challenges for data privacy. Generative models can reproduce verbatim the data on which they have been trained. Although models are typically trained on public data, the danger is that non-public data given to the models during their use is then used to further train the next iteration of the model. Non-public information could then leak to another user during their interaction with the re-trained model. This contrasts with a pipeline of single-task models, which are trained to learn more abstract concepts, such as table, figure, or date.

LLMs and Accelex

At Accelex we are excited by the potential for this next generation of LLMs for our industry and product. We believe these advances will be very relevant to our clients and that the Accelex platform, workflow and data model make us well-positioned to leverage them.

Accelex has always sought to build the best product for our clients using the best available technology. The rapid rise of LLMs as a powerful new tool is no exception, and we are working hard to deliver this technology to our clients. We see immense potential for this technology not just in data extraction, but also as a tool to search documents, improve explainability and flag exceptions as part of our data workflow.

 

Regardless of how data is extracted from documents, it is vitally important to have a workflow and data model tailored to the domain. Data and documents need to be mapped to an appropriate taxonomy, reference data should be managed efficiently and be seamlessly integrated with upstream and downstream systems. User validation should be effortless and intuitive. Uniquely, the Accelex platform covers the whole data flow for alternative markets, from acquisition to analytics and everything in between.

LLMs like ChatGPT have the potential to revolutionize the private markets industry by improving automation rates, expanding capabilities, and streamlining workflows. Accelex is committed to harnessing this technology to deliver innovative solutions to our clients while addressing the challenges of hallucinations, speed, context length, and privacy. By embracing LLMs, we aim to enhance our platform, offering new possibilities for data extraction, document search, explainability, and exception handling. As LLMs continue to evolve, we will stay at the forefront of this technological revolution to provide the best solutions for our clients and the private markets industry.

 

ChatGPT and the LLM revolution: A private markets perspective
Thank you, a copy of the white paper will be emailed to you shortly
Thank you! Here's a copy of the white paper for you to download.
Download
Oops! Something went wrong while submitting the form.

About Accelex

Accelex provides data acquisition, analytics and reporting solutions for investors and asset servicers enabling firms to access the full potential of their investment performance and transaction data. Powered by proprietary artificial intelligence and machine learning techniques, Accelex automates processes for the extraction, analysis and sharing of difficult-to-access unstructured data. Founded by senior alternative investment executives, former BCG partners and successful fintech entrepreneurs, Accelex is headquartered in London with offices in Paris, Luxembourg, New York and Toronto. For more information, please visit accelextech.com

Want to see Accelex in action?
Get in touch now for a free demo of the platform

Schedule a free demo →