BERT Classifier Overview

The guide describes steps to train a PyTorch classifier model with BERT transformer on documents labeled as contracts. The trained model can distinguish contract documents from non-contract documents with a high degree of accuracy.


Code Deep Dive

We have two code examples:

  1. One that uses PyTorch.
  2. A second that uses PyTorch with the simpler Trainer API.

BERT Overview

BERT is a NLP (natural language processing) AI model. You can think of it as being Google's answer to OpenAI's ChatGPT. They are both neural networks, but with some advances that have made things like predicting the next word in a sentence and generating oddly human-like text possible and making the model converge faster.

Here we use BERT for text classification. In particular we train it to look at a document and determine whether it is a contract or not.

The way this works is BERT is pre-trained on giant text databases. To say that it is trained means that it was loaded with unlabelled text from WikiPedia and an enormous collection of self-published books in the public domain. Google did that. Then they make it available to download from the internet from So you can use that without having to train those models yourself, which you cannot do without an enormous amount of computers and computer time.

This training basically taught BERT all of English and other languages, by looking at words in that massive text and how they are arranged into billions of sentences and mapping out how those sentences and words are arranged in almost endless permutations of sentences.

BERT represents almost the entirety of human languages in stacks of encoded sentences. It splits sentences into words. It reads sentences both forwards and backwards. Thus it knows the difference in, for example, the word top as a verb, as in top off the fuel tank and top as a noun in place the bottle on top. It does not really understand the different between top and top. It has just learned how each is used in thousands of examples. So it effectively has learned which is a noun and which is a verb by noting that one has an object and the other is the object of a preposition.

The BERT vast neural network tokenizes all of this immense data and represents it like a huge math model. A math model is what a neural network is. If you know about neural networks you know that they are like a polynomial except in n-dimensional space with coefficients that are fed into functions called perceptrons. So it's just algebra and calculus albeit on an enormous scale, one far too large for any human to grasp. That's why it's called a black box. Neural networks are trained (solved) in the same way as so many other math problems are solved, by finding it's minimal value. In this case, the mimim value is found for the variable error. So, to put that in simple terms, it's like linear regression, except with many times more variables than any regression model could handle.

When you use BERT with data downloaded from your Private Repository in Law Insider that uses that vast knowledge of the English language to build a new model that you can use to classify text. You do this by giving it feature-label data that basically adds your own definitions to BERT's knowledge of English and other languages. Then you can feed it unlabelled data, meaning ordinary text, and it will predict what label to assign to it. This basically is what Law Insider does with its repositories. This is how it knows which documents are contracts, agreements, etc.

Put into a simple example, for the code examples below, we feed it label-feature data like this:

{"contract": "contract text",

"not a contract": "some other text"}

And train it based upon the BERT model to build something like your_trained_model.

Then you feed your_trained_model something like here is some new text. The model will assign a label to it based upon (1) the huge pre-trained BERT model and (2) labeled documents downloaded from LawInsider.

Then you can send it other text and it will be labelled contract or not a contract. So, it's a way to use the very best NLP technology to classify mountains of text. This facilitates the use cases shown below.

Transformer Encoder

BERT has:

  • 12 Transformer encoders
  • 12 attention headers
  • 768 hidden size
  • 110 M parameters

Attention is self attention. Attention is a technology to make neural networks converge faster. Converge means find the coefficients with the minimum loss function.

It learns by reading the encoder stacks, both from left-to-right and right-to-left.

Each sentence fed into it is changed to a sequence of tokens. The tokens are any of:

  • words in the sentence
  • [CLS] classification token. This is changed as the algorithm passes back through the layers again. Again this is a neural network concept, the model goes through the variables fine tuning the coefficients with multiple passes.
  • [SEP] token separator
  • optional tabs used to pad the tokens to a specific length

The model is pre-trained on the words found in millions of documents. As a programmer you feed it your label-feature dataset. Note that the pre-trained BERT data is unlabelled as it's just modelling how words and sentence go together in billions of examples.

You feed data into the encoder. This splits sentences into words and adds the elements we mentioned above.

Use Cases

Besides contracts, there are several document-level use cases where this guide can be applied, such as:

  1. Legal documents—this approach can be used to classify legal documents such as court judgments, case laws, or patent applications based on their legal topics. It can help legal firms to automate document categorization, discovery, and management, thereby saving time and improving the efficiency of the legal process.
  2. Medical documents—the BERT-based model can also be trained to classify medical documents such as clinical trial reports, electronic health records, or medical research papers based on their disease areas or treatment approaches. This can help healthcare providers to manage their medical documents, identify patterns, and optimize patient care.
  3. Financial documents—financial documents such as annual reports, financial statements, or loan agreements can also be classified using this approach. This can help financial institutions to automate document processing, reduce manual errors, and improve the speed and accuracy of their financial decision-making process.

How Avro Facilitates this Classification and Analysis

Avro is a data serialization format that is optimized for storing and processing large datasets, and is commonly used in big data environments. It supports schema evolution, which allows the schema of the data to evolve over time without breaking compatibility with existing code or data. Avro also supports efficient compression and supports a variety of programming languages.

GCS (Google Cloud Storage) is a highly scalable and durable object storage system that allows you to store and retrieve data in a variety of formats, including Avro. GCS is designed to handle large volumes of data and supports features such as versioning, access control, and data encryption. This makes it a good choice for storing and managing big data, including document collections, and to take advantage of GCP's advanced analytics and machine learning tools.

In the context of document big data and NLP analysis, Avro on GCS can be used to store and analyze large volumes of unstructured text data, such as web pages, news articles, and legal documents. You can use NLP techniques to extract insights from the text data, such as sentiment analysis, named entity recognition, and topic modeling. By storing the data in Avro format on GCS, you can take advantage of Avro's efficient compression and schema evolution capabilities, and easily process the data using tools such as Apache Spark, which has built-in support for Avro.

Code and BERT Explained

Transformer Models

Transformer Models are pre-trained models that you can download to save your training costs. They work with the most popular machine learning APIs: TensorFlow, PyTorch, and JAX. In other words, they instantiate models of these types that have set up the first steps in model training, meaning the first layers (in neural network tems). Of course, they are not trained on your dataset, they just save time by doing some of the preliminary steps to which you add additional layers through training.

BERT is a model designed for natural language processing. For LawInsider we use this for classification. In the example on this page, the text is classified is contract or not a contract.

Request Access