Working with Avro

You can download the Law Insider data to different file formats. Here we show how to work with an Avro file.

The Avro files may be placed on S3 in AWS or Azure Blob Storage so they are available to a customer regardless of the public cloud that the customer uses.

We convert a single record to JSON so that we can see the full list of attributes.

The goal is to show a simple example of how to work with Avro files.

Avro is just one file format you can use. You could export Law Insider data from Google Cloud Big Query to any of these formats.

The document we look at is this agreement:

Attributes Explained

Here are the attributes in the exported data. They will be, of course, pretty much the same as the original Big Query attributes. But with Avro you will get some unique to the Avro data structure.

hash, language, namespace, metadata, title, category, intro, content_type, body, clauses, definitions, text, labels, entities, models

AttributeDefinition
clausesAn array of clauses.
bodyWhole document in HTML format.
titleFor this document it is Dedoco Partner Agreement
textWhole document in regular text format.
namespaceempty
entitiesempty
categoriesexamples: {'forbearance agreement', 'amendment number', 'tips vendor agreement', 'preferential trade agreements', 'limited assignment distribution agreement', 'global note', 'customer contract', 'binding arbitration agreement', 'note', 'letter of intent', 'student data privacy agreement', 'financial agreement', 'facility agreements', 'maintenance contract', 'the agreement', 'credit agreement', 'deed of trust', ...
labelscontract type, country, jurisdiction, industry, company, law firm, SEC Filing ID, SEC Exhibit ID
definitionsEmpty for agreements.

JSON Output

Here is a JSON representation of a single agreement. We removed the HTML and text attributes as those are very large. They contain the complete document.

{
  "hash": "75b8P6w7697",
  "language": "en",
  "namespace": null,
  "metadata": {
    "url": "https://uploads-ssl.webflow.com/606d53c3050238e56d1690ef/60bae332207098594ae2a759_Dedoco%20Partner%20Agreement%20(final%20draft)_w%20logo%20(1)%20(DNN%2027%20May%2021)%20cln.pdf",
    "id": null,
    "source": null,
    "keyword": "subreseller agreement (pdf OR docx) -sec.gov -lawinsider.com (\"\" OR amendment OR indenture OR guarantee) (\"\" OR \"shall mean\" OR \"shall terminate\" OR \"shall be construed\") (\"\" OR \"governing law\" OR miscellaneous OR severability) (\"\" OR exhibit OR witnesses)",
    "query_id": "user-searches",
    "search_engine": null,
    "filing_id": null,
    "company_cik": null,
    "company_name": null,
    "company_sic": null,
    "filename": null,
    "doc_type": null,
    "filing_date": null,
    "filing_type": null,
    "creation_date": "20210603",
    "crawl_date": "20210818"
  },
  "title": {
    "text": "Dedoco Partner Agreement",
    "location": {
      "csspaths": [
        "p:nth-child(2)"
      ],
      "offsets": [
        [
          468,
          492
        ]
      ]
    }
  },
  "category": "dedoco partner agreement",
  "intro": {
    "text": "You can bec......",

Python Avro APIs

There are two different Python APIs:

πŸ‘

Compression Codec

There are different compression formats for Avro data files. Note that you might have to install the Snappy or other codec on your laptop to work with that.

Fast Avro

To use Fast Avro install:

pip install fastavro

Here is an example. We removed the body and text attributes so the output would not be so large.

This program opens the Avro file then loops through 5 records. For each it adds to dictionary jusing the {key, value pair} of the hash attribute and the JSON document. The hash is the unique key in the Avro file. Then it prints the results.

from fastavro import reader
import json

count=0

j={}

with open('auto-contractfinder-20210512..20211129-00000-of-00128', 'rb') as f:
    for i, doc in enumerate(reader(f)):
        if count == 5:
            break
        count=count+1
        j[doc['hash']]=doc

        
print(json.dumps(doc,indent=2))

Apache Avro

To use Apache Avro install:

python3 -m pip install avro

Here we use the Avro DataFileReader to read an Avro Data file. Since the Avro shard can be very large we just print one record. Notice also that we print the schema.

We:

  1. open the file then use the __next___() method to loop through the file.
  2. Print the schema using reader.meta['avro.schema'].decode('utf-8').
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
import simplejson as json

reader = DataFileReader(open("auto-contractfinder-20210512..20211129-00000-of-00128", "rb"),  DatumReader())


for i in range(1):
    print(reader.__next__())
    
print("schema " + reader.meta['avro.schema'].decode('utf-8'))