Working with Avro
You can download the Law Insider data to different file formats. Here we show how to work with an Avro file.
The Avro files may be placed on S3 in AWS or Azure Blob Storage so they are available to a customer regardless of the public cloud that the customer uses.
We convert a single record to JSON so that we can see the full list of attributes.
The goal is to show a simple example of how to work with Avro files.
Avro is just one file format you can use. You could export Law Insider data from Google Cloud Big Query to any of these formats.
The document we look at is this agreement:
Attributes Explained
Here are the attributes in the exported data. They will be, of course, pretty much the same as the original Big Query attributes. But with Avro you will get some unique to the Avro data structure.
hash, language, namespace, metadata, title, category, intro, content_type, body, clauses, definitions, text, labels, entities, models
Attribute | Definition |
---|---|
clauses | An array of clauses. |
body | Whole document in HTML format. |
title | For this document it is Dedoco Partner Agreement |
text | Whole document in regular text format. |
namespace | empty |
entities | empty |
categories | examples: {'forbearance agreement', 'amendment number', 'tips vendor agreement', 'preferential trade agreements', 'limited assignment distribution agreement', 'global note', 'customer contract', 'binding arbitration agreement', 'note', 'letter of intent', 'student data privacy agreement', 'financial agreement', 'facility agreements', 'maintenance contract', 'the agreement', 'credit agreement', 'deed of trust', ... |
labels | contract type, country, jurisdiction, industry, company, law firm, SEC Filing ID, SEC Exhibit ID |
definitions | Empty for agreements. |
JSON Output
Here is a JSON representation of a single agreement. We removed the HTML and text attributes as those are very large. They contain the complete document.
{
"hash": "75b8P6w7697",
"language": "en",
"namespace": null,
"metadata": {
"url": "https://uploads-ssl.webflow.com/606d53c3050238e56d1690ef/60bae332207098594ae2a759_Dedoco%20Partner%20Agreement%20(final%20draft)_w%20logo%20(1)%20(DNN%2027%20May%2021)%20cln.pdf",
"id": null,
"source": null,
"keyword": "subreseller agreement (pdf OR docx) -sec.gov -lawinsider.com (\"\" OR amendment OR indenture OR guarantee) (\"\" OR \"shall mean\" OR \"shall terminate\" OR \"shall be construed\") (\"\" OR \"governing law\" OR miscellaneous OR severability) (\"\" OR exhibit OR witnesses)",
"query_id": "user-searches",
"search_engine": null,
"filing_id": null,
"company_cik": null,
"company_name": null,
"company_sic": null,
"filename": null,
"doc_type": null,
"filing_date": null,
"filing_type": null,
"creation_date": "20210603",
"crawl_date": "20210818"
},
"title": {
"text": "Dedoco Partner Agreement",
"location": {
"csspaths": [
"p:nth-child(2)"
],
"offsets": [
[
468,
492
]
]
}
},
"category": "dedoco partner agreement",
"intro": {
"text": "You can bec......",
Python Avro APIs
There are two different Python APIs:
Compression Codec
There are different compression formats for Avro data files. Note that you might have to install the Snappy or other codec on your laptop to work with that.
Fast Avro
To use Fast Avro install:
pip install fastavro
Here is an example. We removed the body and text attributes so the output would not be so large.
This program opens the Avro file then loops through 5 records. For each it adds to dictionary j
using the {key, value pair} of the hash
attribute and the JSON document. The hash
is the unique key in the Avro file. Then it prints the results.
from fastavro import reader
import json
count=0
j={}
with open('auto-contractfinder-20210512..20211129-00000-of-00128', 'rb') as f:
for i, doc in enumerate(reader(f)):
if count == 5:
break
count=count+1
j[doc['hash']]=doc
print(json.dumps(doc,indent=2))
Apache Avro
To use Apache Avro install:
python3 -m pip install avro
Here we use the Avro DataFileReader to read an Avro Data file. Since the Avro shard can be very large we just print one record. Notice also that we print the schema.
We:
- open the file then use the
__next___()
method to loop through the file. - Print the schema using
reader.meta['avro.schema'].decode('utf-8')
.
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
import simplejson as json
reader = DataFileReader(open("auto-contractfinder-20210512..20211129-00000-of-00128", "rb"), DatumReader())
for i in range(1):
print(reader.__next__())
print("schema " + reader.meta['avro.schema'].decode('utf-8'))
Updated about 1 year ago