Avro Data Schema

Here is the schema for the Avro data file and some sample data. It is in JSON format. The JSON document gives the attribute name, type, and description. We also show you to extract the schema from the data using code.

Avro Schema

{
	"name": "ParsedDocument",
	"type": "record",
	"doc": "A representation of a document body.",
	"fields": [{
		"name": "hash",
		"doc": "Document hash calculated from original body.",
		"type": "string"
	}, {
		"name": "language",
		"doc": "Document language code.",
		"type": "string"
	}, {
		"name": "namespace",
		"doc": "Document namespace (id of the document's owner).",
		"type": ["null", "string"]
	}, {
		"name": "metadata",
		"type": {
			"name": "Metadata",
			"type": "record",
			"doc": "Normalized known document metadata, if any.",
			"fields": [{
				"name": "url",
				"type": ["null", "string"]
			}, {
				"name": "id",
				"type": ["null", "string"],
				"doc": "Unique ID within this source."
			}, {
				"name": "source",
				"type": ["null", {
					"name": "Source",
					"type": "enum",
					"symbols": ["sec.gov"]
				}]
			}, {
				"name": "keyword",
				"type": ["null", "string"]
			}, {
				"name": "query_id",
				"type": ["null", "string"]
			}, {
				"name": "search_engine",
				"type": ["null", "string"]
			}, {
				"name": "filing_id",
				"type": ["null", "string"]
			}, {
				"name": "company_cik",
				"type": ["null", "int"]
			}, {
				"name": "company_name",
				"type": ["null", "string"]
			}, {
				"name": "company_sic",
				"type": ["null", "int"]
			}, {
				"name": "filename",
				"type": ["null", "string"]
			}, {
				"name": "doc_type",
				"type": ["null", "string"]
			}, {
				"name": "filing_date",
				"type": ["null", "string"]
			}, {
				"name": "filing_type",
				"type": ["null", "string"]
			}, {
				"name": "creation_date",
				"doc": "Document creation date based on file metadata.",
				"type": ["null", "string"]
			}, {
				"name": "crawl_date",
				"doc": "Date when this document has been downloaded from given URL.",
				"type": ["null", "string"]
			}]
		}
	}, {
		"name": "title",
		"doc": "Document title.",
		"type": ["null", {
			"name": "DocumentText",
			"type": "record",
			"fields": [{
				"name": "text",
				"doc": "Text string extracted from the body, often from multiple nodes.",
				"type": "string"
			}, {
				"name": "location",
				"type": {
					"name": "TextLocation",
					"type": "record",
					"fields": [{
						"name": "csspaths",
						"doc": "CSS selectors that can be used to get each node of this text.",
						"type": {
							"type": "array",
							"items": {
								"type": "string"
							}
						}
					}, {
						"name": "offsets",
						"doc": "List of integer spans with offsets of each text node in the body.",
						"type": {
							"type": "array",
							"items": {
								"type": "array",
								"items": {
									"type": "int"
								}
							}
						}
					}]
				}
			}]
		}]
	}, {
		"name": "category",
		"doc": "Type of contract extracted from title, sometimes matches the title exactly.",
		"type": ["null", "string"]
	}, {
		"name": "intro",
		"doc": "Introductory paragraph or empty string. In contract body appears before first clause.",
		"type": ["null", "DocumentText"]
	}, {
		"name": "content_type",
		"doc": "Content type of original document, e.g. application/pdf.",
		"type": "string"
	}, {
		"name": "body",
		"doc": "Sanitized HTML body.",
		"type": "string"
	}, {
		"name": "clauses",
		"type": {
			"type": "array",
			"items": {
				"name": "Clause",
				"type": "record",
				"fields": [{
					"name": "id",
					"type": "int",
					"doc": "Clause ID assigned by parser, can be used to match children with parents."
				}, {
					"name": "key",
					"type": "string",
					"doc": "Clause slug built from title combined with all clause parents and delimited with slash. Can be empty."
				}, {
					"name": "level",
					"type": "int",
					"doc": "Level at which the clause is located in the document. Zero for root, one for sub-clause of a root, etc."
				}, {
					"name": "title",
					"type": ["null", "DocumentText"],
					"doc": "Clause title as it appears in contract, with all special characters and original case preserved. Can be empty."
				}, {
					"name": "snippet",
					"type": ["null", "DocumentText"],
					"doc": "Snippet content as it appears in contract. Everything after the title. Can be empty."
				}, {
					"name": "prefix",
					"type": "string",
					"doc": "Special prefix from the source attribute. Can be empty."
				}, {
					"name": "source",
					"type": "string",
					"doc": "Hierarchical value of the clause which can be represented as an arabic number, roman number, alphanumeric, etc. Appears just before the title. Can be empty."
				}, {
					"name": "type",
					"type": "int",
					"doc": "Byte mask with types of clause source: num=1, roman=2, alpha=4, definition=8, title=16, subtitle=32, part=64, chapter=128, subchapter=256, section=512, article=1024"
				}, {
					"name": "integer",
					"type": "int",
					"doc": "Numeric representation of clause source."
				}, {
					"name": "parent_id",
					"type": "int",
					"doc": "ID of parent clause, in case this clause is a sub clause of another one. Zero for root."
				}]
			}
		}
	}, {
		"name": "definitions",
		"type": {
			"type": "array",
			"items": {
				"name": "Definition",
				"type": "record",
				"fields": [{
					"name": "id",
					"type": "int"
				}, {
					"name": "title",
					"type": ["null", "DocumentText"]
				}, {
					"name": "snippet",
					"type": ["null", "DocumentText"]
				}]
			}
		}
	}, {
		"name": "text",
		"doc": "Plain text which represents the document body without HTML.",
		"type": "string"
	}, {
		"name": "labels",
		"doc": "Labels that apply to entire document. Map keys are generally label types, e.g. 'TAG/jurisdiction' or 'contract'.",
		"default": {},
		"type": {
			"type": "map",
			"values": {
				"name": "LabelEntity",
				"type": "record",
				"fields": [{
					"name": "probability",
					"type": ["null", "float"]
				}, {
					"name": "value",
					"doc": "Value of the label, e.g. 'Virginia/US'.",
					"type": ["null", "string"]
				}, {
					"name": "model_id",
					"doc": "Id of the model used to predict the label.",
					"type": "string"
				}]
			}
		}
	}, {
		"name": "entities",
		"doc": "Named entity spans found in this document.",
		"default": [],
		"type": {
			"type": "array",
			"items": {
				"name": "NamedEntity",
				"type": "record",
				"fields": [{
					"name": "label",
					"doc": "Label of the entity, e.g. 'DLP/PERSON_NAME'.",
					"type": "string"
				}, {
					"name": "location",
					"doc": "Character offsets of this entity in the text.",
					"type": {
						"name": "Range",
						"type": "record",
						"fields": [{
							"name": "start",
							"type": "int"
						}, {
							"name": "end",
							"type": "int"
						}]
					}
				}, {
					"name": "probability",
					"type": ["null", "float"]
				}, {
					"name": "model_id",
					"doc": "Id of the model used to predict the entity.",
					"type": "string"
				}]
			}
		}
	}, {
		"name": "models",
		"doc": "Map of ids to versions of the models used to predict labels and/or entities.",
		"default": {},
		"type": {
			"type": "map",
			"values": "string"
		}
	}]
}

Avro Sample Data Record

{
	"hash": "7sHebI7fKeI",
	"language": "en",
	"namespace": null,
	"metadata": {
		"url": "https://lawyersalliance.org/userFiles/uploads/legal_alerts/Rollback_Employee_Benefits_updated_March_2020.pdf",
		"id": null,
		"source": null,
		"keyword": "Alliance-Related Employee means any employee of agreement (pdf OR docx) -sec.gov -lawinsider.com (\"\" OR indenture OR guarantee) (\"\" OR \"shall be construed\" OR \"shall for all purposes\") (\"\" OR \"governing law\") (\"\" OR exhibit)",
		"query_id": "user-copies",
		"search_engine": null,
		"filing_id": null,
		"company_cik": null,
		"company_name": null,
		"company_sic": null,
		"filename": null,
		"doc_type": null,
		"filing_date": null,
		"filing_type": null,
		"creation_date": "20200401",
		"crawl_date": "20210911"
	},
	"title": {
		"text": "Rollback Employee Benefits",
		"location": {
			"csspaths": ["p:nth-child(3)"],
			"offsets": [
				[694, 720]
			]
		}
	},
	"category": null,
	"intro": {
		"text": "Staff may represent the most significant asset of a nonprofit corporation and its most significant expense. As a Board of Directors considers expense reductions it may be forced to consider a reduction in force. There are, however, alternatives to reductions in force that may enable Boards to reduce expenses while preserving staff.",
		"location": {
			"csspaths": ["p:nth-child(5)"],
			"offsets": [
				[1170, 1503]
			]
		}
	},
	"content_type": "application/pdf",
	"body": "(HTML is here)",
	"clauses": [{
		"id": 18,
		"key": "",
		"level": 0,
		"title": null,
		"snippet": {
			"text": "(full text is here)",
			"location": {
				"csspaths": ["p:nth-child(10)"],
				"offsets": [
					[3127, 3229]
				]
			}
		},
		"prefix": "",
		"source": "1",
		"type": 1,
		"integer": 1,
		"parent_id": 0
	}, {
		"id": 20,
		"key": "",
		"level": 0,
		"title": null,
		"snippet": {
			"text": "Is it permissible for the organization to move employees from full to part-time or reduce hours?",
			"location": {
				"csspaths": ["p:nth-child(11)"],
				"offsets": [
					[3450, 3549]
				]
			}
		},
		"prefix": "",
		"source": "2",
		"type": 1,
		"integer": 2,
		"parent_id": 0
	}, {
		"id": 22,
		"key": "",
		"level": 0,
		"title": null,
		"snippet": {
			"text": "Has the organization considered the New York State Department of Labor Shared Work program?",
			"location": {
				"csspaths": ["p:nth-child(12)"],
				"offsets": [
					[3770, 3864]
				]
			}
		},
		"prefix": "",
		"source": "3",
		"type": 1,
		"integer": 3,
		"parent_id": 0
	}, {
		"id": 24,
		"key": "",
		"level": 0,
		"title": null,
		"snippet": {
			"text": "Is it possible for the organization to decrease healthcare or pension costs? Answers:1. Can we reassign employees to new positions even if it means a change in job title and compensation?It may be possible to preserve jobs by reassigning employees to open positions within the organization. New York State is an \"employment at will\" state which means that employees do not have a guarantee of continued employment or a guarantee that job description stays same. This general rule, however, will not apply when an employee has a contract, is covered by a bargaining agreement, or the organization\u2019s handbook states otherwise. Depending upon the terms of the job reassignment, an employee who does not accept the new position may be eligible for unemployment benefits. Also, disability law may require a reasonable accommodation to enable a disabled individual to perform essential functions of new position to which he or she is transferred.",
			"location": {
				"csspaths": ["p:nth-child(17)", "p:nth-child(13)", "h1:nth-child(15)", "h1:nth-child(16)", "p:nth-child(17)", "h1:nth-child(19)", "p:nth-child(20)", "p:nth-child(23)>a", "h1:nth-child(25)", "p:nth-child(26)>a", "h1:nth-child(28)", "p:nth-child(29)", "p:nth-child(31)", "p:nth-child(33)", "p:nth-child(35)>a", "p:nth-child(36)>a", "p:nth-child(38)", "p:nth-child(41)>a"],
				"offsets": [
					[5122, 5885],
					[4083, 4162],
					[4609, 4617],
					[4817, 4919],
					[5122, 5885],
					[6316, 6415],
					[6618, 6909],
					[7947, 8044],
					[8639, 8733],
					[9197, 9413],
					[10546, 10625],
					[10828, 11219],
					[11653, 11921],
					[12355, 12422],
					[13284, 13471],
					[14932, 14989],
					[16408, 17190],
					[17920, 17922]
				]
			}
		},
		"prefix": "",
		"source": "4",
		"type": 1,
		"integer": 4,
		"parent_id": 0
	}],
	"definitions": [],
	"text": "(full text is here)",
	"labels": {
		"not_contract": {
			"probability": 4.260831832885742,
			"value": null,
			"model_id": "is_contract"
		}
	},
	"entities": [],
	"models": {
		"is_contract": "v2"
	}
}

Extract Schema from Data

Note that Avro data files include the schema. So you can extract the schema from the data. Here is an example how to do that with Python

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
import json

reader = DataFileReader(open(<avrofile>, "rb"),  DatumReader())

print("schema " + reader.meta['avro.schema'].decode('utf-8'))