A Technical Deep Dive: Engineering the First Comprehensive Legal Corpus for Kenya

Read this story on Esheria.

The Esheria LexChat Kenyan Legal Corpus represents a groundbreaking advancement in the field of legal artificial intelligence (AI) within the African context. As the first and only comprehensive corpus of Kenyan legislative and judicial documents, it stands as a testament to the growing intersection of law and technology in the region. The corpus, which comprises over 350,000 texts amounting to more than 80 million lines and 1.2 billion tokens, is a critical asset for training and fine-tuning natural language processing (NLP) models tailored to the Kenyan legal domain.This article provides an in-depth, technical exploration of the corpus's creation, structure, and potential applications, drawing parallels with similar efforts in other jurisdictions.

Introduction to the Kenyan Legal Corpus

The creation of the Esheria LexChat Kenyan Legal Corpus was a complex and highly technical endeavour, necessitating a multifaceted approach to data collection, processing, and annotation. The corpus includes every statute and regulation currently in force in Kenya, as well as thousands of bills and hundreds of thousands of court and tribunal decisions. Key legal sources such as the Kenya Law Reports from the National Council for Law Reporting, and other essential legal repositories within the country were meticulously harvested to build this extensive dataset.

The technical challenges of assembling a corpus of this magnitude were significant. Legal documents are typically lengthy, with information dispersed throughout the text. To make automatic processing feasible, documents were divided into topically coherent segments, referred to as Rhetorical Roles (RRs), a methodology inspired by similar efforts in other jurisdictions, such as India.

Data Collection and Processing

Source Identification and Text Extraction

The initial phase of creating the corpus involved identifying and sourcing relevant legal documents from various repositories. The documents were sourced from PDF files (90.30%), HTML files (7.27%), Word documents (1.42%), and RTFs (0.99%). The tools used for text extraction were chosen based on the format of the source documents:

PDF: For PDF documents, we utilised Tesseract and tesserocr, leveraging optical character recognition (OCR) to extract text from scanned images and PDFs with embedded text.

HTML: Text extraction from HTML files was performed using Inscriptis, a powerful tool for converting HTML content into plain text while preserving the document structure.

Word (DOCX): DOCX files were converted to HTML using Mammoth, after which Inscriptis was again employed to extract the text.

RTF: The striprtf library was used to handle RTF files, converting them to plain text.

This extraction process was meticulously designed to ensure the integrity and completeness of the data, with each tool chosen for its effectiveness in handling specific document formats.

Rhetorical Role Annotation

Rhetorical Roles Definition

The Kenyan legal documents were then subjected to a process of segmentation into Rhetorical Roles (RRs), a method adapted from a previous study on Indian legal documents. RRs refer to distinct sections within legal texts that serve different functions, such as summarising the facts of the case, presenting arguments, or delivering the court’s ruling. Inspired by the work of Bhattacharya et al. (2019) and Malik et al. (2021), we defined a set of 12 RRs plus a NONE label for sentences that do not fit into any other category.

The RRs used in this corpus include:

Preamble (PREAMBLE): Metadata related to the legal judgement document, such as court names, details of parties, lawyers, and judges’ names, headnotes, and introductory summaries.
Facts (FAC): The chronology of events leading to the case, including depositions, proceedings, and summaries of lower court actions.
Ruling by Lower Court (RLC): Judgments and analyses provided by lower courts in cases appealed to higher courts.
Issues (ISSUE): Key legal questions framed by the court that require a verdict.
Argument by Petitioner (ARG PETITIONER): Arguments and precedent cases presented by the petitioner’s lawyers.
Argument by Respondent (ARG RESPONDENT): Arguments and precedent cases presented by the respondent’s lawyers.
Analysis (ANALYSIS): The court’s discussion on evidence, facts, prior cases, and statutes, including how the law applies to the current case.
Statute (STA): Discussions on established laws, including acts, sections, articles, rules, orders, notices, notifications, and quotations from the bare act.
Precedent Relied (PRE RELIED): Prior case documents and decisions relied upon by the court for its final decisions.
Precedent Not Relied (PRE NOT RELIED): Prior case documents and decisions not relied upon by the court for its final decisions.
Ratio of the Decision (RATIO): The main reason for applying a legal principle to a legal issue, typically appearing before the final decision.
Ruling by Present Court (RPC): The final decision, conclusion, and order of the court.

Annotation Process

The annotation process was a critical step in ensuring the corpus’s utility for machine learning applications. Given the complexity of legal texts, annotation required a deep understanding of the law and the legal process. To achieve high-quality annotations, we involved legal practitioners, law students, and legal experts in a crowdsourcing effort.

Student Selection and Training

We initiated a call for law students to volunteer for the annotation task, selecting a group of motivated individuals based on their interest and performance in a preliminary screening. These students were onboarded and trained through a custom-designed Massive Open Online Course (MOOC) that covered the basics of AI, the importance of legal AI, and the specific requirements of the annotation task.

Calibration and Annotation

To ensure consistency in annotations, students underwent a calibration process where they annotated a set of documents that had already been annotated by experts. This iterative process helped align the students’ understanding of the RRs with the gold standard. Once calibration was complete, the annotation of the entire corpus commenced.

Each document was annotated by multiple students to ensure reliability, with the final RR label for each sentence determined through a majority voting scheme. In cases where annotators assigned different labels, documents were sent for adjudication by experts, ensuring the highest possible quality of annotations.

Quality Assessment

The quality of annotations was assessed using Fleiss Kappa, a statistical measure of inter-annotator agreement. The overall Fleiss Kappa score for the corpus was 0.59, indicating moderate agreement. The PREAMBLE, RPC, NONE, and ISSUE roles showed high agreement, while ANALYSIS, FACTS, and ARGUMENTS were more challenging to annotate consistently. These insights guided further refinement of the annotation process.

Machine Learning Applications and Baseline Models

The annotated corpus serves as a foundation for training various machine learning models aimed at automating the understanding and processing of legal documents. We experimented with several baseline models for Rhetorical Role prediction, including transformer-based models like BERT and SciBERT-HSLN.

Rhetorical Role Prediction

The task of RR prediction involves automatically assigning RRs to sentences in a legal document. We framed this as a multi-class sequence prediction problem, experimenting with different model architectures:

BERT CRF: This model uses BERT-BASE for sentence embeddings, followed by a Conditional Random Field (CRF) layer to predict RRs.
BERT Only: This model relies solely on BERT without hierarchical encoding, capturing contextual dependencies.
SciBERT-HSLN: Inspired by Brack et al. (2021), this model combines BERT with Bi-LSTM and attention pooling layers, followed by CRF, to predict RRs. This model achieved the highest weighted F1 score of 0.79 on test data, outperforming other models in capturing long-range dependencies between sentences.

The confusion matrix for SciBERT-HSLN revealed challenges in correctly classifying ARGUMENT roles, which were often confused with FACTS and ANALYSIS. PREAMBLE, RPC, NONE, and ISSUE were classified with high accuracy.

Applications of Rhetorical Roles

The annotated corpus enabled several downstream tasks, such as legal document summarization and judgement prediction.

Summarization

We explored both extractive and abstractive summarization of court judgments using RRs:

Extractive Summarization: We fine-tuned BERTSUM on the LawBriefs dataset, incorporating RRs to improve summary sentence selection. The BERTSUM RR model, which included RR information, outperformed the baseline BERTSUM model in ROUGE scores, indicating the usefulness of RRs in legal summarization.

Abstractive Summarization: Using the pre-trained Legal Pegasus model, we generated summaries for different RR segments of a document. The Legal Pegasus RR model showed improved performance over the baseline model, demonstrating the value of segmenting legal documents by RRs for abstractive summarization.

Judgement Prediction

We also applied RRs to the task of predicting court judgement outcomes. By filtering training data based on the ANALYSIS role, we improved the prediction performance of an XLNet-based model, highlighting the importance of focusing on specific RRs in legal judgement prediction.

Conclusion and Future Directions

The Esheria LexChat Kenyan Legal Corpus is a pioneering effort in the African legal AI landscape. The comprehensive and well-annotated dataset not only advances the field of legal AI in Kenya but also sets the stage for similar initiatives across the continent. While the corpus is currently closed-source and proprietary, we are exploring opportunities to expand this initiative to other African countries and potentially open up the datasets to the public.

Future work will focus on refining the annotation process, improving model performance, and exploring additional applications of the corpus in legal AI. By fostering collaboration and innovation in this emerging field, we aim to enhance access to justice and legal information across Africa.

Contact Info

Social Links