Want to collaborate or support access to justice?
Contact Us


Read this story on Esheria.
The Esheria LexChat Kenyan Legal Corpus represents a groundbreaking advancement in the field of legal artificial intelligence (AI) within the African context. As the first and only comprehensive corpus of Kenyan legislative and judicial documents, it stands as a testament to the growing intersection of law and technology in the region. The corpus, which comprises over 350,000 texts amounting to more than 80 million lines and 1.2 billion tokens, is a critical asset for training and fine-tuning natural language processing (NLP) models tailored to the Kenyan legal domain.This article provides an in-depth, technical exploration of the corpus's creation, structure, and potential applications, drawing parallels with similar efforts in other jurisdictions.
The creation of the Esheria LexChat Kenyan Legal Corpus was a complex and highly technical endeavour, necessitating a multifaceted approach to data collection, processing, and annotation. The corpus includes every statute and regulation currently in force in Kenya, as well as thousands of bills and hundreds of thousands of court and tribunal decisions. Key legal sources such as the Kenya Law Reports from the National Council for Law Reporting, and other essential legal repositories within the country were meticulously harvested to build this extensive dataset.
The technical challenges of assembling a corpus of this magnitude were significant. Legal documents are typically lengthy, with information dispersed throughout the text. To make automatic processing feasible, documents were divided into topically coherent segments, referred to as Rhetorical Roles (RRs), a methodology inspired by similar efforts in other jurisdictions, such as India.
The initial phase of creating the corpus involved identifying and sourcing relevant legal documents from various repositories. The documents were sourced from PDF files (90.30%), HTML files (7.27%), Word documents (1.42%), and RTFs (0.99%). The tools used for text extraction were chosen based on the format of the source documents:
PDF: For PDF documents, we utilised Tesseract and tesserocr, leveraging optical character recognition (OCR) to extract text from scanned images and PDFs with embedded text.
HTML: Text extraction from HTML files was performed using Inscriptis, a powerful tool for converting HTML content into plain text while preserving the document structure.
Word (DOCX): DOCX files were converted to HTML using Mammoth, after which Inscriptis was again employed to extract the text.
RTF: The striprtf library was used to handle RTF files, converting them to plain text.
This extraction process was meticulously designed to ensure the integrity and completeness of the data, with each tool chosen for its effectiveness in handling specific document formats.
The Kenyan legal documents were then subjected to a process of segmentation into Rhetorical Roles (RRs), a method adapted from a previous study on Indian legal documents. RRs refer to distinct sections within legal texts that serve different functions, such as summarising the facts of the case, presenting arguments, or delivering the court’s ruling. Inspired by the work of Bhattacharya et al. (2019) and Malik et al. (2021), we defined a set of 12 RRs plus a NONE label for sentences that do not fit into any other category.
The RRs used in this corpus include:
The annotation process was a critical step in ensuring the corpus’s utility for machine learning applications. Given the complexity of legal texts, annotation required a deep understanding of the law and the legal process. To achieve high-quality annotations, we involved legal practitioners, law students, and legal experts in a crowdsourcing effort.
Student Selection and Training
We initiated a call for law students to volunteer for the annotation task, selecting a group of motivated individuals based on their interest and performance in a preliminary screening. These students were onboarded and trained through a custom-designed Massive Open Online Course (MOOC) that covered the basics of AI, the importance of legal AI, and the specific requirements of the annotation task.
Calibration and Annotation
To ensure consistency in annotations, students underwent a calibration process where they annotated a set of documents that had already been annotated by experts. This iterative process helped align the students’ understanding of the RRs with the gold standard. Once calibration was complete, the annotation of the entire corpus commenced.
Each document was annotated by multiple students to ensure reliability, with the final RR label for each sentence determined through a majority voting scheme. In cases where annotators assigned different labels, documents were sent for adjudication by experts, ensuring the highest possible quality of annotations.
Quality Assessment
The quality of annotations was assessed using Fleiss Kappa, a statistical measure of inter-annotator agreement. The overall Fleiss Kappa score for the corpus was 0.59, indicating moderate agreement. The PREAMBLE, RPC, NONE, and ISSUE roles showed high agreement, while ANALYSIS, FACTS, and ARGUMENTS were more challenging to annotate consistently. These insights guided further refinement of the annotation process.
The annotated corpus serves as a foundation for training various machine learning models aimed at automating the understanding and processing of legal documents. We experimented with several baseline models for Rhetorical Role prediction, including transformer-based models like BERT and SciBERT-HSLN.
The task of RR prediction involves automatically assigning RRs to sentences in a legal document. We framed this as a multi-class sequence prediction problem, experimenting with different model architectures:
The confusion matrix for SciBERT-HSLN revealed challenges in correctly classifying ARGUMENT roles, which were often confused with FACTS and ANALYSIS. PREAMBLE, RPC, NONE, and ISSUE were classified with high accuracy.
The annotated corpus enabled several downstream tasks, such as legal document summarization and judgement prediction.
Summarization
We explored both extractive and abstractive summarization of court judgments using RRs:
Extractive Summarization: We fine-tuned BERTSUM on the LawBriefs dataset, incorporating RRs to improve summary sentence selection. The BERTSUM RR model, which included RR information, outperformed the baseline BERTSUM model in ROUGE scores, indicating the usefulness of RRs in legal summarization.
Abstractive Summarization: Using the pre-trained Legal Pegasus model, we generated summaries for different RR segments of a document. The Legal Pegasus RR model showed improved performance over the baseline model, demonstrating the value of segmenting legal documents by RRs for abstractive summarization.
Judgement Prediction
We also applied RRs to the task of predicting court judgement outcomes. By filtering training data based on the ANALYSIS role, we improved the prediction performance of an XLNet-based model, highlighting the importance of focusing on specific RRs in legal judgement prediction.
The Esheria LexChat Kenyan Legal Corpus is a pioneering effort in the African legal AI landscape. The comprehensive and well-annotated dataset not only advances the field of legal AI in Kenya but also sets the stage for similar initiatives across the continent. While the corpus is currently closed-source and proprietary, we are exploring opportunities to expand this initiative to other African countries and potentially open up the datasets to the public.
Future work will focus on refining the annotation process, improving model performance, and exploring additional applications of the corpus in legal AI. By fostering collaboration and innovation in this emerging field, we aim to enhance access to justice and legal information across Africa.