OCDS Index 0.2.0#

PyPI Version Build Status Coverage Status Python Version

This Python package provides a command-line tool and library to index OCDS documentation in Elasticsearch 8.x.

To install:

pip install ocdsindex

If you are viewing this on GitHub or PyPI, open the full documentation for additional details.

How it works#

1. Build#

The repositories for standard documentation, profiles and the Extension Explorer contain scripts to build HTML files under language directories, like:

├── en
│   ├── governance
│   │   ├── deprecation
│   │   │   └── index.html
│   │   └── index.html
…   …
├── es
│   ├── governance
│   │   ├── deprecation
│   │   │   └── index.html
│   │   └── index.html
…   …

A build is triggered locally, and more commonly as part of continuous integration: for example, as part of a GitHub Actions workflow.

The HTML files are uploaded to a web server, and served as a static website like the OCDS documentation, which includes a search box.

2. Crawl#

Once the HTML files are built, the sphinx or extension-explorer command crawls the files and extracts the documents to index, producing a JSON file for the next step.

An HTML file can contain one or more documents. Heading elements, like <h1>, typically mark the start of a new document. A document follows this format:


The remote URL of the document, which might include a fragment identifier. The command is provided the base URL of the website whose files are crawled, so that it can construct the remote URL of document. For example, a base URL of:


yields remote URLs like:


The title of the document, which might be the page title and the heading text.


The plain text content of the document.

3. Index#

The index command then adds the extracted documents to Elasticsearch indices.

The command creates a single index for all documents in a given language: for example, ocdsindex_es. As such, an interface can search across all websites in a given language.

It adds three fields to each indexed document:


Same as url.


The base URL of the website whose files were crawled. An interface can filter on the base_url field to limit results to specific websites.


The timestamp at which the files were crawled. The expire command filters on the created_at field to delete documents that are no longer needed.

That’s it! Feel free to browse the documentation below.

Copyright (c) 2020 Open Contracting Partnership, released under the BSD license