OCDS Index 0.2.0#

This Python package provides a command-line tool and library to index OCDS documentation in Elasticsearch 8.x.

To install:

pip install ocdsindex

If you are viewing this on GitHub or PyPI, open the full documentation for additional details.

How it works#

1. Build#

The repositories for standard documentation, profiles and the Extension Explorer contain scripts to build HTML files under language directories, like:

build/
├── en
│   ├── governance
│   │   ├── deprecation
│   │   │   └── index.html
│   │   └── index.html
…   …
├── es
│   ├── governance
│   │   ├── deprecation
│   │   │   └── index.html
│   │   └── index.html
…   …

A build is triggered locally, and more commonly as part of continuous integration: for example, as part of a GitHub Actions workflow.

The HTML files are uploaded to a web server, and served as a static website like the OCDS documentation, which includes a search box.

2. Crawl#

Once the HTML files are built, the sphinx or extension-explorer command crawls the files and extracts the documents to index, producing a JSON file for the next step.

An HTML file can contain one or more documents. Heading elements, like <h1>, typically mark the start of a new document. A document follows this format:

url

The remote URL of the document, which might include a fragment identifier. The command is provided the base URL of the website whose files are crawled, so that it can construct the remote URL of document. For example, a base URL of:

https://standard.open-contracting.org/staging/profiles/ppp/1.0-dev/

yields remote URLs like:

https://standard.open-contracting.org/staging/profiles/ppp/1.0-dev/es/overview/#data

title

The title of the document, which might be the page title and the heading text.

text

The plain text content of the document.

3. Index#

The index command then adds the extracted documents to Elasticsearch indices.

The command creates a single index for all documents in a given language: for example, ocdsindex_es. As such, an interface can search across all websites in a given language.

It adds three fields to each indexed document:

_id: Same as url.
base_url: The base URL of the website whose files were crawled. An interface can filter on the base_url field to limit results to specific websites.
created_at: The timestamp at which the files were crawled. The expire command filters on the created_at field to delete documents that are no longer needed.

That’s it! Feel free to browse the documentation below.