OCDS Index 0.4.1¶

This Python package provides a command-line tool and library to index OCDS documentation in Elasticsearch.

To install:

pip install ocdsindex

If you are viewing this on GitHub or PyPI, open the full documentation for additional details.

How it works¶

1. Build¶

The repositories for standard documentation, profiles and the Extension Explorer contain scripts to build HTML files under language directories, like:

build/
├── en
│   ├── governance
│   │   ├── deprecation
│   │   │   └── index.html
│   │   └── index.html
…   …
├── es
│   ├── governance
│   │   ├── deprecation
│   │   │   └── index.html
│   │   └── index.html
…   …

A build is triggered locally, and more commonly as part of continuous integration: for example, as part of a GitHub Actions workflow.

The HTML files are uploaded to a web server, and served as a static website like the OCDS documentation, which includes a search box.

2. Crawl¶

Once the HTML files are built, the sphinx or extension-explorer command crawls the files and extracts the documents to index, producing a JSON file for the next step.

An HTML file can contain one or more documents. Heading elements, like <h1>, typically mark the start of a new document. A document follows this format:

url

The remote URL of the document, which might include a fragment identifier. The command is provided the base URL of the website whose files are crawled, so that it can construct the remote URL of document. For example, a base URL of:

https://standard.open-contracting.org/staging/profiles/ppp/1.0-dev/

yields remote URLs like:

https://standard.open-contracting.org/staging/profiles/ppp/1.0-dev/es/overview/#data

title

The title of the document, which might be the page title and the heading text.

text

The plain text content of the document.

3. Index¶

The index command then adds the extracted documents to Elasticsearch indices.

The command creates a single index for all documents in a given language: for example, ocdsindex_es. As such, an interface can search across all websites in a given language.

It adds three fields to each indexed document:

_id: Same as url.
base_url: The base URL of the website whose files were crawled. An interface can filter on the base_url field to limit results to specific websites.
created_at: The timestamp at which the files were crawled. The expire command filters on the created_at field to delete documents that are no longer needed.

That’s it! Feel free to browse the documentation below.