OCDS Index 0.2.0#
This Python package provides a command-line tool and library to index OCDS documentation in Elasticsearch 8.x.
pip install ocdsindex
If you are viewing this on GitHub or PyPI, open the full documentation for additional details.
How it works#
The repositories for standard documentation, profiles and the Extension Explorer contain scripts to build HTML files under language directories, like:
build/ ├── en │ ├── governance │ │ ├── deprecation │ │ │ └── index.html │ │ └── index.html … … ├── es │ ├── governance │ │ ├── deprecation │ │ │ └── index.html │ │ └── index.html … …
A build is triggered locally, and more commonly as part of continuous integration: for example, as part of a GitHub Actions workflow.
The HTML files are uploaded to a web server, and served as a static website like the OCDS documentation, which includes a search box.
Once the HTML files are built, the sphinx or extension-explorer command crawls the files and extracts the documents to index, producing a JSON file for the next step.
An HTML file can contain one or more documents. Heading elements, like
<h1>, typically mark the start of a new document. A document follows this format:
The remote URL of the document, which might include a fragment identifier. The command is provided the base URL of the website whose files are crawled, so that it can construct the remote URL of document. For example, a base URL of:
yields remote URLs like:
The title of the document, which might be the page title and the heading text.
The plain text content of the document.
The index command then adds the extracted documents to Elasticsearch indices.
The command creates a single index for all documents in a given language: for example,
ocdsindex_es. As such, an interface can search across all websites in a given language.
It adds three fields to each indexed document:
The base URL of the website whose files were crawled. An interface can filter on the
base_urlfield to limit results to specific websites.
The timestamp at which the files were crawled. The expire command filters on the
created_atfield to delete documents that are no longer needed.
That’s it! Feel free to browse the documentation below.
- Command-Line Interface
- API Reference
Copyright (c) 2020 Open Contracting Partnership, released under the BSD license