OCDS Index 0.2.0#
This Python package provides a command-line tool and library to index OCDS documentation in Elasticsearch 8.x.
To install:
pip install ocdsindex
If you are viewing this on GitHub or PyPI, open the full documentation for additional details.
How it works#
1. Build#
The repositories for standard documentation, profiles and the Extension Explorer contain scripts to build HTML files under language directories, like:
build/
├── en
│ ├── governance
│ │ ├── deprecation
│ │ │ └── index.html
│ │ └── index.html
… …
├── es
│ ├── governance
│ │ ├── deprecation
│ │ │ └── index.html
│ │ └── index.html
… …
A build is triggered locally, and more commonly as part of continuous integration: for example, as part of a GitHub Actions workflow.
The HTML files are uploaded to a web server, and served as a static website like the OCDS documentation, which includes a search box.
2. Crawl#
Once the HTML files are built, the sphinx or extension-explorer command crawls the files and extracts the documents to index, producing a JSON file for the next step.
An HTML file can contain one or more documents. Heading elements, like <h1>
, typically mark the start of a new document. A document follows this format:
- url
The remote URL of the document, which might include a fragment identifier. The command is provided the base URL of the website whose files are crawled, so that it can construct the remote URL of document. For example, a base URL of:
https://standard.open-contracting.org/staging/profiles/ppp/1.0-dev/
yields remote URLs like:
https://standard.open-contracting.org/staging/profiles/ppp/1.0-dev/es/overview/#data
- title
The title of the document, which might be the page title and the heading text.
- text
The plain text content of the document.
3. Index#
The index command then adds the extracted documents to Elasticsearch indices.
The command creates a single index for all documents in a given language: for example, ocdsindex_es
. As such, an interface can search across all websites in a given language.
It adds three fields to each indexed document:
- _id
Same as
url
.- base_url
The base URL of the website whose files were crawled. An interface can filter on the
base_url
field to limit results to specific websites.- created_at
The timestamp at which the files were crawled. The expire command filters on the
created_at
field to delete documents that are no longer needed.
That’s it! Feel free to browse the documentation below.
Copyright (c) 2020 Open Contracting Partnership, released under the BSD license