OCDS Index 0.2.0#
This Python package provides a command-line tool and library to index OCDS documentation in Elasticsearch 8.x.
To install:
pip install ocdsindex
If you are viewing this on GitHub or PyPI, open the full documentation for additional details.
How it works#
1. Build#
The repositories for standard documentation, profiles and the Extension Explorer contain scripts to build HTML files under language directories, like:
build/
├── en
│ ├── governance
│ │ ├── deprecation
│ │ │ └── index.html
│ │ └── index.html
… …
├── es
│ ├── governance
│ │ ├── deprecation
│ │ │ └── index.html
│ │ └── index.html
… …
A build is triggered locally, and more commonly as part of continuous integration: for example, as part of a GitHub Actions workflow.
The HTML files are uploaded to a web server, and served as a static website like the OCDS documentation, which includes a search box.
2. Crawl#
Once the HTML files are built, the sphinx or extension-explorer command crawls the files and extracts the documents to index, producing a JSON file for the next step.
An HTML file can contain one or more documents. Heading elements, like <h1>
, typically mark the start of a new document. A document follows this format:
- url
The remote URL of the document, which might include a fragment identifier. The command is provided the base URL of the website whose files are crawled, so that it can construct the remote URL of document. For example, a base URL of:
https://standard.open-contracting.org/staging/profiles/ppp/1.0-dev/
yields remote URLs like:
https://standard.open-contracting.org/staging/profiles/ppp/1.0-dev/es/overview/#data
- title
The title of the document, which might be the page title and the heading text.
- text
The plain text content of the document.
3. Index#
The index command then adds the extracted documents to Elasticsearch indices.
The command creates a single index for all documents in a given language: for example, ocdsindex_es
. As such, an interface can search across all websites in a given language.
It adds three fields to each indexed document:
- _id
Same as
url
.- base_url
The base URL of the website whose files were crawled. An interface can filter on the
base_url
field to limit results to specific websites.- created_at
The timestamp at which the files were crawled. The expire command filters on the
created_at
field to delete documents that are no longer needed.
That’s it! Feel free to browse the documentation below.
Contents
Copyright (c) 2020 Open Contracting Partnership, released under the BSD license