Extract#

extract_ methods that return the documents to index as a list of dicts. Each dict sets these keys:

url

The remote URL of the document, which might include a fragment identifier

title

The title of the document, which might be the page title and the heading text

text

The plain text content of the document

ocdsindex.extract.extract_sphinx(url, tree)[source]#

Extracts one document per section of the page.

Parameters:
  • url (str) – the file’s remote URL

  • tree – the file’s root HTML element

Returns:

a list of dicts representing the documents to index

Return type:

list

ocdsindex.extract.extract_extension_explorer(url, tree)[source]#