Crawler#

class ocdsindex.crawler.Crawler(directory, base_url, extract, *, allow=<function true>)[source]#

Crawls a directory for documents to index.

__init__(directory, base_url, extract, *, allow=<function true>)[source]#
Parameters:
  • directory (str) – the directory to crawl

  • base_url (str) – the remote URL at which the files will be available

  • extract – a function that accepts a file’s remote URL and its root HTML element, and returns the documents to index as a list of dicts

  • allow – a function that accepts a directory path and a file basename, and returns whether to crawl the file as a boolean

get_documents_by_language()[source]#

Returns the documents to index for each language.

Returns:

a dict in which the key is a language code and the value is the documents to index

Return type:

dict

get_documents_from_file(path)[source]#

Parses the file’s HTML contents, calculates its remote URL, and returns the documents to index from the file.

Parameters:

path (str) – a file path

Returns:

the documents to index

Return type:

list