Crawler¶

class ocdsindex.crawler.Crawler(directory, base_url, extract, *, allow=<function true>)[source]¶

Crawl a directory for documents to index.

__init__(directory, base_url, extract, *, allow=<function true>)[source]¶

Parameters:

directory (str) – the directory to crawl
base_url (str) – the remote URL at which the files will be available
extract – a function that accepts a file’s remote URL and its root HTML element, and returns the documents to index as a list of dicts
allow – a function that accepts a directory path and a file basename, and returns whether to crawl the file as a boolean

get_documents_by_language()[source]¶

Return the documents to index for each language.

Returns:: a dict in which the key is a language code and the value is the documents to index
Return type:: dict

get_documents_from_file(path)[source]¶

Parse the file’s HTML contents, calculate its remote URL, and return the documents to index from the file.