averell

averell.core.export_corpora(corpus_ids, granularity, corpora_folder, filename, no_download=False)[source]
Generates a single JSON file with the chosen granularity for all of the
selected corpora
Parameters:
  • corpus_ids – IDs of the corpora that will be exported
  • granularity – Level of parsing granularity
  • corpora_folder – Local folder where the corpora is located
  • filename – Name of the output file
  • no_download – Whether to download or not a corpora when missing
Returns:

Python dict with the chosen granularity for all of the selected corpora

averell.core.get_corpora(corpus_indices=None, output_folder=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/averell/checkouts/fix-documentation/docs/corpora'))[source]

Download and uncompress selected corpora

Parameters:
  • corpus_indices – Indices of the corpus that will be downloaded
  • output_folder – Local folder where the corpus is going to be uncompressed
Returns:

Python dict with all corpora features

averell.utils.download_corpora(corpus_indices=None, output_folder=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/averell/checkouts/fix-documentation/docs/corpora'))[source]

Download corpus from a list of sources to a local folder

Parameters:
  • corpus_indices – list List with the indexes of CORPORA_SOURCES to choose which corpus is going to be downloaded
  • output_folder – string The folder where the corpus is going to be saved
averell.utils.download_corpus(url, filename=None)[source]

Function to download the corpus zip file from external source

Parameters:url – string URL of the corpus file
Returns:string Local filename of the corpus
averell.utils.filter_corpus_features(corpus_features, corpus_id, granularity)[source]

Get the granularity features for each poem in corpus

Parameters:
  • corpus_features – list of dicts List of corpus poems python dicts
  • corpus_id – int Corpus id to be filtered
  • granularity – string Level to filter the poem (stanza, line, word or syllable)
Returns:

list List of rows with the corpus granularity info

averell.utils.filter_features(features, corpus_index, granularity=None)[source]

Select the granularity

Parameters:
  • features – dict Poem python dict
  • corpus_index – int Corpus index to be filtered
  • granularity – string Level to filter the poem (stanza, line, word or syllable)
Returns:

list List of rows with the poem granularity info

averell.utils.get_ids(values)[source]

Transform numeric identifiers, corpora shortcodes (slugs), and two-letter ISO language codes, into their corresponding numeric identifier as per the order in CORPORA_SOURCES.

Returns:List of indices in CORPORA_SOURCES
Return type:list
averell.utils.get_line_features(features)[source]

Filter the line features of a poem

Parameters:features – dict Poem dictionary
Returns:dict list Lines dict list
averell.utils.get_main_corpora_info()[source]

Create dict with the main corpora info saved in CORPORA_SOURCES

Returns:Dictionary with the corpora info to be shown
Return type:dict
averell.utils.get_stanza_features(poem_features)[source]

Filter the stanza features of a poem

Parameters:poem_features – dict Poem dictionary
Returns:dict list Stanzas dict list
averell.utils.get_syllable_features(features)[source]

Filter the syllable features of a poem

Parameters:features – dict Poem dictionary
Returns:dict list Syllables dict list
averell.utils.get_word_features(features)[source]

Filter the word features of a poem

Parameters:features – dict Poem dictionary
Returns:dict list Words dict list
averell.utils.pretty_string(text, num_words)[source]

Add a line break every number of words into a text to create multiline cells to use in get_main_corpora_info()

Parameters:
  • text – String to be split
  • num_words – Number of words to add a line break after
Returns:

String with line break every number of words entered

Return type:

str

averell.utils.progress_bar(t)[source]

from https://gist.github.com/leimao/37ff6e990b3226c2c9670a2cd1e4a6f5 Wraps tqdm instance. Don’t forget to close() or __exit__() the tqdm instance once you’re done (easiest using with syntax).

averell.utils.read_features(corpus_folder)[source]

Read the dictionary of each poem in “corpus_folder” and return the list of python dictionaries

Parameters:corpus_folder – Local folder where the corpus is located
Returns:List of python dictionaries with the poems features
averell.utils.uncompress_corpus(filename, save_dir)[source]

Simple function to uncompress the corpus zip file

Parameters:
  • filename – string The file that is going to be uncompressed
  • save_dir – string The folder where the corpus is going to be uncompressed
Returns:

string Filename of uncompressed corpus

averell.utils.write_json(poem_dict, filename)[source]

Simple function to save data in json format

Parameters:
  • poem_dict – dict Python dict with poem data
  • filename – string JSON filename that will be written with the poem data