Usage Details

Rake object accepts the following as the constructor parameters, by varying which this implementation of Rake algorithm can be used for several different purposes.

  • stopwords : List of words which are considered to break sentences into phrases. Defaults to None.

  • punctuations : List of punctuations which are considered to break sentences into phrases. Defaults to None.

  • language : Language to use for tokenization and stopwords. Defaults to english.

  • ranking_metricMetric to use for ranking of metrics
    • Ratio of degree of word to its frequency d(w)/f(w). (Default)

    • Degree of word only.

    • Frequency of word only.

  • max_length : Of phrases to consider. Defaults to 100000.

  • min_length : Of phrases to consider. Defaults to 1.

to use it with a specific language supported by nltk.

from rake_nltk import Rake

r = Rake(language=<language>)

Implementation automatically picks up the stopwords for that language and default punctuation set.

to provide your own list of stop words and punctuations

from rake_nltk import Rake

r = Rake(
    stopwords=<list of stopwords>,
    punctuations=<string of puntuations to ignore>

to control the metric for ranking

from rake_nltk import Metric, Rake

# Paper uses d(w)/f(w) as the final metric. You can use this API with the
# following metrics:

# 1. d(w)/f(w) (Default metric) Ratio of degree of word to its frequency.
r = Rake(ranking_metric=Metric.DEGREE_TO_FREQUENCY_RATIO)

# 2. d(w) Degree of word only.
r = Rake(ranking_metric=Metric.WORD_DEGREE)

# 3. f(w) Frequency of word only.
r = Rake(ranking_metric=Metric.WORD_FREQUENCY)

to control the max or min words in a phrase

So that only phrases of specific min and/or max lengthes are considered for ranking

from rake_nltk import Rake

r = Rake(min_length=2, max_length=4)

to control whether or not to include repeated phrases in text

So that user can choose to include all phrases generated from text or to include phrases only once. Example: “Magic systems is a company. Magic systems was founded in a garage” has the phrase (magic, systems) occuring twice.

from rake_nltk import Rake

# To include all phrases even the repeated ones.
r = Rake() # Equivalent to Rake(include_repeated_phrases=True)

# To include all phrases only once and ignore the repetitions
r = Rake(include_repeated_phrases=False)

to control the sentence tokenizer

So that user can choose the sentence tokenizer they want to use.

from rake_nltk import Rake

# To use default `nltk.tokenize.sent_tokenize` tokenizer.
r = Rake() # Equivalent to Rake(sentence_tokenizer=nltk.tokenize.sent_tokenize)

# To use a custom tokenizer.
def custom_tokenizer(text: str) -> List[str]:
r = Rake(sentence_tokenizer=custom_tokenizer)

to control the word tokenizer

So that user can choose the word tokenizer they want to use.

from rake_nltk import Rake

# To use default `nltk.tokenize.wordpunct_tokenize` tokenizer.
r = Rake() # Equivalent to Rake(word_tokenizer=nltk.tokenize.wordpunct_tokenize)

# To use a custom tokenizer.
def custom_tokenizer(text: str) -> List[str]:
r = Rake(word_tokenizer=custom_tokenizer)