Usage Details¶
Rake object accepts the following as the constructor parameters, by varying which this implementation of Rake algorithm can be used for several different purposes.
stopwords : List of words which are considered to break sentences into phrases. Defaults to
None
.punctuations : List of punctuations which are considered to break sentences into phrases. Defaults to
None
.language : Language to use for tokenization and stopwords. Defaults to
english
.- ranking_metricMetric to use for ranking of metrics
Ratio of degree of word to its frequency d(w)/f(w). (Default)
Degree of word only.
Frequency of word only.
max_length : Of phrases to consider. Defaults to
100000
.min_length : Of phrases to consider. Defaults to
1
.
to use it with a specific language supported by nltk.¶
from rake_nltk import Rake
r = Rake(language=<language>)
Implementation automatically picks up the stopwords
for that language and
default punctuation
set.
to provide your own list of stop words and punctuations¶
from rake_nltk import Rake
r = Rake(
stopwords=<list of stopwords>,
punctuations=<string of puntuations to ignore>
)
to control the metric for ranking¶
from rake_nltk import Metric, Rake
# Paper uses d(w)/f(w) as the final metric. You can use this API with the
# following metrics:
# 1. d(w)/f(w) (Default metric) Ratio of degree of word to its frequency.
r = Rake(ranking_metric=Metric.DEGREE_TO_FREQUENCY_RATIO)
# 2. d(w) Degree of word only.
r = Rake(ranking_metric=Metric.WORD_DEGREE)
# 3. f(w) Frequency of word only.
r = Rake(ranking_metric=Metric.WORD_FREQUENCY)
to control the max or min words in a phrase¶
So that only phrases of specific min and/or max lengthes are considered for ranking
from rake_nltk import Rake
r = Rake(min_length=2, max_length=4)
to control whether or not to include repeated phrases in text¶
So that user can choose to include all phrases generated from text or to include phrases only once. Example: “Magic systems is a company. Magic systems was founded in a garage” has the phrase (magic, systems) occuring twice.
from rake_nltk import Rake
# To include all phrases even the repeated ones.
r = Rake() # Equivalent to Rake(include_repeated_phrases=True)
# To include all phrases only once and ignore the repetitions
r = Rake(include_repeated_phrases=False)
to control the sentence tokenizer¶
So that user can choose the sentence tokenizer they want to use.
from rake_nltk import Rake
# To use default `nltk.tokenize.sent_tokenize` tokenizer.
r = Rake() # Equivalent to Rake(sentence_tokenizer=nltk.tokenize.sent_tokenize)
# To use a custom tokenizer.
def custom_tokenizer(text: str) -> List[str]:
...
r = Rake(sentence_tokenizer=custom_tokenizer)
to control the word tokenizer¶
So that user can choose the word tokenizer they want to use.
from rake_nltk import Rake
# To use default `nltk.tokenize.wordpunct_tokenize` tokenizer.
r = Rake() # Equivalent to Rake(word_tokenizer=nltk.tokenize.wordpunct_tokenize)
# To use a custom tokenizer.
def custom_tokenizer(text: str) -> List[str]:
...
r = Rake(word_tokenizer=custom_tokenizer)