min_df was larger than max_df and outside of the acceptable range of 0.0-1.0 (#1601)

aserfass · aaronmarkham · web-flow · commit bc707152691a · 2020-10-29T11:31:19.000-07:00
* min_df was larger than max_df and outside of the acceptable range of 0.0 to 1.0. This gave me an error but changing the min_df to 0.2 or 0.02 resolved the error. It is unclear if the author intended min_df to be 0.2 or 0.02.

* Update ntm_20newsgroups_topic_model.ipynb

remove output and changed min_df to a likely better default of 0.2

Co-authored-by: Aaron Markham &lt;markhama@amazon.com&gt;
diff --git a/introduction_to_applying_machine_learning/ntm_20newsgroups_topic_modeling/ntm_20newsgroups_topic_model.ipynb b/introduction_to_applying_machine_learning/ntm_20newsgroups_topic_modeling/ntm_20newsgroups_topic_model.ipynb
@@ -279,7 +279,7 @@
     "print('Tokenizing and counting, this may take a few minutes...')\n",
     "start_time = time.time()\n",
     "vectorizer = CountVectorizer(input='content', analyzer='word', stop_words='english',\n",
-    "                             tokenizer=LemmaTokenizer(), max_features=vocab_size, max_df=0.95, min_df=2)\n",
+    "                             tokenizer=LemmaTokenizer(), max_features=vocab_size, max_df=0.95, min_df=0.2)\n",
     "vectors = vectorizer.fit_transform(data)\n",
     "vocab_list = vectorizer.get_feature_names()\n",
     "print('vocab size:', len(vocab_list))\n",