This suite of tools is designed as a semi-automatic scraper and data cleaning system specifically for Arabic content on Quora. It comprises three main scripts: gql_scraper_ar_original.py
for scraping, combine_all.py
for combining scraped data, and de_duplication.py
for removing duplicates.
- Python 3.x
- Required Python packages:
- For
gql_scraper_ar_original.py
:requests
,tqdm
- For
combine_all.py
: No additional packages required. - For
de_duplication.py
:datasketch
,re
- For
- Purpose: Searches Quora for a list of keywords and retrieves questions related to each keyword.
- Usage:
- Prepare a
query_list
text file with your desired keywords. - Run
gql_scraper_ar_original.py
. - The script generates separate text files for each keyword, containing up to 1000 related questions.
- Prepare a
- Purpose: Combines all separate files generated by
gql_scraper_ar_original.py
into a single file. - Usage:
- Specify a name for your combined file.
- Run
combine_all.py
. - The script will create a new file combining all separate files from the first script.
- Purpose: Removes duplicates from the combined file using the MinHash method.
- Usage:
- Provide the name of the combined file.
- Run
de_duplication.py
. - The script processes the file for de-duplication.
- These scripts are specialized for retrieving questions in Arabic. For English content, you can use "https://curlconverter.com/python/" to obtain the necessary GQL information from Quora.
This software is provided "as is", without warranty of any kind. The user assumes full responsibility for the use of the software. The developer is not responsible for any direct, indirect, incidental, special, exemplary, or consequential damages resulting from the use of the software. The user must comply with Quora's Terms of Service and ensure ethical and legal use of the software. This tool is intended for non-commercial use only.