Skip to content

tREeFrOGcoder/quora-scraper-ar

Repository files navigation

Quora Arabic Question Scraper and Data Cleaning System

Overview

This suite of tools is designed as a semi-automatic scraper and data cleaning system specifically for Arabic content on Quora. It comprises three main scripts: gql_scraper_ar_original.py for scraping, combine_all.py for combining scraped data, and de_duplication.py for removing duplicates.

Prerequisites

  • Python 3.x
  • Required Python packages:
    • For gql_scraper_ar_original.py: requests, tqdm
    • For combine_all.py: No additional packages required.
    • For de_duplication.py: datasketch, re

Scripts and Usage

1. gql_scraper_ar_original.py

  • Purpose: Searches Quora for a list of keywords and retrieves questions related to each keyword.
  • Usage:
    • Prepare a query_list text file with your desired keywords.
    • Run gql_scraper_ar_original.py.
    • The script generates separate text files for each keyword, containing up to 1000 related questions.

2. combine_all.py

  • Purpose: Combines all separate files generated by gql_scraper_ar_original.py into a single file.
  • Usage:
    • Specify a name for your combined file.
    • Run combine_all.py.
    • The script will create a new file combining all separate files from the first script.

3. de_duplication.py

  • Purpose: Removes duplicates from the combined file using the MinHash method.
  • Usage:
    • Provide the name of the combined file.
    • Run de_duplication.py.
    • The script processes the file for de-duplication.

Note

  • These scripts are specialized for retrieving questions in Arabic. For English content, you can use "https://curlconverter.com/python/" to obtain the necessary GQL information from Quora.

Disclaimer

This software is provided "as is", without warranty of any kind. The user assumes full responsibility for the use of the software. The developer is not responsible for any direct, indirect, incidental, special, exemplary, or consequential damages resulting from the use of the software. The user must comply with Quora's Terms of Service and ensure ethical and legal use of the software. This tool is intended for non-commercial use only.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages