|
1 | 1 | # Project Proposal
|
2 | 2 |
|
3 |
| -## Finding Insights from Stackoverflow Developer Survey |
| 3 | +## Finding Insights from Stack Overflow Developer Survey |
4 | 4 |
|
5 |
| -Stack overflow is a professional community for developers, Stackoverflow conducts a survey every year the collected data from 2011 has been available for open source on the web with the latest dataset 2020 released on March 5th, 2021. If the dataset analysed professionally using modern tools, would enable us to answer real-world questions effectively. The dataset has covered 275 questions in total. |
| 5 | +Stack Overflow is a professional community for developers, conducting an annual survey. The collected data from 2011 onwards has been available for open source on the web, with the latest dataset released in 2020. Analyzing this dataset professionally using modern tools would enable us to answer real-world questions effectively. The dataset includes responses to 275 questions. |
6 | 6 |
|
7 | 7 | ### Project Goal:
|
8 | 8 |
|
9 |
| -1. To perform Analysis on 3 years Stackoverflow Dataset and get insights. |
10 |
| -2. To perform Data Analysis and answer the below questions. |
11 |
| - + Impact of igher education on salary of the surveyed developers. |
12 |
| - + Impact of education/experience/responsibilities on gender inequalities. |
13 |
| - + Impact on participation rate due to different ethnicity. |
14 |
| - + To find whether there is any difference between men and women's income. |
15 |
| - + Impact on the increase in popularity of a language in the current year due to developer’s interest in the previous year. |
16 |
| - |
17 |
| -3. To perform data visualization on |
18 |
| - |
19 |
| - - The most commonly used language. |
20 |
| - |
21 |
| - - Distribution of surveyors based on their developer role. |
22 |
| - |
23 |
| - - Factors affecting Job satisfaction. |
24 |
| - |
25 |
| - - Predicting the growth of languages for upcoming years based on the survey answers. |
26 |
| - |
27 |
| - ###### The Insights can be used to provide information regarding IT environment, hiring employees and job seekers and build a solid résumé. |
| 9 | +1. **Perform Analysis on 3 years of Stack Overflow Dataset:** Extract insights from the data. |
| 10 | +2. **Data Analysis Goals:** Answer the following questions: |
| 11 | + - What is the impact of higher education on the salary of surveyed developers? |
| 12 | + - How do education, experience, and responsibilities affect gender inequalities? |
| 13 | + - How does ethnicity impact participation rates? |
| 14 | + - Is there a difference in income between men and women? |
| 15 | + - How does the previous year's interest in a language affect its popularity in the current year? |
| 16 | +3. **Data Visualization Goals:** |
| 17 | + - Identify the most commonly used language. |
| 18 | + - Analyze the distribution of surveyors based on their developer roles. |
| 19 | + - Explore factors affecting job satisfaction. |
| 20 | + - Predict the growth of languages for upcoming years based on survey answers. |
| 21 | + - Provide insights for IT environment, hiring employees, job seekers, and building a solid résumé. |
28 | 22 |
|
29 | 23 | ### Data Source and Background
|
30 | 24 |
|
31 |
| -The dataset is very diverse and came from a [Stackoverflow developer survey](https://insights.stackoverflow.com/survey/?_ga=2.208907280.304952146.1616422967-1864686930.1616422967) with 275 questions answered from 180 countries. Stackoverflow has data collected through surveys from 2011 to 2020, but for the project, the purpose is to analyze the data of the last 3 years. The people who completed the survey mostly from the US, India, and EMEA regions. The majority of the survey respondents had the background of developer/ coding experience. The data are available in the CSV format ranging from 40 to 150 MB with data of 1.5 Lakh survey participants.The dataset includes survey data gathered from 180 countries, the response ranges from Not at all important to very important/ Not at all satisfied to very satisfied. |
| 25 | +The dataset is sourced from the annual Stack Overflow developer survey, covering responses from developers in 180 countries. The data range from 2011 to 2020, with the focus being on the last 3 years. Respondents primarily come from the US, India, and EMEA regions, with a background in developer/coding experience. The dataset includes survey data gathered from 180 countries, with responses ranging from "Not at all important" to "Very important" and "Not at all satisfied" to "Very satisfied." |
32 | 26 |
|
33 | 27 | ### Data Format
|
34 | 28 |
|
35 |
| -The data is in a schema CSV file that consists of 252,199 observations and 62 variables. |
36 |
| - |
37 |
| -### Projected work needs to be done for Insights. |
38 |
| - |
39 |
| -###### Data Wrangling |
40 |
| - |
41 |
| -**Dealing Null Values**: As this is a developer survey and few questions left unanswered by the respondents as ‘*NA*’ or ‘*Not Applicable*’ so dealing with null values is important to get precise information. Data conversion/ manipulation is also required, as the developer responded to the survey through radio buttons rather than yes or no pattern(Univariate analysis). |
42 |
| - |
43 |
| -###### Techniques expect to use in the project |
| 29 | +The data is in CSV format, consisting of 252,199 observations and 62 variables. |
44 | 30 |
|
45 |
| -Planning to use ML Algorithms like Random, may include, KNN, AUC for classification problems, training model, logistic regression,data visualization, parameter analysis, Linear Regreesion, Root Mean square. |
| 31 | +### Projected Work for Insights |
46 | 32 |
|
47 |
| -> Linear regression(RFE techniques) |
| 33 | +#### Data Wrangling |
48 | 34 |
|
49 |
| -$$ |
50 |
| -y = O_1X + O_2 |
51 |
| -$$ |
| 35 | +- **Dealing with Null Values:** Handle unanswered questions marked as ‘NA’ or ‘Not Applicable’ to ensure precise analysis. |
| 36 | +- **Data Conversion/Manipulation:** Convert data for analysis, considering that respondents answered the survey through radio buttons rather than yes or no patterns (Univariate analysis). |
52 | 37 |
|
53 |
| -> Root Mean Squared Error Calculations |
| 38 | +#### Techniques Expected to Use in the Project |
54 | 39 |
|
55 |
| -$$ |
56 |
| -rmse = \sqrt{(\frac{1}{n})\sum_{i=1}^{n}(y_{i} - x_{i})^{2}} |
57 |
| -$$ |
| 40 | +- ML Algorithms: Utilize algorithms like Random Forest, KNN, AUC for classification problems, logistic regression, and linear regression. |
| 41 | +- Data Visualization: Employ data visualization techniques for better understanding and presentation of insights. |
| 42 | +- Parameter Analysis: Analyze parameters to fine-tune models and improve accuracy. |
58 | 43 |
|
| 44 | +#### Project Plan |
59 | 45 |
|
| 46 | +**Week 8:** Project Base Setup |
| 47 | +- Source control setup on [GitHub](https://github.com/Sanjayviswa/Stackoverflow_survey_Analysis) |
| 48 | +- Project Management using tools like MS Project |
| 49 | +- Complete Data Wrangling & Basic Analysis |
60 | 50 |
|
61 |
| -#### Project plan |
| 51 | +**Week 10:** Baseline Model Building |
| 52 | +- Implement algorithms and build baseline models |
62 | 53 |
|
63 |
| -**Week 8:** Creating Project base, Source control([GitHub](https://github.com/Sanjayviswa/Stackoverflow_survey_Analysis)), Project Management(MS Project). |
| 54 | +**Week 11:** Model Evaluation |
| 55 | +- Run tests and evaluate the performance of models |
64 | 56 |
|
65 |
| -- Complete Data Wrangling & basic Analysis. |
| 57 | +**Week 12:** Finalization |
| 58 | +- Prepare video presentation summarizing the analysis and insights |
66 | 59 |
|
67 |
| -**Week 10**: Complete baseline Model building with algorithms. |
| 60 | +#### Additional Technical Details |
68 | 61 |
|
69 |
| -**Week 11:** Run tests and evaluate model. |
| 62 | +**Linear Regression (RFE techniques)** |
| 63 | +- Equation: \( y = O_1X + O_2 \) |
70 | 64 |
|
71 |
| -**Week 12:** Prepare video presentation. |
| 65 | +**Root Mean Squared Error (RMSE) Calculations** |
| 66 | +- Formula: \( rmse = \sqrt{\left(\frac{1}{n}\right)\sum_{i=1}^{n}(y_{i} - x_{i})^{2}} \) |
72 | 67 |
|
| 68 | +Crafted by @Sanjayviswa. |
0 commit comments