PROJECT PART A: Exploratory Data Analysis

- Open the files for the Course Project and the data set in Doc Sharing.
- For each of the five variables, process, organize, present and summarize the data. Analyze each variable by itself using graphical and numerical techniques of summarization. Use MINITAB as much as possible, explaining what the printout tells you. You may wish to use some of the following graphs: stem-leaf diagram, frequency/relative frequency table, histogram, boxplot, dotplot, pie chart, bar graph. Caution: not all of these are appropriate for each of these variables, nor are they all necessary. More is not necessarily better. In addition be sure to find the appropriate measures of central tendency, the measures of dispersion, and the shapes of the distributions (for the quantitative variables) for the above data. Where appropriate, use the five number summary (the Min, Q1, Median, Q3, Max). Once again, use MINITAB as appropriate, and explain what the results mean.
- Analyze the connections or relationships between the variables. There are ten possible pairings of two variables. Use graphical as well as numerical summary measures. Explain what you see. Be sure to consider all 10 pairings. Some variables show clear relationships, while others do not.
- Prepare your report in Microsoft Word,
**integrating your graphs and tables with text explanations and interpretations.**Be sure that you have graphical and numerical back up for your explanations and interpretations. Be selective in what you include in the report. I'm not looking for a 20 page report on every variable and every possible relationship (that's 15 things to do). - In particular, what I want you do is to highlight what you see for
**three individual variables**(no more than 1 graph for each, one or two measures of central tendency and variability (as appropriate), the shapes of the distributions for quantitative variables, and two or three sentences of interpretation). For the 10 pairings, identify and report only on**three of the pairings**, again using graphical and numerical summary (as appropriate), with interpretations.**Please note that at least one of your pairings must include the qualitative variable and at least one of your pairings must not include the qualitative variable**. - All DeVry University policies are in effect, including the plagiarism policy.
- Project Part A report is due by the end of Week 2.
- Project Part A is worth 100 total points. See grading rubric below.

**Submission: The report including all relevant graphs and numerical analysis along with interpretations.**

**Format for report:**

- Brief Introduction
- Discuss your 1st individual variable, using graphical, numerical summary and interpretation
- Discuss your 2nd individual variable, using graphical, numerical summary and interpretation
- Discuss your 3rd individual variable, using graphical, numerical summary and interpretation
- Discuss your 1st pairing of variables, using graphical, numerical summary and interpretation
- Discuss your 2nd pairing of variables, using graphical, numerical summary and interpretation
- Discuss your 3rd pairing of variables, using graphical, numerical summary and interpretation
- Conclusion

Report:
The descriptive statistics plays a key role in determining the type of analysis which should be performed for the given data. As most of the advanced analysis like hypothesis testing, confidence interval etc. needs some assumption of the data thus descriptive statistics helps us to identify w...

Impact of Number of Siblings on Education Level and Family Income
Background
Only children have been the subjects of numerous studies, sometimes stigmatized as spoiled brats and other times as high achiever. In this project, we will analyze the relationship between the number of siblings a person has or had, and her/his level of education as well as her/his family income. The question we will try to asnwer in this analysis is thus the following:
Are people with no or less siblings better educated and more likely to have a high income later in life?
To answer that question, we will use data from the General Social Survey (GSS) Cumulative File 1972-2012, which provides a sample of selected indicators on contemporary American society. Detailed information on this file can be found at https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html.
Variables
From the GSS file, we will use the following variables:
- sibs: respondent’s number of brothers and sisters - numeric variable;
- coninc: respondent’s total family income in constant dollars - numeric variable;
- degree: respondent’s highest degree - ordinal variable.
Data Processing
We first load the required libraries:
setwd("~/Repositories/Coursera_DataAnalysis_Duke/Project/")
library(reshape2)
library(ggplot2)
We then load the data (and cache it) and get some summary statistics:
load(url("http://bit.ly/dasi_gss_data"))
# only keep count of siblings, degree and constant family income
data <- gss[,c("sibs","degree","coninc")]
# get statistics
summary(data)
## sibs degree coninc
## Min. : 0.0 Lt High School:11822 Min. : 383
## 1st Qu.: 2.0 High School :29287 1st Qu.: 18445
## Median : 3.0 Junior College: 3070 Median...

