import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.stats.api as sms
import re
Overview
Updated on July 21, 2021.
The SAT, or Scholastic Assessment Test, is taken by many high school students in the USA and around the world before college applications. SAT scores are one of the kinds of information used by colleges to determine a student’s level of college readiness.
However, the SAT has faced controversy due to analyses which have shown that students’ demographics - such as sex, race, socio-economic status - may have influenced their SAT performance.
Thus, the goal of this project is to determine the ways in which New York City high school students’ demographics may affect SAT performance. Towards this end, we have:
- Downloaded various datasets on NYC high schools from NYC Open Data
- Cleaned, transformed, and combined these datasets into one DataFrame
- Conducted exploratory data analysis and visualizations
- Performed multiple linear regression
The final regression model explained 78.6% of the variance (R2= 0.786, F(6, 394) = 96.44, p < 0.01). The percentage of each race in a school significantly predicted each school’s average SAT score.
I wrote this notebook for the Dataquest course’s Guided Project: Analyzing NYC High School Data. The general project flow and research questions came from Dataquest. However, all of the text and code here are written by me unless stated otherwise.
Package Installs
Data Transformation and Cleaning
In this section, we transform some columns in each dataset in order to get useful values. Then, we combine all of the datasets into one DataFrame. Some data cleaning will also be performed throughout the process. This is necessary in order to perform analysis using all of the columns available.
List of Datasets
The following are the datasets that were downloaded from NYC Open Data for this project. The files were renamed to have shorter names.
sat_results.csv
: 2012 SAT results (link)hs_directory
: 2010 - 2011 School Attendance and Enrollment Statistics by District (link)class_size.csv
: 2010-2011 Class Size - School-level detail (link)ap_2010.csv
: 2010 AP (College Board) School Level Results (link)graduation.csv
: 2005-2010 Graduation Outcomes - School Level (link)demographics.csv
: 2006 - 2012 School Demographics and Accountability Snapshot (link)survey_all.txt
,survey_d75.txt
: 2011 NYC School Survey (link)
The code below reads the CSVs into the program.
= [
data_files "ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
= {}
data
for filename in data_files:
= filename[:-4] # Remove ".csv"
key = pd.read_csv("./private/2021-06-13-HSD-Files/" + filename)
df
# Rename "dbn" to "DBN".
= df.rename(
df "dbn": "DBN"},
{= 1,
axis
)
# Assign the DataFrame to a key in the dictionary.
= df data[key]
As for the TXT files, these are tab-delimited and encoded using Windows-1252 encoding, so we have to specify that when we read them in. We will also concatenate these two datasets vertically because these have the same column names.
# Survey data on all community schools.
= pd.read_csv(
all_survey "./private/2021-06-13-HSD-Files/survey_all.txt",
= "\t",
delimiter = "windows-1252",
encoding
)
# Survey data on all District 75 schools.
= pd.read_csv(
d75_survey "./private/2021-06-13-HSD-Files/survey_d75.txt",
= "\t",
delimiter = "windows-1252",
encoding
)
# Concatenate the two datasets vertically into one dataset.
= pd.concat(
survey
[all_survey, d75_survey],= 0,
axis
)
# Rename "dbn" column to "DBN" for consistency.
= survey.rename(
survey "dbn": "DBN"},
{= 1,
axis
)
# Put the DataFrame into the main dictionary.
"survey"] = survey data[
Now, all of the datasets can be accessed via the data
Series.
= pd.Series(data)
data
list(data.index)
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results',
'survey']
To give an idea of what these datasets look like, we show the first few rows of the SAT results dataset below.
- Each row contains data on one school.
- Each column provides a different detail about the schools, such as school name, number of test takers in each school, etc.
data.sat_results.head()
DBN | SCHOOL NAME | Num of SAT Test Takers | SAT Critical Reading Avg. Score | SAT Math Avg. Score | SAT Writing Avg. Score | |
---|---|---|---|---|---|---|
0 | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES | 29 | 355 | 404 | 363 |
1 | 01M448 | UNIVERSITY NEIGHBORHOOD HIGH SCHOOL | 91 | 383 | 423 | 366 |
2 | 01M450 | EAST SIDE COMMUNITY SCHOOL | 70 | 377 | 402 | 370 |
3 | 01M458 | FORSYTH SATELLITE ACADEMY | 7 | 414 | 401 | 359 |
4 | 01M509 | MARTA VALLE HIGH SCHOOL | 44 | 390 | 433 | 384 |
SAT Total Scores
The SAT results dataset contains columns showing the average score in each section of the test.
However, there is no column representing the average SAT total score. This is important to know because it gauges a student’s overall mastery of high school knowledge and college preparedness.
Thus, we will generate a total score column by taking the sum of the three section score columns.
= [
score_cols "SAT Math Avg. Score",
"SAT Critical Reading Avg. Score",
"SAT Writing Avg. Score",
]
# Convert each column from string to numeric.
for col in score_cols:
= pd.to_numeric(
data.sat_results[col]
data.sat_results[col],= "coerce", # Set invalid parsing as NaN.
errors
)
# Get the total scores by summing the 3 section scores.
"sat_score"] = (
data.sat_results[
data.sat_results[score_cols]sum(
.= 1,
axis = False, # Return NaN if the row has a NaN.
skipna
)
)
# Drop rows with missing values.
= True)
data.sat_results.dropna(inplace
+ ["sat_score"]].head() data.sat_results[score_cols
SAT Math Avg. Score | SAT Critical Reading Avg. Score | SAT Writing Avg. Score | sat_score | |
---|---|---|---|---|
0 | 404.0 | 355.0 | 363.0 | 1122.0 |
1 | 423.0 | 383.0 | 366.0 | 1172.0 |
2 | 402.0 | 377.0 | 370.0 | 1149.0 |
3 | 401.0 | 414.0 | 359.0 | 1174.0 |
4 | 433.0 | 390.0 | 384.0 | 1207.0 |
The sat_score
column now shows each school’s average SAT total score.
School Location
Another piece of information we want to find is the set of geographical coordinates of each NYC school. We can find this using the hs_directory
dataset, which has a Location 1
column.
"DBN", "school_name", "Location 1"]].head() data.hs_directory[[
DBN | school_name | Location 1 | |
---|---|---|---|
0 | 17K548 | Brooklyn School for Music & Theatre | 883 Classon Avenue\nBrooklyn, NY 11225\n(40.67... |
1 | 09X543 | High School for Violin and Dance | 1110 Boston Road\nBronx, NY 10456\n(40.8276026... |
2 | 09X327 | Comprehensive Model School Project M.S. 327 | 1501 Jerome Avenue\nBronx, NY 10452\n(40.84241... |
3 | 02M280 | Manhattan Early College School for Advertising | 411 Pearl Street\nNew York, NY 10038\n(40.7106... |
4 | 28Q680 | Queens Gateway to Health Sciences Secondary Sc... | 160-20 Goethals Avenue\nJamaica, NY 11432\n(40... |
Let’s inspect the first value in this column to see its format.
print(data.hs_directory.loc[0, "Location 1"])
883 Classon Avenue
Brooklyn, NY 11225
(40.67029890700047, -73.96164787599963)
Each value is a string with three lines. The third line contains the school’s latitude and longitude in parentheses.
We can extract the coordinate data using the regular expression \((.+)\)
. This matches a pair of parentheses containing 1 or more characters between them.
def get_coords(text, which = 0):
"""Take a string, extract coordinates, and return one of the values."""
= r"\((.+)\)" # Regex
pattern = re.findall(pattern, text) # Returns a list of extracted strings
extracted = extracted[0].split(", ") # Split string into list of strings
coords
# Return one of the coordinates as a float64.
= np.float64(coords[which])
result return result
# Make latitude and longitude columns.
"lat"] = (
data.hs_directory["Location 1"]
data.hs_directory[apply(get_coords, which = 0)
.
)
"lon"] = (
data.hs_directory["Location 1"]
data.hs_directory[apply(get_coords, which = 1)
.
)
"lat", "lon"]].head() data.hs_directory[[
lat | lon | |
---|---|---|
0 | 40.670299 | -73.961648 |
1 | 40.827603 | -73.904475 |
2 | 40.842414 | -73.916162 |
3 | 40.710679 | -74.000807 |
4 | 40.718810 | -73.806500 |
We have successfully extracted the latitude and longitude from each string.
Survey Dataset Columns
Another issue is that there are too many columns in the survey
dataset:
data.survey.shape
(1702, 2773)
There are 2773 columns. The reason is that each column represents a different survey item or option. It appears that the survey had many items.
However, if we look at the data dictionary, we can see that there are only a few very important columns:
= [
important_fields # Unique DBN of each school
"DBN",
# Response rates
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
# Scores about various aspects of the school experience
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
Thus, we will only keep these columns in survey
and remove the rest.
= data.survey[important_fields] data.survey
DBN Column
Earlier, we saw that the first column in the sat_results
dataset was DBN
.
data.sat_results.head()
DBN | SCHOOL NAME | Num of SAT Test Takers | SAT Critical Reading Avg. Score | SAT Math Avg. Score | SAT Writing Avg. Score | sat_score | |
---|---|---|---|---|---|---|---|
0 | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES | 29 | 355.0 | 404.0 | 363.0 | 1122.0 |
1 | 01M448 | UNIVERSITY NEIGHBORHOOD HIGH SCHOOL | 91 | 383.0 | 423.0 | 366.0 | 1172.0 |
2 | 01M450 | EAST SIDE COMMUNITY SCHOOL | 70 | 377.0 | 402.0 | 370.0 | 1149.0 |
3 | 01M458 | FORSYTH SATELLITE ACADEMY | 7 | 414.0 | 401.0 | 359.0 | 1174.0 |
4 | 01M509 | MARTA VALLE HIGH SCHOOL | 44 | 390.0 | 433.0 | 384.0 | 1207.0 |
The DBN, or District Borough Number, is a unique code that identifies each school in NYC.
Since it is unique to each school, we can use it to match rows across our datasets and combine them into one. However, the class_size
dataset doesn’t have a DBN column.
data.class_size.head()
CSD | BOROUGH | SCHOOL CODE | SCHOOL NAME | GRADE | PROGRAM TYPE | CORE SUBJECT (MS CORE and 9-12 ONLY) | CORE COURSE (MS CORE and 9-12 ONLY) | SERVICE CATEGORY(K-9* ONLY) | NUMBER OF STUDENTS / SEATS FILLED | NUMBER OF SECTIONS | AVERAGE CLASS SIZE | SIZE OF SMALLEST CLASS | SIZE OF LARGEST CLASS | DATA SOURCE | SCHOOLWIDE PUPIL-TEACHER RATIO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | M | M015 | P.S. 015 Roberto Clemente | 0K | GEN ED | - | - | - | 19.0 | 1.0 | 19.0 | 19.0 | 19.0 | ATS | NaN |
1 | 1 | M | M015 | P.S. 015 Roberto Clemente | 0K | CTT | - | - | - | 21.0 | 1.0 | 21.0 | 21.0 | 21.0 | ATS | NaN |
2 | 1 | M | M015 | P.S. 015 Roberto Clemente | 01 | GEN ED | - | - | - | 17.0 | 1.0 | 17.0 | 17.0 | 17.0 | ATS | NaN |
3 | 1 | M | M015 | P.S. 015 Roberto Clemente | 01 | CTT | - | - | - | 17.0 | 1.0 | 17.0 | 17.0 | 17.0 | ATS | NaN |
4 | 1 | M | M015 | P.S. 015 Roberto Clemente | 02 | GEN ED | - | - | - | 15.0 | 1.0 | 15.0 | 15.0 | 15.0 | ATS | NaN |
By comparing the sat_results
dataset to the class_size
dataset, we can see that the DBN is actually a combination of:
- the CSD (Community School District), like “01”.
- the School Code, like “M015”.
The CSD must have two digits, so single-digit CSDs need to be padded with a zero in front. This is done below.
def zero_pad(num):
"""Zero-pad a number if it only has 1 digit."""
= len(str(num))
digits
if digits == 1:
= str(num).zfill(2)
zero_padded return zero_padded
elif digits == 2:
return str(num)
else:
raise ValueError("Invalid number of digits.")
"padded_csd"] = data.class_size["CSD"].apply(zero_pad)
data.class_size[
"CSD", "padded_csd", "SCHOOL CODE"]].head() data.class_size[[
CSD | padded_csd | SCHOOL CODE | |
---|---|---|---|
0 | 1 | 01 | M015 |
1 | 1 | 01 | M015 |
2 | 1 | 01 | M015 |
3 | 1 | 01 | M015 |
4 | 1 | 01 | M015 |
Now, we can combine padded_csd
and SCHOOL CODE
to make the DBN.
"DBN"] = (
data.class_size["padded_csd", "SCHOOL CODE"]]
data.class_size[[sum(axis = 1) # Add values along rows.
.
)
# Reorder the df so that DBN is in front.
def reorder_columns(df, first_cols):
"""Take a DataFrame and a list of columns. Move those columns to the left side. Return the new DataFrame."""
= df[
result
first_cols+ [col for col in df if col not in first_cols]
]
return result
= reorder_columns(data.class_size, ["DBN"])
data.class_size
# Remove padded_csd column
data.class_size.drop("padded_csd",
= 1,
axis = True,
inplace
)
data.class_size.head()
DBN | CSD | BOROUGH | SCHOOL CODE | SCHOOL NAME | GRADE | PROGRAM TYPE | CORE SUBJECT (MS CORE and 9-12 ONLY) | CORE COURSE (MS CORE and 9-12 ONLY) | SERVICE CATEGORY(K-9* ONLY) | NUMBER OF STUDENTS / SEATS FILLED | NUMBER OF SECTIONS | AVERAGE CLASS SIZE | SIZE OF SMALLEST CLASS | SIZE OF LARGEST CLASS | DATA SOURCE | SCHOOLWIDE PUPIL-TEACHER RATIO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M015 | 1 | M | M015 | P.S. 015 Roberto Clemente | 0K | GEN ED | - | - | - | 19.0 | 1.0 | 19.0 | 19.0 | 19.0 | ATS | NaN |
1 | 01M015 | 1 | M | M015 | P.S. 015 Roberto Clemente | 0K | CTT | - | - | - | 21.0 | 1.0 | 21.0 | 21.0 | 21.0 | ATS | NaN |
2 | 01M015 | 1 | M | M015 | P.S. 015 Roberto Clemente | 01 | GEN ED | - | - | - | 17.0 | 1.0 | 17.0 | 17.0 | 17.0 | ATS | NaN |
3 | 01M015 | 1 | M | M015 | P.S. 015 Roberto Clemente | 01 | CTT | - | - | - | 17.0 | 1.0 | 17.0 | 17.0 | 17.0 | ATS | NaN |
4 | 01M015 | 1 | M | M015 | P.S. 015 Roberto Clemente | 02 | GEN ED | - | - | - | 15.0 | 1.0 | 15.0 | 15.0 | 15.0 | ATS | NaN |
We were successful in creating a DBN
column for the class size dataset.
Condensing Datasets
Another issue that needs to be addressed is that some datasets have multiple rows with the same DBN. For example, look at the frequency table of DBN values in class_size
.
"DBN"].value_counts() data.class_size[
15K429 57
09X505 56
09X517 56
21K690 52
15K448 52
..
27Q273 3
03M452 3
09X090 2
27Q465 2
02M267 2
Name: DBN, Length: 1487, dtype: int64
Each DBN appears at least twice in the dataset. This would be problematic when merging it with other datasets, since each row must have a unique DBN value.
Therefore, we need to condense class_size
and other such datasets so that the DBN value is unique to each row.
The code below identifies which columns need to be condensed based on the frequency of its DBN values.
[
df_namefor df_name in data.index
if data[df_name]["DBN"].value_counts().max() > 1
# Check if any DBN values appear more than once.
]
['ap_2010', 'class_size', 'demographics', 'graduation']
The 4 datasets listed above will be condensed one by one.
Advanced Placement Dataset
First, the ap_2010
dataset will be condensed. It contains data on AP (Advanced Placement) test results in 2010.
print(data.ap_2010.shape)
data.ap_2010.head()
(258, 5)
DBN | SchoolName | AP Test Takers | Total Exams Taken | Number of Exams with scores 3 4 or 5 | |
---|---|---|---|---|---|
0 | 01M448 | UNIVERSITY NEIGHBORHOOD H.S. | 39 | 49 | 10 |
1 | 01M450 | EAST SIDE COMMUNITY HS | 19 | 21 | s |
2 | 01M515 | LOWER EASTSIDE PREP | 24 | 26 | 24 |
3 | 01M539 | NEW EXPLORATIONS SCI,TECH,MATH | 255 | 377 | 191 |
4 | 02M296 | High School of Hospitality Management | s | s | s |
From the first five rows, it is not immediately clear why DBN values repeat. The DBNs are unique in this sample.
Let’s check the frequency table of DBN values.
"DBN"].value_counts() data.ap_2010[
04M610 2
01M448 1
19K507 1
17K528 1
17K537 1
..
09X329 1
09X365 1
09X412 1
09X414 1
32K556 1
Name: DBN, Length: 257, dtype: int64
This frequency table is ordered by frequency, descending. Thus, we can tell that 04M610
is the only DBN that repeats in this dataset. Let’s inspect the 2 rows with this DBN.
data.ap_2010.loc["DBN"] == "04M610"
data.ap_2010[ ]
DBN | SchoolName | AP Test Takers | Total Exams Taken | Number of Exams with scores 3 4 or 5 | |
---|---|---|---|---|---|
51 | 04M610 | THE YOUNG WOMEN'S LEADERSHIP SCHOOL OF EAST HA... | 41 | 55 | 29 |
52 | 04M610 | YOUNG WOMEN'S LEADERSHIP SCH | s | s | s |
It looks like the duplication was caused by a simple error in data entry. Row 51 is valid, whereas row 52 has an incomplete school name. It also has a string “s” in numerical columns; this is invalid data.
Thus, we will drop row 52 from the dataset.
52, inplace = True)
data.ap_2010.drop(
"DBN"].value_counts().max() data.ap_2010[
1
Now that we’ve dropped the inaccurate row, none of the DBN values repeat.
As a side note, since there were “s” strings in the numerical columns, we realize that these are object-type columns. Thus, we have to convert these into numerical columns in order to perform analysis on them later on.
= ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
cols
for col in cols:
# Convert each column to numerical format.
= pd.to_numeric(
data.ap_2010[col]
data.ap_2010[col],= "coerce", # In case of errors, use NaN values.
errors
)
data.ap_2010.dtypes
DBN object
SchoolName object
AP Test Takers float64
Total Exams Taken float64
Number of Exams with scores 3 4 or 5 float64
dtype: object
The numerical columns were successfully converted into float64
(decimal) format.
Class Size Dataset
Next, we will condense the class_size
dataset. It contains aggregated data on the number of students in classes in NYC schools.
Why are there rows with duplicate DBNs? Let’s look at the first few rows.
print(data.class_size.shape)
data.class_size.head()
(27611, 17)
DBN | CSD | BOROUGH | SCHOOL CODE | SCHOOL NAME | GRADE | PROGRAM TYPE | CORE SUBJECT (MS CORE and 9-12 ONLY) | CORE COURSE (MS CORE and 9-12 ONLY) | SERVICE CATEGORY(K-9* ONLY) | NUMBER OF STUDENTS / SEATS FILLED | NUMBER OF SECTIONS | AVERAGE CLASS SIZE | SIZE OF SMALLEST CLASS | SIZE OF LARGEST CLASS | DATA SOURCE | SCHOOLWIDE PUPIL-TEACHER RATIO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M015 | 1 | M | M015 | P.S. 015 Roberto Clemente | 0K | GEN ED | - | - | - | 19.0 | 1.0 | 19.0 | 19.0 | 19.0 | ATS | NaN |
1 | 01M015 | 1 | M | M015 | P.S. 015 Roberto Clemente | 0K | CTT | - | - | - | 21.0 | 1.0 | 21.0 | 21.0 | 21.0 | ATS | NaN |
2 | 01M015 | 1 | M | M015 | P.S. 015 Roberto Clemente | 01 | GEN ED | - | - | - | 17.0 | 1.0 | 17.0 | 17.0 | 17.0 | ATS | NaN |
3 | 01M015 | 1 | M | M015 | P.S. 015 Roberto Clemente | 01 | CTT | - | - | - | 17.0 | 1.0 | 17.0 | 17.0 | 17.0 | ATS | NaN |
4 | 01M015 | 1 | M | M015 | P.S. 015 Roberto Clemente | 02 | GEN ED | - | - | - | 15.0 | 1.0 | 15.0 | 15.0 | 15.0 | ATS | NaN |
Look at the GRADE
and PROGRAM TYPE
columns. Each school’s data is split by grade level, from K to 12. The data are further split by the program type (GEN ED, CTT, etc.). That’s why each DBN has multiple rows.
In our case, we are only concerned with high school students’ data (grades 9 to 12) since the SAT is taken at this level of education. Let’s see which GRADE
value/s correspond to high school.
"GRADE "].value_counts().sort_index() data.class_size[
01 1185
02 1167
03 1143
04 1140
05 1086
06 846
07 778
08 735
09 20
09-12 10644
0K 1237
0K-09 1384
MS Core 4762
Name: GRADE , dtype: int64
Conveniently, there is a 09-12
value. Rows with this value would contain aggregated data on all high school grade levels. Thus, we will filter the class_size
dataset to keep only these rows.
There is still the issue of the PROGRAM TYPE
column, however. Let’s look at its frequency table.
"PROGRAM TYPE"].value_counts() data.class_size[
GEN ED 14545
CTT 7460
SPEC ED 3653
G&T 469
Name: PROGRAM TYPE, dtype: int64
There are hundreds or thousands of rows under each program type. According to the data dictionary, the following are the meanings of these values:
GEN ED
: General EducationCTT
: Collaborative Team TeachingSPEC ED
: Special EducationG&T
: Gifted & Talented
The General Education program best represents the majority of high school students, and this is the most frequent program in the data. Thus, we will keep the GEN ED
rows and drop the rest.
To recap, we chose to filter the dataset to keep rows where the grade level is from 9 to 12 and the program type is General Education. We will do this using the code below.
= data.class_size.loc[
data.class_size "GRADE "] == "09-12")
(data.class_size[& (data.class_size["PROGRAM TYPE"] == "GEN ED")
]
print(data.class_size.shape)
data.class_size.head()
(6513, 17)
DBN | CSD | BOROUGH | SCHOOL CODE | SCHOOL NAME | GRADE | PROGRAM TYPE | CORE SUBJECT (MS CORE and 9-12 ONLY) | CORE COURSE (MS CORE and 9-12 ONLY) | SERVICE CATEGORY(K-9* ONLY) | NUMBER OF STUDENTS / SEATS FILLED | NUMBER OF SECTIONS | AVERAGE CLASS SIZE | SIZE OF SMALLEST CLASS | SIZE OF LARGEST CLASS | DATA SOURCE | SCHOOLWIDE PUPIL-TEACHER RATIO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
225 | 01M292 | 1 | M | M292 | Henry Street School for International Studies | 09-12 | GEN ED | ENGLISH | English 9 | - | 63.0 | 3.0 | 21.0 | 19.0 | 25.0 | STARS | NaN |
226 | 01M292 | 1 | M | M292 | Henry Street School for International Studies | 09-12 | GEN ED | ENGLISH | English 10 | - | 79.0 | 3.0 | 26.3 | 24.0 | 31.0 | STARS | NaN |
227 | 01M292 | 1 | M | M292 | Henry Street School for International Studies | 09-12 | GEN ED | ENGLISH | English 11 | - | 38.0 | 2.0 | 19.0 | 16.0 | 22.0 | STARS | NaN |
228 | 01M292 | 1 | M | M292 | Henry Street School for International Studies | 09-12 | GEN ED | ENGLISH | English 12 | - | 69.0 | 3.0 | 23.0 | 13.0 | 30.0 | STARS | NaN |
229 | 01M292 | 1 | M | M292 | Henry Street School for International Studies | 09-12 | GEN ED | MATH | Integrated Algebra | - | 53.0 | 3.0 | 17.7 | 16.0 | 21.0 | STARS | NaN |
The first few rows of the dataset show us that we were successful in filtering rows by GRADE
and PROGRAM TYPE
. From 27611 rows, we have narrowed the dataset down to 6513 rows.
However, there are still duplicate DBN values. The new problem is that the data are divided by the following columns:
CORE SUBJECT (MS CORE and 9-12 ONLY)
CORE COURSE (MS CORE and 9-12 ONLY)
These columns contain the specific subject or course of the class. We want to get class size data that accounts for all subjects, so we cannot simply filter the dataset by 1 subject.
Instead, we will:
- Group the data by DBN value.
- Aggregate the numerical data by taking the mean of each column in each group.
= (
data.class_size
data.class_size"DBN") # Group by the school's unique DBN
.groupby(# Aggregate numerical columns using mean
.agg(np.mean) # Put the DBN column back in the df
.reset_index()
)
print(data.class_size.shape)
data.class_size.head()
(583, 8)
DBN | CSD | NUMBER OF STUDENTS / SEATS FILLED | NUMBER OF SECTIONS | AVERAGE CLASS SIZE | SIZE OF SMALLEST CLASS | SIZE OF LARGEST CLASS | SCHOOLWIDE PUPIL-TEACHER RATIO | |
---|---|---|---|---|---|---|---|---|
0 | 01M292 | 1.0 | 88.0000 | 4.000000 | 22.564286 | 18.50 | 26.571429 | NaN |
1 | 01M332 | 1.0 | 46.0000 | 2.000000 | 22.000000 | 21.00 | 23.500000 | NaN |
2 | 01M378 | 1.0 | 33.0000 | 1.000000 | 33.000000 | 33.00 | 33.000000 | NaN |
3 | 01M448 | 1.0 | 105.6875 | 4.750000 | 22.231250 | 18.25 | 27.062500 | NaN |
4 | 01M450 | 1.0 | 57.6000 | 2.733333 | 21.200000 | 19.40 | 22.866667 | NaN |
The following changes have occurred:
- The dataset now has only 583 rows.
- Only the DBN, CSD, and numerical columns remain.
- The numerical columns contain decimals since the data were aggregated.
Let’s check what the maximum frequency of DBN values is.
"DBN"].value_counts().max() data.class_size[
1
Each DBN only appears once in the data. We have condensed the dataset successfully.
Demographics Dataset
Next, we condense the Demographics dataset. It contains data on NYC high school students’ grade level, gender, ethnicity, etc. from 2006 to 2012.
Let’s try to identify why each DBN has multiple rows.
print(data.demographics.shape)
data.demographics.head()
(10075, 38)
DBN | Name | schoolyear | fl_percent | frl_percent | total_enrollment | prek | k | grade1 | grade2 | ... | black_num | black_per | hispanic_num | hispanic_per | white_num | white_per | male_num | male_per | female_num | female_per | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20052006 | 89.4 | NaN | 281 | 15 | 36 | 40 | 33 | ... | 74 | 26.3 | 189 | 67.3 | 5 | 1.8 | 158.0 | 56.2 | 123.0 | 43.8 |
1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20062007 | 89.4 | NaN | 243 | 15 | 29 | 39 | 38 | ... | 68 | 28.0 | 153 | 63.0 | 4 | 1.6 | 140.0 | 57.6 | 103.0 | 42.4 |
2 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20072008 | 89.4 | NaN | 261 | 18 | 43 | 39 | 36 | ... | 77 | 29.5 | 157 | 60.2 | 7 | 2.7 | 143.0 | 54.8 | 118.0 | 45.2 |
3 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20082009 | 89.4 | NaN | 252 | 17 | 37 | 44 | 32 | ... | 75 | 29.8 | 149 | 59.1 | 7 | 2.8 | 149.0 | 59.1 | 103.0 | 40.9 |
4 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20092010 | 96.5 | 208 | 16 | 40 | 28 | 32 | ... | 67 | 32.2 | 118 | 56.7 | 6 | 2.9 | 124.0 | 59.6 | 84.0 | 40.4 |
5 rows × 38 columns
Based on the schoolyear
column, each school has a separate row for each academic year. Let’s inspect the years available.
"schoolyear"].value_counts().sort_index() data.demographics[
20052006 1356
20062007 1386
20072008 1410
20082009 1441
20092010 1475
20102011 1498
20112012 1509
Name: schoolyear, dtype: int64
One of the values represents school year 2011-2012. This is relevant to our analysis because the SAT results dataset is from 2012. Thus, we will filter demographics
to include only this year.
= data.demographics.loc[
data.demographics "schoolyear"] == 20112012
data.demographics[
]
print(data.demographics.shape)
data.demographics.head()
(1509, 38)
DBN | Name | schoolyear | fl_percent | frl_percent | total_enrollment | prek | k | grade1 | grade2 | ... | black_num | black_per | hispanic_num | hispanic_per | white_num | white_per | male_num | male_per | female_num | female_per | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20112012 | NaN | 89.4 | 189 | 13 | 31 | 35 | 28 | ... | 63 | 33.3 | 109 | 57.7 | 4 | 2.1 | 97.0 | 51.3 | 92.0 | 48.7 |
13 | 01M019 | P.S. 019 ASHER LEVY | 20112012 | NaN | 61.5 | 328 | 32 | 46 | 52 | 54 | ... | 81 | 24.7 | 158 | 48.2 | 28 | 8.5 | 147.0 | 44.8 | 181.0 | 55.2 |
20 | 01M020 | PS 020 ANNA SILVER | 20112012 | NaN | 92.5 | 626 | 52 | 102 | 121 | 87 | ... | 55 | 8.8 | 357 | 57.0 | 16 | 2.6 | 330.0 | 52.7 | 296.0 | 47.3 |
27 | 01M034 | PS 034 FRANKLIN D ROOSEVELT | 20112012 | NaN | 99.7 | 401 | 14 | 34 | 38 | 36 | ... | 90 | 22.4 | 275 | 68.6 | 8 | 2.0 | 204.0 | 50.9 | 197.0 | 49.1 |
35 | 01M063 | PS 063 WILLIAM MCKINLEY | 20112012 | NaN | 78.9 | 176 | 18 | 20 | 30 | 21 | ... | 41 | 23.3 | 110 | 62.5 | 15 | 8.5 | 97.0 | 55.1 | 79.0 | 44.9 |
5 rows × 38 columns
Filtering the dataset has reduced it to 1509 rows. Also, if we look at the DBN column, the values appear to be distinct now. Let’s check the maximum frequency of its values:
"DBN"].value_counts().max() data.demographics[
1
Each DBN value appears only once.
Graduation Dataset
Lastly, we will condense the graduation dataset. It contains data on the graduation rate in the Cohorts of 2001 to 2006. The cohort of a year is the set of students that entered Grade 9 in that year. Thus, the cohorts in this dataset graduated between 2005 and 2010.
Below are the first five rows of the dataset.
print(data.graduation.shape)
5) data.graduation.head(
(25096, 23)
Demographic | DBN | School Name | Cohort | Total Cohort | Total Grads - n | Total Grads - % of cohort | Total Regents - n | Total Regents - % of cohort | Total Regents - % of grads | ... | Regents w/o Advanced - n | Regents w/o Advanced - % of cohort | Regents w/o Advanced - % of grads | Local - n | Local - % of cohort | Local - % of grads | Still Enrolled - n | Still Enrolled - % of cohort | Dropped Out - n | Dropped Out - % of cohort | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Total Cohort | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL | 2003 | 5 | s | s | s | s | s | ... | s | s | s | s | s | s | s | s | s | s |
1 | Total Cohort | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL | 2004 | 55 | 37 | 67.3% | 17 | 30.9% | 45.9% | ... | 17 | 30.9% | 45.9% | 20 | 36.4% | 54.1% | 15 | 27.3% | 3 | 5.5% |
2 | Total Cohort | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL | 2005 | 64 | 43 | 67.2% | 27 | 42.2% | 62.8% | ... | 27 | 42.2% | 62.8% | 16 | 25% | 37.200000000000003% | 9 | 14.1% | 9 | 14.1% |
3 | Total Cohort | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL | 2006 | 78 | 43 | 55.1% | 36 | 46.2% | 83.7% | ... | 36 | 46.2% | 83.7% | 7 | 9% | 16.3% | 16 | 20.5% | 11 | 14.1% |
4 | Total Cohort | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL | 2006 Aug | 78 | 44 | 56.4% | 37 | 47.4% | 84.1% | ... | 37 | 47.4% | 84.1% | 7 | 9% | 15.9% | 15 | 19.2% | 11 | 14.1% |
5 rows × 23 columns
At first glance, the Cohort
column appears to be responsible for the duplication of DBNs. Each school has a separate row for each cohort. Below are the cohort values:
"Cohort"].value_counts().sort_index() data.graduation[
2001 2637
2002 3095
2003 3432
2004 3708
2005 3963
2006 4130
2006 Aug 4131
Name: Cohort, dtype: int64
As expected, the cohorts range from 2001 to 2006. There also appears to be an extra value representing August of 2006.
Since we can only keep 1 row from each school, it is best to keep the 2006
rows since these are the most recent.
= data.graduation.loc[
data.graduation "Cohort"] == "2006"
data.graduation[
]
print(data.graduation.shape)
data.graduation.head()
(4130, 23)
Demographic | DBN | School Name | Cohort | Total Cohort | Total Grads - n | Total Grads - % of cohort | Total Regents - n | Total Regents - % of cohort | Total Regents - % of grads | ... | Regents w/o Advanced - n | Regents w/o Advanced - % of cohort | Regents w/o Advanced - % of grads | Local - n | Local - % of cohort | Local - % of grads | Still Enrolled - n | Still Enrolled - % of cohort | Dropped Out - n | Dropped Out - % of cohort | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | Total Cohort | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL | 2006 | 78 | 43 | 55.1% | 36 | 46.2% | 83.7% | ... | 36 | 46.2% | 83.7% | 7 | 9% | 16.3% | 16 | 20.5% | 11 | 14.1% |
10 | Total Cohort | 01M448 | UNIVERSITY NEIGHBORHOOD HIGH SCHOOL | 2006 | 124 | 53 | 42.7% | 42 | 33.9% | 79.2% | ... | 34 | 27.4% | 64.2% | 11 | 8.9% | 20.8% | 46 | 37.1% | 20 | 16.100000000000001% |
17 | Total Cohort | 01M450 | EAST SIDE COMMUNITY SCHOOL | 2006 | 90 | 70 | 77.8% | 67 | 74.400000000000006% | 95.7% | ... | 67 | 74.400000000000006% | 95.7% | 3 | 3.3% | 4.3% | 15 | 16.7% | 5 | 5.6% |
24 | Total Cohort | 01M509 | MARTA VALLE HIGH SCHOOL | 2006 | 84 | 47 | 56% | 40 | 47.6% | 85.1% | ... | 23 | 27.4% | 48.9% | 7 | 8.300000000000001% | 14.9% | 25 | 29.8% | 5 | 6% |
31 | Total Cohort | 01M515 | LOWER EAST SIDE PREPARATORY HIGH SCHO | 2006 | 193 | 105 | 54.4% | 91 | 47.2% | 86.7% | ... | 22 | 11.4% | 21% | 14 | 7.3% | 13.3% | 53 | 27.5% | 35 | 18.100000000000001% |
5 rows × 23 columns
The dataset is left with 4130 rows. Let’s check if the DBNs are unique.
"DBN"].value_counts().max() data.graduation[
13
It seems that the DBNs still have up to 13 duplicates each.
This may be due to the Demographic
column. What if it contains values other than “Total Cohort”?
"Demographic"].value_counts() data.graduation[
Special Education Students 411
Total Cohort 405
General Education Students 405
English Proficient Students 401
Female 396
Hispanic 395
Male 395
Black 394
English Language Learners 340
Asian 296
White 292
Name: Demographic, dtype: int64
Indeed, each school has a separate row for each demographic. Since we want to include all or most of the students, we will take the “Total Cohort” rows.
= data.graduation.loc[
data.graduation "Demographic"] == "Total Cohort"
data.graduation[
]
print(data.graduation.shape)
data.graduation.head()
(405, 23)
Demographic | DBN | School Name | Cohort | Total Cohort | Total Grads - n | Total Grads - % of cohort | Total Regents - n | Total Regents - % of cohort | Total Regents - % of grads | ... | Regents w/o Advanced - n | Regents w/o Advanced - % of cohort | Regents w/o Advanced - % of grads | Local - n | Local - % of cohort | Local - % of grads | Still Enrolled - n | Still Enrolled - % of cohort | Dropped Out - n | Dropped Out - % of cohort | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | Total Cohort | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL | 2006 | 78 | 43 | 55.1% | 36 | 46.2% | 83.7% | ... | 36 | 46.2% | 83.7% | 7 | 9% | 16.3% | 16 | 20.5% | 11 | 14.1% |
10 | Total Cohort | 01M448 | UNIVERSITY NEIGHBORHOOD HIGH SCHOOL | 2006 | 124 | 53 | 42.7% | 42 | 33.9% | 79.2% | ... | 34 | 27.4% | 64.2% | 11 | 8.9% | 20.8% | 46 | 37.1% | 20 | 16.100000000000001% |
17 | Total Cohort | 01M450 | EAST SIDE COMMUNITY SCHOOL | 2006 | 90 | 70 | 77.8% | 67 | 74.400000000000006% | 95.7% | ... | 67 | 74.400000000000006% | 95.7% | 3 | 3.3% | 4.3% | 15 | 16.7% | 5 | 5.6% |
24 | Total Cohort | 01M509 | MARTA VALLE HIGH SCHOOL | 2006 | 84 | 47 | 56% | 40 | 47.6% | 85.1% | ... | 23 | 27.4% | 48.9% | 7 | 8.300000000000001% | 14.9% | 25 | 29.8% | 5 | 6% |
31 | Total Cohort | 01M515 | LOWER EAST SIDE PREPARATORY HIGH SCHO | 2006 | 193 | 105 | 54.4% | 91 | 47.2% | 86.7% | ... | 22 | 11.4% | 21% | 14 | 7.3% | 13.3% | 53 | 27.5% | 35 | 18.100000000000001% |
5 rows × 23 columns
After the dataset was filtered for the second time, it was left with 405 rows.
Let’s check the maximum frequency of DBNs again.
"DBN"].value_counts().max() data.graduation[
1
The DBNs are now unique to each row in the dataset.
Column Names
We’re almost at the point where we can combine the datasets. However, the final dataset would have many columns. It would be good to be able to know which dataset each column came from. Thus, we will use the following prefix system:
A_
: Theap_2010
datasetC_
: Theclass_size
datasetD_
: Thedemographics
datasetG_
: Thegraduation
datasetH_
: Thehs_directory
datasetR_
: Thesat_results
datasetS_
: Thesurvey
dataset
I add these prefixes using the code below.
for df_name in data.index:
# Take the first letter, capitalize it, and append an underscore.
if df_name == "sat_results":
= "R_"
prefix else:
= df_name[0].upper() + "_"
prefix
# Add the prefix to each column label.
= data[df_name]
df = [col for col in df if col != "DBN"]
non_dbn
= df[["DBN"]]
dbn_col = df[non_dbn].add_prefix(prefix)
other_cols
# Put the df back into the data Series.
= pd.concat(
data[df_name]
[dbn_col, other_cols],= 1,
axis )
For example, let’s see what ap_2010
looks like now:
data.ap_2010.head()
DBN | A_SchoolName | A_AP Test Takers | A_Total Exams Taken | A_Number of Exams with scores 3 4 or 5 | |
---|---|---|---|---|---|
0 | 01M448 | UNIVERSITY NEIGHBORHOOD H.S. | 39.0 | 49.0 | 10.0 |
1 | 01M450 | EAST SIDE COMMUNITY HS | 19.0 | 21.0 | NaN |
2 | 01M515 | LOWER EASTSIDE PREP | 24.0 | 26.0 | 24.0 |
3 | 01M539 | NEW EXPLORATIONS SCI,TECH,MATH | 255.0 | 377.0 | 191.0 |
4 | 02M296 | High School of Hospitality Management | NaN | NaN | NaN |
The prefixes were added successfully.
Combining Datasets
At last, we are ready to combine all of the datasets into one. Datasets be merged on the DBN
column since this is unique to each row.
We will start with the sat_results
dataset as the first “left” dataset in the merging process.
We will consider two ways to merge datasets:
- Left join: Keep all rows in the “left” dataset, but remove rows in the “right” dataset if their DBNs had no match.
- Inner join: Only keep rows where the DBN exists in both datasets.
Before we choose, though, we will have to find out whether the DBNs in sat_results
match the ones present in other datasets.
Heatmap of DBNs
In previous Dataquest guided projects, heatmaps were sometimes used to visualize correlation tables and missing values. I realized that this could be extended to the situation of the DBNs.
What if we created a DataFrame where:
- The columns represent the 7 datasets.
- The indices represent the unique DBNs.
- Each value is
True
if the DBN appears in the specific dataset, elseFalse
.
Then, we would be able to create a heatmap to visualize which DBNs were present and absent in each dataset.
I wrote the code below in order to make the DataFrame.
= []
col_list
for df_name in data.index:
= data[df_name][["DBN"]] # Get a 1-column DataFrame of DBN values.
dbn = dbn.set_index("DBN") # Set the DBNs as the indices. There are no columns now.
new_col = 1 # Create a new column of ones.
new_col[df_name]
# Append the column to the list.
col_list.append(new_col)
# Combine the columns into one DataFrame.
= pd.concat(
dbn_grid
col_list,= 1,
axis # Replace values with False if they're null, else True.
).notnull()
dbn_grid.head()
ap_2010 | class_size | demographics | graduation | hs_directory | sat_results | survey | |
---|---|---|---|---|---|---|---|
DBN | |||||||
01M448 | True | True | True | True | True | True | True |
01M450 | True | True | True | True | True | True | True |
01M515 | True | True | True | True | False | True | True |
01M539 | True | True | True | True | True | True | True |
02M296 | True | True | True | True | True | True | True |
Let’s take the top left cell as an example. Its value is True
. This means that the DBN 01M448 appears in the ap_2010
dataset.
Since SAT performance is our focus, we can filter the rows in the DataFrame to only include the ones where the DBN is present in sat_results
.
= dbn_grid.loc[
dbn_grid "sat_results"] # Keep rows where this column contains True values.
dbn_grid[ ]
Now, we can plot our heatmap. Above it, we will also print a table of the exact number of relevant DBNs present in each dataset.
print("Counts of relevant DBNs in each dataset")
print(dbn_grid.sum())
# Initiate the figure
= (15, 5))
plt.figure(figsize
# Create a heatmap of the boolean dataframe
sns.heatmap(
dbn_grid,= True, # Include a color bar.
cbar = False, # Do not show y-tick labels.
yticklabels
)
"Heatmap of Relevant DBN Values in each Dataset")
plt.title("Dataset")
plt.xlabel(= 0)
plt.xticks(rotation plt.show()
Counts of relevant DBNs in each dataset
ap_2010 251
class_size 401
demographics 412
graduation 381
hs_directory 339
sat_results 421
survey 418
dtype: int64
Each column in the heatmap corresponds to one of our 7 datasets. The y-axis represents the unique DBNs.
The DBNs present in a dataset are represented by a light brown color. The missing DBNs are represented by black.
The following can be observed:
- The
sat_results
dataset is complete since we are using it as our basis for which DBNs are “relevant.” There are 421 relevant DBNs in total. - The
ap_2010
dataset has the least DBNs, only 251. hs_directory
andgraduation
have missing DBNs scattered around. They have less than 400 DBNs each.class_size
,demographics
, andsurvey
are the most complete datasets, with over 400 DBNs.
This information will help us in choosing how to merge the datasets.
Left join
The advantage of performing a left join is apparent when the “right” dataset has many missing DBN values.
- With a left join, all of the
sat_results
rows would be kept. - With an inner join, the number of rows would be significantly reduced.
We saw that ap_2010
, hs_directory
, and graduation
have the most missing DBN values, so these are good candidates for a left join. These 3 datasets provide information that is supplementary but not very crucial to our analysis, so it is fine for some rows to contain missing values.
= data.sat_results
combined
for df_name in ["ap_2010", "hs_directory", "graduation"]:
# Merge horizontally.
= combined.merge(
combined
data[df_name],= "DBN", # Merge on DBN column
on = "left", # Left join
how
)
print(combined.shape)
combined.head()
(421, 92)
DBN | R_SCHOOL NAME | R_Num of SAT Test Takers | R_SAT Critical Reading Avg. Score | R_SAT Math Avg. Score | R_SAT Writing Avg. Score | R_sat_score | A_SchoolName | A_AP Test Takers | A_Total Exams Taken | ... | G_Regents w/o Advanced - n | G_Regents w/o Advanced - % of cohort | G_Regents w/o Advanced - % of grads | G_Local - n | G_Local - % of cohort | G_Local - % of grads | G_Still Enrolled - n | G_Still Enrolled - % of cohort | G_Dropped Out - n | G_Dropped Out - % of cohort | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES | 29 | 355.0 | 404.0 | 363.0 | 1122.0 | NaN | NaN | NaN | ... | 36 | 46.2% | 83.7% | 7 | 9% | 16.3% | 16 | 20.5% | 11 | 14.1% |
1 | 01M448 | UNIVERSITY NEIGHBORHOOD HIGH SCHOOL | 91 | 383.0 | 423.0 | 366.0 | 1172.0 | UNIVERSITY NEIGHBORHOOD H.S. | 39.0 | 49.0 | ... | 34 | 27.4% | 64.2% | 11 | 8.9% | 20.8% | 46 | 37.1% | 20 | 16.100000000000001% |
2 | 01M450 | EAST SIDE COMMUNITY SCHOOL | 70 | 377.0 | 402.0 | 370.0 | 1149.0 | EAST SIDE COMMUNITY HS | 19.0 | 21.0 | ... | 67 | 74.400000000000006% | 95.7% | 3 | 3.3% | 4.3% | 15 | 16.7% | 5 | 5.6% |
3 | 01M458 | FORSYTH SATELLITE ACADEMY | 7 | 414.0 | 401.0 | 359.0 | 1174.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 01M509 | MARTA VALLE HIGH SCHOOL | 44 | 390.0 | 433.0 | 384.0 | 1207.0 | NaN | NaN | NaN | ... | 23 | 27.4% | 48.9% | 7 | 8.300000000000001% | 14.9% | 25 | 29.8% | 5 | 6% |
5 rows × 92 columns
The dataset still has all of the data from sat_results
due to the left join. There are some missing values where the other datasets didn’t find a matching DBN.
Inner join
Inner joins are advantageous when the “right” dataset is important to analysis. We would want each row to have complete data from this dataset.
In our case, the remaining datasets to merge are class_size
, demographics
, and survey
. We need the latter two datasets to be complete, since we aim to explore the relationships between student demographics and SAT performance. Thus, we will use an inner join.
for df_name in ["class_size", "demographics", "survey"]:
# Merge horizontally.
= combined.merge(
combined
data[df_name],= "DBN", # Merge on DBN column
on = "inner", # Inner join
how
)
print(combined.shape)
combined.head()
(401, 158)
DBN | R_SCHOOL NAME | R_Num of SAT Test Takers | R_SAT Critical Reading Avg. Score | R_SAT Math Avg. Score | R_SAT Writing Avg. Score | R_sat_score | A_SchoolName | A_AP Test Takers | A_Total Exams Taken | ... | S_eng_t_11 | S_aca_t_11 | S_saf_s_11 | S_com_s_11 | S_eng_s_11 | S_aca_s_11 | S_saf_tot_11 | S_com_tot_11 | S_eng_tot_11 | S_aca_tot_11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES | 29 | 355.0 | 404.0 | 363.0 | 1122.0 | NaN | NaN | NaN | ... | 6.1 | 6.5 | 6.0 | 5.6 | 6.1 | 6.7 | 6.7 | 6.2 | 6.6 | 7.0 |
1 | 01M448 | UNIVERSITY NEIGHBORHOOD HIGH SCHOOL | 91 | 383.0 | 423.0 | 366.0 | 1172.0 | UNIVERSITY NEIGHBORHOOD H.S. | 39.0 | 49.0 | ... | 6.6 | 7.3 | 6.0 | 5.7 | 6.3 | 7.0 | 6.8 | 6.3 | 6.7 | 7.2 |
2 | 01M450 | EAST SIDE COMMUNITY SCHOOL | 70 | 377.0 | 402.0 | 370.0 | 1149.0 | EAST SIDE COMMUNITY HS | 19.0 | 21.0 | ... | 8.0 | 8.8 | NaN | NaN | NaN | NaN | 7.9 | 7.9 | 7.9 | 8.4 |
3 | 01M458 | FORSYTH SATELLITE ACADEMY | 7 | 414.0 | 401.0 | 359.0 | 1174.0 | NaN | NaN | NaN | ... | 8.9 | 8.9 | 6.8 | 6.1 | 6.1 | 6.8 | 7.8 | 7.1 | 7.2 | 7.8 |
4 | 01M509 | MARTA VALLE HIGH SCHOOL | 44 | 390.0 | 433.0 | 384.0 | 1207.0 | NaN | NaN | NaN | ... | 6.1 | 6.8 | 6.4 | 5.9 | 6.4 | 7.0 | 6.9 | 6.2 | 6.6 | 7.0 |
5 rows × 158 columns
The final combined
DataFrame has 401 rows (schools) and 158 columns. That’s only 20 rows less than what we started with.
City School Districts
Additionally, we will create a new column containing the city school district (CSD) of each school. The CSD is made up of the first 2 digits of a school’s DBN.
# Make the new column.
"school_dist"] = combined["DBN"].apply(
combined[lambda text: text[:2] # Get first two characters.
)
# Reorder the columns.
= reorder_columns(combined, ["school_dist"])
combined
combined.head()
school_dist | DBN | R_SCHOOL NAME | R_Num of SAT Test Takers | R_SAT Critical Reading Avg. Score | R_SAT Math Avg. Score | R_SAT Writing Avg. Score | R_sat_score | A_SchoolName | A_AP Test Takers | ... | S_eng_t_11 | S_aca_t_11 | S_saf_s_11 | S_com_s_11 | S_eng_s_11 | S_aca_s_11 | S_saf_tot_11 | S_com_tot_11 | S_eng_tot_11 | S_aca_tot_11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01 | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES | 29 | 355.0 | 404.0 | 363.0 | 1122.0 | NaN | NaN | ... | 6.1 | 6.5 | 6.0 | 5.6 | 6.1 | 6.7 | 6.7 | 6.2 | 6.6 | 7.0 |
1 | 01 | 01M448 | UNIVERSITY NEIGHBORHOOD HIGH SCHOOL | 91 | 383.0 | 423.0 | 366.0 | 1172.0 | UNIVERSITY NEIGHBORHOOD H.S. | 39.0 | ... | 6.6 | 7.3 | 6.0 | 5.7 | 6.3 | 7.0 | 6.8 | 6.3 | 6.7 | 7.2 |
2 | 01 | 01M450 | EAST SIDE COMMUNITY SCHOOL | 70 | 377.0 | 402.0 | 370.0 | 1149.0 | EAST SIDE COMMUNITY HS | 19.0 | ... | 8.0 | 8.8 | NaN | NaN | NaN | NaN | 7.9 | 7.9 | 7.9 | 8.4 |
3 | 01 | 01M458 | FORSYTH SATELLITE ACADEMY | 7 | 414.0 | 401.0 | 359.0 | 1174.0 | NaN | NaN | ... | 8.9 | 8.9 | 6.8 | 6.1 | 6.1 | 6.8 | 7.8 | 7.1 | 7.2 | 7.8 |
4 | 01 | 01M509 | MARTA VALLE HIGH SCHOOL | 44 | 390.0 | 433.0 | 384.0 | 1207.0 | NaN | NaN | ... | 6.1 | 6.8 | 6.4 | 5.9 | 6.4 | 7.0 | 6.9 | 6.2 | 6.6 | 7.0 |
5 rows × 159 columns
The data is now cleaned and ready for analysis.
Exploratory Data Analysis
Correlation Coefficients
The goal of this project is to determine how demographic factors affect SAT scores. Thus, we can start our analysis by investigating the Pearson’s correlation coefficient between SAT score and each of the other columns.
= (
sat_corr
combined"R_sat_score"] # Take sat_score correlations
.corr()["R_sat_score"]) # Drop unnecessary variables
.drop([# Drop missing values
.dropna() )
The correlation table has one value for each of the numerical columns in combined
.
sat_corr.head()
R_SAT Critical Reading Avg. Score 0.976819
R_SAT Math Avg. Score 0.956406
R_SAT Writing Avg. Score 0.981751
A_AP Test Takers 0.625184
A_Total Exams Taken 0.614112
Name: R_sat_score, dtype: float64
We’ll analyze the data by going through each set of variables based on their original dataset. We’ll only use the most important datasets.
Before we do this, we’ll define a function that will take a Series and return values whose labels have certain prefixes. This will help us select parts of the correlation table more easily.
def filter_prefix(series, regex):
"""Takes a Series and a regex. Returns items in the Series whose labels match the regex at the start."""
= series[[
result for label in series.index
label if re.match(regex, label)
]]return result
For example, we could select all correlations for ap_2010
and class_size
variables by filtering prefixes A
and C
.
r"[AC]") filter_prefix(sat_corr,
A_AP Test Takers 0.625184
A_Total Exams Taken 0.614112
A_Number of Exams with scores 3 4 or 5 0.614752
C_CSD 0.054011
C_NUMBER OF STUDENTS / SEATS FILLED 0.400095
C_NUMBER OF SECTIONS 0.364537
C_AVERAGE CLASS SIZE 0.395964
C_SIZE OF SMALLEST CLASS 0.282388
C_SIZE OF LARGEST CLASS 0.327099
Name: R_sat_score, dtype: float64
Let’s proceed to the first set of variables.
AP Exam Results
First, we’ll look at the correlations for ap_2010
variables.
r"[A]") filter_prefix(sat_corr,
A_AP Test Takers 0.625184
A_Total Exams Taken 0.614112
A_Number of Exams with scores 3 4 or 5 0.614752
Name: R_sat_score, dtype: float64
The number of AP test takers has a positive correlation of 0.63 with SAT score.
Remember that the AP exams are “Advanced Placement” exams taken by USA high schoolers in order to get college credit. Thus, the students who take AP exams tend to have higher academic performance.
Let’s make a scatter plot of SAT score against percentage of AP test takers in each school.
# Make a new column for AP test taker percentage.
"A_ap_perc"] = combined["A_AP Test Takers "] / combined["D_total_enrollment"]
combined[
# Replace inaccurate data (above 100%) with nulls.
"A_ap_perc"] = combined["A_ap_perc"].mask(
combined["A_ap_perc"] > 1.0,
combined[
np.nan,
)
# Scatter plot
combined.plot.scatter(= "A_ap_perc",
x = "R_sat_score",
y
)
"SAT Score vs AP Test Taker Percentage")
plt.title("AP Test Taker Percentage")
plt.xlabel("Average SAT Score")
plt.ylabel(True)
plt.grid( plt.show()
The relationship is not linear. We can only observe that:
- Most schools have less than 40% AP takers and less than 1400 SAT points.
- The schools with high SAT scores (above 1600) have less than 60% AP test takers.
- The schools with low SAT scores (under 1400) can have any percentage of AP test takers.
Therefore, it doesn’t seem that the percentage of AP test takers in a school reliably influences its average SAT score.
Survey Data
Let’s move on to the survey
variables.
r"[S]") filter_prefix(sat_corr,
S_rr_s 0.286470
S_rr_t 0.022365
S_rr_p 0.087970
S_N_s 0.438427
S_N_t 0.306520
S_N_p 0.439547
S_saf_p_11 0.111892
S_com_p_11 -0.106899
S_eng_p_11 0.016640
S_aca_p_11 0.026691
S_saf_t_11 0.315796
S_com_t_11 0.097797
S_eng_t_11 0.045746
S_aca_t_11 0.135490
S_saf_s_11 0.286153
S_com_s_11 0.157417
S_eng_s_11 0.174425
S_aca_s_11 0.292592
S_saf_tot_11 0.292468
S_com_tot_11 0.083068
S_eng_tot_11 0.093489
S_aca_tot_11 0.178388
Name: R_sat_score, dtype: float64
The variable names are abbreviated, so it is difficult to understand them. However, the data dictionary explains their meanings.
The most important ones are from S_saf_p_11
downwards. These variables represent scores of 4 aspects of each school:
- Safety and Respect
- Communication
- Engagement
- Academic Expectations
Each aspect has 4 scores based on the respondent group: parents, teachers, students, and total (all groups).
For example, S_saf_tot_11
refers to the total Safety and Respect score of a school. It had a coefficient of 0.29. Let’s see a scatterplot of this.
combined.plot.scatter(= "S_saf_tot_11",
x = "R_sat_score",
y
)
"SAT Score vs Safety and Respect Score (Total)")
plt.title("Safety and Respect Score (Total)")
plt.xlabel("Average SAT Score")
plt.ylabel(True)
plt.grid( plt.show()
The correlation is not strong, as many schools have SAT scores under 1400 regardless of Safety and Respect score.
The schools which did have high scores (above 1500) had safety scores ranging from 7.25 to 8.75. This suggests that having sufficient safety and respect in a school may support the learning experience, though it does not always increase SAT scores.
Another interesting variable would be S_aca_s_11
, the Academic Expectations score based on student responses. It had a coefficient of 0.29.
combined.plot.scatter(= "S_aca_s_11",
x = "R_sat_score",
y
)
"SAT Score vs Academic Expectations Score (Students)")
plt.title("Academic Expectations Score (Students)")
plt.xlabel("Average SAT Score")
plt.ylabel(True)
plt.grid( plt.show()
Again, the correlation is not strong.
It’s interesting to note, however, that none of the schools with an Academic Expectations score under 7.5 had SAT scores above 1600. We could say that a school must first have proper academic expectations in order to open the opportunity to excel.
In terms of strong conclusions, though, we can’t make any yet.
Demographics
Lastly, we go to the demographics
dataset variables. These are the most important variables to investigate for this project.
Here are the correlation coefficients:
r"[D]") filter_prefix(sat_corr,
D_frl_percent -0.718163
D_total_enrollment 0.385741
D_ell_num -0.130355
D_ell_percent -0.379510
D_sped_num 0.058782
D_sped_percent -0.420646
D_asian_num 0.483939
D_asian_per 0.552204
D_black_num 0.043408
D_black_per -0.302794
D_hispanic_num 0.052672
D_hispanic_per -0.363696
D_white_num 0.460505
D_white_per 0.646568
D_male_num 0.345808
D_male_per -0.107624
D_female_num 0.403581
D_female_per 0.107676
Name: R_sat_score, dtype: float64
The results can be divided into general topics:
- School size: The
total_enrollment
has a coefficient of 0.39. This means that bigger schools may have higher SAT scores. - English proficiency: The coefficient for ELL percentage is -0.38.
- ELL means English Language Learner. It refers to students who are learning as a second or third language because they did not grow up with it.
- SpEd percentage: The coefficient for SpEd percentage is -0.42.
- SpEd means Special Education. It refers to educational plans with special accommodations for students who are disabled or have developmental disorders.
- Sex: The following are the coefficients of the percentage of each sex:
- Male: -0.11
- Female: 0.11
- Race: The following are the coefficients related to the percentage of students belonging to specific races.
- Asian: 0.55
- Black: -0.30
- Hispanic: -0.36
- White: 0.65
- Socio-economic status: The coefficient for FRL percentage -0.72.
- FRL means Free or Reduced-Price Lunch. This is a program that supports students below a certain income threshold who do not have the means to bring their own lunch to school.
We will go through each topic one by one.
School Size
Below, we plot a scatter plot of SAT scores against the total number of enrollees in each school.
combined.plot.scatter(= "D_total_enrollment",
x = "R_sat_score",
y
)
"SAT Score vs Total Enrollment")
plt.title("Total Enrollment")
plt.xlabel("Average SAT Score")
plt.ylabel(True)
plt.grid( plt.show()
The scatter plot doesn’t have a clear linear trend. Instead, the majority of points are clustered in one area, and the rest of the points are scattered.
Let’s inspect the names of the schools at the very bottom left, which have SAT scores under 1000 and total enrollment under 1000.
combined.loc["R_sat_score"] < 1000) & (combined["D_total_enrollment"] < 1000),
(combined["R_SCHOOL NAME"
]
102 INTERNATIONAL COMMUNITY HIGH SCHOOL
138 ACADEMY FOR LANGUAGE AND TECHNOLOGY
139 BRONX INTERNATIONAL HIGH SCHOOL
153 KINGSBRIDGE INTERNATIONAL HIGH SCHOOL
155 INTERNATIONAL SCHOOL FOR LIBERAL ARTS
197 PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE
199 HIGH SCHOOL OF WORLD CULTURES
210 BROOKLYN INTERNATIONAL HIGH SCHOOL
241 PACIFIC HIGH SCHOOL
251 INTERNATIONAL HIGH SCHOOL AT PROSPECT HEIGHTS
266 IT TAKES A VILLAGE ACADEMY
285 MULTICULTURAL HIGH SCHOOL
314 ASPIRATIONS DIPLOMA PLUS HIGH SCHOOL
320 PAN AMERICAN INTERNATIONAL HIGH SCHOOL
Name: R_SCHOOL NAME, dtype: object
Interestingly, the schools with the lowest average SAT scores are mostly small international schools. These schools tend to have more students from different racial and ethnic backgrounds. These students may speak English as a second or third language.
Therefore, the factor that affected SAT performance may have been the percentage of English language learners (ELL) in each school. This makes sense since the SAT is offered only in English and includes sections on Reading and Writing.
English Proficiency
Let’s plot SAT scores against the percentage of ELLs (English Language Learners) in each school.
combined.plot.scatter(= "D_ell_percent",
x = "R_sat_score",
y
)
"SAT Score vs ELL Percentage")
plt.title("ELL Percentage")
plt.xlabel("Average SAT Score")
plt.ylabel(True)
plt.grid( plt.show()
Once again, there isn’t a linear trend even though the correlation coefficient between the variables was -0.38. However, we can note that:
- Most schools have less than 50% ELLs and less than 1400 SAT points.
- The schools with the highest SAT scores (over 1500) have almost 0% ELLs.
- All of the schools with over 50% ELLs have very low scores (under 1300).
Therefore, it appears that ELL students are more challenged in reaching high SAT scores.
SpEd Percentage
The coefficient for SpEd percentage was -0.42, which indicates that schools with more SpEd students may have lower SAT scores. Let’s view a scatter plot:
combined.plot.scatter(= "D_sped_percent",
x = "R_sat_score",
y
)
"SAT Score vs SpEd Percentage")
plt.title("SpEd Percentage")
plt.xlabel("Average SAT Score")
plt.ylabel(True)
plt.grid( plt.show()
The correlation is somewhat linear, but it is not very steep. The points are very scattered, especially on the left side. Interestingly, the schools with an SAT score over 1600 had 15% or less SpEd students.
However, the variable doesn’t predict SAT score well.
Sex
Another factor that’s important to investigate is sex. Do schools with more male students or more female students have higher SAT scores?
Earlier, we saw that the male and female percentages had weak correlations with SAT score.
- Male: -0.11; lower SAT scores.
- Female: 0.11; higher SAT scores.
However, these correlations are weak, so the difference may not be that great.
Let’s use the female_per
variable in a scatter plot.
combined.plot.scatter(= "D_female_per",
x = "R_sat_score",
y
)
"SAT Score vs Female Percentage")
plt.title("Female Percentage")
plt.xlabel("Average SAT Score")
plt.ylabel(True)
plt.grid( plt.show()
As expected, the correlation is weak. We can see that:
- In most schools, 40% to 60% of students are female.
- SAT score is typically below 1400 regardless of female percentage.
- The schools with the highest SAT scores (above 1500) have a 30% to 75% female population.
This variable doesn’t seem to be very important to SAT score.
Race
Earlier, we saw that the following were the coefficients for race percentage variables. Positive values imply that a higher percentage of the given race may lead to higher average SAT score. Negative values: lower score.
- Asian: 0.55
- Black: -0.30
- Hispanic: -0.36
- White: 0.65
Out of all 4, the white percentage value has the strongest correlation, 0.65. This means that having more white students in a school can lead to a higher average SAT score. Let’s see a scatterplot of this.
combined.plot.scatter(= "D_white_per",
x = "R_sat_score",
y
)
"SAT Score vs White Percentage")
plt.title("White Percentage")
plt.xlabel("Average SAT Score")
plt.ylabel(True)
plt.grid( plt.show()
Though the correlation coefficient was stronger than the others, the trend is loose; the points are scattered far away from each other. We can see that:
- Most of the schools in the dataset have SAT scores under 1400 and white percentage under 20%.
- The schools with the highest scores have white percentages ranging from under 10% to over 60%.
Thus, we can’t say that a higher white percentage definitely raises the average SAT score.
Since the other race-related variables are weaker than this one, we won’t investigate them further. Let’s move on to the topic of socio-economic status.
Socio-economic status
As mentioned earlier, there was a variable representing the percentage of students availing of FRL (Free or Reduced-Price Lunch). Since FRL is for students below an income threshold, we can consider it similar to a poverty rate.
The correlation coefficient for this variable is the strongest at -0.72. The FRL students may have lower SAT scores due to external difficulties associated with low socio-economic status which hinder their education.
Below is the scatterplot.
combined.plot.scatter(= "D_frl_percent",
x = "R_sat_score",
y
)
"SAT Score vs FRL Percentage")
plt.title("FRL Percentage")
plt.xlabel("Average SAT Score")
plt.ylabel(True)
plt.grid( plt.show()
This graph shows a tighter, more linear relationship than was seen in the other graphs.
At the top left of the graph are schools with low FRL percentage and high average SAT score. There are still a few schools with an SAT score over 1600 despite an FRL percentage of over 50%.
However, for the majority of the schools, the FRL percentage is over 60%. All of these schools have an SAT score less than 1500.
Generally, the results suggest that socio-economic status is the strongest predictor of average SAT score among all variables in the dataset. High poverty rate is associated with a lower average SAT score.
Inferential Statistics: Linear Regression
As an additional step to this project, we will be performing Ordinary Least Squares linear regression. Some predictors may have a low Pearson’s correlation coefficient with the response variable but turn out to be statistically significant when other variables are taken into account.
I used Linear Regression by Python for Data Science LLC, along with (Simple) Linear Regression and OLS: Introduction to the Theory (Sluijmers 2020), as my main references.
The dataset has 401 observations. The dependent variable (or response variable) will be each school’s average SAT total score, represented by R_sat_score
.
As for independent variables (or predictor variables), we will start by considering the following.
- English proficiency
D_ell_percent
- SpEd percentage
D_sped_percent
- Sex
D_female_per
D_male_per
- Race
D_asian_per
D_black_per
D_hispanic_per
D_white_per
- Socio-economic status
D_frl_percent
.
The variable names are placed in the list below.
= [
iv_list "D_ell_percent",
"D_sped_percent",
"D_female_per",
"D_male_per",
"D_asian_per",
"D_black_per",
"D_hispanic_per",
"D_white_per",
"D_frl_percent",
]
I’d like to note that none of these columns have null values, so we don’t have to drop any rows or perform imputation.
print("Null values in each column")
sum() combined[iv_list].isnull().
Null values in each column
D_ell_percent 0
D_sped_percent 0
D_female_per 0
D_male_per 0
D_asian_per 0
D_black_per 0
D_hispanic_per 0
D_white_per 0
D_frl_percent 0
dtype: int64
Moving on, the code below creates a matrix X
for IVs and vector y
for the DV.
= (
regression_data "DBN", "R_sat_score"] + iv_list]
combined[["DBN")
.set_index(
)
= regression_data[iv_list].copy()
X = regression_data["R_sat_score"].copy()
y
regression_data
R_sat_score | D_ell_percent | D_sped_percent | D_female_per | D_male_per | D_asian_per | D_black_per | D_hispanic_per | D_white_per | D_frl_percent | |
---|---|---|---|---|---|---|---|---|---|---|
DBN | ||||||||||
01M292 | 1122.0 | 22.3 | 24.9 | 38.6 | 61.4 | 14.0 | 29.1 | 53.8 | 1.7 | 88.6 |
01M448 | 1172.0 | 21.1 | 21.8 | 42.6 | 57.4 | 29.2 | 22.6 | 45.9 | 2.3 | 71.8 |
01M450 | 1149.0 | 5.0 | 26.4 | 45.3 | 54.7 | 9.7 | 23.9 | 55.4 | 10.4 | 71.8 |
01M458 | 1174.0 | 4.0 | 8.9 | 56.7 | 43.3 | 2.2 | 34.4 | 59.4 | 3.6 | 72.8 |
01M509 | 1207.0 | 11.2 | 25.9 | 53.7 | 46.3 | 9.3 | 31.6 | 56.9 | 1.6 | 80.7 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32K549 | 1035.0 | 18.2 | 18.4 | 51.6 | 48.4 | 0.5 | 28.3 | 68.7 | 1.8 | 75.1 |
32K552 | 1060.0 | 23.9 | 22.1 | 45.5 | 54.5 | 1.5 | 26.4 | 70.4 | 1.0 | 77.1 |
32K554 | 1315.0 | 2.7 | 4.9 | 47.9 | 52.1 | 5.3 | 12.9 | 79.5 | 1.5 | 81.4 |
32K556 | 1055.0 | 18.2 | 18.6 | 46.8 | 53.2 | 0.9 | 21.2 | 77.3 | 0.0 | 88.0 |
32K564 | 1034.0 | 5.2 | 9.4 | 55.8 | 44.2 | 0.3 | 35.9 | 61.5 | 1.8 | 81.8 |
401 rows × 10 columns
Before we fit and interpret the model with these DataFrames, we must first test the assumptions of linear regression.
Multicollinearity
Multicollinearity occurs when multiple predictor variables are highly correlated. Variables with high multicollinearity have unreliable significance values. Thus, this is a problem that must be avoided.
The first way to check for multicollinearity is with a Pearson’s correlation coefficient table. The code below plots this table using a heatmap for easier interpretation.
# Create a correlation matrix from X.
= X.corr()
corr
# Create a mask that selects the bottom left triangle of the table.
= corr.iloc[1:, :-1]
corr = np.triu( # Replace the lower triangle with 0's.
mask # Make a DataFrame of 1's with the same shape as corr.
np.ones_like(corr), = 1, # Raise the diagonal by 1.
k
)
# Plot a heatmap of the values.
= (12, 7))
plt.figure(figsize
= sns.heatmap(
ax
corr,= -1, # Anchor colors on -1 and 1.
vmin = 1,
vmax = True, # Draw a color bar.
cbar = "RdBu", # Use Red for negatives and Blue for positives.
cmap = mask, # Use a mask to remove the upper right triangle of the matrix.
mask = True, # Write the data value in each cell.
annot
)
# Format the text in the plot to make it easier to read.
for text in ax.texts:
= float(text.get_text())
coeff
# Remove data values with low correlation.
if np.abs(coeff) < 0.3:
'')
text.set_text(# Round all visible data values to the hundredths place.
else:
round(coeff, 2))
text.set_text(
"large")
text.set_fontsize(
"Correlation Heatmap of Independent Variables")
plt.title(
# Rotate x-ticks to read from top to bottom.
= -30, ha = "left", size = "large")
plt.xticks(rotation = 0, size = "large")
plt.yticks(rotation
plt.show()
In the heatmap above, blue colors represent positive correlation; red represents negative correlation. Colors with more saturation indicate that the correlation is stronger.
I removed most of the coefficient labels except for the ones greater than 0.3, which show somewhat strong correlations. Interestingly, we can see that:
- Male percentage has a perfect negative correlation with female percentage. This is to be expected since both must sum up to 100.
- Hispanic percentage is correlated with black percentage and FRL percentage.
- White percentage is strongly negatively correlated with FRL percentage, indicating that schools with more white students also have fewer poor students.
Another way to assess multicollinearity is to use variance inflation factors or VIFs. These are numbers greater than or equal to 1.
A VIF less than 5 suggests low to moderate collinearity. If a predictor’s VIF is at least 5, this indicates high multicollinearity, so the significance value is not reliable.
The code below produces a VIF for each predictor.
from statsmodels.stats.outliers_influence import variance_inflation_factor as sm_vif
def get_vif(X):
"""Get a VIF value for each column in a DataFrame."""
= pd.DataFrame()
vif_df "label"] = X.columns
vif_df["VIF"] = [
vif_df[
sm_vif(X.values, i)for i in range(len(X.columns))
]"label", inplace = True)
vif_df.set_index(
return vif_df
get_vif(X)
VIF | |
---|---|
label | |
D_ell_percent | 2.874752 |
D_sped_percent | 8.509117 |
D_female_per | 2578.305578 |
D_male_per | 2569.217298 |
D_asian_per | 304.520086 |
D_black_per | 2196.055978 |
D_hispanic_per | 2413.721821 |
D_white_per | 231.186584 |
D_frl_percent | 60.692762 |
These values are problematic. Almost all of the VIFs indicate strong multicollinearity except for that of D_ell_percent
. This is likely due to the strong correlations we saw earlier in the heatmap.
Therefore, we must remove a few variables with high multicollinearity.
- I choose to remove
male_per
since it has -1.0 correlation withfemale_per
, which makes it redundant. - I choose to remove
hispanic_per
because it has over 0.5 correlation withblack_per
andfrl_percent
. - I choose to remove
frl_percent
since it has over 0.3 correlation with 4 other variables, including 0.7 correlation withwhite_per
.
Thus, below is the new list of predictors.
= [
iv_list "D_ell_percent",
"D_sped_percent",
"D_female_per",
"D_asian_per",
"D_black_per",
"D_white_per",
]
= (
regression_data "DBN", "R_sat_score"] + iv_list]
combined[["DBN")
.set_index(
)
= regression_data[iv_list].copy()
X = regression_data["R_sat_score"].copy()
y
= get_vif(X)
vif_df vif_df
VIF | |
---|---|
label | |
D_ell_percent | 1.600755 |
D_sped_percent | 3.824473 |
D_female_per | 7.289758 |
D_asian_per | 1.788134 |
D_black_per | 3.926733 |
D_white_per | 1.777499 |
We can see that most of the VIFs are less than 5, so the p-values in the model will be reliable. Only the female_per
variable’s p-value will be unreliable. However, I will still include this in the model since the percentages of sexes in a school may have an effect on SAT performance.
IID Errors and Normality of Residuals
Next, we will be testing two more assumptions:
- IID (identically and independently distributed) residuals
- This is tested using the Durbin-Watson test statistic, which ranges from 0 to 4.
- Ideally, the statistic should be close to 2.
- As a rule of thumb, values from 1.5 to 2.5 are considered acceptable.
- Normality of residuals
- This is tested using the Jarque-Bera test.
- The null hypothesis is that the residuals are normally distributed.
- A statistically significant value (p < 0.05) supports the claim that the residuals are not normal.
In order to test these, we must fit the model and look at its results. statsmodels
automatically calculates both the Durbin-Watson and the Jarque-Bera tests for us.
# OLS model
# Add a constant to the X matrix.
# This is done so that the y-intercept is not locked at 0.
= sm.add_constant(X)
X
# Make and fit the OLS model.
= sm.OLS(y, X)
model = model.fit()
results = results.summary()
summary
# Extract summary tables from OLS results summary.
def extract_summary(sm_summary):
"""Take a statsmodels summary instance and return a list of DataFrames."""
= []
tables = list(range(3))
t_inds
for i in t_inds:
= sm_summary.tables[i].as_html()
table_as_html
= pd.read_html(
table_df
table_as_html,= (0 if i == 1 else None), # Only set a header for table 2.
header = 0,
index_col 0]
)[
tables.append(table_df)
if i == 1:
# Combine summary table 1 with the VIF column.
= pd.concat([table_df, vif_df], axis = 1)
table_df "label", inplace = True)
table_df.rename_axis(
else:
# For tables 0 and 2, turn the index back into a column.
= tables[i].reset_index()
table_df
= table_df
tables[i]
return tables
= extract_summary(summary)
tables
# Display the third table.
2] tables[
C:\Users\migs\anaconda3\envs\new_streamlit_env2\lib\site-packages\statsmodels\tsa\tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
x = pd.concat(x[::order], 1)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | Omnibus: | 28.251 | Durbin-Watson: | 1.660000e+00 |
1 | Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 6.075100e+01 |
2 | Skew: | 0.378 | Prob(JB): | 6.430000e-14 |
3 | Kurtosis: | 4.751 | Cond. No. | 5.040000e+02 |
The Durbin-Watson test statistic is 1.66. This is less than 2, so it indicates positive autocorrelation among residuals. (That is, each residual has a positive correlation with the residual before it). However, it falls within the acceptable range (1.5 to 2.5), so it is not a cause for concern.
The Jarque-Bera test significance, on the other hand, is \(6.43 \cdot 10^{-14}\), which is very close to 0. It is statistically significant, so the residuals are not normally distributed. However, since the sample size (401) is greater than 30, OLS regression is robust against deviations from normality. Thus, this is not a cause for concern either.
Let’s plot a histogram of residuals to inspect the deviation from normality, just to be sure.
#%% Make a histogram of residuals.
= results.resid
residuals #bin_edges = get_bin_edges(residuals)
#bins = bin_edges)
plt.hist(residuals, )"Histogram of Residuals")
plt.title("Residual Error")
plt.xlabel("Frequency")
plt.ylabel(True)
plt.grid(
plt.show()
Though this isn’t normal, it is still somewhat bell-shaped. Most of the data converges around a residual error of 0, with fewer values going farther away. Thus, the distribution isn’t problematic.
We can also visualize normality of residuals using a Q-Q (quantile-quantile) plot.
# Plot a Q-Q plot.
sm.qqplot(= residuals,
data = "r",
line # Regression line
)"Q-Q Plot of Residuals")
plt.title(
plt.legend(= ["Actual data", "Theoretical quantiles line"],
labels = "best",
loc
)True)
plt.grid(
plt.show()
C:\Users\migs\anaconda3\envs\new_streamlit_env2\lib\site-packages\statsmodels\graphics\gofplots.py:993: UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "bo" (-> marker='o'). The keyword argument will take precedence.
ax.plot(x, y, fmt, **plot_style)
The red line is our reference for normality. The blue dots should ideally follow this line. We can see that they do, though not perfectly, and the data are less normal at the higher quantiles. Generally, though, the data is close enough to being normal.
Lastly, for IID residuals, we can plot the residuals against the fitted response variable.
# Plot residuals against fitted DV values.
= results.fittedvalues
fitted = results.resid
residuals
# Horizontal line on the x axis.
= "r", label = "Fitted SAT Score")
plt.axhline(color
# Scatter plot.
= "Residual Error")
plt.scatter(fitted, residuals, label
"Residuals against Fitted SAT Score")
plt.title("Fitted SAT Score")
plt.xlabel("Residual Error")
plt.ylabel(= "best")
plt.legend(loc True)
plt.grid(
plt.show()
In this graph, we ideally want to see the points cluster uniformly around the red line, without any trends.
However, if we follow the outline of the general mass of points, these seem to follow a curve. However, the points on the right side are loosely scattered, so we cannot be sure whether there is truly a curve or it is merely due to outliers.
Anyway, we saw earlier that the Durbin-Watson test statistic was within the acceptable range, so the distribution of residuals is alright. We can move on to the next step.
Linearity
The next assumption that we will test is linearity. The predictors must be linearly correlated with the response variable; the relationship should not be too curved or abnormal.
We can test this visually by plotting the actual average SAT scores against the fitted scores which were predicted by our model.
# Plot actual SAT scores against fitted SAT scores.
= results.fittedvalues
fitted
# Scatter plot
= "Actual data")
plt.scatter(fitted, y, label
# Use OLS model results to get trend line.
= sm.add_constant(fitted)
x_fitting = sm.OLS(y, x_fitting)
trend_model = trend_model.fit()
trend_results
= trend_results.params
trend_params = trend_params["const"]
trend_const = trend_params[0]
trend_coeff = trend_coeff * fitted + trend_const
y_plot
= "r", label = "Trend line")
plt.plot(fitted, y_plot, color
# Extra plot info
"""Actual SAT scores against Fitted SAT scores""")
plt.title("Fitted SAT Score")
plt.xlabel("Actual SAT Score")
plt.ylabel(= "best")
plt.legend(loc True)
plt.grid(
plt.show()
C:\Users\migs\anaconda3\envs\new_streamlit_env2\lib\site-packages\statsmodels\tsa\tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
x = pd.concat(x[::order], 1)
The trend line above fits the actual data quite well. Moreover, the actual data seems to follow a mostly linear shape. Thus, there is no cause for concern here.
Homoscedasticity
The last assumption that we will test is homoscedasticity. This means that the variance is consistent across all values of residuals. If this is not the case, the model has heteroscedasticity, which is problematic.
This can be tested using the Breusch-Pagan test. The null hypothesis is that heteroscedasticity is not present (which is ideal). A statistically significant p-value supports the claim that there is heteroscedasticity, in which case we would have to adjust the model.
# Breusch-Pagan test.
= ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value']
b_labels = sms.het_breuschpagan(
b_test
residuals,
model.exog,
)
= pd.DataFrame(zip(b_labels, b_test))
b_results = ["Information", "Value"]
b_results.columns
b_results
Information | Value | |
---|---|---|
0 | Lagrange multiplier statistic | 1.028960e+02 |
1 | p-value | 6.237518e-20 |
2 | f-value | 2.266603e+01 |
3 | f p-value | 5.711239e-23 |
Both p-values have a factor of around \(10^{-20}\). In other words, these are very close to 0, which means that these are significant. Thus, the model is heteroscedastic.
We can make the results robust against heteroscedasticity by specifying an HCCM (Heteroscedasticity Consistent Covariance Matrix). We will specifically choose HC0. It doesn’t perform well for sample sizes under 250 (Python for Data Science), but our sample size is 401, so we can use it.
# Make and fit the OLS model.
= sm.OLS(y, X)
model = model.fit(
results = 'HC0', # Specify an HCCM.
cov_type
)= results.summary()
summary = extract_summary(summary) tables
Now that we’ve tested all of the assumptions and adjusted for heteroscedasticity, we can move on to partial regression plots.
Partial regression plots
Partial regression plots are used to visualize the relationship between 1 predictor and the response variable while holding all other predictors constant.
The plots are made below.
#%% Partial regression plots.
def par_reg_plot(data, label, show_dbn = False):
"""Take a predictor label and plot its partial regression plot, which takes the other predictors into account.
Note: This function references objects from the global namespace."""
= "Partial Regression Plot\nSAT Score against `{}`".format(label)
title = "`{}`\nPartial Residuals".format(label)
x_label
# List of all IVs other than the current one.
= [col for col in iv_list if col != label]
others
= sm.graphics.plot_partregress(
fig = "R_sat_score",
endog = label,
exog_i = others,
exog_others = data,
data = show_dbn,
obs_labels
)
plt.title(title)
plt.xlabel(x_label)"SAT Score\nPartial Residuals")
plt.ylabel(True)
plt.grid(
plt.legend(= ["Actual partial residuals", "Trend line"],
labels = "best",
loc
)
plt.tight_layout(= 2,
h_pad
)
plt.show()
for label in iv_list:
par_reg_plot(regression_data, label)
We can observe that:
- English Language Learner percentage and Special Education percentage fit the linear trend best.
- Female percentage residuals all cluster around one spot and don’t follow a trend.
- The other variables, which involve race percentages, seem to be affected by outliers.
We should keep these plots in mind as we interpret the model results since these don’t show whether outliers may have led to an overestimation of the significance of a predictor.
Interpretation of Results
Below is the first table of model results.
0] tables[
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | Dep. Variable: | R_sat_score | R-squared: | 7.860000e-01 |
1 | Model: | OLS | Adj. R-squared: | 7.820000e-01 |
2 | Method: | Least Squares | F-statistic: | 9.644000e+01 |
3 | Date: | Mon, 13 Dec 2021 | Prob (F-statistic): | 3.410000e-74 |
4 | Time: | 10:27:39 | Log-Likelihood: | -2.330700e+03 |
5 | No. Observations: | 401 | AIC: | 4.675000e+03 |
6 | Df Residuals: | 394 | BIC: | 4.703000e+03 |
7 | Df Model: | 6 | NaN | NaN |
8 | Covariance Type: | HC0 | NaN | NaN |
Multiple regression analysis was used to test if a school’s percentages of English Language Learners, Special Education students, female students Asian students, black students, and white students significantly predicted the school’s average total SAT score.
The six predictors explained 78.6% of the variance (R2= 0.786, F(6, 394) = 96.44, p < 0.01).
Next is the second table of results, which shows the beta coefficient, p-value, etc. for each predictor variable.
1] tables[
coef | std err | z | P>|z| | [0.025 | 0.975] | VIF | |
---|---|---|---|---|---|---|---|
label | |||||||
const | 1430.1978 | 32.665 | 43.784 | 0.00 | 1366.175 | 1494.220 | NaN |
D_ell_percent | -5.2206 | 0.320 | -16.290 | 0.00 | -5.849 | -4.592 | 1.600755 |
D_sped_percent | -8.6750 | 0.861 | -10.076 | 0.00 | -10.362 | -6.988 | 3.824473 |
D_female_per | -0.8097 | 0.316 | -2.564 | 0.01 | -1.429 | -0.191 | 7.289758 |
D_asian_per | 3.5967 | 0.524 | 6.861 | 0.00 | 2.569 | 4.624 | 1.788134 |
D_black_per | -1.3553 | 0.182 | -7.446 | 0.00 | -1.712 | -0.999 | 3.926733 |
D_white_per | 3.8196 | 0.604 | 6.329 | 0.00 | 2.637 | 5.002 | 1.777499 |
The predicted average SAT score in a school is approximately equal to 1430 - 5.22(ELL percentage) - 8.68(SpEd percentage) - 0.81(female percentage) + 3.60(Asian percentage) - 1.36(black percentage) + 3.82(white percentage).
All of the independent variables used in the model were significant predictors of the school’s average SAT score, with p < 0.05. However, since the female percentage variable’s VIF value was greater than 5, strong multicollinearity was present, so its significance value may not be reliably interpreted.
The coefficients for ELL percentage, SpEd percentage, and black student percentage were negative. This means that these factors are associated with lower average SAT score in an NYC high school.
On the other hand, the coefficients for Asian percentage and white percentage were positive. This means that these factors are associated with higher average SAT score in an NYC high school. However, the linear relationship in the partial regression plots appeared to be affected by outliers, so the effect of these variables may not be strong for the majority of high schools.
In general, however, these results present initial evidence that race percentages in an NYC high school may significantly affect its average SAT score.
Conclusion
In this project, we cleaned, transformed, and combined multiple datasets about New York City high schools. We then conducted exploratory data analysis using Pearson’s correlation coefficients and bivariate scatterplots.
Then, multiple linear regression was performed. The model explained 78.6% of the variance (R2= 0.786, F(6, 394) = 96.44, p < 0.01). This supports the claim that the percentage of each race in a school has an effect on the school’s average SAT score.
Thanks for reading!