Working with Data

3. Working with Data#

Learning Objectives

Be able to:

open data files (.txt, .xlsx) using pandas
present the data in a table
clean and modify the data as needed
represent the cleaned data in a scatter plot using pyplot from the package matplotlib
fit the data using curve_fit from the package scipy.optimize
export a table of data into an Excel file
review functions, list comprehensions, random numbers and selecting data using conditionals

First we need to import the packages we will need in this notebook:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

3.1. Importing / Reading data using `pandas`#

3.1.1. Define your file path#

First we need to define our path to the file we want to import. If you find the file on your machine, you can right click on it and select properties.

By highlighting the text in the “location” box you can copy ctrl-c and paste directly in your notebook ctrl-v. For me this would give:

C:\Users\Sean\OneDrive - Virginia Tech\Documents\Teaching\Course ebooks\Python-for-MSE\data_for_practice

now I just need to add “\filename” to the end. For example,

C:\Users\Sean\OneDrive - Virginia Tech\Documents\Teaching\Course ebooks\Python-for-MSE\data_for_practice\Student credit hours dept numbers.xlsx

It is very important to remember to include the file extension at the end.

Below I define the path separate from the filename and then join them together. This is often more convenient if you are grabbing multiple files from the same path.

import os  #we use this to join our file path to our filename
path = r'C:\Users\Sean\OneDrive - Virginia Tech\Documents\Teaching\Course ebooks\Python-for-MSE\data_for_practice'
filename='Student credit hours dept numbers.xlsx'
fullpath=os.path.join(path,filename)
fullpath

'C:\\Users\\Sean\\OneDrive - Virginia Tech\\Documents\\Teaching\\Course ebooks\\Python-for-MSE\\data_for_practice\\Student credit hours dept numbers.xlsx'

3.1.2. Read the file#

Let’s take a look at some student credit hours taught for engineering departments.

path = r"C:\Users\Sean\OneDrive - Virginia Tech\Documents\Teaching\2114 & 3114 Math I & II\git_2114_3114\data for practice"
filename='Student credit hours dept numbers.xlsx'
fullpath=os.path.join(path,filename)

raw_excel=pd.read_excel(fullpath)
raw_excel.head(7)

	Unnamed: 0	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	Fall 2018 SCH	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	# faculty	Dept	on campus	off campus	Total	SCH / Faculty

3.2. Clean the data#

# let's drop rows and columns that have no data
df=raw_excel.dropna(how='all', axis=0) #removes rows
# try changing 'all' to 'any' in these commands.  What is the result? 
df=df.dropna(how='all', axis=1) #now remove columns
df.head(6)

	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9
5	NaN	NaN	Fall 2018 SCH	NaN	NaN	NaN
6	# faculty	Dept	on campus	off campus	Total	SCH / Faculty
8	29	AOE	7313	318	7631	263.137931
9	NaN	BMES	1432	NaN	1432	NaN
10	58	CEE	10240	224	10464	180.413793
11	18	CHE	2951	NaN	2350	130.555556

looks like we have an extra row at the top. Let’s drop that one also. Notice the left column which is the row index is 5 so we drop by identifying the index.

df=df.drop(5)
# now lets rename our columns
df.columns=['num faculty','dept','on campus','off campus', 'total', 'SCH per faculty']
df.head(6)

	num faculty	dept	on campus	off campus	total	SCH per faculty
6	# faculty	Dept	on campus	off campus	Total	SCH / Faculty
8	29	AOE	7313	318	7631	263.137931
9	NaN	BMES	1432	NaN	1432	NaN
10	58	CEE	10240	224	10464	180.413793
11	18	CHE	2951	NaN	2350	130.555556
12	NaN	CNST	2951	NaN	2951	NaN

df=df.drop(6) # we don't need the first row again now labeled "6"
df=df.reset_index(drop=True) #reset row numbering
df=df.fillna(0) #replace NaN with 0
df.head(6)

	num faculty	dept	on campus	off campus	total	SCH per faculty
0	29	AOE	7313	318	7631.0	263.137931
1	0	BMES	1432	0	1432.0	0.000000
2	58	CEE	10240	224	10464.0	180.413793
3	18	CHE	2951	0	2350.0	130.555556
4	0	CNST	2951	0	2951.0	0.000000
5	55	CS	17257	1122	18379.0	334.163636

3.3. Analysis of data#

3.3.1. sorting, slicing, math on columns, adding new columns#

# sort data frame by values in column "num faculty"
sch_data=df.sort_values(['num faculty','dept']).reset_index(drop=True)
#try removing ".reset_index()" and see what you get
#or change drop=True to drop=False and see what happens
sch_data

	num faculty	dept	on campus	off campus	total	SCH per faculty
0	0	0	0	Average:	7257.928571	200.426961
1	0	0	0	Median:	5966.000000	207.760000
2	0	BMES	1432	0	1432.000000	0.000000
3	0	CNST	2951	0	2951.000000	0.000000
4	0	ENGR	2104	30	2134.000000	0.000000
5	13	MINE	1095	0	1095.000000	84.230769
6	18	CHE	2951	0	2350.000000	130.555556
7	21	MSE	3771	75	3846.000000	183.142857
8	25	ENGE	5194	0	5194.000000	207.760000
9	29	AOE	7313	318	7631.000000	263.137931
10	32	ISE	7015	118	7133.000000	222.906250
11	43	ESM + BEAM	6606	132	6738.000000	156.697674
12	55	CS	17257	1122	18379.000000	334.163636
13	58	CEE	10240	224	10464.000000	180.413793
14	62	ME	10416	2563	12979.000000	209.338710
15	83	ECE	17017	2268	19285.000000	232.349398

# we don't need lines 0 - 4 so let's take only 5-15
sch_data=sch_data[5:16]
sch_data

	num faculty	dept	on campus	off campus	total	SCH per faculty
5	13	MINE	1095	0	1095.0	84.230769
6	18	CHE	2951	0	2350.0	130.555556
7	21	MSE	3771	75	3846.0	183.142857
8	25	ENGE	5194	0	5194.0	207.760000
9	29	AOE	7313	318	7631.0	263.137931
10	32	ISE	7015	118	7133.0	222.906250
11	43	ESM + BEAM	6606	132	6738.0	156.697674
12	55	CS	17257	1122	18379.0	334.163636
13	58	CEE	10240	224	10464.0	180.413793
14	62	ME	10416	2563	12979.0	209.338710
15	83	ECE	17017	2268	19285.0	232.349398

sch_data=sch_data.reset_index(drop="True")
sch_data

	num faculty	dept	on campus	off campus	total	SCH per faculty
0	13	MINE	1095	0	1095.0	84.230769
1	18	CHE	2951	0	2350.0	130.555556
2	21	MSE	3771	75	3846.0	183.142857
3	25	ENGE	5194	0	5194.0	207.760000
4	29	AOE	7313	318	7631.0	263.137931
5	32	ISE	7015	118	7133.0	222.906250
6	43	ESM + BEAM	6606	132	6738.0	156.697674
7	55	CS	17257	1122	18379.0	334.163636
8	58	CEE	10240	224	10464.0	180.413793
9	62	ME	10416	2563	12979.0	209.338710
10	83	ECE	17017	2268	19285.0	232.349398

# Let's just take the first 5 columns
sch_data=sch_data.loc[:,'num faculty':'total']
sch_data

	num faculty	dept	on campus	off campus	total
0	13	MINE	1095	0	1095.0
1	18	CHE	2951	0	2350.0
2	21	MSE	3771	75	3846.0
3	25	ENGE	5194	0	5194.0
4	29	AOE	7313	318	7631.0
5	32	ISE	7015	118	7133.0
6	43	ESM + BEAM	6606	132	6738.0
7	55	CS	17257	1122	18379.0
8	58	CEE	10240	224	10464.0
9	62	ME	10416	2563	12979.0
10	83	ECE	17017	2268	19285.0

We can easily refer to a column of data by its header, for example:

sch_data['dept']

         MINE
          CHE
          MSE
         ENGE
          AOE
          ISE
   ESM + BEAM
           CS
          CEE
           ME
         ECE
Name: dept, dtype: object

Let’s take the total credit hours and divide by the total faculty then add this to our dataframe.

sch_data['credit_faculty']=sch_data['total']/sch_data['num faculty']
sch_data

	num faculty	dept	on campus	off campus	total	credit_faculty
0	13	MINE	1095	0	1095.0	84.230769
1	18	CHE	2951	0	2350.0	130.555556
2	21	MSE	3771	75	3846.0	183.142857
3	25	ENGE	5194	0	5194.0	207.760000
4	29	AOE	7313	318	7631.0	263.137931
5	32	ISE	7015	118	7133.0	222.906250
6	43	ESM + BEAM	6606	132	6738.0	156.697674
7	55	CS	17257	1122	18379.0	334.163636
8	58	CEE	10240	224	10464.0	180.413793
9	62	ME	10416	2563	12979.0	209.338710
10	83	ECE	17017	2268	19285.0	232.349398

3.4. Plotting the data#

now let’s plot the credit_faculty vs the num faculty

# plt.plot(x, y, 'formatting') 'ro' means red points
plt.plot(sch_data['num faculty'], sch_data['credit_faculty'], 'ro')
plt.ylabel('student credit hours per faculty')
plt.xlabel('number of faculty in department')
plt.show()

../_images/2728b1b7463081cfc37c3d7de57a987e016a195333735d44ce7463cbfc83826d.png

It looks like a linear relationship exists for smaller departments. Let’s fit the data for departments with faculty less than 30.

sch_slice=sch_data[0:5]
# plt.plot(x, y, 'formatting') 'ro' means red points
plt.plot(sch_slice['num faculty'], sch_slice['credit_faculty'], 'ko')
plt.ylabel('student credit hours per faculty')
plt.xlabel('number of faculty in department')
plt.show()

../_images/8bca0154c8a73906efc78a5aaf8efbef0e8dec0a3df6417c047804881c38b127.png

3.5. Fitting data using scipy.optimize curvefit#

#read in package for curve fitting
from scipy.optimize import curve_fit

#define a function for the fit.  In this case just a line
def lin(x, m, b):
    return m*x+b

best_vals, covar = curve_fit(lin, sch_slice['num faculty'], sch_slice['credit_faculty'])
print(f'Our fit is given by {best_vals[0]:.2f} x + {best_vals[1]:.2f}')

Our fit is given by 11.11 x + -61.66

print(best_vals) #these are in order as defined in function i.e. m , b

[ 11.10513878 -61.66351949]

plt.plot(sch_slice['num faculty'], lin(sch_slice['num faculty'],11.105,-61.6635), 'k--')
plt.plot(sch_data['num faculty'], sch_data['credit_faculty'], 'ro')
plt.ylabel('student credit hours per faculty')
plt.xlabel('number of faculty in department')
plt.show()

../_images/132f67f5216bd6d3731dd63f4a1418f64f1c8811b6096e93875b84b17c68ec43.png

Let’s add a horizontal line at 275

plt.plot(sch_slice['num faculty'], lin(sch_slice['num faculty'],11.105,-61.6635), 'k--')
plt.plot(sch_data['num faculty'], sch_data['credit_faculty'], 'ro')
plt.axhline(y=275, linewidth=1, color='b', linestyle='-')
plt.axis([10,70,50,400]) #specify axis limits like this
plt.ylabel('student credit hours per faculty')
plt.xlabel('number of faculty in department')
plt.show()

../_images/b8dadd9a2a70ec4ce4392f2b2c80cd4f591195f45e9fef7201ceb11d0766ec47.png

3.6. Exporting data to excel#

3.6.1. Double check the data that you want to export#

# .loc[rows, columns] lets us use the names of the columns and rows
#in our case rows = ":" which means all rows
# columns is a list of columns to take ["num faculty",'credit_faculty']
data_to_save=sch_data.loc[:,["num faculty",'credit_faculty']]
data_to_save

	num faculty	credit_faculty
0	13	84.230769
1	18	130.555556
2	21	183.142857
3	25	207.760000
4	29	263.137931
5	32	222.906250
6	43	156.697674
7	55	334.163636
8	58	180.413793
9	62	209.338710
10	83	232.349398

3.6.2. Export#

path = r"C:\Users\Sean\OneDrive - Virginia Tech\Documents\Teaching\2114 & 3114 Math I & II\git_2114_3114\data for practice"
filename='lesson 3 outfile.xlsx'
outpath=os.path.join(path,filename)

data_to_save.to_excel(outpath, index = True, header=True)
#try changing index=True and check your output file 

Here’s what my excel file looks like:

3.7. Exercises#

The problems below may require the ability to:

import an excel file
define a function
use if, elif, else statements
use numpy.array() to multiply a sequence of numbers
generate a random number using np.random.random()
use matplotlib to make a scatter plot

3.7.1. Review of some coding that may make the homework exercises easier.#

3.7.1.1. Using `if`, `elif`, `else` statements#

def piecewise(var1, var2, var3): # use the ratio var3/var1 to return different outputs
    # variable var2 is not actually used but the function doesn't care
    
    # first if, else statement to avoid dividing by zero
    if var1 ==0:
        return(0)
    else: ratio=var3/var1
    
    #second if, elif, else statement
    if ratio < 0:
        return(50)
    elif ratio == 0:
        return(1)
    elif ratio == 0.5:
        return(60)
    else: return(ratio)

np.linspace(-1,4,21) # linspace is used to generate a list of equally spaced points

array([-1.  , -0.75, -0.5 , -0.25,  0.  ,  0.25,  0.5 ,  0.75,  1.  ,
        1.25,  1.5 ,  1.75,  2.  ,  2.25,  2.5 ,  2.75,  3.  ,  3.25,
        3.5 ,  3.75,  4.  ])

[piecewise(x,3,1.5) for x in [0,1,2,3]] #list comprehension -- short method to make a loop

[0, 1.5, 0.75, 60]

# compare the above to
out=[]
for x in range(4):
    out.append(piecewise(x,3,1.5))
out

[0, 1.5, 0.75, 60]

plt.plot(np.linspace(-1,4,21), [piecewise(x,3,1.5) for x in np.linspace(-1,4,21)], 'b.')
plt.show()

../_images/f2d8b05500187900bfa9081c897a823aa66d5d6710fa2ec157e4738e4410bb7b.png

3.7.1.2. Let’s determine if a few points fall inside a circle with center (0.5, 0.5) and radius 0.7#

np.random.random()

0.11949452264943106

np.random.random(5) #give me 5 random points

array([0.02570857, 0.2393304 , 0.15681669, 0.17950396, 0.23284677])

# !! important note:  I use np.array below because I'm manually typing out a few coordinates
#  you will use np.random.random() which will automatically give you an array as its output
xcoord=np.array([0, 0.5, 1,0.5]) #convert lists [] to numpy array
ycoord=np.array([0, -0.5, 0.8,0.5])
magnitude=np.sqrt((xcoord-0.5)**2+(ycoord-0.5)**2)
magnitude

array([0.70710678, 1.        , 0.58309519, 0.        ])

fig, ax = plt.subplots()
ax.plot(xcoord, ycoord, 'k^')
ax.add_patch(plt.Circle((0.5, 0.5), 0.7, color='r', alpha=0.5))
ax.set_aspect('equal')
plt.show()

../_images/6312ba232f5de1da3ea8f2d77e8fd8f70fde2c21d43542341533e6bb76d1f62c.png

test=magnitude<0.7
x=xcoord[test] #take all x values where magnitude array is <0.7
y=ycoord[test] #take all y values where magnitude array is <0.7

fig, ax = plt.subplots()
ax.plot(xcoord, ycoord, 'ro') # 'ro' means red points
ax.plot(x, y, 'ko') #'ko' means black points
ax.add_patch(plt.Circle((0.5, 0.5), 0.7, color='r', alpha=0.5))
ax.set_aspect('equal')
plt.show()

../_images/1f85f282eabfae0af19b344520fd3047c723d0ecb28eb819f454e04abe7ed696.png

# if you want to find the length of a list
len(xcoord)

3.7.2. Problem 1:#

File needed: Al7075.xlsx

(a) Import the load-length data in file “Al7075.xlsx” The header in the excel file should give you all the information you need. Plot the stress-strain data. Put the stress in units of MPa.

(b) Take the first couple of data points and determine the slope by fitting a straight line to the data. You may want to replot the first quarter or so of the original plot so that you can see how far the linear region extends.

(c) Add a line to your plot that starts at a strain of 0.002 and has the slope found previously. Plot this line so that it intersects the stress-strain data curve and read off the yield strength. You should only plot the first quarter or so of the original plot so that you can clearly see the yield point value.

(d) Add a horizontal line to your plot at the yield point value estimated above. Refine your estimate if your horizontal line does not pass correctly through the intersection point.

(e) Export your data to a new excel file with a name of your choosing.

3.7.3. Problem 2:#

Write a function that given the ionic radii for a simple ceramic compound (2 elements) returns the expected coordination number. The coordinaton numbers versus ratio of the radii can be found in Chapter 12, Table 12.2 on page 408 in Callister 10th ed. and is shown below. Your function should check if your ratio is greater than 1.0. If it is, then take the inverse of your ratio. You may want to consider the if elif else conditionals. Test your function by calculating the coordination number for NaCl. Remember to use the ionic radii not the atomic radii.

3.7.4. Problem 3:#

Pick 1000 random points in a square of edge length 2 centered at (0,0). Determine how many of these points are contained in a circle of radius 1 centered at (0,0). Multiply the ratio of points in the circle to the total number of points by 4, i.e. \(4\times \frac{\text{pts in circle}}{\text{total num pts}}\). Try 10,000 points. What do you get?