Application: Plastic wire coating data

9. Application: Plastic wire coating data#

9.1. Background#

In this lesson, we are going to look at over 2 million records containing data for the thickness of a PVC coating versus several processing parameters. These records were taken over a 3 year period.

The basic operation of an coating extruder is shown above and details of the cooling pool is shown below. Some of the machine parameters would be the wire coating speed (LineSpeed \(m\ sec^{-1}\)), temperature of the extruder (ExtrTemp \(^{\circ} C\)), temperature of the water in the cooling pool (WaterTemp \(^{\circ} C\)), date of the coating (Date) and perhaps the location of the machine (City). Each of these variables will be included in our dataset along with the measured coating thickenss (Measure \(\mu m\)).

Learning Objectives

Be able to:

Read two text files containing 1 million points each and combine into a single dataframe
Perform all necessary cleaning, organizing, and sorting steps as needed
Perform corrections on data values when spelling variations exist for the same same record (e.g. city name)
Use conditionals to filter data when multiple conditions are required
Display the data using matplotlib.pyplot

Date files needed:

First 1 million data records: insulation1.csv
Second 1 million data records: insulation2.csv

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

9.2. Read, Combine and Clean the Data#

Let’s read in the “insulation1.csv” data directly from the google drive it is stored. Every file in Drive has a unique identifyer that never changes. If you hover over the links above or right click and select copy link you will see “https://drive.google.com/uc?export=download&id=1bWIIA9ivM9SdZi6fzUItj8WM1tu_Tirz”. The file identifyer is everything after “id=” or in this case “1bWIIA9ivM9SdZi6fzUItj8WM1tu_Tirz”. Now that we have the identifyer we can use pandas to read the file.

import warnings
warnings.filterwarnings("ignore")

raw=pd.read_csv('https://drive.google.com/uc?id=1bWIIA9ivM9SdZi6fzUItj8WM1tu_Tirz')
# if I had a second file to read I would look at use concat to combine the files
# example usage  new_df=pd.concat([df_1,df_2,df_3, etc...],ignore_index=True)
raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000003 entries, 0 to 1000002
Data columns (total 8 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   ID         1000003 non-null  int64 
 1   Date       1000003 non-null  object
 2   Shifts     1000001 non-null  object
 3   ExtrTemp   1000003 non-null  object
 4   LineSpeed  1000003 non-null  object
 5   WaterTemp  1000003 non-null  object
 6   Measure    1000003 non-null  object
 7   City       1000003 non-null  object
dtypes: int64(1), object(7)
memory usage: 61.0+ MB

You probably got a DtypeWarning but the file was read correctly. Just ignore that for now. Several of the data columns have entries other than numbers (e.g. “*”) so the data type for that column is set to “object.” This is not memory efficient. We will fix the data types now. Let’s see what we currently have.

raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000003 entries, 0 to 1000002
Data columns (total 8 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   ID         1000003 non-null  int64 
 1   Date       1000003 non-null  object
 2   Shifts     1000001 non-null  object
 3   ExtrTemp   1000003 non-null  object
 4   LineSpeed  1000003 non-null  object
 5   WaterTemp  1000003 non-null  object
 6   Measure    1000003 non-null  object
 7   City       1000003 non-null  object
dtypes: int64(1), object(7)
memory usage: 61.0+ MB

So we have just over 1 million entries. Dtype “object” means that the column contains at least some text values (or non-numeric values). We don’t want non-numeric values in Columns 3-6 so let’s convert these columns to_numeric and replace all non-numeric values with NaN using the option coerce. Likewise we will use pd.to_datetime to convert column 1.

df=raw.copy()
df['Date']=pd.to_datetime(df['Date'], errors='coerce')
df['ExtrTemp'] = pd.to_numeric(df['ExtrTemp'], errors='coerce')
df['LineSpeed'] = pd.to_numeric(df['LineSpeed'], errors='coerce')
df['WaterTemp'] = pd.to_numeric(df['WaterTemp'], errors='coerce')
df['Measure'] = pd.to_numeric(df['Measure'], errors='coerce')
df.dtypes

ID                    int64
Date         datetime64[ns]
Shifts               object
ExtrTemp            float64
LineSpeed           float64
WaterTemp           float64
Measure             float64
City                 object
dtype: object

Notice that in column “Measure” the “*” has been replaced with NaN.

df.head()

	ID	Date	Shifts	ExtrTemp	LineSpeed	WaterTemp	Measure	City
0	1	2016-02-01	Shifts1	234.83	55.28	26.67	170.67	Kairo
1	2	2016-02-01	Shifts1	235.09	54.13	26.97	182.86	Kairo
2	3	2016-02-01	Shifts1	235.09	54.33	27.46	182.74	Kairo
3	4	2016-02-01	Shifts1	235.05	55.00	26.83	175.49	Kairo
4	5	2016-02-01	Shifts1	234.99	54.03	27.04	NaN	Kairo

We want a complete data set so let’s drop all rows that contain ‘NaN’ in columns 3 - 6.

df=df.dropna(axis='rows', how='any')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 999995 entries, 0 to 1000002
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   ID         999995 non-null  int64         
 1   Date       999995 non-null  datetime64[ns]
 2   Shifts     999995 non-null  object        
 3   ExtrTemp   999995 non-null  float64       
 4   LineSpeed  999995 non-null  float64       
 5   WaterTemp  999995 non-null  float64       
 6   Measure    999995 non-null  float64       
 7   City       999995 non-null  object        
dtypes: datetime64[ns](1), float64(4), int64(1), object(2)
memory usage: 68.7+ MB

After dropping rows with missing values we are left with 999,995 records.

9.3. Insulation thickness versus line speed#

xdata=df.loc[:,'LineSpeed']
ydata=df.loc[:,'Measure']
plt.plot(xdata, ydata, 'k.')
plt.xlabel('Linespeed, m/sec')
plt.ylabel('Measured Thickness, microns')
plt.show()

../_images/c26e444721f00ef092b5bc5b99d33c8b858d2d138a3afde34fac477a668e8931.png

9.3.1. Filtering Data using Conditionals#

Note

Let’s talk about selecting subsets of the data for analysis or plotting etc. You can setup as many conditions as you want and combine them as needed. It is important here to use parentheses for grouping our conditions together between the logical operators in this case ‘&’.

Boolean operators we could use:

‘&’ = and
‘|’ = or this is the pipe symbol above ‘’ on your keyboard
‘~’ = not
‘>’ = greater than
‘>=’ = greater than or equal to
‘<’ = less than
‘<=’ = less than or equal to
‘!=’ not equal to
‘==’ equal to

That is a lot of data. What if we wanted to plot only the data from City Kairo and for the year 2016.

#define our conditions in "()" 
cond1=(df['City']=="Kairo")
cond2=(df['Date']>'2016-01-01')
cond3=(df['Date']<'2016-12-31')

#use these to filter data with "&" = AND and "|" = OR
df_slice=df[cond1&cond2&cond3]
df_slice.head()

	ID	Date	Shifts	ExtrTemp	LineSpeed	WaterTemp	Measure	City
0	1	2016-02-01	Shifts1	234.83	55.28	26.67	170.67	Kairo
1	2	2016-02-01	Shifts1	235.09	54.13	26.97	182.86	Kairo
2	3	2016-02-01	Shifts1	235.09	54.33	27.46	182.74	Kairo
3	4	2016-02-01	Shifts1	235.05	55.00	26.83	175.49	Kairo
5	6	2016-02-01	Shifts2	235.04	55.26	26.62	172.80	Kairo

# all data
xdata=df.loc[:,'LineSpeed']
ydata=df.loc[:,'Measure']
plt.plot(xdata, ydata, 'k.')
# only 2016 data in Kairo
xdata=df_slice.loc[:,'LineSpeed']
ydata=df_slice.loc[:,'Measure']
plt.plot(xdata, ydata, 'co')

plt.xlabel('Linespeed, m/sec')
plt.ylabel('Measured Thickness, microns')
plt.show()

../_images/ea4ee4232ab007b18f7ba5318295722ac906b6c0fd517d742bc86fa700ad6a56.png

Let’s take a look at what cities are included in our data. We can use the .unique() function for this.

df['City'].unique()

array(['Kairo', 'Cairo', 'CAI', 'Alexandria', 'Alex'], dtype=object)

Interesting… so “Cairo” shows up as both “Kairo” and “CAI” while “Alexandria” shows also as “Alex”

We don’t want multiple labels this way so lets make the following substitutions:
{'Kairo':'Cairo', 'CAI':'Cairo', 'Alex':'Alexandria'}

df['City']=df['City'].replace({'Kairo':'Cairo', 'CAI':'Cairo'})
df['City'].unique()

array(['Cairo', 'Alexandria', 'Alex'], dtype=object)

We fixed the entries for Cairo. We still have to fix those for Alexandria.

df.head()

	ID	Date	Shifts	ExtrTemp	LineSpeed	WaterTemp	Measure	City
0	1	2016-02-01	Shifts1	234.83	55.28	26.67	170.67	Cairo
1	2	2016-02-01	Shifts1	235.09	54.13	26.97	182.86	Cairo
2	3	2016-02-01	Shifts1	235.09	54.33	27.46	182.74	Cairo
3	4	2016-02-01	Shifts1	235.05	55.00	26.83	175.49	Cairo
5	6	2016-02-01	Shifts2	235.04	55.26	26.62	172.80	Cairo

Here’s an interesting case. Say we want all cities except those labeled “Cairo”

#define our conditions in "()" 
cond1=(df['City']=="Cairo")

# use "~" before a condition to indicate the reverse of the condition or NOT cond1
df_slice=df[~cond1]
df_slice.head()

	ID	Date	Shifts	ExtrTemp	LineSpeed	WaterTemp	Measure	City
437	438	2016-03-01	Shifts1	235.04	54.36	26.88	180.33	Alexandria
438	439	2016-03-01	Shifts1	234.86	55.66	27.15	169.56	Alexandria
439	440	2016-03-01	Shifts1	234.75	56.11	27.23	165.24	Alexandria
440	441	2016-03-01	Shifts1	235.10	54.52	26.80	179.43	Alexandria
441	442	2016-03-01	Shifts1	235.08	55.13	26.88	175.01	Alexandria

#notice we filtered out "Cairo" with the above conditional
df_slice['City'].unique()

array(['Alexandria', 'Alex'], dtype=object)

df['City'].value_counts() # unique values and number of times found in column

Alexandria    502458
Cairo         497293
Alex             244
Name: City, dtype: int64

9.4. Exercises#

Important

For all problems, use both the insulation1 and insulation2 datasets. Combine these so that you have approximately 2 million data points. For any problem below that uses a city name you should make sure you use all appropriate labels for that city. That is, fix the city names as we did above in the lesson. Use the .unique() function to make sure you fixed all the names.

Hint

If things are running slow, I would make a separate dataframe with only a few 100 rows until your code works. Then use the full dataframe. Try something like:
ran_list=[np.random.randint(0,990000) for i in range(200)]
df_short=df.loc[ran_list,:]
to make a short dataframe “df_short” with 200 rows randomly selected from a dataframe named “df”

9.4.1. Problem 1#

Plot x=‘LineSpeed’, y=‘Measure’ for the following on one plot:

Alexandria in blue circles
Cairo in red points For the conditions:
234.85 < ExtrTemp < 234.9 and 26.83 < WaterTemp < 26.9

9.4.2. Problem 2#

Plot the measured thickness ‘Measure’ vs the extruder temperature ‘ExtrTemp’ for

Alexandria as black small filled circles ‘ko’
Cairo as green small points ‘g.’

for: 55.0 < LineSpeed < 55.05 and 26.83 < WaterTemp < 26.9

9.4.3. Problem 3#

Plot the measured thickness ‘Measure’ vs the water temperature ‘WaterTemp’ for

Alexandria as blue small points ‘b.’
Cairo as red small points ‘r.’

for: 55.0 < LineSpeed < 55.05 and 234.85 < ExtrTemp < 234.9

9.4.4. Problem 4#

Plot the measured thickness ‘Measure’ vs the date ‘Date’ for

Alexandria as blue small points ‘b.’ Cairo as red small points ‘r.’ for:

55.0 < LineSpeed < 55.05 and
234.85 < ExtrTemp < 234.9 and
26.83 < WaterTemp < 26.9

9.4.5. Problem 5#

Break up the data “Measure” into 20 groups based upon its value, i.e each group will have a size = ((max of Measure)-(min of Measure))/20. For each group find the average of the “Measure” and average of the corresponding “LineSpeed”. You should have 20 pairs of points when you are done: (Measure, Linespeed). Plot these values with Linespeed on the y-axis.

There are many ways to think about doing this. You could create another column in your dataframe called “group” and assign a group number from 1 to 20 for everypoint in your dataframe based on the value of “Measure”. Then you could use a groupby(‘group’).mean() to find the mean of all points in each group.

You could find all records within a given range of “Measure” values, find the mean of this group and append the value to an array. Then iterate through all the ranges for a for loop.

You could make a new column of data in your dataframe with values equal to pd.cut(df[‘Measure’],20), then consider what groupby could do…

… and many other options.