Performing Analysis of Meteorological Data

By: Hrishikesh Dherange

Introduction:

Hello Folks, Welcome to my blog where we will perform analysis on the meteorological Data i.e. the Weather Data and the Data is solely taken from Krackin.com and the link for the same is: "https://www.kaggle.com/muthuj7/weather-dataset" all the operations will be done using the standard python libraries like Numpy, Pandas to perform analysis and matplotlib and seaborn library for the visualization.

Methodology:

So, Lets start by importing the dataset in the Google Collaboratory mainly known as 'Colab' we can use any interpreter, depends on the personal preference. As Google colab is an cloud based interpreter importing the dataset includes one extra step like uploading the data first to the colab and then reading that data in our traditional method. Here's how,

from google.colab import files
upload = files.upload()

This will prompt you to select the dataset from your local computer and you can read the same data for your analysis. After uploading the data, we need to read the data and store it in a variable.

weatherHistory.csv(application/vnd.ms-excel) - 13414498 bytes, last modified: 11/9/2020 - 3% done

Saving weatherHistory.csv to weatherHistory.csv

Now we can see that the file is uploading, Once uploaded lets see how we can read the data and store it in a variable.

dataset = pd.read_csv("weatherHistory.csv")

Hence the uploaded data is now stored in the variable 'dataset' on which we can perform analysis and also the visualization.

Coming to the data cleaning task, it's a good practice to firstly check if there's any null values present in our dataset which most of the times is the common scenario that impacts our analysis results or accuracy. So, Lets check it out.

dataset.isnull().sum()

Executing the above cell would prompt us with the actual result with the number of rows contain the null values for each and every feature in our dataset, Lets see if any of our feature has the null value.

Formatted Date 0 Summary 0 Precip Type 517 Temperature (C) 0 Apparent Temperature (C) 0 Humidity 0 Wind Speed (km/h) 0 Wind Bearing (degrees) 0 Visibility (km) 0 Pressure (millibars) 0 Daily Summary 0 dtype: int64

so here we can see that the feature named Precip Type contains the null values with 517 rows, there are various ways to deal with the Null Values but here in this case it depends on the importance of the feature if we really do need in our analysis. for now we will be performing analysis only on the Humidity and the Apparent Temperature (C).

Let's take a look at our dataset,

dataset.head()

So here we can see clearly that in our data the time is not a proper format and really hard to get that in this format also we won't be able to do proper analysis as well Visualization.

Formatted Date	Summary	Precip Type	Temperature (C)	Apparent Temperature (C)	Humidity	Wind Speed (km/h)	Wind Bearing (degrees)	Visibility (km)	Pressure (millibars)	Daily Summary
0	2006-04-01 00:00:00.000 +0200	Partly Cloudy	rain	9.472222	7.388889	0.89	14.1197	251	15.8263	1015.13	Partly cloudy throughout the day.
1	2006-04-01 01:00:00.000 +0200	Partly Cloudy	rain	9.355556	7.227778	0.86	14.2646	259	15.8263	1015.63	Partly cloudy throughout the day.
2	2006-04-01 02:00:00.000 +0200	Mostly Cloudy	rain	9.377778	9.377778	0.89	3.9284	204	14.9569	1015.94	Partly cloudy throughout the day.
3	2006-04-01 03:00:00.000 +0200	Partly Cloudy	rain	8.288889	5.944444	0.83	14.1036	269	15.8263	1016.41	Partly cloudy throughout the day.
4	2006-04-01 04:00:00.000 +0200	Mostly Cloudy	rain	8.755556	6.977778	0.83	11.0446	259	15.8263	1016.51	Partly cloudy throughout the day.

Let's get this sorted, Next step in the data cleaning we will set the proper indexing to the 'Formatted Date'

dataset = dataset.set_index('Formatted Date')
dataset.head()

And here we are done with setting proper indexing to our feature 'Formatted Date' now we can use the data efficiently with an ease.

	Summary	Precip Type	Temperature (C)	Apparent Temperature (C)	Humidity	Wind Speed (km/h)	Wind Bearing (degrees)	Visibility (km)	Pressure (millibars)	Daily Summary
Formatted Date
2006-03-31 22:00:00+00:00	Partly Cloudy	rain	9.472222	7.388889	0.89	14.1197	251	15.8263	1015.13	Partly cloudy throughout the day.
2006-03-31 23:00:00+00:00	Partly Cloudy	rain	9.355556	7.227778	0.86	14.2646	259	15.8263	1015.63	Partly cloudy throughout the day.
2006-04-01 00:00:00+00:00	Mostly Cloudy	rain	9.377778	9.377778	0.89	3.9284	204	14.9569	1015.94	Partly cloudy throughout the day.
2006-04-01 01:00:00+00:00	Partly Cloudy	rain	8.288889	5.944444	0.83	14.1036	269	15.8263	1016.41	Partly cloudy throughout the day.
2006-04-01 02:00:00+00:00	Mostly Cloudy	rain	8.755556	6.977778	0.83	11.0446	259	15.8263	1016.51	Partly cloudy throughout the day

We here are only concerned with the 'Apparent Temperature (C)' and 'Humidity' so lets get our data sorted for us and neglect the other data for this analysis

data_columns = ['Apparent Temperature (C)', 'Humidity']
df_monthly_mean = dataset[data_columns].resample('MS').mean()
df_monthly_mean.head()

Clear enough with the data now.

Now lets do the last step in the Data cleaning, i.e. extract the needed data of only the common month in the year so that it wont be the mess considering whole data of the year and will also be hard enough to analyze the pattern, so we here take the data only mean of the month of April.

df1 = df_monthly_mean[df_monthly_mean.index.month==4]

Now as we are sorted with the required data, Lets move forward with Visualization.

import seaborn as sns
plt.figure(figsize=(14,6))
plt.title("Variation in Apparent Temprature vs Humidity")
sns.lineplot(data=df_monthly_mean)

The output of the above cell would be like,

Here we can clearly see that the average temperature in the month of the April is Approximately same, with slight difference and the level of Humidity is far same along the whole 10 years.

import matplotlib.dates as mdates
from datetime import datetime 

fig, ax = plt.subplots(figsize=(15,5))
ax.plot(df1.loc['2006-04-01':'2016-04-01', 'Apparent Temperature (C)'], marker='o', linestyle='-',label='Apparent Temperature (C)')
ax.plot(df1.loc['2006-04-01':'2016-04-01', 'Humidity'], marker='o', linestyle='-',label='Humidity')
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
ax.legend(loc = 'center right')

ax.set_xlabel('Month of April')

Output:

We can clearly see that there is a sharp rise in temperature in the year of 2009 whereas there is a fall in temperature in the year of 2015. Hence we can conclude that global warming has caused an uncertainty in temperature over the past 10 years while the average humidity as remained constant throughout the 10 years.

Search This Blog

Data Analytics Using Python