Performing Analysis of Meteorological Data

Performing Analysis of Meteorological Data

By: Hrishikesh Dherange


 Introduction:

Hello Folks, Welcome to my blog where we will perform analysis on the meteorological Data i.e. the Weather Data and the Data is solely taken from Krackin.com and the link for the same is: "https://www.kaggle.com/muthuj7/weather-dataset" all the operations will be done using the standard python libraries like Numpy, Pandas to perform analysis and matplotlib and seaborn library for the visualization. 

Methodology:

So, Lets start by importing the dataset in the Google Collaboratory mainly known as 'Colab'  we can use any interpreter, depends on the personal preference. As Google colab is an cloud based interpreter importing the dataset includes one extra step like uploading the data first to the colab and then reading that data in our traditional method. Here's how,

from google.colab import files
upload = files.upload()

This will prompt you to select the dataset from your local computer and you can read the same data for your analysis. After uploading the data, we need to read the data and store it in a variable.

  • weatherHistory.csv(application/vnd.ms-excel) - 13414498 bytes, last modified: 11/9/2020 - 3% done
  • Saving weatherHistory.csv to weatherHistory.csv                                        

    Now we can see that the file is uploading, Once uploaded lets see how we can read the data and store it in a variable.

    dataset = pd.read_csv("weatherHistory.csv")

    Hence the uploaded data is now stored in the variable 'dataset' on which we can perform analysis and also the visualization.

    Coming to the data cleaning task, it's a good practice to firstly check if there's any null values present in our dataset which most of the times is the common scenario that impacts our analysis results or accuracy. So, Lets check it out.

    dataset.isnull().sum()

    Executing the above cell would prompt us with the actual result with the number of rows contain the null values for each and every feature in our dataset, Lets see if any of our feature has the null value.

    Formatted Date 0 Summary 0 Precip Type 517 Temperature (C) 0 Apparent Temperature (C) 0 Humidity 0 Wind Speed (km/h) 0 Wind Bearing (degrees) 0 Visibility (km) 0 Pressure (millibars) 0 Daily Summary 0 dtype: int64

    so here we can see that the feature named Precip Type contains the null values with 517 rows, there are various ways to deal with the Null Values but here in this case it depends on the importance of the feature if we really do need in our analysis. for now we will be performing analysis only on the Humidity and the Apparent Temperature (C).

    Let's take a look at our dataset,

    dataset.head()

    So here we can see clearly that in our data the time is not a proper format and really hard to get that in this format also we won't be able to do proper analysis as well Visualization.

    Formatted DateSummaryPrecip TypeTemperature (C)Apparent Temperature (C)HumidityWind Speed (km/h)Wind Bearing (degrees)Visibility (km)Pressure (millibars)Daily Summary
    02006-04-01 00:00:00.000 +0200Partly Cloudyrain9.4722227.3888890.8914.119725115.82631015.13Partly cloudy throughout the day.
    12006-04-01 01:00:00.000 +0200Partly Cloudyrain9.3555567.2277780.8614.264625915.82631015.63Partly cloudy throughout the day.
    22006-04-01 02:00:00.000 +0200Mostly Cloudyrain9.3777789.3777780.893.928420414.95691015.94Partly cloudy throughout the day.
    32006-04-01 03:00:00.000 +0200Partly Cloudyrain8.2888895.9444440.8314.103626915.82631016.41Partly cloudy throughout the day.
    42006-04-01 04:00:00.000 +0200Mostly Cloudyrain8.7555566.9777780.8311.044625915.82631016.51Partly cloudy throughout the day.

    Let's get this sorted, Next step in the data cleaning we will set the proper indexing to the 'Formatted Date'

    dataset = dataset.set_index('Formatted Date')
    dataset.head()

    And here we are done with setting proper indexing to our feature 'Formatted Date' now we can use the data efficiently with an ease.

    SummaryPrecip TypeTemperature (C)Apparent Temperature (C)HumidityWind Speed (km/h)Wind Bearing (degrees)Visibility (km)Pressure (millibars)Daily Summary
    Formatted Date
    2006-03-31 22:00:00+00:00Partly Cloudyrain9.4722227.3888890.8914.119725115.82631015.13Partly cloudy throughout the day.
    2006-03-31 23:00:00+00:00Partly Cloudyrain9.3555567.2277780.8614.264625915.82631015.63Partly cloudy throughout the day.
    2006-04-01 00:00:00+00:00Mostly Cloudyrain9.3777789.3777780.893.928420414.95691015.94Partly cloudy throughout the day.
    2006-04-01 01:00:00+00:00Partly Cloudyrain8.2888895.9444440.8314.103626915.82631016.41Partly cloudy throughout the day.
    2006-04-01 02:00:00+00:00Mostly Cloudyrain8.7555566.9777780.8311.044625915.82631016.51Partly cloudy throughout the day


    We here are only concerned with the 'Apparent Temperature (C)' and 'Humidity' so lets get our data sorted for us and neglect the other data for this analysis 

    data_columns = ['Apparent Temperature (C)', 'Humidity']
    df_monthly_mean = dataset[data_columns].resample('MS').mean()
    df_monthly_mean.head()

    Clear enough with the data now.

    Now lets do the last step in the Data cleaning, i.e. extract the needed data of only the common month in the year so that it wont be the mess considering whole data of the year and will also be hard enough to analyze the pattern, so we here take the data only mean of the month of April. 

    df1 = df_monthly_mean[df_monthly_mean.index.month==4]

    Now as we are sorted with the required data, Lets move forward with Visualization.

    import seaborn as sns
    plt.figure(figsize=(14,6))
    plt.title("Variation in Apparent Temprature vs Humidity")
    sns.lineplot(data=df_monthly_mean)

    The output of the above cell would be like,

    Here we can clearly see that the average temperature in the month of the April is Approximately same, with slight difference and the level of Humidity is far same along the whole 10 years. 

    import matplotlib.dates as mdates
    from datetime import datetime 

    fig, ax = plt.subplots(figsize=(15,5))
    ax.plot(df1.loc['2006-04-01':'2016-04-01', 'Apparent Temperature (C)'], marker='o', linestyle='-',label='Apparent Temperature (C)')
    ax.plot(df1.loc['2006-04-01':'2016-04-01', 'Humidity'], marker='o', linestyle='-',label='Humidity')
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
    ax.legend(loc = 'center right')
    ax.set_xlabel('Month of April') 

    Output:

    We can clearly see that there is a sharp rise in temperature in the year of 2009 whereas there is a fall in temperature in the year of 2015. Hence we can conclude that global warming has caused an uncertainty in temperature over the past 10 years while the average humidity as remained constant throughout the 10 years.





    Comments