Analyzing Data Wrangling project 4

By Ekofiongo Eale

Udacity

Resources

In this project, I will work with two poweful Python packages, Pandas and Seaborn. Both packages have extensive online documentation. There is an extensive tutorial on Visualization with Pandas. The Seaborn tutorial contains many examples of data Visualization. The matplotlib web site has addition resources for learning plotting with Python tools.

In this project, I will you use two powerful Python packages. Pandas and Seaborn...

In [77]:
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import seaborn as sb
import tweepy
import requests
import os
import json
import time
%matplotlib inline
In [78]:
from requests import get
from bs4 import BeautifulSoup
In [79]:
import warnings
warnings.filterwarnings('ignore')

Gathering Data for this Project

- Importting the Twitter archive as CSV file unto a DataFrame (twitter-archive-enhanced.csv)

In [80]:
df_archive = pd.read_csv('./twitter-archive-enhanced.csv')
In [81]:
df_archive.shape
Out[81]:
(2356, 17)
In [82]:
df_archive.describe()
Out[82]:
tweet_id in_reply_to_status_id in_reply_to_user_id retweeted_status_id retweeted_status_user_id rating_numerator rating_denominator
count 2.356000e+03 7.800000e+01 7.800000e+01 1.810000e+02 1.810000e+02 2356.000000 2356.000000
mean 7.427716e+17 7.455079e+17 2.014171e+16 7.720400e+17 1.241698e+16 13.126486 10.455433
std 6.856705e+16 7.582492e+16 1.252797e+17 6.236928e+16 9.599254e+16 45.876648 6.745237
min 6.660209e+17 6.658147e+17 1.185634e+07 6.661041e+17 7.832140e+05 0.000000 0.000000
25% 6.783989e+17 6.757419e+17 3.086374e+08 7.186315e+17 4.196984e+09 10.000000 10.000000
50% 7.196279e+17 7.038708e+17 4.196984e+09 7.804657e+17 4.196984e+09 11.000000 10.000000
75% 7.993373e+17 8.257804e+17 4.196984e+09 8.203146e+17 4.196984e+09 12.000000 10.000000
max 8.924206e+17 8.862664e+17 8.405479e+17 8.874740e+17 7.874618e+17 1776.000000 170.000000
In [83]:
df_archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
In [84]:
df_archive.tweet_id.nunique()
Out[84]:
2356
In [85]:
df_archive.isnull().sum()
Out[85]:
tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

- Importting the image_predictions as TSV file unto a DataFrame

In [86]:
# Importing the tweet image predictions TSV file into a DataFrame
predict_img_df = pd.read_csv('image_predictions.tsv', sep='\t')
In [87]:
# Download tweet image predictions TSV using the Requests library and write it to image_predictions.tsv
response = requests.get("https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv") 
In [88]:
response
Out[88]:
<Response [200]>
In [89]:
# Download the page
#url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
#response = get(url)
#print(response.text[:1000])
In [90]:
# print statistics
# high-level overview of data shape and composition
print('Rows and columns: ' + str(predict_img_df.shape))
print('')
print(predict_img_df.dtypes)
Rows and columns: (2075, 12)

tweet_id      int64
jpg_url      object
img_num       int64
p1           object
p1_conf     float64
p1_dog         bool
p2           object
p2_conf     float64
p2_dog         bool
p3           object
p3_conf     float64
p3_dog         bool
dtype: object
In [91]:
predict_img_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [92]:
predict_img_df.head()
Out[92]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

- Importting the twitter_archive_master as CSV file unto a DataFrame

In [93]:
twitter_archive_master = pd.read_csv('twitter_archive_master.csv')

print('Rows: and columns: ' + str(twitter_archive_master.shape))
Rows: and columns: (1349, 25)
In [94]:
twitter_archive_master.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1349 entries, 0 to 1348
Data columns (total 25 columns):
tweet_id                 1349 non-null int64
in_reply_to_status_id    1349 non-null int64
in_reply_to_user_id      1349 non-null int64
timestamp                1349 non-null object
source                   1349 non-null object
text                     1349 non-null object
expanded_urls            1349 non-null object
rating_numerator         1349 non-null float64
rating_denominator       1349 non-null int64
name                     1349 non-null object
dog_stages               182 non-null object
jpg_url                  1349 non-null object
img_num                  1349 non-null int64
p1                       1349 non-null object
p1_conf                  1349 non-null float64
p1_dog                   1349 non-null bool
p2                       1349 non-null object
p2_conf                  1349 non-null float64
p2_dog                   1349 non-null bool
p3                       1349 non-null object
p3_conf                  1349 non-null float64
p3_dog                   1349 non-null bool
retweet_count            1349 non-null int64
favorite_count           1349 non-null int64
display_text_range       1349 non-null object
dtypes: bool(3), float64(4), int64(7), object(11)
memory usage: 235.9+ KB
In [95]:
twitter_archive_master.describe()
Out[95]:
tweet_id in_reply_to_status_id in_reply_to_user_id rating_numerator rating_denominator img_num p1_conf p2_conf p3_conf retweet_count favorite_count
count 1.349000e+03 1.349000e+03 1.349000e+03 1349.000000 1349.000000 1349.000000 1349.000000 1.349000e+03 1.349000e+03 1349.000000 1349.000000
mean 7.446039e+17 4.950381e+14 3.111181e+06 12.048569 10.030393 1.210526 0.600582 1.338343e-01 6.050994e-02 2752.896219 9410.949592
std 6.880680e+16 1.818212e+16 1.142698e+08 48.112024 1.089386 0.574050 0.274176 1.007713e-01 5.168008e-02 3969.380179 11685.811257
min 6.664188e+17 0.000000e+00 0.000000e+00 2.000000 10.000000 1.000000 0.044333 1.011300e-08 1.740170e-10 23.000000 107.000000
25% 6.801158e+17 0.000000e+00 0.000000e+00 10.000000 10.000000 1.000000 0.367368 5.171310e-02 1.540050e-02 708.000000 2369.000000
50% 7.240499e+17 0.000000e+00 0.000000e+00 11.000000 10.000000 1.000000 0.599660 1.180890e-01 4.923690e-02 1541.000000 4878.000000
75% 8.011151e+17 0.000000e+00 0.000000e+00 12.000000 10.000000 1.000000 0.868560 1.975130e-01 9.411200e-02 3357.000000 12334.000000
max 8.924206e+17 6.678065e+17 4.196984e+09 1776.000000 50.000000 4.000000 1.000000 4.880140e-01 2.734190e-01 56625.000000 107956.000000

Tweets (Twitter API)

In [96]:
# !pip install tweepy
In [97]:
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions

# app login
consumer_key = 'BCf0alB97VE7gvB4AZ9HNyZTu'
consumer_secret = 'BuAXliR0v6adMRgZVPYAS22P4CVY2HWIsPuBctqR56kmXf1Wus'

# user login
access_token = '1304789419711508480-yMou4A8k9qXe0mJvJIIGCK8sZYAQix'
access_secret = 'acxZ9SAdZ40jK7Iwpuw0VpzWKPJHbWQZRJEKqUosmvLxX'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor

# Iterate through all of the authenticated user's friends
#for friend in tweepy.Cursor(api.friends).items():
    # Process the friend here
    #print(friend)

screen_name = "dog_rates"
statuses = api.user_timeline(screen_name)

print("retrieved: ", len(statuses))
status = statuses[0]

data_list = []
for status in statuses:
    data_list.append(status._json)


df = pd.DataFrame(data_list)
df
retrieved:  20
Out[97]:
created_at id id_str text truncated entities source in_reply_to_status_id in_reply_to_status_id_str in_reply_to_user_id ... coordinates place contributors is_quote_status retweet_count favorite_count favorited retweeted lang possibly_sensitive
0 Wed Sep 16 23:58:30 +0000 2020 1306381981140226049 1306381981140226049 @melissamjwine omg False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="https://about.twitter.com/products/tw... 1.306379e+18 1306379029067554816 2.948032e+09 ... None None None False 0 6 False False und NaN
1 Wed Sep 16 18:34:30 +0000 2020 1306300446273724416 1306300446273724416 @laurDIY innocent False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.306300e+18 1306300162885607425 3.804365e+08 ... None None None False 0 588 False False en NaN
2 Wed Sep 16 18:08:24 +0000 2020 1306293875263172609 1306293875263172609 @afookinglousah that’d be wild False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.306287e+18 1306287330815365127 2.275193e+09 ... None None None False 1 265 False False en NaN
3 Wed Sep 16 16:45:03 +0000 2020 1306272902250033153 1306272902250033153 @Sm813Nancy MIKE WAGOWSKI False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.306271e+18 1306271204328771584 5.222814e+08 ... None None None False 3 1561 False False pl NaN
4 Wed Sep 16 16:28:55 +0000 2020 1306268838573424645 1306268838573424645 @abby_geniusz i do yes 🥺 False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.306268e+18 1306268049947975682 7.674052e+17 ... None None None False 0 538 False False en NaN
5 Wed Sep 16 16:25:13 +0000 2020 1306267907362492416 1306267907362492416 This is Tao and Oko. Tao had to have his eyes ... True {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... NaN None NaN ... None None None False 10435 92838 False False en False
6 Wed Sep 16 15:15:21 +0000 2020 1306250326442299394 1306250326442299394 @SarahCAndersen the dog understood her that’s ... False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.306240e+18 1306239570799931393 9.160316e+08 ... None None None False 203 6564 False False en NaN
7 Tue Sep 15 19:07:08 +0000 2020 1305946269643190272 1305946269643190272 @bschwaaa 😭😭😭 False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.305946e+18 1305945544632668160 8.591741e+08 ... None None None False 0 126 False False und NaN
8 Tue Sep 15 18:45:12 +0000 2020 1305940748865601537 1305940748865601537 @Jen_eration_x @jamiemacqueen omg False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.305929e+18 1305929295949262848 2.592777e+07 ... None None None False 0 24 False False und NaN
9 Tue Sep 15 17:06:53 +0000 2020 1305916009174323201 1305916009174323201 @carrieannie3 @Jen_eration_x SO EXCITING False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="https://about.twitter.com/products/tw... 1.305906e+18 1305905992887152640 3.482786e+08 ... None None None False 0 120 False False en NaN
10 Tue Sep 15 16:22:56 +0000 2020 1305904946194259968 1305904946194259968 @darth @Jen_eration_x HE HAD A HUNCH DARTH THI... False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="https://about.twitter.com/products/tw... 1.305905e+18 1305904658783723520 1.337271e+06 ... None None None False 10 1496 False False en NaN
11 Tue Sep 15 16:16:58 +0000 2020 1305903446436933633 1305903446436933633 @ProfBreen587 @Jen_eration_x yes he can False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.305902e+18 1305902444820525064 1.275259e+18 ... None None None False 0 523 False False en NaN
12 Tue Sep 15 16:04:38 +0000 2020 1305900341829935104 1305900341829935104 @JordanGreenwaId hey i like your songs False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.305900e+18 1305899852866310144 8.248215e+07 ... None None None False 8 88 False False en NaN
13 Tue Sep 15 16:01:39 +0000 2020 1305899590726504449 1305899590726504449 This is Henry. He's had a very long year. Wond... True {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... NaN None NaN ... None None None False 12118 112342 False False en False
14 Tue Sep 15 02:47:47 +0000 2020 1305699806585786370 1305699806585786370 @ButteSheriff hi! any pupdates on trooper? False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.304873e+18 1304872699534008320 7.878647e+08 ... None None None False 1 11 False False en NaN
15 Mon Sep 14 16:28:48 +0000 2020 1305544036644659211 1305544036644659211 @andrea1079 @TakuroSpirit @asdfghjill no no th... False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.305544e+18 1305543830356230148 2.703605e+07 ... None None None False 2 480 False False en NaN
16 Mon Sep 14 16:13:39 +0000 2020 1305540222214578185 1305540222214578185 @papscun 😍😍😍 False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.305537e+18 1305536750920568837 3.351042e+09 ... None None None False 0 4 False False und NaN
17 Mon Sep 14 16:09:38 +0000 2020 1305539211156570113 1305539211156570113 @TakuroSpirit @asdfghjill i’m sorry but “dogs ... False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.305538e+18 1305537906111582209 2.034178e+07 ... None None None False 23 1411 False False en NaN
18 Mon Sep 14 16:03:19 +0000 2020 1305537621108469761 1305537621108469761 @prashamparikh @asdfghjill same 😔 False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.305537e+18 1305537439038021632 4.168653e+08 ... None None None False 0 372 False False en NaN
19 Mon Sep 14 16:02:35 +0000 2020 1305537435925848064 1305537435925848064 @RissRiss24 @asdfghjill it hit him at 40mph it... False {'hashtags': [], 'symbols': [], 'user_mentions... <a href="http://twitter.com/download/iphone" r... 1.305537e+18 1305537250105741314 5.495929e+08 ... None None None False 0 789 False False en NaN

20 rows × 24 columns

The Twitter API gives developers access to most of Twitter’s functionality. You can use the API to read and write information related to Twitter entities such as tweets, users, and trends. Twitter API runs without error

In [98]:
df.describe()
Out[98]:
id in_reply_to_status_id in_reply_to_user_id retweet_count favorite_count
count 2.000000e+01 1.800000e+01 1.800000e+01 20.000000 20.000000
mean 1.305942e+18 1.305878e+18 1.134813e+17 1140.200000 11007.300000
std 3.024863e+14 3.933680e+14 3.415700e+17 3477.566793 31512.876994
min 1.305537e+18 1.304873e+18 1.337271e+06 0.000000 4.000000
25% 1.305661e+18 1.305539e+18 1.489313e+08 0.000000 112.000000
50% 1.305910e+18 1.305905e+18 5.359372e+08 0.500000 501.500000
75% 1.306268e+18 1.306261e+18 1.935403e+09 8.500000 1432.250000
max 1.306382e+18 1.306379e+18 1.275259e+18 12118.000000 112342.000000

Requête POST

In [99]:
# List of dictionaries to read tweet's JSON data line by line and later convert to a DataFrame
df_list = []
with open('tweet-json copy', 'r') as json_file:
    for line in json_file:
        status = json.loads(line)
        
 # Append to list of dictionaries
        df_list.append({'tweet_id': status['id'],
                        'retweet_count': status['retweet_count'],
                        'favorite_count': status['favorite_count'],
                        'display_text_range': status['display_text_range']
                       })

# Create a DataFrame with tweet ID, retweet count, favorite count and display_text_range
status_df = pd.DataFrame(df_list, columns = ['tweet_id', 'retweet_count', 'favorite_count', 'display_text_range'])       
In [100]:
status_df
Out[100]:
tweet_id retweet_count favorite_count display_text_range
0 892420643555336193 8853 39467 [0, 85]
1 892177421306343426 6514 33819 [0, 138]
2 891815181378084864 4328 25461 [0, 121]
3 891689557279858688 8964 42908 [0, 79]
4 891327558926688256 9774 41048 [0, 138]
... ... ... ... ...
2349 666049248165822465 41 111 [0, 120]
2350 666044226329800704 147 311 [0, 137]
2351 666033412701032449 47 128 [0, 130]
2352 666029285002620928 48 132 [0, 139]
2353 666020888022790149 532 2535 [0, 131]

2354 rows × 4 columns

In [101]:
status_df.shape
Out[101]:
(2354, 4)
In [102]:
status_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 4 columns):
tweet_id              2354 non-null int64
retweet_count         2354 non-null int64
favorite_count        2354 non-null int64
display_text_range    2354 non-null object
dtypes: int64(3), object(1)
memory usage: 73.7+ KB

Assessing Data

Checking missing values using heatmap function.

In [103]:
plt.figure(figsize=(20,10))
sns.heatmap(df_archive.isna(), cbar=False)
Out[103]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb8f2afc08>

As we can notice, there is a lot of missing data in the columns concerning the response and retweeted status. Since we only want the original messages with images, we have to delete them later - the missing data in the "expanded_urls" column will also disappear with this cleanup operation.

float histograms for df_archive
In [104]:
for col in df_archive.select_dtypes('float64'):
    plt.figure()
    sns.distplot(df_archive[col])

Qualitatives Variables df_archive

In [105]:
 for col in df_archive.select_dtypes('object'):
    print(f'{col :-<50} {df_archive[col].nunique()}')
timestamp----------------------------------------- 2356
source-------------------------------------------- 4
text---------------------------------------------- 2356
retweeted_status_timestamp------------------------ 181
expanded_urls------------------------------------- 2218
name---------------------------------------------- 957
doggo--------------------------------------------- 2
floofer------------------------------------------- 2
pupper-------------------------------------------- 2
puppo--------------------------------------------- 2

count numbers of values in columns doggo

In [106]:
df_archive['doggo'].value_counts()
Out[106]:
None     2259
doggo      97
Name: doggo, dtype: int64

count numbers of values in columns floofer

In [107]:
df_archive['floofer'].value_counts()
Out[107]:
None       2346
floofer      10
Name: floofer, dtype: int64

count numbers of values in columns puppo

In [108]:
df_archive['puppo'].value_counts()
Out[108]:
None     2326
puppo      30
Name: puppo, dtype: int64

count numbers of values in columns pupper

In [109]:
df_archive['pupper'].value_counts()
Out[109]:
None      2099
pupper     257
Name: pupper, dtype: int64

Only for 16% of the rows the data is not missing. Now let's take a look at the ratings. By what we have seen so far, it looks like the ratings have always a format of 13/10 or 12/10 and so on. So we would expect a numerator > 10 and denominator = 10.

In [110]:
df_archive.nunique()
Out[110]:
tweet_id                      2356
in_reply_to_status_id           77
in_reply_to_user_id             31
timestamp                     2356
source                           4
text                          2356
retweeted_status_id            181
retweeted_status_user_id        25
retweeted_status_timestamp     181
expanded_urls                 2218
rating_numerator                40
rating_denominator              18
name                           957
doggo                            2
floofer                          2
pupper                           2
puppo                            2
dtype: int64
In [111]:
for col in df_archive.select_dtypes('object'):
    print(f'{col :-<50} {df_archive[col].value_counts(10)}')
timestamp----------------------------------------- 2016-03-11 21:15:02 +0000    0.000424
2016-05-03 15:46:33 +0000    0.000424
2016-11-15 01:44:00 +0000    0.000424
2015-12-04 01:55:13 +0000    0.000424
2016-02-26 04:48:02 +0000    0.000424
                               ...   
2016-06-06 15:40:26 +0000    0.000424
2016-02-20 01:00:55 +0000    0.000424
2016-05-27 01:47:23 +0000    0.000424
2016-07-30 17:56:51 +0000    0.000424
2015-11-25 03:06:32 +0000    0.000424
Name: timestamp, Length: 2356, dtype: float64
source-------------------------------------------- <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     0.942699
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                        0.038625
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                     0.014007
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>    0.004669
Name: source, dtype: float64
text---------------------------------------------- RT @dog_rates: Meet Shadow. In an attempt to reach maximum zooming borkdrive, he tore his ACL. Still 13/10 tho. Help him out below\n\nhttps:/…             0.000424
This is Erik. He's fucken massive. But also kind. Let's people hug him for free. Looks soft. 11/10 I would hug Erik https://t.co/MT7Q4aDQS1                0.000424
This is Jay. He's really h*ckin happy about the start of fall. Sneaky tongue slip in 2nd pic. 11/10 snuggly af https://t.co/vyx1X5eyWI                     0.000424
We normally don't rate bears but this one seems nice. Her name is Thea. Appears rather fluffy. 10/10 good bear https://t.co/fZc7MixeeT                     0.000424
This is Sammy. Her tongue ejects without warning sometimes. It's a serious condition. Needs a hefty dose from a BlepiPen. 13/10 https://t.co/g20EmqK7vc    0.000424
                                                                                                                                                             ...   
RT @dog_rates: Say hello to Cooper. His expression is the same wet or dry. Absolute 12/10 but Coop desperately requests your help\n\nhttps://…             0.000424
This is Tito. He's on the lookout. Nobody knows for what. 10/10 https://t.co/Qai481H6RA                                                                    0.000424
This is Cash. He's officially given pup on today. 12/10 frighteningly relatable https://t.co/m0hrATIEyw                                                    0.000424
RT @dog_rates: This is Sampson. He just graduated. Ready to be a doggo now. Time for the real world. 12/10 have fun with taxes https://t.co…               0.000424
This is an Iraqi Speed Kangaroo. It is not a dog. Please only send in dogs. I'm very angry with all of you ...9/10 https://t.co/5qpBTTpgUt                 0.000424
Name: text, Length: 2356, dtype: float64
retweeted_status_timestamp------------------------ 2015-12-28 17:12:42 +0000    0.005525
2016-10-12 15:55:59 +0000    0.005525
2017-01-15 21:49:15 +0000    0.005525
2016-07-12 18:27:35 +0000    0.005525
2017-01-11 02:15:36 +0000    0.005525
                               ...   
2017-05-27 19:39:34 +0000    0.005525
2016-02-21 01:19:47 +0000    0.005525
2016-05-19 01:38:16 +0000    0.005525
2016-11-02 23:45:19 +0000    0.005525
2016-06-13 18:27:32 +0000    0.005525
Name: retweeted_status_timestamp, Length: 181, dtype: float64
expanded_urls------------------------------------- https://twitter.com/dog_rates/status/683391852557561860/photo/1                                                                    0.000871
https://twitter.com/dog_rates/status/753375668877008896/photo/1                                                                    0.000871
https://twitter.com/dog_rates/status/681523177663676416/photo/1                                                                    0.000871
https://twitter.com/dog_rates/status/700143752053182464/photo/1                                                                    0.000871
https://twitter.com/dog_rates/status/820749716845686786/photo/1,https://twitter.com/dog_rates/status/820749716845686786/photo/1    0.000871
                                                                                                                                     ...   
https://twitter.com/dog_rates/status/875144289856114688/video/1                                                                    0.000435
https://twitter.com/dog_rates/status/673689733134946305/photo/1,https://twitter.com/dog_rates/status/673689733134946305/photo/1    0.000435
https://twitter.com/dog_rates/status/772152991789019136/photo/1,https://twitter.com/dog_rates/status/772152991789019136/photo/1    0.000435
https://twitter.com/dog_rates/status/678740035362037760/photo/1                                                                    0.000435
https://twitter.com/dog_rates/status/672139350159835138/photo/1                                                                    0.000435
Name: expanded_urls, Length: 2218, dtype: float64
name---------------------------------------------- None       0.316214
a          0.023345
Charlie    0.005093
Oliver     0.004669
Lucy       0.004669
             ...   
Comet      0.000424
Ed         0.000424
Spencer    0.000424
Mimosa     0.000424
Pavlov     0.000424
Name: name, Length: 957, dtype: float64
doggo--------------------------------------------- None     0.958829
doggo    0.041171
Name: doggo, dtype: float64
floofer------------------------------------------- None       0.995756
floofer    0.004244
Name: floofer, dtype: float64
pupper-------------------------------------------- None      0.890917
pupper    0.109083
Name: pupper, dtype: float64
puppo--------------------------------------------- None     0.987267
puppo    0.012733
Name: puppo, dtype: float64
  1. not all tweets could be classified as doggo, floofer, pupper or puppo and all columns contain "None"
  2. the source contains unnecessary HTML code
  3. there is the name "None" in the name column
In [112]:
for col in df_archive.select_dtypes('int64'):
    plt.figure()
    sns.distplot(df_archive[col])

Check if all records in df_archive are retweets

In [113]:
len(df_archive[df_archive.retweeted_status_id.isnull() == False])
Out[113]:
181
In [114]:
# Checking if there are any records in df whose corresponding record with same tweet_id is missing in img_df table
len(df_archive[~df_archive.tweet_id.isin(predict_img_df.tweet_id)])
Out[114]:
281
In [115]:
df_archive.name.value_counts().sort_index(ascending=False)
Out[115]:
very            5
unacceptable    1
this            1
the             8
such            1
               ..
Aiden           1
Adele           1
Acro            1
Ace             1
Abby            2
Name: name, Length: 957, dtype: int64
In [116]:
# Sort by rating_denominator values
df_archive.rating_denominator.value_counts().sort_index()
Out[116]:
0         1
2         1
7         1
10     2333
11        3
15        1
16        1
20        2
40        1
50        3
70        1
80        2
90        1
110       1
120       1
130       1
150       1
170       1
Name: rating_denominator, dtype: int64
In [117]:
df_archive.name.value_counts()
Out[117]:
None       745
a           55
Charlie     12
Oliver      11
Lucy        11
          ... 
Comet        1
Ed           1
Spencer      1
Mimosa       1
Pavlov       1
Name: name, Length: 957, dtype: int64

descriptive statistics for numeric variables

In [118]:
print(df_archive.describe())
           tweet_id  in_reply_to_status_id  in_reply_to_user_id  \
count  2.356000e+03           7.800000e+01         7.800000e+01   
mean   7.427716e+17           7.455079e+17         2.014171e+16   
std    6.856705e+16           7.582492e+16         1.252797e+17   
min    6.660209e+17           6.658147e+17         1.185634e+07   
25%    6.783989e+17           6.757419e+17         3.086374e+08   
50%    7.196279e+17           7.038708e+17         4.196984e+09   
75%    7.993373e+17           8.257804e+17         4.196984e+09   
max    8.924206e+17           8.862664e+17         8.405479e+17   

       retweeted_status_id  retweeted_status_user_id  rating_numerator  \
count         1.810000e+02              1.810000e+02       2356.000000   
mean          7.720400e+17              1.241698e+16         13.126486   
std           6.236928e+16              9.599254e+16         45.876648   
min           6.661041e+17              7.832140e+05          0.000000   
25%           7.186315e+17              4.196984e+09         10.000000   
50%           7.804657e+17              4.196984e+09         11.000000   
75%           8.203146e+17              4.196984e+09         12.000000   
max           8.874740e+17              7.874618e+17       1776.000000   

       rating_denominator  
count         2356.000000  
mean            10.455433  
std              6.745237  
min              0.000000  
25%             10.000000  
50%             10.000000  
75%             10.000000  
max            170.000000  
In [119]:
df_archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

predictions DataSet: predict_img_df

In [120]:
predict_img_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [121]:
predict_img_df.columns
Out[121]:
Index(['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2',
       'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')
In [122]:
predict_img_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
Checking predict_img_df plot by using pie
In [123]:
plt.figure(figsize=(8,6))
predict_img_df.dtypes.value_counts().plot.pie()
Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb90db53c8>
In [124]:
for col in predict_img_df.select_dtypes('float64'):
    plt.figure()
    sns.distplot(predict_img_df[col])
In [125]:
predict_img_df.describe()
Out[125]:
tweet_id img_num p1_conf p2_conf p3_conf
count 2.075000e+03 2075.000000 2075.000000 2.075000e+03 2.075000e+03
mean 7.384514e+17 1.203855 0.594548 1.345886e-01 6.032417e-02
std 6.785203e+16 0.561875 0.271174 1.006657e-01 5.090593e-02
min 6.660209e+17 1.000000 0.044333 1.011300e-08 1.740170e-10
25% 6.764835e+17 1.000000 0.364412 5.388625e-02 1.622240e-02
50% 7.119988e+17 1.000000 0.588230 1.181810e-01 4.944380e-02
75% 7.932034e+17 1.000000 0.843855 1.955655e-01 9.180755e-02
max 8.924206e+17 4.000000 1.000000 4.880140e-01 2.734190e-01

min and max of prediction are well in 0-1 range, as expected. And confidence decrease from first to third prediction.

In [126]:
 for col in predict_img_df.select_dtypes('object'):
    print(f'{col :-<50} {predict_img_df[col].nunique()}')
jpg_url------------------------------------------- 2009
p1------------------------------------------------ 378
p2------------------------------------------------ 405
p3------------------------------------------------ 408

The best way to find duplicates is to look at the jpg - url. If there are value counts > 1, then this data contains duplicates/retweets and only want Tweets with pictures which contain dogs. Let's see if there are pictures, for which the ML - Algorithm didn't predict any dogs.

In [127]:
for col in predict_img_df.select_dtypes('object'):
    print(f'{col :-<60} {predict_img_df[col].value_counts()}')
jpg_url----------------------------------------------------- https://pbs.twimg.com/media/CpmyNumW8AAAJGj.jpg    2
https://pbs.twimg.com/media/Cbs3DOAXIAAp3Bd.jpg    2
https://pbs.twimg.com/media/CWyD2HGUYAQ1Xa7.jpg    2
https://pbs.twimg.com/media/Crwxb5yWgAAX5P_.jpg    2
https://pbs.twimg.com/media/CVMOlMiWwAA4Yxl.jpg    2
                                                  ..
https://pbs.twimg.com/media/CWD_jQMWEAAdYwH.jpg    1
https://pbs.twimg.com/media/CVtvf6bWwAAd1rT.jpg    1
https://pbs.twimg.com/media/CqBiMAgWAAEJKgI.jpg    1
https://pbs.twimg.com/media/CUTl5m1WUAAabZG.jpg    1
https://pbs.twimg.com/media/CVRGDrsWsAAUWSF.jpg    1
Name: jpg_url, Length: 2009, dtype: int64
p1---------------------------------------------------------- golden_retriever      150
Labrador_retriever    100
Pembroke               89
Chihuahua              83
pug                    57
                     ... 
agama                   1
wooden_spoon            1
grey_fox                1
rain_barrel             1
flamingo                1
Name: p1, Length: 378, dtype: int64
p2---------------------------------------------------------- Labrador_retriever    104
golden_retriever       92
Cardigan               73
Chihuahua              44
Pomeranian             42
                     ... 
Gila_monster            1
iPod                    1
neck_brace              1
cardigan                1
otter                   1
Name: p2, Length: 405, dtype: int64
p3---------------------------------------------------------- Labrador_retriever    79
Chihuahua             58
golden_retriever      48
Eskimo_dog            38
kelpie                35
                      ..
wing                   1
eel                    1
grand_piano            1
theater_curtain        1
rapeseed               1
Name: p3, Length: 408, dtype: int64

We saw that some entry have nothing to with dogs : minibus, fountain, desktop_computer, ... But we could identify them with the column "p1_dog".

In [128]:
predict_img_df.p1_dog.value_counts()
Out[128]:
True     1532
False     543
Name: p1_dog, dtype: int64

So, off 2 750 picture, 543 are identified as not dog.

Checking also the missing data first for predict_img_df

In [129]:
plt.figure(figsize=(20,10))
sns.heatmap(predict_img_df.isna(), cbar=False)
Out[129]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb8f50f888>
In [130]:
predict_img_df.head()
Out[130]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
  • the predicitions are sometimes lowercase, sometimes uppercase
  • there is an underscore instead of a whitespace between the words
  • there are rows with no prediciton of a dog (neither in 1, 2 nor 3)

Lets repeat the same pattern for this dataset: missing data → visual assessment.

In [131]:
plt.figure(figsize=(20,10))
sns.heatmap(status_df.isna(), cbar=False)
Out[131]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb8f704808>
In [132]:
status_df.head()
Out[132]:
tweet_id retweet_count favorite_count display_text_range
0 892420643555336193 8853 39467 [0, 85]
1 892177421306343426 6514 33819 [0, 138]
2 891815181378084864 4328 25461 [0, 121]
3 891689557279858688 8964 42908 [0, 79]
4 891327558926688256 9774 41048 [0, 138]
In [133]:
status_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 4 columns):
tweet_id              2354 non-null int64
retweet_count         2354 non-null int64
favorite_count        2354 non-null int64
display_text_range    2354 non-null object
dtypes: int64(3), object(1)
memory usage: 73.7+ KB
In [134]:
for col in status_df.select_dtypes('object'):
    print(f'{col :-<50} {status_df[col].value_counts()}')
display_text_range-------------------------------- [0, 140]     260
[0, 139]     147
[0, 138]      80
[0, 137]      78
[0, 116]      59
            ... 
[13, 105]      1
[0, 42]        1
[30, 60]       1
[0, 13]        1
[0, 18]        1
Name: display_text_range, Length: 143, dtype: int64

Quality

Concerning df_archive

  • unnecessary html tags in source column in place of utility name
  • df_archive_clean: In the name column, there are several values that are not dog names, like 'a', 'the', 'such', etc. We notice that all of these observations have lowercase characters. We should clean up this field.
  • Change in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id to uint64 and use 0 as not-available
  • If expanded_urls is null, add by using tweet id
  • df_archive_clean: contains retweets and therefore, duplicates
  • Remove 343th entry from dataframe
  • Find when text has dog stages in it and add it to the corresponding column
  • Finding entries that are retweets by matching text pattern 'RT @...' and drop entries from the table
  • Get the decimal ratings by matching the text pattern "#.#/#" and save as new rating (only denominators had decimals)
  • Most ratings are expected to have a denominator of 10 and a numerator around 0...20. We have a look at potential outliers and manually suppress incorrect tweets.
  • Keep only those records in arc_df table whose tweet_id exists in img_df table.
  • Drop retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp columns from df_archive_clean table.
  • doggo, floofer, pupper and puppo columns in df_archive_clean table should be merged into one column named "dog_stage".
  • Replace the value 'None' with NaN (missing value).
  • 'None' previously passed the test as a name but isn't one. Drop missing tweets with missing names: Previously 'None', now nan.

Concerning predict_img_df

  • p1, p2, p3 inconsistent capitalization (sometimes first letter is capital)
  • no empty data cell

Concerning status_df

- no empty data cell

Tidiness

  • Three data frames df_twitter_archive, df_image_predictions, and df_tweet_info should be one (combined table) since all tables' entries are each describing one tweet

Cleaning Data

Making copies of dataframes before cleaning

In [135]:
df_archive_clean = df_archive.copy()
predict_img_df_clean = predict_img_df.copy()
status_df_clean = status_df.copy()

unnecessary html tags in source column in place of utility name

Define

Strip all html anchor tags (i.e. <a..>) in source column and retain just the text in between the tags. Convert the datatype from string to categorical.

Code

In [136]:
df_archive_clean.source = df_archive_clean.source.str.replace(r'<(?:a\b[^>]*>|/a>)', '')
df_archive_clean.source = df_archive_clean.source.astype('category')

Test

In [137]:
df_archive_clean.source.value_counts()
Out[137]:
Twitter for iPhone     2221
Vine - Make a Scene      91
Twitter Web Client       33
TweetDeck                11
Name: source, dtype: int64

Define

df_archive_clean: In the name column, there are several values that are not dog names, like 'a', 'the', 'such', etc. We notice that all of these observations have lowercase characters. We should clean up this field.

Code

In [138]:
mask = [not x.islower() if isinstance(x,str) 
        and len(x)>0 else False 
        for x in df_archive_clean['name']]

df_archive_clean = df_archive_clean[mask]

Test

In [139]:
# Checking for invalid names

unique_names = df_archive_clean['name'].unique()

print("<a> in names?", "a" in unique_names)
print("<the> in names?", "the" in unique_names)

unique_names
<a> in names? False
<the> in names? False
Out[139]:
array(['Phineas', 'Tilly', 'Archie', 'Darla', 'Franklin', 'None', 'Jax',
       'Zoey', 'Cassie', 'Koda', 'Bruno', 'Ted', 'Stuart', 'Oliver',
       'Jim', 'Zeke', 'Ralphus', 'Canela', 'Gerald', 'Jeffrey', 'Maya',
       'Mingus', 'Derek', 'Roscoe', 'Waffles', 'Jimbo', 'Maisey', 'Lilly',
       'Earl', 'Lola', 'Kevin', 'Yogi', 'Noah', 'Bella', 'Grizzwald',
       'Rusty', 'Gus', 'Stanley', 'Alfy', 'Koko', 'Rey', 'Gary', 'Elliot',
       'Louis', 'Jesse', 'Romeo', 'Bailey', 'Duddles', 'Jack', 'Emmy',
       'Steven', 'Beau', 'Snoopy', 'Shadow', 'Terrance', 'Aja', 'Penny',
       'Dante', 'Nelly', 'Ginger', 'Benedict', 'Venti', 'Goose', 'Nugget',
       'Cash', 'Coco', 'Jed', 'Sebastian', 'Walter', 'Sierra', 'Monkey',
       'Harry', 'Kody', 'Lassie', 'Rover', 'Napolean', 'Dawn', 'Boomer',
       'Cody', 'Rumble', 'Clifford', 'Dewey', 'Scout', 'Gizmo', 'Cooper',
       'Harold', 'Shikha', 'Jamesy', 'Lili', 'Sammy', 'Meatball',
       'Paisley', 'Albus', 'Neptune', 'Quinn', 'Belle', 'Zooey', 'Dave',
       'Jersey', 'Hobbes', 'Burt', 'Lorenzo', 'Carl', 'Jordy', 'Milky',
       'Trooper', 'Winston', 'Sophie', 'Wyatt', 'Rosie', 'Thor', 'Oscar',
       'Luna', 'Callie', 'Cermet', 'George', 'Marlee', 'Arya', 'Einstein',
       'Alice', 'Rumpole', 'Benny', 'Aspen', 'Jarod', 'Wiggles',
       'General', 'Sailor', 'Astrid', 'Iggy', 'Snoop', 'Kyle', 'Leo',
       'Riley', 'Gidget', 'Noosh', 'Odin', 'Jerry', 'Charlie', 'Georgie',
       'Rontu', 'Cannon', 'Furzey', 'Daisy', 'Tuck', 'Barney', 'Vixen',
       'Jarvis', 'Mimosa', 'Pickles', 'Bungalo', 'Brady', 'Margo',
       'Sadie', 'Hank', 'Tycho', 'Stephan', 'Indie', 'Winnie', 'Bentley',
       'Ken', 'Max', 'Maddie', 'Pipsy', 'Monty', 'Sojourner', 'Odie',
       'Arlo', 'Sunny', 'Vincent', 'Lucy', 'Clark', 'Mookie', 'Meera',
       'Buddy', 'Ava', 'Rory', 'Eli', 'Ash', 'Tucker', 'Tobi', 'Chester',
       'Wilson', 'Sunshine', 'Lipton', 'Gabby', 'Bronte', 'Poppy',
       'Rhino', 'Willow', 'Orion', 'Eevee', 'Smiley', 'Logan', 'Moreton',
       'Klein', 'Miguel', 'Emanuel', 'Kuyu', 'Dutch', 'Pete', 'Scooter',
       'Reggie', 'Kyro', 'Samson', 'Loki', 'Mia', 'Malcolm', 'Dexter',
       'Alfie', 'Fiona', 'Mutt', 'Bear', 'Doobert', 'Beebop', 'Alexander',
       'Sailer', 'Brutus', 'Kona', 'Boots', 'Ralphie', 'Phil', 'Cupid',
       'Pawnd', 'Pilot', 'Ike', 'Mo', 'Toby', 'Sweet', 'Pablo', 'Nala',
       'Balto', 'Crawford', 'Gabe', 'Mattie', 'Jimison', 'Hercules',
       'Duchess', 'Harlso', 'Sampson', 'Sundance', 'Luca', 'Flash',
       'Finn', 'Peaches', 'Howie', 'Jazzy', 'Anna', 'Bo', 'Seamus',
       'Wafer', 'Chelsea', 'Tom', 'Moose', 'Florence', 'Autumn', 'Dido',
       'Eugene', 'Herschel', 'Strudel', 'Tebow', 'Chloe', 'Betty',
       'Timber', 'Binky', 'Dudley', 'Comet', 'Larry', 'Levi', 'Akumi',
       'Titan', 'Olivia', 'Alf', 'Oshie', 'Bruce', 'Chubbs', 'Sky',
       'Atlas', 'Eleanor', 'Layla', 'Rocky', 'Baron', 'Tyr', 'Bauer',
       'Swagger', 'Brandi', 'Mary', 'Moe', 'Halo', 'Augie', 'Craig',
       'Sam', 'Hunter', 'Pavlov', 'Maximus', 'Wallace', 'Ito', 'Milo',
       'Ollie', 'Cali', 'Lennon', 'Major', 'Duke', 'Reginald', 'Sansa',
       'Shooter', 'Django', 'Diogi', 'Sonny', 'Philbert', 'Marley',
       'Severus', 'Ronnie', 'Anakin', 'Bones', 'Mauve', 'Chef', 'Doc',
       'Sobe', 'Longfellow', 'Mister', 'Iroh', 'Baloo', 'Stubert',
       'Paull', 'Timison', 'Davey', 'Pancake', 'Tyrone', 'Snicku', 'Ruby',
       'Brody', 'Rizzy', 'Mack', 'Butter', 'Nimbus', 'Laika', 'Dobby',
       'Juno', 'Maude', 'Lily', 'Newt', 'Benji', 'Nida', 'Robin',
       'Monster', 'BeBe', 'Remus', 'Mabel', 'Misty', 'Happy', 'Mosby',
       'Maggie', 'Leela', 'Ralphy', 'Brownie', 'Meyer', 'Stella', 'Frank',
       'Tonks', 'Lincoln', 'Oakley', 'Dale', 'Rizzo', 'Arnie', 'Pinot',
       'Dallas', 'Hero', 'Frankie', 'Stormy', 'Mairi', 'Loomis', 'Godi',
       'Kenny', 'Deacon', 'Timmy', 'Harper', 'Chipson', 'Combo', 'Dash',
       'Bell', 'Hurley', 'Jay', 'Mya', 'Strider', 'Wesley', 'Solomon',
       'Huck', 'O', 'Blue', 'Finley', 'Sprinkles', 'Heinrich',
       'Shakespeare', 'Fizz', 'Chip', 'Grey', 'Roosevelt', 'Gromit',
       'Willem', 'Dakota', 'Dixie', 'Al', 'Jackson', 'Carbon', 'DonDon',
       'Kirby', 'Lou', 'Nollie', 'Chevy', 'Tito', 'Louie', 'Rupert',
       'Rufus', 'Brudge', 'Shadoe', 'Colby', 'Angel', 'Brat', 'Tove',
       'Aubie', 'Kota', 'Eve', 'Glenn', 'Shelby', 'Sephie', 'Bonaparte',
       'Albert', 'Wishes', 'Rose', 'Theo', 'Rocco', 'Fido', 'Emma',
       'Spencer', 'Lilli', 'Boston', 'Brandonald', 'Corey', 'Leonard',
       'Chompsky', 'Beckham', 'Devón', 'Gert', 'Watson', 'Rubio', 'Keith',
       'Dex', 'Carly', 'Ace', 'Tayzie', 'Grizzie', 'Fred', 'Gilbert',
       'Zoe', 'Stewie', 'Calvin', 'Lilah', 'Spanky', 'Jameson', 'Piper',
       'Atticus', 'Blu', 'Dietrich', 'Divine', 'Tripp', 'Cora', 'Huxley',
       'Keurig', 'Bookstore', 'Linus', 'Abby', 'Shaggy', 'Shiloh',
       'Gustav', 'Arlen', 'Percy', 'Lenox', 'Sugar', 'Harvey', 'Blanket',
       'Geno', 'Stark', 'Beya', 'Kilo', 'Kayla', 'Maxaroni', 'Doug',
       'Edmund', 'Aqua', 'Theodore', 'Chase', 'Rorie', 'Simba', 'Charles',
       'Bayley', 'Axel', 'Storkson', 'Remy', 'Chadrick', 'Kellogg',
       'Buckley', 'Livvie', 'Terry', 'Hermione', 'Ralpher', 'Aldrick',
       'Rooney', 'Crystal', 'Ziva', 'Stefan', 'Pupcasso', 'Puff',
       'Flurpson', 'Coleman', 'Enchilada', 'Raymond', 'Rueben',
       'Cilantro', 'Karll', 'Sprout', 'Blitz', 'Bloop', 'Lillie',
       'Ashleigh', 'Kreggory', 'Sarge', 'Luther', 'Ivar', 'Jangle',
       'Schnitzel', 'Panda', 'Berkeley', 'Ralphé', 'Charleson', 'Clyde',
       'Harnold', 'Sid', 'Pippa', 'Otis', 'Carper', 'Bowie',
       'Alexanderson', 'Suki', 'Barclay', 'Skittle', 'Ebby', 'Flávio',
       'Smokey', 'Link', 'Jennifur', 'Ozzy', 'Bluebert', 'Stephanus',
       'Bubbles', 'Zeus', 'Bertson', 'Nico', 'Michelangelope', 'Siba',
       'Calbert', 'Curtis', 'Travis', 'Thumas', 'Kanu', 'Lance', 'Opie',
       'Kane', 'Olive', 'Chuckles', 'Staniel', 'Sora', 'Beemo', 'Gunner',
       'Lacy', 'Tater', 'Olaf', 'Cecil', 'Vince', 'Karma', 'Billy',
       'Walker', 'Rodney', 'Klevin', 'Malikai', 'Bobble', 'River',
       'Jebberson', 'Remington', 'Farfle', 'Jiminus', 'Clarkus',
       'Finnegus', 'Cupcake', 'Kathmandu', 'Ellie', 'Katie', 'Kara',
       'Adele', 'Zara', 'Ambrose', 'Jimothy', 'Bode', 'Terrenth', 'Reese',
       'Chesterson', 'Lucia', 'Bisquick', 'Ralphson', 'Socks', 'Rambo',
       'Rudy', 'Fiji', 'Rilo', 'Bilbo', 'Coopson', 'Yoda', 'Millie',
       'Chet', 'Crouton', 'Daniel', 'Kaia', 'Murphy', 'Dotsy', 'Eazy',
       'Coops', 'Fillup', 'Miley', 'Charl', 'Reagan', 'Yukon', 'CeCe',
       'Cuddles', 'Claude', 'Jessiga', 'Carter', 'Ole', 'Pherb',
       'Blipson', 'Reptar', 'Trevith', 'Berb', 'Bob', 'Colin', 'Brian',
       'Oliviér', 'Grady', 'Kobe', 'Freddery', 'Bodie', 'Dunkin', 'Wally',
       'Tupawc', 'Amber', 'Edgar', 'Teddy', 'Kingsley', 'Brockly',
       'Richie', 'Molly', 'Vinscent', 'Cedrick', 'Hazel', 'Lolo', 'Eriq',
       'Phred', 'Oddie', 'Maxwell', 'Geoff', 'Covach', 'Durg', 'Fynn',
       'Ricky', 'Herald', 'Lucky', 'Ferg', 'Trip', 'Clarence', 'Hamrick',
       'Brad', 'Pubert', 'Frönq', 'Derby', 'Lizzie', 'Ember', 'Blakely',
       'Opal', 'Marq', 'Kramer', 'Barry', 'Gordon', 'Baxter', 'Mona',
       'Horace', 'Crimson', 'Birf', 'Hammond', 'Lorelei', 'Marty',
       'Brooks', 'Petrick', 'Hubertson', 'Gerbald', 'Oreo', 'Bruiser',
       'Perry', 'Bobby', 'Jeph', 'Obi', 'Tino', 'Kulet', 'Sweets', 'Lupe',
       'Tiger', 'Jiminy', 'Griffin', 'Banjo', 'Brandy', 'Lulu', 'Darrel',
       'Taco', 'Joey', 'Patrick', 'Kreg', 'Todo', 'Tess', 'Ulysses',
       'Toffee', 'Apollo', 'Asher', 'Glacier', 'Chuck', 'Champ', 'Ozzie',
       'Griswold', 'Cheesy', 'Moofasa', 'Hector', 'Goliath', 'Kawhi',
       'Emmie', 'Penelope', 'Willie', 'Rinna', 'Mike', 'William',
       'Dwight', 'Evy', 'Rascal', 'Linda', 'Tug', 'Tango', 'Grizz',
       'Jerome', 'Crumpet', 'Jessifer', 'Izzy', 'Ralph', 'Sandy',
       'Humphrey', 'Tassy', 'Juckson', 'Chuq', 'Tyrus', 'Karl',
       'Godzilla', 'Vinnie', 'Kenneth', 'Herm', 'Bert', 'Striker',
       'Donny', 'Pepper', 'Bernie', 'Buddah', 'Lenny', 'Arnold', 'Zuzu',
       'Mollie', 'Laela', 'Tedders', 'Superpup', 'Rufio', 'Jeb', 'Rodman',
       'Jonah', 'Chesney', 'Henry', 'Bobbay', 'Mitch', 'Kaiya', 'Acro',
       'Aiden', 'Obie', 'Dot', 'Shnuggles', 'Kendall', 'Jeffri', 'Steve',
       'Mac', 'Fletcher', 'Kenzie', 'Pumpkin', 'Schnozz', 'Gustaf',
       'Cheryl', 'Ed', 'Leonidas', 'Norman', 'Caryl', 'Scott', 'Taz',
       'Darby', 'Jackie', 'Jazz', 'Franq', 'Pippin', 'Rolf', 'Snickers',
       'Ridley', 'Cal', 'Bradley', 'Bubba', 'Tuco', 'Patch', 'Mojo',
       'Batdog', 'Dylan', 'Mark', 'JD', 'Alejandro', 'Scruffers', 'Pip',
       'Julius', 'Tanner', 'Sparky', 'Anthony', 'Holly', 'Jett', 'Amy',
       'Sage', 'Andy', 'Mason', 'Trigger', 'Antony', 'Creg', 'Traviss',
       'Gin', 'Jeffrie', 'Danny', 'Ester', 'Pluto', 'Bloo', 'Edd',
       'Willy', 'Herb', 'Damon', 'Peanut', 'Nigel', 'Butters', 'Sandra',
       'Fabio', 'Randall', 'Liam', 'Tommy', 'Ben', 'Raphael', 'Julio',
       'Andru', 'Kloey', 'Shawwn', 'Skye', 'Kollin', 'Ronduh', 'Billl',
       'Saydee', 'Dug', 'Tessa', 'Sully', 'Kirk', 'Ralf', 'Clarq',
       'Jaspers', 'Samsom', 'Harrison', 'Chaz', 'Jeremy', 'Jaycob',
       'Lambeau', 'Ruffles', 'Amélie', 'Bobb', 'Banditt', 'Kevon',
       'Winifred', 'Hanz', 'Churlie', 'Zeek', 'Timofy', 'Maks',
       'Jomathan', 'Kallie', 'Marvin', 'Spark', 'Gòrdón', 'Jo', 'DayZ',
       'Jareld', 'Torque', 'Ron', 'Skittles', 'Cleopatricia', 'Erik',
       'Stu', 'Tedrick', 'Filup', 'Kial', 'Naphaniel', 'Dook', 'Hall',
       'Philippe', 'Biden', 'Fwed', 'Genevieve', 'Joshwa', 'Bradlay',
       'Clybe', 'Keet', 'Carll', 'Jockson', 'Josep', 'Lugan',
       'Christoper'], dtype=object)

Define

Change in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id to uint64 and use 0 as not-available

Change timestamp and retweeted_status_timestamp to datetimes

Code

Convert data types to integer but ignore null values

In [140]:
def convert_to_uint64(df, column):
    tmp = pd.Series(index=df.index, dtype="uint64")
    non_na_mask = ~df[column].isna()
    non_na = df[column][non_na_mask].astype("uint64")
    tmp[non_na_mask] = non_na    
    return tmp
    
In [141]:
for column in ["in_reply_to_status_id", "in_reply_to_user_id", "retweeted_status_id", "retweeted_status_user_id"]:
    df_archive_clean[column] = convert_to_uint64(df_archive_clean, column)
In [142]:
df_archive_clean.timestamp = pd.to_datetime(df_archive_clean.timestamp)
df_archive_clean.retweeted_status_timestamp = pd.to_datetime(df_archive_clean.retweeted_status_timestamp)

Test

In [143]:
# Check data types are now correct
df_archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2247 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2247 non-null int64
in_reply_to_status_id         2247 non-null uint64
in_reply_to_user_id           2247 non-null uint64
timestamp                     2247 non-null datetime64[ns, UTC]
source                        2247 non-null category
text                          2247 non-null object
retweeted_status_id           2247 non-null uint64
retweeted_status_user_id      2247 non-null uint64
retweeted_status_timestamp    176 non-null datetime64[ns, UTC]
expanded_urls                 2188 non-null object
rating_numerator              2247 non-null int64
rating_denominator            2247 non-null int64
name                          2247 non-null object
doggo                         2247 non-null object
floofer                       2247 non-null object
pupper                        2247 non-null object
puppo                         2247 non-null object
dtypes: category(1), datetime64[ns, UTC](2), int64(3), object(7), uint64(4)
memory usage: 380.8+ KB

Define

Use tweet_id to get expanded_urls

Code

If expanded_urls is null, add by using tweet id

In [144]:
df_archive_clean['expanded_urls'].isna().sum()
Out[144]:
59
In [145]:
def add_urls(row):
    # Don't do anything if expanded_urls is not null
    if pd.notnull(row['expanded_urls']):
        return row
    else:
        # Get tweet_id
        tweet_id = row['tweet_id']
        # Make new URL and save into expanded_urls
        row['expanded_urls'] = 'https://twitter.com/dog_rates/status/{}'.format(tweet_id)
        # return updated row
        return row


# Save dataframe with missing urls
df_archive_clean = df_archive_clean.apply(add_urls, axis=1)

Test

Check that there are no empty expanded_urls

In [146]:
print(df_archive_clean[df_archive_clean['expanded_urls'].isnull()])
print(df_archive_clean['expanded_urls'].isna().sum())
Empty DataFrame
Columns: [tweet_id, in_reply_to_status_id, in_reply_to_user_id, timestamp, source, text, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_urls, rating_numerator, rating_denominator, name, doggo, floofer, pupper, puppo]
Index: []
0
In [147]:
df_archive_clean['expanded_urls']
Out[147]:
0       https://twitter.com/dog_rates/status/892420643...
1       https://twitter.com/dog_rates/status/892177421...
2       https://twitter.com/dog_rates/status/891815181...
3       https://twitter.com/dog_rates/status/891689557...
4       https://twitter.com/dog_rates/status/891327558...
                              ...                        
2342    https://twitter.com/dog_rates/status/666082916...
2343    https://twitter.com/dog_rates/status/666073100...
2344    https://twitter.com/dog_rates/status/666071193...
2351    https://twitter.com/dog_rates/status/666049248...
2355    https://twitter.com/dog_rates/status/666020888...
Name: expanded_urls, Length: 2247, dtype: object

Define

Quality

df_archive_clean: contains retweets and therefore, duplicates

Define

Keep only those rows in df_archive_clean table that are original tweets and NOT retweets (i.e. retweeted_status_id column is null). Delete the rest.

In [148]:
# Note: 0 represent missing ids

len(df_archive_clean[df_archive_clean.retweeted_status_id != 0])
Out[148]:
176

Code

In [149]:
df_archive_clean = df_archive_clean[df_archive_clean.retweeted_status_id==0]

Test

In [150]:
len(df_archive_clean[df_archive_clean.retweeted_status_id != 0])
Out[150]:
0

df_archive_clean: 343rd entry is not a dog rating

Define

Remove 343th entry from dataframe

Code

In [151]:
# Drop 343rd entry (index=342 since starts at zero)
df_archive_clean = df_archive_clean.drop(df_archive_clean.index[342])

Test

In [152]:
# Check entries about point 
df_archive_clean.iloc[340:343]
Out[152]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
410 823322678127919110 0 0 2017-01-23 00:13:17+00:00 Twitter for iPhone This is Wyatt. He's got the fastest paws in th... 0 0 NaT https://twitter.com/dog_rates/status/823322678... 11 10 Wyatt None None None None
412 822975315408461824 0 0 2017-01-22 01:12:59+00:00 Twitter for iPhone This is Albus. He's soaked as h*ck. Seems to h... 0 0 NaT https://twitter.com/dog_rates/status/822975315... 12 10 Albus None None None None
414 822859134160621569 0 0 2017-01-21 17:31:20+00:00 Twitter for iPhone This is Hobbes. He was told he was going to th... 0 0 NaT https://twitter.com/dog_rates/status/822859134... 12 10 Hobbes None None None None
df_archive_clean: some entries should be classified as puppers (missing data)

Define

Find when text has dog stages in it and add it to the corresponding column

In [153]:
for col in ['pupper', 'doggo', 'puppo', 'floofer']:
    print(col)
    print(df_archive_clean[col].value_counts())
pupper
None      1848
pupper     222
Name: pupper, dtype: int64
doggo
None     1986
doggo      84
Name: doggo, dtype: int64
puppo
None     2046
puppo      24
Name: puppo, dtype: int64
floofer
None       2060
floofer      10
Name: floofer, dtype: int64

Code

In [160]:
# Find when 'dog stage' is used in text column
def extend_dog_stages(row):
    # Only do something if <dog_stage> is in text
    for dog_stage in ['pupper', 'doggo', 'puppo', 'floofer']:
        if dog_stage in text:
            row[dog_stage] = dog_stage
        
    # return row whether or not it was updated    
    return row

Test

In [161]:
for col in ['pupper', 'doggo', 'puppo', 'floofer']:
    print(col)
    print(df_archive_clean[col].value_counts())
pupper
None      1848
pupper     222
Name: pupper, dtype: int64
doggo
None     1986
doggo      84
Name: doggo, dtype: int64
puppo
None     2046
puppo      24
Name: puppo, dtype: int64
floofer
None       2060
floofer      10
Name: floofer, dtype: int64
df_archive_clean: some entries are retweets

Define

Finding entries that are retweets by matching text pattern 'RT @...' and drop entries from the table

Code

In [162]:
# Find all the retweets 
retweets = df_archive_clean.text.str.match('^RT @')

# Only keep the tweets that aren't retweets
df_archive_clean = df_archive_clean[~retweets]

Test

In [163]:
# Test that we have no retweets
retweets = df_archive_clean.text.str.match('^RT @')
df_archive_clean[retweets]
Out[163]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo

df_archive_clean: some ratings are decimal

Define

Get the decimal ratings by matching the text pattern "#.#/#" and save as new rating (only denominators had decimals)

In [164]:
print(list(df_archive_clean['rating_numerator'].unique()))
[13, 12, 14, 5, 17, 11, 10, 420, 666, 6, 182, 15, 960, 0, 7, 84, 24, 75, 27, 3, 8, 9, 4, 165, 1776, 50, 99, 80, 45, 44, 1, 143, 121, 20, 26, 2, 144, 88]

Code

In [165]:
# Apply function to redo rating if a denominator is found
import re

def get_decimal_rating(row):
    #if text matches for numerator, change it
    decimal_numerator_pattern = '(\d+\.\d+)\/(\d+)'
    matches = re.search(decimal_numerator_pattern, row['text'])
    if matches != None:
        row['rating_numerator'] = float(matches.group(1))
        row['rating_denominator'] = int(matches.group(2))


    return row
    

    
df_archive_clean = df_archive_clean.apply(get_decimal_rating, axis=1)

Test

In [166]:
print(list(df_archive_clean['rating_numerator'].unique()))
# 13.5, 11.27, 11.26 are now present
[13.0, 12.0, 14.0, 13.5, 17.0, 11.0, 10.0, 420.0, 666.0, 6.0, 182.0, 15.0, 960.0, 0.0, 7.0, 84.0, 24.0, 9.75, 5.0, 11.27, 3.0, 8.0, 9.0, 4.0, 165.0, 1776.0, 50.0, 99.0, 80.0, 45.0, 44.0, 1.0, 143.0, 121.0, 20.0, 9.5, 11.26, 2.0, 144.0, 88.0]

Define

Most ratings are expected to have a denominator of 10 and a numerator around 0...20.

We have a look at potential outliers and manually suppress incorrect tweets.

Code

In [167]:
df_archive.rating_numerator.value_counts()
Out[167]:
12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64
In [168]:
df_archive.rating_denominator.value_counts()
Out[168]:
10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64
In [169]:
pd.set_option('display.max_colwidth', -1)

rating_ratio = df_archive_clean.rating_numerator / df_archive_clean.rating_denominator
mask = (rating_ratio < 0) | (rating_ratio > 2.0)

df_archive_clean[['rating_numerator', 'rating_denominator', 'text']][mask]
Out[169]:
rating_numerator rating_denominator text
188 420.0 10 @dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research
189 666.0 10 @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10
290 182.0 10 @markhoppus 182/10
313 960.0 0 @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
516 24.0 7 Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
979 1776.0 10 This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
2074 420.0 10 After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY
In [170]:
# suppression of rows 313 and 516

df_archive_clean = df_archive_clean.drop([313, 516])

Test

In [171]:
pd.set_option('display.max_colwidth', -1)

rating_ratio = df_archive_clean.rating_numerator / df_archive_clean.rating_denominator
mask = (rating_ratio < 0) | (rating_ratio > 2.0)

df_archive_clean[['rating_numerator', 'rating_denominator', 'text']][mask]
Out[171]:
rating_numerator rating_denominator text
188 420.0 10 @dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research
189 666.0 10 @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10
290 182.0 10 @markhoppus 182/10
979 1776.0 10 This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
2074 420.0 10 After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY
In [ ]:
 

Define

Keep only those records in arc_df table whose tweet_id exists in img_df table

Code

In [172]:
df_archive_clean = df_archive_clean[df_archive_clean.tweet_id.isin(predict_img_df.tweet_id)]

Test

In [173]:
len(df_archive_clean[~df_archive_clean.tweet_id.isin(predict_img_df.tweet_id)])
Out[173]:
0
In [174]:
df_archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1894 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      1894 non-null int64
in_reply_to_status_id         1894 non-null int64
in_reply_to_user_id           1894 non-null int64
timestamp                     1894 non-null datetime64[ns, UTC]
source                        1894 non-null object
text                          1894 non-null object
retweeted_status_id           1894 non-null int64
retweeted_status_user_id      1894 non-null int64
retweeted_status_timestamp    0 non-null datetime64[ns]
expanded_urls                 1894 non-null object
rating_numerator              1894 non-null float64
rating_denominator            1894 non-null int64
name                          1894 non-null object
doggo                         1894 non-null object
floofer                       1894 non-null object
pupper                        1894 non-null object
puppo                         1894 non-null object
dtypes: datetime64[ns, UTC](1), datetime64[ns](1), float64(1), int64(6), object(8)
memory usage: 266.3+ KB

Define

Drop retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp columns from df_archive_clean table

In [175]:
df_archive_clean.drop(['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], axis=1, inplace=True)

Test

In [176]:
df_archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1894 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                 1894 non-null int64
in_reply_to_status_id    1894 non-null int64
in_reply_to_user_id      1894 non-null int64
timestamp                1894 non-null datetime64[ns, UTC]
source                   1894 non-null object
text                     1894 non-null object
expanded_urls            1894 non-null object
rating_numerator         1894 non-null float64
rating_denominator       1894 non-null int64
name                     1894 non-null object
doggo                    1894 non-null object
floofer                  1894 non-null object
pupper                   1894 non-null object
puppo                    1894 non-null object
dtypes: datetime64[ns, UTC](1), float64(1), int64(4), object(8)
memory usage: 222.0+ KB
df_archive_clean: text column contains untruncated text instead of displayable text

Quality

One variable in four columns (doggo, floofer, pupper, and puppo)

Define

doggo, floofer, pupper and puppo columns in df_archive_clean table should be merged into one column named "dog_stage"

Code

Number of records whose both doggo and floofer columns are not None

In [177]:
len(df_archive_clean[(df_archive_clean.doggo != 'None') & (df_archive_clean.floofer != 'None')])
Out[177]:
1

Number of records whose both doggo and pupper columns are not None

In [178]:
len(df_archive_clean[(df_archive_clean.doggo != 'None') & (df_archive_clean.pupper != 'None')])
Out[178]:
8

Number of records whose both doggo and pupper columns are not None

In [179]:
len(df_archive_clean[(df_archive_clean.doggo != 'None') & (df_archive_clean.puppo != 'None')])
Out[179]:
1

Number of records whose both floofer and pupper columns are not None

In [180]:
len(df_archive_clean[(df_archive_clean.floofer != 'None') & (df_archive_clean.pupper != 'None')])
Out[180]:
0

Number of records whose both floofer and puppo columns are not None

In [181]:
len(df_archive_clean[(df_archive_clean.floofer != 'None') & (df_archive_clean.puppo != 'None')])
Out[181]:
0

Number of records whose both pupper and puppo columns are not None

In [182]:
len(df_archive_clean[(df_archive_clean.pupper != 'None') & (df_archive_clean.puppo != 'None')])
Out[182]:
0

Dog stages into one columns!

In [183]:
# Dog stages into one columns!
df_tmp = df_archive_clean.copy()
df_tmp['dog_stages'] = None
for col in ['doggo', 'floofer', 'pupper', 'puppo']:
    mask = df_tmp[col]==col
    df_tmp['dog_stages'][mask] = col
In [184]:
mask_doggo_floofer = (df_archive_clean.doggo != 'None') & (df_archive_clean.floofer != 'None')
df_tmp['dog_stages'][mask_doggo_floofer] = 'doggo|floofer'

mask_doggo_pupper = (df_archive_clean.doggo != 'None') & (df_archive_clean.pupper != 'None')
df_tmp['dog_stages'][mask_doggo_pupper] = 'doggo|pupper'

mask_doggo_puppo = (df_archive_clean.doggo != 'None') & (df_archive_clean.puppo != 'None')
df_tmp['dog_stages'][mask_doggo_puppo] = 'doggo|puppo'

df_tmp['dog_stages'].value_counts()

df_archive_clean = df_tmp
In [185]:
# drop old columns
df_archive_clean = df_archive_clean.drop(['doggo', 'pupper', 'puppo', 'floofer'], axis=1)

Test

In [186]:
df_archive_clean['dog_stages'].value_counts()
Out[186]:
pupper           194
doggo            61 
puppo            21 
doggo|pupper     8  
floofer          7  
doggo|floofer    1  
doggo|puppo      1  
Name: dog_stages, dtype: int64

Define

Replace the value 'None' with NaN (missing value)

Code

In [187]:
#replace the value 'None' with NaN (missing value)
df_archive_clean =  df_archive_clean.replace('None', np.nan)

Test

In [188]:
df_archive_clean.isnull().sum()
Out[188]:
tweet_id                 0   
in_reply_to_status_id    0   
in_reply_to_user_id      0   
timestamp                0   
source                   0   
text                     0   
expanded_urls            0   
rating_numerator         0   
rating_denominator       0   
name                     545 
dog_stages               1601
dtype: int64

Quality

'None' previously passed the test as a name but isn't one. Drop missing tweets with missing names: Previously 'None', now nan.

Code

In [189]:
mask = ~df_archive_clean['name'].isna()
df_archive_clean = df_archive_clean[mask]

test

In [190]:
df_archive_clean.isnull().sum()
Out[190]:
tweet_id                 0   
in_reply_to_status_id    0   
in_reply_to_user_id      0   
timestamp                0   
source                   0   
text                     0   
expanded_urls            0   
rating_numerator         0   
rating_denominator       0   
name                     0   
dog_stages               1172
dtype: int64

predict_img_df

p1, p2, p3 inconsistent capitalization (sometimes first letter is capital)

Define

Making all strings in p1, p2, and p3 lowercase

Code

In [191]:
predict_img_df['p1'] = predict_img_df['p1'].str.lower()
predict_img_df['p2'] = predict_img_df['p2'].str.lower()
predict_img_df['p3'] = predict_img_df['p3'].str.lower()

Test

In [192]:
predict_img_df.head()
Out[192]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 welsh_springer_spaniel 0.465074 True collie 0.156665 True shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 german_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True rottweiler 0.243682 True doberman 0.154629 True
combine archive and predict_img dataframes

Define

Creating new dataframe by merging data from image predictions into a copy of twitter archive if the image prediction's tweet is already found in the twitter archive

Code

In [193]:
# Only keep predictions that have ids in archive 
# (final size is as large as archive) 
df1 = df_archive_clean
df2 = predict_img_df_clean

final_df_clean  = pd.merge(df1, df2, how='left', on=['tweet_id'])
In [194]:
final_df_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1349 entries, 0 to 1348
Data columns (total 22 columns):
tweet_id                 1349 non-null int64
in_reply_to_status_id    1349 non-null int64
in_reply_to_user_id      1349 non-null int64
timestamp                1349 non-null datetime64[ns, UTC]
source                   1349 non-null object
text                     1349 non-null object
expanded_urls            1349 non-null object
rating_numerator         1349 non-null float64
rating_denominator       1349 non-null int64
name                     1349 non-null object
dog_stages               177 non-null object
jpg_url                  1349 non-null object
img_num                  1349 non-null int64
p1                       1349 non-null object
p1_conf                  1349 non-null float64
p1_dog                   1349 non-null bool
p2                       1349 non-null object
p2_conf                  1349 non-null float64
p2_dog                   1349 non-null bool
p3                       1349 non-null object
p3_conf                  1349 non-null float64
p3_dog                   1349 non-null bool
dtypes: bool(3), datetime64[ns, UTC](1), float64(4), int64(5), object(9)
memory usage: 214.7+ KB
Test
In [195]:
# If tweet id (from image predictions) not in archive, 
# then it isn't in final merged dataframe
tweet_ids = predict_img_df_clean.tweet_id

for tweet_id in tweet_ids:
    # Test if id is also in archive
    if tweet_id not in df_archive_clean.tweet_id.values:
        # Check that it also isn't in new df
        if tweet_id in final_df_clean.tweet_id.values:
            print('Paradox! ID#{} in final DF but shouldn\'t be'.format(tweet_id))

status_df (Twitter API)

no empty cells
In [196]:
status_df_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 4 columns):
tweet_id              2354 non-null int64
retweet_count         2354 non-null int64
favorite_count        2354 non-null int64
display_text_range    2354 non-null object
dtypes: int64(3), object(1)
memory usage: 73.7+ KB

Define

Merge data from tweet info into a final dataframe if the tweet info's tweet is already found in the final dataframe (twitter archive)

Note that the 9 missing tweets are because the tweets have been removed.

Code

In [197]:
# Only keep predictions that have ids in archive 
# (final size is as large as archive) 
df1 = final_df_clean
df2 = status_df_clean
df2.rename(columns={'id':'tweet_id'}, inplace=True)

final_df_clean = pd.merge(df1, df2, how='left', on=['tweet_id'])

Test

In [198]:
# If tweet id (from tweet info) not in archive, 
# then it isn't in final merged dataframe
tweet_ids = status_df_clean.tweet_id

for tweet_id in tweet_ids:
    # Test if id is also in new df
    if tweet_id not in df_archive_clean.tweet_id.values:
        # Test if id is also in new df
        if tweet_id in final_df_clean.tweet_id.values:
            print('ID#{} in final DF but shouldn\'t be'.format(tweet_id))
            
final_df_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1349 entries, 0 to 1348
Data columns (total 25 columns):
tweet_id                 1349 non-null int64
in_reply_to_status_id    1349 non-null int64
in_reply_to_user_id      1349 non-null int64
timestamp                1349 non-null datetime64[ns, UTC]
source                   1349 non-null object
text                     1349 non-null object
expanded_urls            1349 non-null object
rating_numerator         1349 non-null float64
rating_denominator       1349 non-null int64
name                     1349 non-null object
dog_stages               177 non-null object
jpg_url                  1349 non-null object
img_num                  1349 non-null int64
p1                       1349 non-null object
p1_conf                  1349 non-null float64
p1_dog                   1349 non-null bool
p2                       1349 non-null object
p2_conf                  1349 non-null float64
p2_dog                   1349 non-null bool
p3                       1349 non-null object
p3_conf                  1349 non-null float64
p3_dog                   1349 non-null bool
retweet_count            1349 non-null int64
favorite_count           1349 non-null int64
display_text_range       1349 non-null object
dtypes: bool(3), datetime64[ns, UTC](1), float64(4), int64(7), object(10)
memory usage: 246.4+ KB

Test

In [199]:
final_df_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1349 entries, 0 to 1348
Data columns (total 25 columns):
tweet_id                 1349 non-null int64
in_reply_to_status_id    1349 non-null int64
in_reply_to_user_id      1349 non-null int64
timestamp                1349 non-null datetime64[ns, UTC]
source                   1349 non-null object
text                     1349 non-null object
expanded_urls            1349 non-null object
rating_numerator         1349 non-null float64
rating_denominator       1349 non-null int64
name                     1349 non-null object
dog_stages               177 non-null object
jpg_url                  1349 non-null object
img_num                  1349 non-null int64
p1                       1349 non-null object
p1_conf                  1349 non-null float64
p1_dog                   1349 non-null bool
p2                       1349 non-null object
p2_conf                  1349 non-null float64
p2_dog                   1349 non-null bool
p3                       1349 non-null object
p3_conf                  1349 non-null float64
p3_dog                   1349 non-null bool
retweet_count            1349 non-null int64
favorite_count           1349 non-null int64
display_text_range       1349 non-null object
dtypes: bool(3), datetime64[ns, UTC](1), float64(4), int64(7), object(10)
memory usage: 246.4+ KB

Storing Data

Storing our final and clean dataframe into a CSV file twitter_archive_master.csv

In [200]:
final_df_clean.to_csv('twitter_archive_master.csv', encoding='utf-8', index=False)

Analyzing Data

In [201]:
final_df_clean = pd.read_csv('twitter_archive_master.csv')

Most used Twitter source

In [202]:
final_df_clean['source'].value_counts()
Out[202]:
Twitter for iPhone    1324
Twitter Web Client    16  
TweetDeck             9   
Name: source, dtype: int64

WeRateDogs has posted 98% of the tweets from iPhone.

Analysis of rating of dogs

In [203]:
final_df_clean['rating_numerator'].value_counts().sort_index()
Out[203]:
2.00       3  
3.00       10 
4.00       5  
5.00       13 
6.00       17 
7.00       33 
8.00       70 
9.00       100
9.75       1  
10.00      273
11.00      301
11.27      1  
12.00      326
13.00      176
13.50      1  
14.00      17 
50.00      1  
1776.00    1  
Name: rating_numerator, dtype: int64
In [204]:
final_df_clean['rating_numerator'][final_df_clean['rating_numerator'] > 10].value_counts().sum()
Out[204]:
824

Out of a total of 1349 dogs rated by WeRateDogs, a total of 824 were rated above 10/10, which is almost 61%. However, only 1 was rated the highest: 1776 .

Most Common Dog Breeds

Motivation: Which are the most common breeds? Are some dog breeds more common in tweets?

In [241]:
# Get attributes before determinig dog breed
attributes = ['retweet_count', 'favorite_count', 'rating_denominator', 'rating_numerator']
df_dog_breeds = final_df_clean[attributes].copy()

# Create rating column
def percent_rating(row):
    if row['rating_denominator'] == 0:
        return 0
    return row['rating_numerator']/row['rating_denominator']


df_dog_breeds['rating'] = df_dog_breeds.apply(percent_rating, axis=1)
In [242]:
# For each entry, check which is the most likely breed
def best_breed_match(row):
    
    # Defaults to compare against
    breed = 'not_dog'
    confidence = 0
    
    # Only keep if it is a possible dog
    dog_preds = [(row[['p1', 'p1_dog', 'p1_conf']]), (row[['p2', 'p2_dog', 'p2_conf']]), (row[['p3', 'p3_dog', 'p3_conf']])]
    # Use this for easy reference
    index_breed, index_isDog, index_conf = 0,1,2
    
    for pred in dog_preds:
        # If it's a dog breed, check if it's max confidence seen
        if pred[index_isDog]:
            # save breed and confidence if higher confidence
            if pred[index_conf] >= confidence:
                breed = pred[index_breed]
                confidence = pred[index_conf]
    
    # Update breed list
    row['breed'] = breed
    row['breed_conf'] = confidence
    return row
        
    breeds = [row['p1'], row['p2'], row['p3']]
    
    
df_possible_breeds = df_possible_breeds.apply(best_breed_match, axis=1)

# Save info into dog breed dataframe
df_dog_breeds['breed'] = df_possible_breeds['breed']

df_dog_breeds['breed_conf'] = df_possible_breeds['breed_conf']
In [243]:
# show 30 most common dog breeds
df_dog_breeds.breed.value_counts()[:30] # 'not_dog' is most common so keep this out of plot
Out[243]:
not_dog                           163
golden_retriever                  104
Labrador_retriever                72 
Pembroke                          71 
Chihuahua                         67 
pug                               45 
toy_poodle                        39 
chow                              36 
Pomeranian                        28 
French_bulldog                    26 
Samoyed                           25 
malamute                          24 
Chesapeake_Bay_retriever          22 
cocker_spaniel                    20 
Siberian_husky                    19 
miniature_pinscher                19 
Shih-Tzu                          17 
Staffordshire_bullterrier         17 
Cardigan                          16 
beagle                            16 
German_shepherd                   15 
Eskimo_dog                        14 
Shetland_sheepdog                 14 
Old_English_sheepdog              13 
Pekinese                          12 
schipperke                        12 
basset                            12 
Italian_greyhound                 11 
American_Staffordshire_terrier    11 
Lakeland_terrier                  11 
Name: breed, dtype: int64
In [244]:
# Plot the 12 most common dog breeds in bar chart
plt.figure(figsize=(15,10))
only_dogs = df_dog_breeds['breed'] != 'not_dog'
df_dog_breeds[only_dogs].breed.value_counts()[12::-1].plot(kind='barh')
Out[244]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fb9186e788>

We first look at the top 30 most common dog breeds. We see that nearly 500 tweets were classified as not being a dog, which is about 3 times more than the first real dog breed. It could be that many tweets in this ranking archive are not dogs, but this seems strange and unlikely. It is possible that the image predictions have misclassified many images as not being dogs. This seems more likely, especially considering that the data comes from an image prediction code that may have made significant errors.

We then plotted the 12 most common (real) breeds of dogs in a bar graph. We see that according to this data, the most common breed of dog in the tweets was the golden retriever with more than 150 tweets of the breed of dog. The other four most common breeds, in descending order, were Labrador retriever, Pembrooke (corgi), Chihuahua and Pug. The first three breeds seem to fit well with my general use of the Internet and my experience in viewing different images of dogs.

However, we must again be careful not to draw too many conclusions from this data, as we rely on the prediction code of the images to inform us about the breed of dog. It is possible that some breeds of dogs were more difficult to determine by the prediction code and should in fact be at the top of the list. In future surveys, the effectiveness of the image prediction data could be further tested to ensure that the data accurately reflects the different breeds of dogs.

In [245]:
plt.figure(figsize=(15,10))
final_df_clean['source'].value_counts().plot('barh', figsize=(11,5), title='Most used Twitter source').set_xlabel("Number of Tweets")
plt.savefig('twitter_source')
In [246]:
plt.figure(figsize=(15,10))
final_df_clean.name.value_counts()[1:7].plot('barh', figsize=(11,5), title='Top 6 common dog names').set_xlabel("Number of Dogs")
plt.savefig('dog_names')
In [247]:
final_df_clean.dog_stages.value_counts()
Out[247]:
pupper          117
doggo           36 
puppo           15 
floofer         5  
doggo|pupper    4  
Name: dog_stages, dtype: int64
In [248]:
dog_counts = final_df_clean.dog_stages.value_counts()

fig,ax = plt.subplots(figsize = (15,10))
ax.bar(dog_counts.index, dog_counts.values, width = 0.8)
ax.set_ylabel('Dog Count')
ax.set_xlabel('Category')
plt.title("Most Common Dog Category")
plt.show()
In [249]:
plt.figure(figsize=(15,10))
sns.lmplot(x = 'rating_numerator', y = 'rating_denominator',
          data = final_df_clean,
          hue = 'source',
          palette ='Set2', fit_reg = False)
Out[249]:
<seaborn.axisgrid.FacetGrid at 0x1fb917763c8>
<Figure size 1080x720 with 0 Axes>

Insights

  • The most common dog type is a "pupper"
  • The most common dog, based on image predictions, is a Golden Retriever
In [250]:
retweet_count = final_df_clean.retweet_count
print("The median and mean retweet count are: ", retweet_count.median(), "and ", retweet_count.mean(), "respectively." )

favorite_count = final_df_clean.favorite_count
print("The median and mean favorite count are: ", favorite_count.median(), "and ", favorite_count.mean(), "respectively." )

rating = df_dog_breeds.rating
print("The median and mean rating are: ", rating.median(), "and ", rating.mean(), "respectively." )
The median and mean retweet count are:  1541.0 and  2752.896219421794 respectively.
The median and mean favorite count are:  4878.0 and  9410.949592290586 respectively.
The median and mean rating are:  1.1 and  1.2018445986926345 respectively.
In [ ]: