Live streaming tweet for hashtags

Streaming of tweets for hashtags and keywords

This notebook explains how to search the live twitter stream for certain hashtags and keywords. The gathered tweets are stored as text files and further processed into a panda DataFrame.

The script is designed for long term streaming with gathered tweets stored in succeeding files.

Note: This notebook is not directly runable since personalized OAuths keys are required to access the Twitter API. See below for how to obtain them. Once you have OAuth keys, save them in a json file and point to this file in script described in "Harvesting tweets".

Prerequisites

For the streaming of the twitter stream, we will use Tweepy. Currently (by 22.11.2017) there is a bug in the Tweepy version available at PyPI which breaks the code for certain twitter responses. A bug fixed version of Tweepy is available at the source code repo here: https://github.com/tweepy/tweepy

To install from source, clone the repository and install locally as explained in the readme file:

git clone https://github.com/tweepy/tweepy.git
cd tweepy
python setup.py install

To start using the Twitter API, we also need OAuth keys to grant us access rights.

To generate these, login into your twitter account and generate a new application at https://apps.twitter.com with read and write access rights. You can find further information about the OAuth of Twitter at the Twitter developer guide.

Once you have tweepy installed and the OAuth keys you are ready to go.

Harvesting tweets

In [1]:
import json
import os
import tweepy

First, we need a function providing the OAuths keys obtained above. The example function below assumes that the OAuths keys are stored in a json file. This should have the following format:

{
"consumer_key": "str_sequence",
"consumer_secret": "str_sequence",
"access_token": "str_sequence",
"access_secret": "str_sequence"
}
In [2]:
def get_oauth_from_file(oauth_file):
    """ Provides oauth keys stored in json file 'oauth_file'
    """
    
    with open(oauth_file) as json_file:
        keys = json.load(json_file)

    auth = tweepy.OAuthHandler(keys["consumer_key"], 
                               keys["consumer_secret"])
    auth.set_access_token(keys["access_token"], 
                          keys["access_secret"])
    return auth

Next, we design a custom Twitter listener which saves the gathered tweets in succeeding files (for further information on how to design your own StreamListener check the tweepy docs for this topic).

In [3]:
class MyListener(tweepy.streaming.StreamListener):

    def __init__(self, storage_folder, base_file_name, nr_tweet_per_file=1000):
        """ Twitter listener storing tweets as json entries in a file
        
        Note
        -----
        The tweets are stored as a list of json entries in a file.
        Thus, every line in the resulting file is a valid json entry.
        
        
        Parameters
        ----------
        
        storage_folder: str
            Folder for storing the tweets. Will be created if not existing.
            Already existing files matching the base_file_name will be overwritten!
            
        base_file_name: str
            Basename of the file for storing tweets. 
            E.g. 'tweets' would result in 
                'tweet0000000000.jsons'
                'tweet0000000001.jsons'
                ...
        nr_tweet_per_file: int, optional
            How many tweets to store in one json file (default: 1000)
            
        """
        try: 
            os.mkdir(storage_folder)
        except FileExistsError:
            pass
        self.store_base = os.path.join(storage_folder, base_file_name)
        self.nr_tweet_per_file = nr_tweet_per_file
        self.counter_tweets = 0
        self.counter_files = 0

    def on_data(self, data):
        try:
            if self.nr_tweet_per_file <= self.counter_tweets:
                self.counter_files += 1
                self.counter_tweets = 0
            else:
                self.counter_tweets += 1
                
            current_file = (
                    self.store_base + 
                    '_' + str(self.counter_files).zfill(10) +
                    '.jsons')
            with open(current_file, 'a') as f:
                f.write(data)
                return True
        except BaseException as e:
            print("Error on_data: {}".format(str(e)))
        return True
 
    def on_error(self, status):
        print(status)
        return True

Next, we define the hashtags, keywords and usernames which we would like to stream (here an example of the cop22 and cop23 streaming):

In [4]:
twitter_tracking_list = [
    '@COP22', 'cop22', '#COP', '#cop22', 
    '@COP23', 'cop23', '#COP', '#cop23', 
    '#climatechange', '#climateaction', 
    '#ParisAgreement', '#globalwarming', '#beforetheflood', 
    '#actonclimate', '#climate',
]

Now everything is setup and we can start streaming:

In [5]:
try:
    print('Started streaming tweets')
    while True:
        twitter_stream = tweepy.Stream(
            auth=get_oauth_from_file(oauth_file='/tmp/OAuthTwitter.json'),
            listener=MyListener(storage_folder='/tmp/twitter_stream',
                                base_file_name='tweets'))
        twitter_stream.filter(track=twitter_tracking_list)
except KeyboardInterrupt:
    print('Stopped streaming tweets')
Started streaming tweets
Stopped streaming tweets

To stop the streaming interrupt the kernel with:

- Jupyter notebook: Press the stop button or choose "Interrupt" from the "Kernel" menu
- Script: Press Control-C

The tweets are stored as json entries in the files in the specified folder. These can easily be read in Python with:

In [6]:
with open('/tmp/twitter_stream/tweets_0000000000.jsons', 'r') as twitter_jsons:
        tweets = [json.loads(line) for line in twitter_jsons]
In [7]:
tweets[:2]
Out[7]:
[{'contributors': None,
  'coordinates': None,
  'created_at': 'Fri Nov 10 17:46:32 +0000 2017',
  'entities': {'hashtags': [],
   'symbols': [],
   'urls': [{'display_url': 'blog.education.nationalgeographic.com/2017/11/09/cli…',
     'expanded_url': 'https://blog.education.nationalgeographic.com/2017/11/09/climate-action-250-schools-69-countries-focusing-on-climate-change',
     'indices': [84, 107],
     'url': 'https://t.co/gJRNyyAky1'}],
   'user_mentions': [{'id': 149593681,
     'id_str': '149593681',
     'indices': [3, 14],
     'name': 'Koen Timmers',
     'screen_name': 'zelfstudie'},
    {'id': 136441843,
     'id_str': '136441843',
     'indices': [108, 124],
     'name': 'NatGeo Education',
     'screen_name': 'NatGeoEducation'}]},
  'favorite_count': 0,
  'favorited': False,
  'filter_level': 'low',
  'geo': None,
  'id': 929042603479269377,
  'id_str': '929042603479269377',
  'in_reply_to_screen_name': None,
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'is_quote_status': False,
  'lang': 'en',
  'place': None,
  'possibly_sensitive': False,
  'quote_count': 0,
  'reply_count': 0,
  'retweet_count': 0,
  'retweeted': False,
  'retweeted_status': {'contributors': None,
   'coordinates': None,
   'created_at': 'Thu Nov 09 17:30:01 +0000 2017',
   'display_text_range': [0, 140],
   'entities': {'hashtags': [],
    'symbols': [],
    'urls': [{'display_url': 'blog.education.nationalgeographic.com/2017/11/09/cli…',
      'expanded_url': 'https://blog.education.nationalgeographic.com/2017/11/09/climate-action-250-schools-69-countries-focusing-on-climate-change',
      'indices': [68, 91],
      'url': 'https://t.co/gJRNyyAky1'},
     {'display_url': 'twitter.com/i/web/status/9…',
      'expanded_url': 'https://twitter.com/i/web/status/928676060069269509',
      'indices': [110, 133],
      'url': 'https://t.co/0Ua4kwQU1p'}],
    'user_mentions': [{'id': 136441843,
      'id_str': '136441843',
      'indices': [92, 108],
      'name': 'NatGeo Education',
      'screen_name': 'NatGeoEducation'}]},
   'extended_tweet': {'display_text_range': [0, 155],
    'entities': {'hashtags': [{'indices': [109, 123], 'text': 'climatechange'},
      {'indices': [124, 139], 'text': 'climateactionp'},
      {'indices': [140, 155], 'text': 'teachersmatter'}],
     'media': [{'display_url': 'pic.twitter.com/fFhRDnCbUQ',
       'expanded_url': 'https://twitter.com/zelfstudie/status/928676060069269509/photo/1',
       'id': 928674531937538048,
       'id_str': '928674531937538048',
       'indices': [156, 179],
       'media_url': 'http://pbs.twimg.com/media/DONQkQQW0AAuZYH.jpg',
       'media_url_https': 'https://pbs.twimg.com/media/DONQkQQW0AAuZYH.jpg',
       'sizes': {'large': {'h': 838, 'resize': 'fit', 'w': 945},
        'medium': {'h': 838, 'resize': 'fit', 'w': 945},
        'small': {'h': 603, 'resize': 'fit', 'w': 680},
        'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
       'type': 'photo',
       'url': 'https://t.co/fFhRDnCbUQ'}],
     'symbols': [],
     'urls': [{'display_url': 'blog.education.nationalgeographic.com/2017/11/09/cli…',
       'expanded_url': 'https://blog.education.nationalgeographic.com/2017/11/09/climate-action-250-schools-69-countries-focusing-on-climate-change',
       'indices': [68, 91],
       'url': 'https://t.co/gJRNyyAky1'}],
     'user_mentions': [{'id': 136441843,
       'id_str': '136441843',
       'indices': [92, 108],
       'name': 'NatGeo Education',
       'screen_name': 'NatGeoEducation'}]},
    'extended_entities': {'media': [{'display_url': 'pic.twitter.com/fFhRDnCbUQ',
       'expanded_url': 'https://twitter.com/zelfstudie/status/928676060069269509/photo/1',
       'id': 928674531937538048,
       'id_str': '928674531937538048',
       'indices': [156, 179],
       'media_url': 'http://pbs.twimg.com/media/DONQkQQW0AAuZYH.jpg',
       'media_url_https': 'https://pbs.twimg.com/media/DONQkQQW0AAuZYH.jpg',
       'sizes': {'large': {'h': 838, 'resize': 'fit', 'w': 945},
        'medium': {'h': 838, 'resize': 'fit', 'w': 945},
        'small': {'h': 603, 'resize': 'fit', 'w': 680},
        'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
       'type': 'photo',
       'url': 'https://t.co/fFhRDnCbUQ'}]},
    'full_text': 'The Climate Action project covered by National Geographic Education\nhttps://t.co/gJRNyyAky1 @NatGeoEducation #climatechange #climateactionp #teachersmatter https://t.co/fFhRDnCbUQ'},
   'favorite_count': 70,
   'favorited': False,
   'filter_level': 'low',
   'geo': None,
   'id': 928676060069269509,
   'id_str': '928676060069269509',
   'in_reply_to_screen_name': None,
   'in_reply_to_status_id': None,
   'in_reply_to_status_id_str': None,
   'in_reply_to_user_id': None,
   'in_reply_to_user_id_str': None,
   'is_quote_status': False,
   'lang': 'en',
   'place': None,
   'possibly_sensitive': False,
   'quote_count': 11,
   'reply_count': 7,
   'retweet_count': 34,
   'retweeted': False,
   'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
   'text': 'The Climate Action project covered by National Geographic Education\nhttps://t.co/gJRNyyAky1 @NatGeoEducation… https://t.co/0Ua4kwQU1p',
   'truncated': True,
   'user': {'contributors_enabled': False,
    'created_at': 'Sat May 29 17:52:29 +0000 2010',
    'default_profile': False,
    'default_profile_image': False,
    'description': '@TeacherPrize Top 50 teacher | Teaching Kakuma Refugees via Skype | Author | Speaker | Technology Enhanced Learning | Founder zelfstudie.be | MIE Fellow',
    'favourites_count': 8248,
    'follow_request_sent': None,
    'followers_count': 3816,
    'following': None,
    'friends_count': 2711,
    'geo_enabled': True,
    'id': 149593681,
    'id_str': '149593681',
    'is_translator': False,
    'lang': 'nl',
    'listed_count': 331,
    'location': 'Belgium',
    'name': 'Koen Timmers',
    'notifications': None,
    'profile_background_color': 'D0D0D0',
    'profile_background_image_url': 'http://pbs.twimg.com/profile_background_images/375133972/bg-twitter.jpg',
    'profile_background_image_url_https': 'https://pbs.twimg.com/profile_background_images/375133972/bg-twitter.jpg',
    'profile_background_tile': False,
    'profile_banner_url': 'https://pbs.twimg.com/profile_banners/149593681/1498040391',
    'profile_image_url': 'http://pbs.twimg.com/profile_images/808926736851357696/_LIvoZ7Q_normal.jpg',
    'profile_image_url_https': 'https://pbs.twimg.com/profile_images/808926736851357696/_LIvoZ7Q_normal.jpg',
    'profile_link_color': 'D9650D',
    'profile_sidebar_border_color': '0D0202',
    'profile_sidebar_fill_color': 'EDA426',
    'profile_text_color': '333333',
    'profile_use_background_image': True,
    'protected': False,
    'screen_name': 'zelfstudie',
    'statuses_count': 7064,
    'time_zone': 'Brussels',
    'translator_type': 'none',
    'url': 'http://www.timmers.me',
    'utc_offset': 3600,
    'verified': False}},
  'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  'text': 'RT @zelfstudie: The Climate Action project covered by National Geographic Education\nhttps://t.co/gJRNyyAky1 @NatGeoEducation… ',
  'timestamp_ms': '1510335992103',
  'truncated': False,
  'user': {'contributors_enabled': False,
   'created_at': 'Tue Aug 18 19:59:41 +0000 2009',
   'default_profile': False,
   'default_profile_image': False,
   'description': 'Head of Beta School & Teacher. I #LoveToTeach because to teach is to empower others, to lead changes & create the future!!! FC Lead Ambassador, MIE EXPERT',
   'favourites_count': 945,
   'follow_request_sent': None,
   'followers_count': 218,
   'following': None,
   'friends_count': 121,
   'geo_enabled': True,
   'id': 66782856,
   'id_str': '66782856',
   'is_translator': False,
   'lang': 'en',
   'listed_count': 23,
   'location': 'israel',
   'name': 'karina',
   'notifications': None,
   'profile_background_color': 'BF1238',
   'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme20/bg.png',
   'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme20/bg.png',
   'profile_background_tile': False,
   'profile_banner_url': 'https://pbs.twimg.com/profile_banners/66782856/1461433147',
   'profile_image_url': 'http://pbs.twimg.com/profile_images/902939255093284865/0zsSHLvE_normal.jpg',
   'profile_image_url_https': 'https://pbs.twimg.com/profile_images/902939255093284865/0zsSHLvE_normal.jpg',
   'profile_link_color': 'BF1238',
   'profile_sidebar_border_color': 'FFFFFF',
   'profile_sidebar_fill_color': 'EFEFEF',
   'profile_text_color': '333333',
   'profile_use_background_image': True,
   'protected': False,
   'screen_name': 'karinam60',
   'statuses_count': 1519,
   'time_zone': 'Jerusalem',
   'translator_type': 'none',
   'url': 'http://karinam60.wix.com/betasefer',
   'utc_offset': 7200,
   'verified': False}},
 {'contributors': None,
  'coordinates': None,
  'created_at': 'Fri Nov 10 17:46:35 +0000 2017',
  'entities': {'hashtags': [{'indices': [0, 14], 'text': 'ClimateChange'}],
   'symbols': [],
   'urls': [{'display_url': 'ift.tt/2yq4q9t',
     'expanded_url': 'http://ift.tt/2yq4q9t',
     'indices': [59, 82],
     'url': 'https://t.co/J7o95JylhR'}],
   'user_mentions': []},
  'favorite_count': 0,
  'favorited': False,
  'filter_level': 'low',
  'geo': None,
  'id': 929042617765171201,
  'id_str': '929042617765171201',
  'in_reply_to_screen_name': None,
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'is_quote_status': False,
  'lang': 'it',
  'place': None,
  'possibly_sensitive': False,
  'quote_count': 0,
  'reply_count': 0,
  'retweet_count': 0,
  'retweeted': False,
  'source': '<a href="https://ifttt.com" rel="nofollow">IFTTT</a>',
  'text': '#ClimateChange: Ferma i finanziamenti. Ferma il gasdotto.: https://t.co/J7o95JylhR',
  'timestamp_ms': '1510335995509',
  'truncated': False,
  'user': {'contributors_enabled': False,
   'created_at': 'Fri Jan 21 20:04:39 +0000 2011',
   'default_profile': False,
   'default_profile_image': False,
   'description': 'Watching frogs boil as #Banks, #BigCorp, #BigOil & the #MilitaryIndustrialComplex use #Brainwashing & #TwoPartyTyranny to take a little more life away each day.',
   'favourites_count': 7,
   'follow_request_sent': None,
   'followers_count': 231,
   'following': None,
   'friends_count': 1,
   'geo_enabled': False,
   'id': 241240437,
   'id_str': '241240437',
   'is_translator': False,
   'lang': 'en',
   'listed_count': 81,
   'location': 'Free Palestine | Save Gaza',
   'name': 'IronBoltBruce',
   'notifications': None,
   'profile_background_color': '000000',
   'profile_background_image_url': 'http://pbs.twimg.com/profile_background_images/443373680211406848/dWlLVPye.png',
   'profile_background_image_url_https': 'https://pbs.twimg.com/profile_background_images/443373680211406848/dWlLVPye.png',
   'profile_background_tile': False,
   'profile_banner_url': 'https://pbs.twimg.com/profile_banners/241240437/1444648350',
   'profile_image_url': 'http://pbs.twimg.com/profile_images/462512490761297920/i7V3sGFl_normal.png',
   'profile_image_url_https': 'https://pbs.twimg.com/profile_images/462512490761297920/i7V3sGFl_normal.png',
   'profile_link_color': 'FD6304',
   'profile_sidebar_border_color': '000000',
   'profile_sidebar_fill_color': 'C0DFEC',
   'profile_text_color': '333333',
   'profile_use_background_image': False,
   'protected': False,
   'screen_name': 'WatchFrogsBoil',
   'statuses_count': 100618,
   'time_zone': 'Central Time (US & Canada)',
   'translator_type': 'none',
   'url': 'http://IronBoltBruce.com',
   'utc_offset': -21600,
   'verified': False}}]

Parsing tweets into pandas DataFrame and store as csv

Once we are finished streaming tweets, we can process the files into a pandas DataFrame.

In [8]:
import pandas as pd

First, we define a function which extracts the interesting information of a specific tweet:

In [9]:
def parse_tweet(tweet):
    """ Parse an individual tweet into a dict
    """
    
    def get_list_from_dict_list(list_of_dict, ik):
        """ Extract values from a list of dictionaries

        Parameters:
        -----------
        list_of_dict : list of dictonaries with same key

        ik : key for the dict to extract

        Returns
        -------
        list of extracted items or None
        """
        try:
            list_result = [dd.get(ik) for dd in list_of_dict]
        except TypeError:
            list_result = None
        return list_result
    
    def extractor_closure(tweet_dict):
        """Closure for extracting nested entries from a dict
        """
        def get_item(*args):
            dd_nested=tweet_dict
            for item in args:
                dd_nested = dd_nested.get(item, None)
                if dd_nested is None:
                    break
            return dd_nested
        return get_item

    extractor = extractor_closure(tweet)
       
    dd_extract = {
        'id' : extractor('id'),
        'created_at' : extractor('created_at'),
        'lang' : extractor('lang'),
        'text' : extractor('text').replace('\n', ' ').replace('\r', ' ') if extractor('text') else None,
        'user_name' : extractor('user', 'name'),
        'screen_name' : extractor('user', 'screen_name'),
        'followers_count' : extractor('user', 'followers_count'),
        'country' : extractor('place', 'country_code'),
        'place' : extractor('place', 'full_name'),
        'user_mentions_names' : get_list_from_dict_list(
                extractor('entities', 'user_mentions'), 'screen_name'),
        'hashtags' : get_list_from_dict_list(
                extractor('entities', 'hashtags'), 'text'),
        }

    return dd_extract

Next, we define a function which parsed one jsons file into a pandas DataFrame:

In [10]:
def jsons_to_df(jsons_file):
    """Read the jsons file and return the content as pandas DataFrame"""
    
    with open(jsons_file, 'r') as twitter_jsons:
        list_tweets = [parse_tweet(json.loads(line)) 
                       for line in twitter_jsons]
    return pd.DataFrame(list_tweets)
    

To run the parser over all gathered files, we need a list of these files:

In [11]:
tweets_folder = '/tmp/twitter_stream/'
list_jsons_files = [os.path.join(tweets_folder, tweets_file)
                    for tweets_file in
                        os.listdir(tweets_folder) if
                            os.path.splitext(tweets_file)[1] == '.jsons']

We can than process all gathers tweets into a pandas DataFrame, removing duplicate tweets:

In [12]:
df_tweets = (pd.concat([jsons_to_df(ff) for ff in list_jsons_files]).
             drop_duplicates(subset='id').
             sort_values('id').
             reset_index(drop=True)) 

To store the English tweets we can than use

In [13]:
df_eng = df_tweets[df_tweets['lang']=='en']
df_eng.to_csv('/tmp/eng_tweets.csv', sep='|')
In [14]:
df_eng.head()
Out[14]:
country created_at followers_count hashtags id lang place screen_name text user_mentions_names user_name
0 None Fri Nov 10 17:46:32 +0000 2017 218 [] 929042603479269377 en None karinam60 RT @zelfstudie: The Climate Action project cov... [zelfstudie, NatGeoEducation] karina
2 None Fri Nov 10 17:46:36 +0000 2017 184 [paludiculture] 929042620462092290 en None DiannaKopansky RT @iki_bmub: “We have to make peatland wet ag... [iki_bmub, WetlandsInt] Dianna Kopansky
4 None Fri Nov 10 17:46:38 +0000 2017 157 [FastForward, ClimateAction, COP23] 929042629597302789 en None IreneAyaa RT @dw_akademie: Have a walk through the Bonn ... [dw_akademie] Irene Ayaa
5 None Fri Nov 10 17:46:38 +0000 2017 2307 [] 929042631983812608 en None TimmonsRoberts Snappy title on this in session document in th... [] Timmons Roberts
6 None Fri Nov 10 17:46:42 +0000 2017 421 [ClimateChange] 929042646080692224 en None gameing RT @Greenpeace: #ClimateChange is making storm... [Greenpeace] gameing

How to process the resulting DataFrame will be the topic of the next notebook example.