Analyzing the Social Networks of over 2.7 Billion Reddit Comments¶

This notebook gives detailed information to help better understand our paper. In this notebook, we will demonstrate how to construct the subreddits' social networks created by more than 2.7 billion comments. Additionally, we will demonstrate how to calculate various statistics related to the subreddits. This code is licensed under a BSD license. See license file.

0. Setup¶

Before we begin, make sure you have installed all the required Python packages. (The instructions below use pip. You can use easy_install, too.) Also, consider using virtualenv for a cleaner installation experience instead of sudo. We also recommend running the code via IPython Notebook.

sudo pip install --upgrade graphlab-create # We mainly used SFrame object, which is an open source package.
sudo pip install --upgrade praw
sudo pip install --upgrade seaborn
sudo pip install --upgrade networkx

1. Preparing the Dataset¶

First, we need to download the compressed Reddit dataset files from pushshift.io website. This dataset was created by Jason Michael Baumgartner. Additional details about this dataset can be found at this Link. Downloading this hundreds-of-GB dataset can take a considerable amount of time. To save time, you can download only one month’s, or several months’, worth of data. After we download the dataset, we notice that the dataset is organized in directories, where each directory contains the posts of a specific year. These directories contain posts that were published from December 2005 to the most recent month. For this tutorial, we utilized over 2.71 billion comments that were posted from December 2005 through October 2016. Let's create a single SFrame that contains all these posts. To achieve this, we first will convert each monthly zipped file into an SFrame object using the following code:

import os
import logging
import bz2
from datetime import datetime
import graphlab as gl
import graphlab.aggregate as agg
import fnmatch

gl.canvas.set_target('ipynb')
gl.set_runtime_config('GRAPHLAB_CACHE_FILE_LOCATIONS', '/data/tmp')
gl.set_runtime_config('GRAPHLAB_DEFAULT_NUM_GRAPH_LAMBDA_WORKERS', 128)
gl.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 128)


basedir = "/data/reddit/raw" # Replace this with the directory which you downloaded the file into
sframes_dir = "/data/reddit/sframes/" # Replace this with the directory you want to save the SFrame to
tmp_dir =  "/data/tmp" # Replace this with the directory you want to save the SFrame to

def get_month_from_path(path):
    m = os.path.basename(path)
    m = m.split(".")[0]
    return int(m.split("-")[-1])

def get_year_from_path(path):
    y = os.path.basename(path)
    y = y.split(".")[0]
    return int(y.lower().replace("rc_","").split("-")[0])

def json2sframe(path):
    """
    Creates an SFrame object from the file in the input path
    :param path: path to a file that contains a list of JSON objects, each JSON is saved in a separate line.
        The file can also be compressed in bz2 format.
    :return: SFrame object created from the file in the input path. The SFrame also contains information regarding
     each post date & time
    :rtype: gl.SFrame
    """
    if not path.endswith(".bz2"):
        sf = gl.SFrame.read_json(path, orient="lines")
    else:
        dpath = decompress_bz2(path)
        sf = gl.SFrame.read_json(dpath, orient="lines")
        #remove the decompressed file
        os.remove(dpath)
    #add datetime information
    sf['month'] = get_month_from_path(path)
    sf['year'] = get_year_from_path(path)
    sf['datetime']= sf["created_utc"].apply(lambda utc: datetime.fromtimestamp(float(utc)))

    return sf

def decompress_bz2(inpath, outpath=None):
    """
    Decompress bz2 to the outpath, if the outpath is not provided then decompress the file to the inpath directory
    :param inpath: decompress bz2 file to the outpath
    :param outpath: output path for the decompress file
    :return: the output file path
    """
    if outpath is None:
        outpath = tmp_dir + os.path.sep + os.path.basename(inpath) + ".decompressed"
    out_file = file(outpath, 'wb')
    logging.info("Decompressing file %s to %s" % (inpath,outpath))
    in_file = bz2.BZ2File(inpath, 'rb')
    for data in iter(lambda : in_file.read(100 * 1024), b''):
        out_file.write(data)
    out_file.close()
    in_file.close()
    return outpath

def match_files_in_dir(basedir, ext):
    """
    Find all files in the basedir with 'ext' as filename extension
    :param basedir: input basedir
    :param ext: filename extension
    :return: list of file paths with the input extension
    """
    matches = []
    for root, dirnames, filenames in os.walk(basedir):
        for filename in fnmatch.filter(filenames, ext):
            matches.append(os.path.join(root, filename))
    return matches

#Creating all SFrames
for p in match_files_in_dir(basedir, "*.bz2"):
    logging.info("Analyzing of %s " % p)
    outp = sframes_dir + os.path.sep + os.path.basename(p).replace(".bz2", ".sframe")
    if os.path.isdir(outp): #if file already exists skip it
        logging.info("Skipping the analysis of %s file" % p)
        continue
    sf = json2sframe(p)
    sf.save(outp)

This non-commercial license of GraphLab Create for academic use is assigned .

[INFO] graphlab.cython.cy_server: GraphLab Create v1.10.1 started. Logging: /tmp/graphlab_server_1487345388.log

Finished parsing file /data/tmp/RC_2009-09.bz2.decompressed

Parsing completed. Parsed 100 lines in 1.59448 secs.

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[dict]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Read 88555 lines. Lines per second: 30815

Read 972023 lines. Lines per second: 112707

Read 1590501 lines. Lines per second: 114090

Finished parsing file /data/tmp/RC_2009-09.bz2.decompressed

Parsing completed. Parsed 2032276 lines in 17.5218 secs.

Now let’s join all the SFrame objects into a single object. Please notice that different posts contain different metadata information about each post. Therefore, we will create a single SFrame which contains all the various metadata information.

join_sframe_path = sframes_dir + os.path.sep + "join_all.sframe" # Where to save the join large SFrame object
def get_all_cols_names(sframes_dir):
    """
    Return the column names of all SFrames in the input path
    :param sframes_dir: directory path which contains SFrames
    :return: list of all the column names in all the sframes in the input directory
    :rtype: set()
    """
    sframes_paths = [sframes_dir + os.path.sep + s for s in  os.listdir(sframes_dir)]
    column_names = set()
    for p in sframes_paths:
        if not p.endswith(".sframe"):
            continue
        print p
        sf = gl.load_sframe(p)
        column_names |= set(sf.column_names())
    return column_names

def get_sframe_columns_type_dict(sf):
    """
    Returns a dict with the sframe column names as keys and column types as values
    :param sf: input SFrame
    :return: dict with the sframe column names as keys and column types as values
    :rtype dict[str,str]
    """
    n = sf.column_names()
    t = sf.column_types()

    return {n[i]: t[i]for i in range(len(n))}

def update_sframe_columns_types(sf, col_types):
    """
    Updates the input sframe column types according to the input types dict.
    :param sf: input SFrame object
    :param col_types: dict in which the keys are the column names and the values are the columns types
    :return: SFrame object with column types update to the col_types dict. If a column doesn't exist in the SFrame object
     then a new column is added with None values
    :rtype: gl.SFrame
    """
    sf_cols_dict = get_sframe_columns_type_dict(sf)

    for k,v in col_types.iteritems():
        if k not in sf_cols_dict:
            sf[k] = None
            sf[k] = sf[k].astype(v)
        elif v != sf_cols_dict[k]:
            sf[k] = sf[k].astype(v)
    return sf

def join_all_sframes(sframes_dir, col_types):
    """
    Joins all SFrames in the input directory where the column types are according to col_types dict
    :param sframes_dir:
    :param col_types: dict with column names and their corresponding types
    :return: merged SFrame of all the SFrames in the input directory
    "rtype: gl.SFrame
    """
    sframes_paths = [sframes_dir + os.path.sep + s for s in  os.listdir(sframes_dir) if s.endswith(".sframe")]
    sframes_paths.sort()
    sf = gl.load_sframe(sframes_paths[0])
    sf = update_sframe_columns_types(sf, col_types)
    for p in sframes_paths[1:]:
        if not p.endswith(".sframe"):
            continue
        print "Joining %s" % p
        sf2 = update_sframe_columns_types(gl.load_sframe(p), col_types)
        sf2.__materialize__()
        sf = sf.append(sf2)
        sf.__materialize__()

    return sf



# use the inferred column type according to last month posts' SFrame. Set all other columns to be 
# as type str
#col_names = get_all_cols_names(sframes_dir)
sf = gl.load_sframe(sframes_dir + '/RC_2015-05.sframe')
d = get_sframe_columns_type_dict(sf)
for c in col_names:
    if c not in d:
        print "Found new column %s" %c
    d[c] = str

#Create Single SFrame
sf = join_all_sframes(sframes_dir, d)
sf.save(join_sframe_path)

At the end of this process, we obtained an SFrame with 2,718,784,464 rows, which is about 444 GB in size. Let's use the show function to get a better understanding of the data.

gl.canvas.set_target('ipynb')
sf = gl.load_sframe(join_sframe_path)
#sf.show() # running this can take considerable amount of time

Let's clean it by removing columns that aren't useful for creating the subreddit's social network. Namely, we remove the following columns: "archived," "downs," "retrieved_on," "banned_by," "likes," "user_reports," "saved," "report_reasons," "approved_by," "body_html," "created," "mod_reports," and "num_reports.”

sf = sf.remove_columns(["archived", "downs", "retrieved_on", "banned_by", "likes","user_reports", "saved",
                        "report_reasons", "approved_by", "body_html", "created", "mod_reports", "num_reports"])
sf.__materialize__()

Let's also delete users' posts that are from users that are probably bots and from those who have posted too many messages.

#First let's find how many posts the most active users posted
posts_count_sf = sf.groupby('author', gl.aggregate.COUNT())
posts_count_sf.sort('Count',ascending=False).print_rows(50)

+----------------------+-----------+
|        author        |   Count   |
+----------------------+-----------+
|      [deleted]       | 302138049 |
|    AutoModerator     |  10976942 |
|     conspirobot      |   575576  |
|    ModerationLog     |   562192  |
|     TweetPoster      |   555976  |
|    MTGCardFetcher    |   444411  |
|        RPBot         |   436199  |
|     autowikibot      |   422769  |
|   imgurtranscriber   |   404441  |
|      PoliticBot      |   388444  |
|      dogetipbot      |   374536  |
|   ImagesOfNetwork    |   355037  |
| throwthrowawaytothee |   344583  |
| Late_Night_Grumbler  |   320383  |
|   qkme_transcriber   |   301974  |
| TweetsInCommentsBot  |   282467  |
|    TotesMessenger    |   222073  |
|     Franciscouzo     |   202346  |
|      havoc_bot       |   197761  |
|     morbiusgreen     |   195312  |
|    User_Simulator    |   194358  |
|        Lots42        |   192985  |
|       hit_bot        |   182189  |
|         -rix         |   173022  |
|      pixis-4950      |   171907  |
|     PornOverlord     |   171623  |
|    BitcoinAllBot     |   170096  |
|     UnluckyLuke      |   169971  |
|     PriceZombie      |   163710  |
|      amici_ursi      |   163288  |
|        matts2        |   156381  |
|       gifv-bot       |   155513  |
| WritingPromptsRobot  |   154295  |
|    atomicimploder    |   148904  |
|       autotldr       |   146046  |
|      TrollaBot       |   145737  |
|     SnapshillBot     |   143317  |
|   NoMoreNicksLeft    |   135738  |
|      MovieGuide      |   134567  |
|      raddit-bot      |   133615  |
|       iam4real       |   133262  |
|   youtubefactsbot    |   132521  |
|      CaptionBot      |   131743  |
|        hwsbot        |   131402  |
|       G_Morgan       |   129057  |
|      davidjl123      |   128481  |
|     noeatnosleep     |   125357  |
|     samacharbot2     |   125226  |
|    backpackwayne     |   116002  |
|     ttumblrbots      |   114590  |
+----------------------+-----------+
[20299812 rows x 2 columns]

To clean the dataset, we removed redditors who posted over 100,000 times and seemed to post automatic content. Additionally, we used the praw Python package to parse the comments posted in the BotWatchman subreddit in order to identify bots and remove them. We use the following code to assemble a bots list:

import praw


def get_bots_set():
    #Please insert your Reddit application's authentication information.
    #See more details at: https://github.com/reddit/reddit/wiki/OAuth2-Quick-Start-Example#first-steps
    secret ='<insert your secert string>'
    client_id = '<insert your client-id string>'
    user_agent = '<insert your user-agent string>'
    
    reddit = praw.Reddit(client_id=client_id,client_secret=secret, user_agent=user_agent)
    sr = reddit.subreddit('BotWatchman')
    bots_set = set()
    for p in sr.search("overview for", limit=2000):
        bots_set.add(p.title.split(" ")[2])
    return bots_set    


def get_remove_profiles_list(sf):
    # We get the bots set twice to reduce the variance of the results
    bots_set = get_bots_set()
    bots_set |= get_bots_set()
    posts_count_sf = sf.groupby('author', gl.aggregate.COUNT())
    delete_users = set(posts_count_sf[posts_count_sf['Count'] > 100000]['author'])
    delete_users |= bots_set
    return delete_users


delete_users = get_remove_profiles_list(sf)
print "Delete Users List (%s users):\n %s" % (len(delete_users), delete_users)

Delete Users List (951 users):
 set(['Removedpixel', 'SoulFire6464', u'RealtechPostBot', u'serendipitybot', u'r2d8', u'arethemodsasleep', u'squis1000', u'The_Penis_Game_Bot', u'whooshbot', u'Zyra_test', u'hitlerbotv2', u'topredditbot', u'VerseBot', u'MusicB0t', u'Explains_FTFY', u'BaddieBot', u'autourbanbot', 'NinjaDiscoJesus', u'thier_2_their_bot', u'amProbablyPooping', u'Insane_Photo_Bot', u'Auto_Neckbeard_Bot', u'shadowbanbot', u'_Rita_', u'Profanity-bot', u'gracefulclaritybot', u'DataFire_bot', 'autowikibot', u'FTFY_Cat', u'yes_answers.', u'DollarSignDouche', u'ELIMINATE_GIFS', u'PlusOneBot', u'CutoutBot', u'visual_clarification', u'triggeredbot', u'haiku_finder_bot', u'Eats_Small_Text_Bot', u'GameDealsBot', u'Craigslist-Bot', u'STEALTHM0UNTAIN', u'LegoLinkBot', 'atomicimploder', u'Dickish_Bot_Bot', 'BitcoinAllBot', u'/u/codes_comments', u'CuteBot6969', u'LittleHelperRobot', u'Anti-Brigade-Bot3', u'autowikiabot', u'Anti-Brigade-Bot8', 'backpackwayne', u'UnobtaniumTipBot', u'YOGSbot', u'Innocence_bot2', u'PreserverBot', u'givesafuckbot', u'youreyoureiswrong', u'gabenizer-bot', u'crushe69', u'Canadian-Sorry-Bot', u'Comrade__Bot', u'GunnitBot', u'checks_for_checks', u'define_bot', u'Hikelsa_bot', u'ThanksObamaBiden', u'rusetipbot', 'pixis-4950', u'autotrope_bot', u'Feathersbumbled', u'NSLbot', u'kittehcointipbot', u'ModisDead', u'ItsSpelledPrincipal', u'demobile_bot', u'acini', u'rawlaying90', u'OneBleachinBot', u'giddyp93', u'NobodyDoesThis', u'image_linker_bot', u'ReddiTron1000', u'WWE_Network_Bot', u'iTunesLinks', u'TheWallGrows', u'PokemonFlairBot', u'StockTrendsBot', u'pikachu_hsu', u'PM_ME_YOUR_COCK_', 'CaptionBot', u'FucksWithTables', u'The_Dad_Bot', u'hearing-aid_bot', u'bee2kg', u'IPostAlotBot', u'stockinfo_bot', u'ASOT_Tracklister_bot', u'resavr_bot', u'chant89', u'StackBot', u'it-is-not-would-of', u'Converts2Useless', u'OnlyRepliesLMAO', u'WillPrismWordsBot', u'magic-8-bot', '-rix', u'DUCCI__BOT', u'Product_Helper_Bot', u'CationBot', u'Readdit_Bot', u'shiftace-', u'Comment_Codebreaker', u'ToBeFairBot', u'thank_mr_skeltal_bot', u'steam_bot', u'RFootballBot', u'ratsb1000', u'RedVanguardBot', u'AVR_Modbot', u'nulli1000', u'DidSomeoneSayBoobs', u'compilebot', u'Brigade_Bot', u'/u/TotesHuman', u'league_match_bot', u'_Sweetiebot', u'much-wow-doge', u'UselessAndBot', u'disapprovalbot', u'ButWhatDoIDoWithIt', u'ASCIICockBot', u'tabledresser', u'cruise_bot', 'G_Morgan', 'autotldr', 'PornOverlord', u'topcoin_tip', u'tagstr1000', u'heads_need_bodies', 'PoliticBot', u'bedoot', u'AAbot', u'gandhiwatchesbot', u'AladeenBot', u'iscuck_bot', u'nice_meme_bot', 'samacharbot2', u'dotaherobot', u'TreeFiddyBot', u'RedditAnalysisBot', u'b6d27f0x3', u'roger_bot', u'BotOfOriginality', u'TheTopCommentBot', u'annoying_yes_bot', u'isDegenerate', u'NightMirrorMoon', u'BOTFORSUBREDDITNAME', u'gandhi_spell_bot', u'Decronym', u'WikiBotEU', u'cartri69', u'ProfanityWatch', u'livid_taco', u'loser_detector_bot', u'IthkuilRobot', u'zalgoskeletor', u'PlayStoreLinks_Bot', u'ShillForMonsanto', u'BrokenArmsBot', u'qznc_bot', 'matts2', u'PonyTipBot', u'Twitch2YouTube', u'GabenCoinTipBot', u'gabentipbot', u'DogeLotteryModBot', u'DrivePower', u'alot-of-bot', u'The-Paranoid-Android', u'YouTube_Time', u'companymi69', u'NotApostolate', u'FFBot', u'wheres_the_karma_bot', u'Wolfram_Bot', u'ImageRehoster', u'highlightsbot', u'codex561', u'garu323', 'raddit-bot', u'YT_Timestamp_Bot', u'WeAppreciateYou', u'hhhehehe_BOT', u'skitchbot', 'Late_Night_Grumbler', u'MumeBot', u'not-cleverbot', u'Gestabot', 'NoMoreNicksLeft', u'TheSpellingAsshole', u'QuoteMe-Bot', u'Dad_Jokes_Inbound', u'lolpenisbot', u'you_thejoke', u'TerperBot', 'GoTradeBot', u'Nepene', u'greeter_bot', u'TitsOrGTFO_Bot', u'saskaai', u'ArmFixerBot', u'I_Love_You_Too_Bot', u'grreviews_bot', 'qkme_transcriber', u'ASOIAFSearchBot', u'Dad_Jokes_Incoming', u'Peteusse', 'SnapshillBot', u'Honey189', u'_AyyLmaoBot_', u'SoThereYouHaveIt', u'ferrarispider52', u'publicmodlogs', u'Joasdisle', u'ItWillGetBetterBot', u'thelinkfixerbot', u'CalvinBot', u'splin69', u'Anti-Brigade-Bot7', u'Rule_13_Bot', u'Shlak2k15', u'Handy_Related_Sub', u'BELITipBot', u'scraptip', u'Antiracism_Bot', u'auto-doge', u'PuushToImgurBot', '[deleted]', u'anusBot', u'Reddit2Trend', u'ToBeFairCounter', u'JotBot', u'WikipediaLinkFixer', u'NFLVideoBot', u'UrbanDicBot', u'maybemaybemaybe_bot', u'fieldpain6969', u'MaveTipBot', u'nsfw_gfys_bot2', u'CortanaBot', u'rumblep666', u'SmallTextReader', u'KSPortBot', u'Define_It', u'_TonyWonder_', u'wrapt6969', u'PnSProtectionService', 'throwthrowawaytothee', u'Control_F', u'XPostLinker', u'phrase_bot', u'TOP_COMMENT_OF_YORE', u'smile_today_', u'bitcointip', u'templatebot', u'Dictionary__Bot', u'Downtotes_Plz', u'Mr_Vladimir_Putin', u'gracefulcharitybot', u'A_random_gif', u'trpbot', u'MetricPleaseBot', u'redditShitBots', u'Anti-Brigade-Bot-35', u'MatchThreadder', u'GrammarianBot', u'SpellingB', u'Mobile_Version', u'Metric_System_Bot', u'FriendlyCamelCaseBot', u'I_Am_Genesis', 'TweetsInCommentsBot', u'AutoNiggaBot', u'DadBot3000', u'VideopokerBot', u'InformativeButFalse', u'mcservers-bot', u'sendto', u'classhole_bot', u'DNotesTip', u'TheAssBot', u'UselessArithmeticBot', u'GoogleTranslateBot', u'MAGNIFIER_BOT', u'DealWithItbot', u'NintendoVideoBot', u'CONFUSED_COPYCAT', u'Sandstorm_Bot', 'ttumblrbots', u'classybot', u'bitofnewsbot', u'MetatasticBot', u'post_only_cocks', u'Deviantart-Mirror', u'ImprovedGrammarBot', u'ShillHill666', u'EscapistVideoBot', u'AntiBrigadeBot', u'IRCR_Info_Bot', u'swiftwee88', u'ICouldntCareLessBot', u'X_BOT', u'clay_davis_bot', u'rumbl1000', u'My_Bot', u'panderingwhore', u'GeekWhackBot', u'TryShouldHaveInstead', u'SauceHunt', u'bust1nbot', u'valkyribot', u'WordCloudBot2', u'OnlyPostsJFC', u'wooshbot', u'infiltration_bot', 'PriceZombie', u'ContentForager', u'SatoshiTipBot', u'BoobBot3000', u'TheButtonStatsBot', u'MonsterMashBot', u'subtext-bot', u'shirley_bot', u'flower_bot', u'BensonTheBot', u'SwearWordPolice', u'Reddit_JS_Bot', u'corgicointip', u'SuchModBot', u'RAOM_Bot', u'haha_bot', u'WeatherReportBot', u'xkcd37bot', u'_chao_', u'NASCARThreadBot', u'/u/TARDIS-BOT', 'AutoModerator', u'ThePoliceBot', u'coinflipbot', u'spinnelein', u'meme_transcriber', 'rollme', u'TablesWillBeFlipped', u'anagrammm', u'rarchives', u'Anti-Brigade-Bot', u'pm_me_your_bw_pics', u'last_cakeday_bot', u'SRS_History_Bot', u'Nazeem_Bot', u'murde1', u'I_Say_No_', u'navigatorbot', u'Anti-Brigade-Bot-1', 'Lots42', u'ThePictureDescribot', u'VeteransResourcesBot', u'fact_check_bot', u'milestone_upvotes', u'goodnotesbot', u'hybrid377', u'BogdanGhita', u'RaceReporter', u'gives_you_boobies', u'Anti-Brigade-Bot-5', u'yes-bot.', u'SubredditMetadataBot', u'AutoInsult', u'mvinfo', u'coinyetipper', u'clothinf89', u'only_posts_chickens', u'Meta_Bot', 'WritingPromptsRobot', u'dgctipbot', u'Correct_Apostrophe_S', u'IAMARacistBot', u'_FallacyBot_', u'decidebot', u'HCE_Replacement_Bot', u'kitabi_vig', u'GrammarCorrectingBot', u'VideoLinkBot', u'scanr', u'GoogleForYouBot', u'Anti-brigade-bot-19', u'Team60sBot', u'Anything_At_All_', u'luckoftheshibe', u'forehead92', u'faketipbot', u'NotRedditEnough', u'FapFindr', u'WatchGroupBot', u'AbixBot', u'AyyLmao2DongerBot', u'thisbotsays', u'PhoenixBot', u'sentimentviewbot', u'AntiHateBrigadingBot', u'edward_r_servo', u'Le_xD_Bot', u'frontbot', u'bRMT_Bot', u'PunknRollBot', u'ConvertsToMetric', u'PCMasterRacebot', u'nba_gif_bot', u'yourebot', u'Could_Care_Corrector', u'GrammerNazi_', u'rubycointipbot', u'/u/hearingaid_bot', u'facts_sphere', 'TweetPoster', u'curly1232', u'AstrosBot', u'chromabot', u'LocationBot', u'webpage_down_bot', u'TheChosenBot', u'c5bot', u'subredditChecker', u'flips_title', u'ObamaRobot', u'ButtTipBot', 'User_Simulator', u'battery_bot', u'/r/bot_police', u'totes_alpha_bot', u'frytipbot', u'provides-id', u'RaGodOfTheSunisaCunt', u'TrendingBot', u'EDC_automod', u'Daily_Fail_Bot', u'lilliecute', u'slapbot', u'BASICALLY_LITERALLY', u'AntiBrigadeBot2', u'tmobaird', u'Link_Demobilizer', u'Spaghetti_Robotti', u'TalkieToasterBot', u'AllahuAkbarBot', u'LinkFixerBot1', u'LinkFixerBot2', u'LinkFixerBot3', 'hit_bot', u'math_or_math_joke', u'nbtip', u'howstat', u'Relevant_News_Bot', u'flip_text_bot', u'PresidentObama___', u'giftoslideshowdotcom', u'DownvotesMcGoats', u'malen-shutup-bot', u'Some_Bot', u'approve_unmoderated', u'leveretb89', u'Gukga-anjeon-bowibu', u'swearjar_bot', u'TheHandOfOmega', u'test_bot0x00', u'URLfixerBot', u'HowIsThisBestOf_Bot', u'RonaldTheRight', u'-faggotbot', u'colorcodebot', 'Franciscouzo', u'Somalia_Bot', 'youtubefactsbot', u'NSA_for_ELS', u'bosk_tino', u'reallygoodbot', u'no_context_bot', u'NoseyPantyBot', u'Shiny-Bot', u'RemindMeBotBro', u'inside_voices', u'penguingun', u'XboxDVRBot', u'GotCrypto', u'CAH_BLACK_BOT', u'ExpectedFactorialBot', u'vertcoinbot', u'NoSobStoryBot2', u'Smile_Bot', u'IFlipCoins', u'memedad-transcriber', u'I_Like_Spaghetti', u'PleaseRespectTables', u'xkcd_transcriber', u'CockBotHDEdition', u'golferbot', u'RemindMeBot', u'tyo-translate', u'GrasiaABot', u'JiffierBot', u'JumpToBot', u'new_eden_news_bot', u'CasualMetricBot', u'Unhandy_Related_Sub', u'tipmoonbot2', u'SwitcharooInventory', u'okc_rating_bot', u'RfreebandzBOT', u'BailyBot', u'MassdropBot', 'RPBot', u'Rad_Rad_Robot', u'pandatips', u'imgur_rehosting', u'YesManBot', u'CantHearYouBotEX', u'UselessConversionBot', u'MY_GLASSES_BITCH', u'Key_Points', u'AwkwardMod', u'JiffyBot', u'FunCatFacts', u'spursgifs_xposterbot', u'SRD_Notifier', 'noeatnosleep', u'isreactionary_bot', u'Sammy-SlimBot', u'PlaylisterBot', u'rotoreuters', u'Celeb_Username_Bot', u'digitipbot', u'reddtipbot', u'ImCongratulating', u'PM_UR_NUDES-PLEASE', u'PlayStoreLinks__Bot', u'HMF_Bot', u'DotaCastingBot', u'thankyoubot', u'A858DE45F56D9BC9', u'quack_bot', u'JennyCherry18', u'ShadowBannedBot', u'Mentioned_Videos', u'fedora_tip_bot', u'Captn_Hook', u'TDTMBot', u'/u/Wraptram', u'haikub0t', u'sunhatd1000', u'NewAustrian', u'Accountabili_bot', u'ExmoBot', u'isReactionaryBot', 'nikorasu_the_great', u'picsonlybot', u'shadowman735', u'jerkbot-3hunna', u'I_Dislike_Spaghetti', u'rss_feed', 'TotesMessenger', u'AutoCorrectionBot', u'LinkDemobilizerBot', u'MemeExDe', u'TicTacToeBot', u'SubredditLinkBot', u'the__meep', u'GoldFact', u'SERIAL_JOKE_KILLER', u'gender_bot', u'peekerbot', u'YTDLBot', u'CamptownRobot', u'_perly_bot', u'tagstra93', u'Barrack_Obama__', u'MuhOutrage_Bot', 'Sir_Willis_CMS', u'cake_day_bot', u'sports_gif_bot', u'StopSayingRIPinPBot', u'nyantip', u'Finnbot', u'Link_Correction_Bot', u'That_Attitude_Bot', u'AwkwardDev', 'dogetipbot', u'YouGotGlitched', u'MarkovBot', u'magic_8_bot', u'haiku_robot', u'AlaTull', u'DeltaBot', u'malo_the_bot', u'DEAL_WITH_IT_bot', 'eaturbrainz', u'Tereshang', u'OfficialCommentRater', u'WontRemindYou', u'BeHappyBot', u'lollygag1000', u'FTFYouBot', u'cahbot', u'CouldBeWorseBot', u'totes_meta_bot', u'autoleprabot', u'AsciBoobsBot', u'freeEBOOKS-Batman', u'WordcloudYou', 'davidjl123', u'English_Helper_Bot', u'I_BITCOIN_CATS', 'morbiusgreen', u'mirror__bot', u'asmccc4123', u'MontrealBot', u'metoobot', u'ac_502002', 'andrewsmith1986', u'citation-is-needed', u'SEO_Killer', u'vertcointipbot', u'VsauceBot', u'Mattyso', u'makeswordcloudsagain', u'The_Marco_Rubio', u'SWTOR_Helper_Bot', u'Virnibot', u'pi_day_bot', u'classhole_robot', u'Shrugfacebot', u'elMatadero_bot', u'Random-ComplimentBOT', u'GoneWildResearcher', u'AndMarquez', u'matthewrobo', u'EveryThingIsSoRaven', u'GreasyBacon', u'bajneeds', u'CantHearYouBot', u'stockbot2000', u'gfycat-bot-sucksdick', u'neutlime', u'let_me_nig_that_4_u', u'Polite_Users_Bot', u'captainhindsightbot', u'AgentKoenigLMD', u'Text_Reader_Bot', u'FedoraTipAutoBot', u'/u/BadSarcasmDetector', u'lmaoRobot', u'ban_pruner', u'elwh392', u'GandhiSpellingBot', u'TwitchToYoutubeBot', u'icehock6969', u'ddlbot', u'relevantxkcd-bot', u'RequirementsBot', u'h0b0_monkey', u'ImJimmieJohnsonBot', u'nrl_cfb', u'PaidBot', u'bitcoinpartybot', u'moderator-bot', u'baseball_gif_bot', u'DropBox_Bot', u'DownvotedComments', u'slickwom-bot', u'kuilinbot', u'Gatherer_bot', u'VideoPostBot', u'WPBot', u'DirectImageLinkerBot', u'user/no_context_bot.', u'jaysbot2', u'ShittyGandhiQuotes', u'stairs1000', u'Wiki_FirstPara_bot', u'Fixes_GrammerNazi_', u'HauntedHashbrown', u'FrownUpsideDownBot', u'Anti-Brigade-Bot-11', u'Anti-Brigade-Bot-12', u'kg2bee', u'RoundUp_bot', u'cmhocbot', u'imirror_bot', u'BHObamanator', u'JokingLikeaBot', u'InternetLyingPolice', u'Raptor_Reading', u'JeopardyQBot', u'PedanticIrishBot', u'domoarigatobtfcboto', u'BritishHaikuBot', u'SurviAvi', u'WikipediaPoster', 'ModerationLog', u'pandatipbot', u'foreigneducationbot', u'non_dm_mirror_bot', u'fedoratips', u'Anti-Reactionary-Bot', u'TheLazyLinker', u'CouldCareFewer', u'InceptionBwongBot', u'/u/trollmaster-5000', u'maaz7', u'allinonebot', u'LineBreakBot', u'WhoopThereItWas', u'u/alot-of-bot', u'AlbumHelperBot', u'NoSleepAutoMod', u'Your_Post_As_A_Movie', u'rSGSpolice', u'TipMoonBot', u'GivesSilverBot300', 'MovieGuide', 'TrollaBot', u'JakeandAmirBot', u'BlackjackBot', u'CHART_BOT', u'bandage1000', u'KarmaConspiracy_Bot', u'BigLebowskiBot', u'grumblehi88', u'BuildAPC4MeBot', u'BotAntagonist', u'clevercommen', u'unitconvert', u'dragoneno6969', u'CreepierSmileBot', u'CIRCLJERK_REPOST_BOT', u'Grumbler_bot', u'DetailsMaster', u'cageypro666', u'SakuraiBot', u'openseadragonizer', u'ImproperGrammarBot', 'imgurtranscriber', 'MasterT231', u'obamabot9000', u'zunjae', u'PoliteUsersBotBot', u'translatesit43times', u'Insigible_Bot', u'VoterBot', u'sara_son19', u'CarterDugSubLinkBot', u'Nidalee_Bot', u'ProductHelperBot', u'CantHearYouBot2', u'moanrigid90', u'changetip', u'redditbots', u'ApiContraption', 'smacksaw', u'RedditMarkovBot', u'edhrec', u'Ladonnalder', u'RepeatBOT', u'dirtymindrebel', u'YT_Bot', u'xkcd_number_bot', u'hitlerbot', u'Rangers_Bot', u'SexSellsStats', u'Forgotten_News_Bot', u'HFY_Tag_Bot', u'ARandomTipAppears', u'LinkedCommentBot', u'LEGO_not_LEGOS_', u'PopcornBot', u'FelineFacts', u'request_bot', u'ReadSmallTextBot', u'MirrorNinjaBot', u'CAH_TEST_BOT', u'delaybot', u'-ecksdee-', u'timewaitsforsome', u'grumpybot', 'MTGCardFetcher', u'BreakfastCerealBot', u'DailMail_Bot', u'tara1', u'SMCTipBot', u'gocougs_bot', u'ThenThanMistakeNoted', u'keysteal_bot', u'DJ_Khaled_Best', u'anon_bot', 'UnluckyLuke', u'beecointipbot', u'QuoteMeBot', u'videos_mod', u'Tumblr_In_Starbucks', u'ExplainsRemovals', u'ComplimentingBot', u'Videos_Mentioned', u'User_History_Bot', u'__bot__', u'Anti_Vladimir_Putin', u'connipti92', u'amazedbot', u'replieswithsmokeweed', u'CreepySmileBotFanBot', u'rule_bot', u'sunhatda666', u'StarboundBot', u'saysjokes', u'IQuoteYouBot', u'colombia_bot', u'zigzagh90', u'IAgreeBot', u'Codebreakerbreaker', u'auto_help_bot', u'PrettyMuchPrettyMuch', u'ADHDbot', u'so_doge_tip', u'ConvertsToText', u'PoliteBot', u'bocketybot', u'maddie_bot', 'ImagesOfNetwork', u'stats-bot', u'imgurerbot', u'potdealer', 'gifv-bot', u'f1bet', u'stormske91', u'notoverticalvideo', u'imgurHostBot', u'amumu_bot', u'Meta_Bot2', u'flappytip', u'Anti-Brigade-Bot-2', u'Anti-Brigade-Bot-3', u'ToMetric', u'Anti-Brigade-Bot-6', u'asmrspambot', u'recursion_bot', u'thewutbot', u'QUICHE-BOT', u'grammar_nazi_bot_13', u'GimliBot', u'AtheismModBot', u'Lionelttson', u'unsta666', u'friendlygame', u'testbotjen', u'sumthenews', u'calcolatore', u'CussWordPolice', u'FixYourGrammarBot', u'Reads_Small_Text_Bot', u'DRKTipBot', u'chawalbanda33', u'ProselytizerBot', u'MultiFunctionBot', u'historymodbot', u'Makes_Small_Text_Bot', u'untouchedURL', 'iam4real', u'LazyLinkerBot', u'tennis_gif_bot', u'Cakeday-Bot', u'IsItDownBot', u'NoGrasiaABot', u'WhenisHL3', u'Gynocologist_Bot', 'havoc_bot', u'archlinkbot', u'rbutrBot', u'DirectImageLinkBot', u'Birthdaytits', u'fa_mirror', u'New_Small_Text_Bot', u'Top_Comment_Repo_Man', 'conspirobot', u'cris9696', u'siteswap-bot', u'SketchNotSkit', u'Anti-Brigade-Bot1917', u'OriginalLinkBot', u'TitleLinkerBot', u'BensonTheSecond', u'godwin_finder', u'SmallSubBot', u'whispen', u'Shrug_Bot', u'raiseyourdongersbot', u'hearing_aids_bot', u'upmo', u'IS_IT_SOLVED', u'AltCodeBot', u'Wink-Bot', u'random-imgur', 'hwsbot', u'snapshot_bot', u'replies_randomly', 'TheNitromeFan', u'khalifa111', u'GoodGuyGold', u'FrontpageWatch', u'RelevantSummary', u'redditreviewbot', u'liver1000', u'Shakespeares_Ghost', u'gfy_bot', u'FirstThingsFirst_bot', u'Stop_Insulting_Mommy', u'a_lot_vs_alot_bot', 'amici_ursi', u'L-O-S-E_not_LOOSE', u'ComradeBot_', u'myrtipbot', u'expired_link_bot', u'Translates10Times', u'BeetusBot', u'ExplanationBot', u'HelperBot_', u'comment_copier_bot', u'Shinybot', u'Bronze-Bot', u'poonta93', u'Brybot', u'SmartphonePitchforks', u'ThatsWhatSheSaidAI', u'adverb_bot', u'RelatedBot', u'youtube_unblock_bot', u'CreepySmileBot', u'rightsbot', u'fsctipbot', u'ParenthesisBot', u'IM_DEFINITELY_A_BOT', u'Clickbait_ReplierBot'])

Next, we used the following code to filter out comments posted by bot redditors that appeared in the bots list.

sf = sf[sf['author'].apply(lambda a: a not in delete_authors)]
len(sf)
sf.save("/data/reddit_data_no_txt_without_bots_and_deleted_authors.sframe")

2389764512

We are left with about 2.39 billion comments.

2. Analyzing Subreddits¶

We want to better understand the structure and evolution of subreddits. Let's calculate some interesting statistics on these subreddit communities. We will start by calculating the number of unique subreddits, and then we’ll create histograms of the number of posts on each subreddit.

#For running this section please make sure you created 'reddit_data_no_txt_without_bots_and_deleted_authors.sframe' as explained
# in the previous section
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import graphlab as gl
import graphlab.aggregate as agg

sns.set(color_codes=True)
sns.set_style("darkgrid")

print "The dataset contains %s unique subreddits." % len(sf['subreddit'].unique())
g = sf.groupby('subreddit', {'posts_num': agg.COUNT()})

The dataset contains 371833 unique subreddits.

sns.distplot(g['posts_num'],bins=20, axlabel="posts")

<matplotlib.axes._subplots.AxesSubplot at 0x7f5af854fd90>

We have 371,833 subreddits in the dataset. From the above histogram, we can see that the overwhelming majority of subreddits have very few posts. Let's look at the histogram of subreddits with at least a million posts.

g_mil = g[g['posts_num'] >= 1000000]
print "%s subreddits with at least a million posts" % len(g_mil)
sns.distplot(g_mil['posts_num'], axlabel="posts")

357 subreddits with at least a million posts

<matplotlib.axes._subplots.AxesSubplot at 0x7f5a84108ed0>

We discover that only 357 subreddits, less 0.1% of all the subreddits, contain more than a million posts. Let's calculate how many posts these subreddits contain in total.

print "The most popular subreddits contain %s posts" % g_mil['posts_num'].sum()

The most popular subreddits contain 1629867408 posts

The most popular subreddits contain over 1.62 billion posts. In other words, less than 0.1% of the subreddits contain 68.2% of the posts. This result reminds me of the fact that over 57% of the world's population lives in the ten most populous countries. Let's map the users' activity in each subbreddit. Namely, we will find how many distinct user names there are in each subreddit, and what subreddits have the most unique users.

g = sf.groupby('subreddit', {'distinct_authors_number':agg.COUNT_DISTINCT('author')})
g = g.sort('distinct_authors_number', ascending=False)
g.print_rows(100)

+----------------------+-------------------------+
|      subreddit       | distinct_authors_number |
+----------------------+-------------------------+
|      AskReddit       |         5465755         |
|        funny         |         3204002         |
|         pics         |         3076198         |
|         IAmA         |         2155718         |
|        gaming        |         2059042         |
|        videos        |         2021998         |
|         WTF          |         1962940         |
|    todayilearned     |         1902033         |
|    AdviceAnimals     |         1679658         |
|         aww          |         1525875         |
|      worldnews       |         1503139         |
|        movies        |         1341623         |
|         gifs         |         1327207         |
|        Music         |         1265940         |
|         news         |         1120131         |
|       politics       |         1078397         |
|   leagueoflegends    |          909849         |
|    Showerthoughts    |          903086         |
|       atheism        |          900926         |
|  mildlyinteresting   |          897297         |
|      technology      |          874948         |
|  explainlikeimfive   |          872558         |
|      reddit.com      |          836257         |
|        trees         |          791134         |
|       science        |          699283         |
|         tifu         |          627500         |
|     pcmasterrace     |          601513         |
|     LifeProTips      |          591906         |
|        books         |          566367         |
|       gonewild       |          552611         |
|       Fitness        |          551939         |
|         food         |          513428         |
|    relationships     |          506711         |
|      television      |          492369         |
| fffffffuuuuuuuuuuuu  |          475221         |
|        Jokes         |          458080         |
|         sex          |          452290         |
|   TwoXChromosomes    |          451801         |
|     nottheonion      |          447415         |
|   personalfinance    |          435008         |
|        creepy        |          433207         |
|       buildapc       |          423400         |
|       pokemon        |          417395         |
|      cringepics      |          379199         |
|    OldSchoolCool     |          378328         |
|        sports        |          375538         |
|         DIY          |          369555         |
|   GlobalOffensive    |          356078         |
|      Overwatch       |          353248         |
|      pokemongo       |          348015         |
|      Minecraft       |          346610         |
|    gameofthrones     |          342183         |
|       woahdude       |          342034         |
|      EarthPorn       |          331829         |
|        Games         |          327470         |
|         nfl          |          323519         |
|     hearthstone      |          318704         |
|       Android        |          314916         |
|      askscience      |          313791         |
|        DotA2         |          309898         |
|        bestof        |          309743         |
|        soccer        |          309067         |
|         wow          |          305485         |
|     friendsafari     |          298996         |
|      Futurology      |          294915         |
|   dataisbeautiful    |          289916         |
|     GetMotivated     |          288994         |
|         nba          |          285947         |
|     reactiongifs     |          282759         |
|  malefashionadvice   |          279534         |
|        4chan         |          274217         |
|        space         |          272824         |
|    DestinyTheGame    |          265169         |
|  millionairemakers   |          263846         |
|       nosleep        |          259567         |
|  interestingasfuck   |          259040         |
| ImGoingToHellForThis |          258763         |
|         Art          |          251100         |
|    tipofmytongue     |          250919         |
|        cringe        |          248369         |
|        skyrim        |          247361         |
|      circlejerk      |          246957         |
|     techsupport      |          242859         |
|  BlackPeopleTwitter  |          241724         |
|   photoshopbattles   |          235111         |
|        anime         |          233713         |
|       StarWars       |          233363         |
|      The_Donald      |          232584         |
|        AskMen        |          231575         |
|        promos        |          227740         |
|       gadgets        |          227630         |
|    Documentaries     |          227155         |
|     hiphopheads      |          224751         |
|        Frugal        |          223954         |
|         nsfw         |          218661         |
|    TumblrInAction    |          217394         |
|       Fallout        |          216836         |
|     programming      |          215156         |
|        Steam         |          214232         |
|         PS4          |          214131         |
+----------------------+-------------------------+
[371833 rows x 2 columns]

By calculating the elapsed time between users' first post and last post, we can also estimate how much time users have been active in each subreddit.

Important: running the following code block may take considerable time
sf['created_utc'] = sf['created_utc'].astype(int)
subreddit_users = sf.groupby(['subreddit', 'author'], {'start_date':agg.MIN('created_utc'), 'end_date': agg.MAX('created_utc'), 'posts_num':agg.COUNT()} )
subreddit_users['activity_time'] = subreddit_users.apply(lambda d: d['end_date'] - d['start_date'])
subreddit_users

Let's calculate the average time users have been active in each subreddit. To understand the activity time distribution across the subreddits, let's plot a histogram of average activity time.

g = subreddit_users.groupby('subreddit', {'avg_active_time_in_seconds': agg.AVG('activity_time')})
g['avg_active_time_in_days'] = g['avg_active_time_in_seconds'].apply(lambda sec: sec/(60.0*60.0*24))
sns.distplot(g['avg_active_time_in_days'], axlabel="days", bins=20 )

<matplotlib.axes._subplots.AxesSubplot at 0x7f93f80df8d0>

g['avg_active_time_in_days'].sketch_summary()

+--------------------+---------------+----------+
|        item        |     value     | is exact |
+--------------------+---------------+----------+
|       Length       |     371833    |   Yes    |
|        Min         |      0.0      |   Yes    |
|        Max         | 2208.85173611 |   Yes    |
|        Mean        | 24.0121476349 |   Yes    |
|        Sum         | 8928508.89152 |   Yes    |
|      Variance      | 3906.78028205 |   Yes    |
| Standard Deviation | 62.5042421124 |   Yes    |
|  # Missing Values  |       0       |   Yes    |
|  # unique values   |     201266    |    No    |
+--------------------+---------------+----------+

Most frequent items:
+-------+--------+-------------------+-------------------+-------------------+
| value |  0.0   | 2.31481481481e-05 | 6.94444444444e-05 | 1.15740740741e-05 |
+-------+--------+-------------------+-------------------+-------------------+
| count | 154663 |        383        |        381        |        372        |
+-------+--------+-------------------+-------------------+-------------------+
+-------------------+-------------------+-------------------+-------------------+
| 3.47222222222e-05 | 4.62962962963e-05 | 5.78703703704e-05 | 8.10185185185e-05 |
+-------------------+-------------------+-------------------+-------------------+
|        350        |        250        |        235        |        182        |
+-------------------+-------------------+-------------------+-------------------+
+-------------------+-------------------+
| 1.73611111111e-05 | 9.25925925926e-05 |
+-------------------+-------------------+
|        155        |        148        |
+-------------------+-------------------+

Quantiles: 
+-----+-----+-----+-----+-----------------+---------------+---------------+
|  0% |  1% |  5% | 25% |       50%       |      75%      |      95%      |
+-----+-----+-----+-----+-----------------+---------------+---------------+
| 0.0 | 0.0 | 0.0 | 0.0 | 0.0447029320988 | 18.2136749847 | 130.748425926 |
+-----+-----+-----+-----+-----------------+---------------+---------------+
+---------------+---------------+
|      99%      |      100%     |
+---------------+---------------+
| 277.742471065 | 2208.85173611 |
+---------------+---------------+

We can see from the above results that most of the subreddits' users are active for a very limited time, with an average of less than a month and a median of less than a day.

3. Join-Rate Curves¶

In our study, we focused on analyzing the various patterns in which users join each subreddit (also referred to as the Join-Rate-Curve). In this section, we present the code that was used to create these curves. To create the join-rate-curve of each subreddit, we first created a TimeSeries object with the subreddit information. Throughout this section, we will use the Science subreddit as an example.

from datetime import datetime, timedelta

science_sf = sf[sf['subreddit'] =='science']
science_sf['datetime'] = science_sf['created_utc'].apply(lambda utc:datetime.utcfromtimestamp(int(utc)))
subreddit_ts = gl.TimeSeries(science_sf, index='datetime')
subreddit_ts

We will use the following function to create the subreddit user arrival curve from the TimeSeries object.

import math
def get_subreddit_users_arrival_curve(subreddit_ts, weeks_number=4):
    """
    Calculates the percent of authors that joined after X weeks from all authors that joined the subreddit
    between the date of the first comment and the date of the last comment
    :param subreddit_ts: TimeSeries with the subreddit posts information
    :param weeks_number: the number of weeks to set the time-interval between each two calculations
    :return: dict in which all the keys are the number of weeks since the first comment was posted and the
        corresponding percentage of authors that joined the subreddit up until this week
    :rtype: dict
    """
    dt = subreddit_ts.min_time
    end_dt = subreddit_ts.max_time
    authors = set()
    d = {0: 0}
    td = timedelta(days=7 * weeks_number)
    count = 1
    total_authors_num = float(len(subreddit_ts['author'].unique()))    
    while dt + td <= end_dt:
        ts = subreddit_ts.slice(dt, dt + td)
        authors |= set(ts['author'])
        print "Calculating the user arrival curve between %s and %s" % (dt, dt + td) 
        dt += td
        d[count * weeks_number] = len(authors) / total_authors_num
        count += 1
    ts = subreddit_ts.slice(dt, subreddit_ts.max_time)
    authors |= set(ts['author'])
    subreddit_age = subreddit_ts.max_time - subreddit_ts.min_time
    d[math.ceil(     subreddit_age.days/ 7.0)] = len(
        authors) / total_authors_num  # round the number of weeks up mainly for drawing the graph
    return d
d = get_subreddit_users_arrival_curve(subreddit_ts)

Calculating the user arrival curve between 2006-10-18 16:36:27 and 2006-11-15 16:36:27
Calculating the user arrival curve between 2006-11-15 16:36:27 and 2006-12-13 16:36:27
Calculating the user arrival curve between 2006-12-13 16:36:27 and 2007-01-10 16:36:27
Calculating the user arrival curve between 2007-01-10 16:36:27 and 2007-02-07 16:36:27
Calculating the user arrival curve between 2007-02-07 16:36:27 and 2007-03-07 16:36:27
Calculating the user arrival curve between 2007-03-07 16:36:27 and 2007-04-04 16:36:27
Calculating the user arrival curve between 2007-04-04 16:36:27 and 2007-05-02 16:36:27
Calculating the user arrival curve between 2007-05-02 16:36:27 and 2007-05-30 16:36:27
Calculating the user arrival curve between 2007-05-30 16:36:27 and 2007-06-27 16:36:27
Calculating the user arrival curve between 2007-06-27 16:36:27 and 2007-07-25 16:36:27
Calculating the user arrival curve between 2007-07-25 16:36:27 and 2007-08-22 16:36:27
Calculating the user arrival curve between 2007-08-22 16:36:27 and 2007-09-19 16:36:27
Calculating the user arrival curve between 2007-09-19 16:36:27 and 2007-10-17 16:36:27
Calculating the user arrival curve between 2007-10-17 16:36:27 and 2007-11-14 16:36:27
Calculating the user arrival curve between 2007-11-14 16:36:27 and 2007-12-12 16:36:27
Calculating the user arrival curve between 2007-12-12 16:36:27 and 2008-01-09 16:36:27
Calculating the user arrival curve between 2008-01-09 16:36:27 and 2008-02-06 16:36:27
Calculating the user arrival curve between 2008-02-06 16:36:27 and 2008-03-05 16:36:27
Calculating the user arrival curve between 2008-03-05 16:36:27 and 2008-04-02 16:36:27
Calculating the user arrival curve between 2008-04-02 16:36:27 and 2008-04-30 16:36:27
Calculating the user arrival curve between 2008-04-30 16:36:27 and 2008-05-28 16:36:27
Calculating the user arrival curve between 2008-05-28 16:36:27 and 2008-06-25 16:36:27
Calculating the user arrival curve between 2008-06-25 16:36:27 and 2008-07-23 16:36:27
Calculating the user arrival curve between 2008-07-23 16:36:27 and 2008-08-20 16:36:27
Calculating the user arrival curve between 2008-08-20 16:36:27 and 2008-09-17 16:36:27
Calculating the user arrival curve between 2008-09-17 16:36:27 and 2008-10-15 16:36:27
Calculating the user arrival curve between 2008-10-15 16:36:27 and 2008-11-12 16:36:27
Calculating the user arrival curve between 2008-11-12 16:36:27 and 2008-12-10 16:36:27
Calculating the user arrival curve between 2008-12-10 16:36:27 and 2009-01-07 16:36:27
Calculating the user arrival curve between 2009-01-07 16:36:27 and 2009-02-04 16:36:27
Calculating the user arrival curve between 2009-02-04 16:36:27 and 2009-03-04 16:36:27
Calculating the user arrival curve between 2009-03-04 16:36:27 and 2009-04-01 16:36:27
Calculating the user arrival curve between 2009-04-01 16:36:27 and 2009-04-29 16:36:27
Calculating the user arrival curve between 2009-04-29 16:36:27 and 2009-05-27 16:36:27
Calculating the user arrival curve between 2009-05-27 16:36:27 and 2009-06-24 16:36:27
Calculating the user arrival curve between 2009-06-24 16:36:27 and 2009-07-22 16:36:27
Calculating the user arrival curve between 2009-07-22 16:36:27 and 2009-08-19 16:36:27
Calculating the user arrival curve between 2009-08-19 16:36:27 and 2009-09-16 16:36:27
Calculating the user arrival curve between 2009-09-16 16:36:27 and 2009-10-14 16:36:27
Calculating the user arrival curve between 2009-10-14 16:36:27 and 2009-11-11 16:36:27
Calculating the user arrival curve between 2009-11-11 16:36:27 and 2009-12-09 16:36:27
Calculating the user arrival curve between 2009-12-09 16:36:27 and 2010-01-06 16:36:27
Calculating the user arrival curve between 2010-01-06 16:36:27 and 2010-02-03 16:36:27
Calculating the user arrival curve between 2010-02-03 16:36:27 and 2010-03-03 16:36:27
Calculating the user arrival curve between 2010-03-03 16:36:27 and 2010-03-31 16:36:27
Calculating the user arrival curve between 2010-03-31 16:36:27 and 2010-04-28 16:36:27
Calculating the user arrival curve between 2010-04-28 16:36:27 and 2010-05-26 16:36:27
Calculating the user arrival curve between 2010-05-26 16:36:27 and 2010-06-23 16:36:27
Calculating the user arrival curve between 2010-06-23 16:36:27 and 2010-07-21 16:36:27
Calculating the user arrival curve between 2010-07-21 16:36:27 and 2010-08-18 16:36:27
Calculating the user arrival curve between 2010-08-18 16:36:27 and 2010-09-15 16:36:27
Calculating the user arrival curve between 2010-09-15 16:36:27 and 2010-10-13 16:36:27
Calculating the user arrival curve between 2010-10-13 16:36:27 and 2010-11-10 16:36:27
Calculating the user arrival curve between 2010-11-10 16:36:27 and 2010-12-08 16:36:27
Calculating the user arrival curve between 2010-12-08 16:36:27 and 2011-01-05 16:36:27
Calculating the user arrival curve between 2011-01-05 16:36:27 and 2011-02-02 16:36:27
Calculating the user arrival curve between 2011-02-02 16:36:27 and 2011-03-02 16:36:27
Calculating the user arrival curve between 2011-03-02 16:36:27 and 2011-03-30 16:36:27
Calculating the user arrival curve between 2011-03-30 16:36:27 and 2011-04-27 16:36:27
Calculating the user arrival curve between 2011-04-27 16:36:27 and 2011-05-25 16:36:27
Calculating the user arrival curve between 2011-05-25 16:36:27 and 2011-06-22 16:36:27
Calculating the user arrival curve between 2011-06-22 16:36:27 and 2011-07-20 16:36:27
Calculating the user arrival curve between 2011-07-20 16:36:27 and 2011-08-17 16:36:27
Calculating the user arrival curve between 2011-08-17 16:36:27 and 2011-09-14 16:36:27
Calculating the user arrival curve between 2011-09-14 16:36:27 and 2011-10-12 16:36:27
Calculating the user arrival curve between 2011-10-12 16:36:27 and 2011-11-09 16:36:27
Calculating the user arrival curve between 2011-11-09 16:36:27 and 2011-12-07 16:36:27
Calculating the user arrival curve between 2011-12-07 16:36:27 and 2012-01-04 16:36:27
Calculating the user arrival curve between 2012-01-04 16:36:27 and 2012-02-01 16:36:27
Calculating the user arrival curve between 2012-02-01 16:36:27 and 2012-02-29 16:36:27
Calculating the user arrival curve between 2012-02-29 16:36:27 and 2012-03-28 16:36:27
Calculating the user arrival curve between 2012-03-28 16:36:27 and 2012-04-25 16:36:27
Calculating the user arrival curve between 2012-04-25 16:36:27 and 2012-05-23 16:36:27
Calculating the user arrival curve between 2012-05-23 16:36:27 and 2012-06-20 16:36:27
Calculating the user arrival curve between 2012-06-20 16:36:27 and 2012-07-18 16:36:27
Calculating the user arrival curve between 2012-07-18 16:36:27 and 2012-08-15 16:36:27
Calculating the user arrival curve between 2012-08-15 16:36:27 and 2012-09-12 16:36:27
Calculating the user arrival curve between 2012-09-12 16:36:27 and 2012-10-10 16:36:27
Calculating the user arrival curve between 2012-10-10 16:36:27 and 2012-11-07 16:36:27
Calculating the user arrival curve between 2012-11-07 16:36:27 and 2012-12-05 16:36:27
Calculating the user arrival curve between 2012-12-05 16:36:27 and 2013-01-02 16:36:27
Calculating the user arrival curve between 2013-01-02 16:36:27 and 2013-01-30 16:36:27
Calculating the user arrival curve between 2013-01-30 16:36:27 and 2013-02-27 16:36:27
Calculating the user arrival curve between 2013-02-27 16:36:27 and 2013-03-27 16:36:27
Calculating the user arrival curve between 2013-03-27 16:36:27 and 2013-04-24 16:36:27
Calculating the user arrival curve between 2013-04-24 16:36:27 and 2013-05-22 16:36:27
Calculating the user arrival curve between 2013-05-22 16:36:27 and 2013-06-19 16:36:27
Calculating the user arrival curve between 2013-06-19 16:36:27 and 2013-07-17 16:36:27
Calculating the user arrival curve between 2013-07-17 16:36:27 and 2013-08-14 16:36:27
Calculating the user arrival curve between 2013-08-14 16:36:27 and 2013-09-11 16:36:27
Calculating the user arrival curve between 2013-09-11 16:36:27 and 2013-10-09 16:36:27
Calculating the user arrival curve between 2013-10-09 16:36:27 and 2013-11-06 16:36:27
Calculating the user arrival curve between 2013-11-06 16:36:27 and 2013-12-04 16:36:27
Calculating the user arrival curve between 2013-12-04 16:36:27 and 2014-01-01 16:36:27
Calculating the user arrival curve between 2014-01-01 16:36:27 and 2014-01-29 16:36:27
Calculating the user arrival curve between 2014-01-29 16:36:27 and 2014-02-26 16:36:27
Calculating the user arrival curve between 2014-02-26 16:36:27 and 2014-03-26 16:36:27
Calculating the user arrival curve between 2014-03-26 16:36:27 and 2014-04-23 16:36:27
Calculating the user arrival curve between 2014-04-23 16:36:27 and 2014-05-21 16:36:27
Calculating the user arrival curve between 2014-05-21 16:36:27 and 2014-06-18 16:36:27
Calculating the user arrival curve between 2014-06-18 16:36:27 and 2014-07-16 16:36:27
Calculating the user arrival curve between 2014-07-16 16:36:27 and 2014-08-13 16:36:27
Calculating the user arrival curve between 2014-08-13 16:36:27 and 2014-09-10 16:36:27
Calculating the user arrival curve between 2014-09-10 16:36:27 and 2014-10-08 16:36:27
Calculating the user arrival curve between 2014-10-08 16:36:27 and 2014-11-05 16:36:27
Calculating the user arrival curve between 2014-11-05 16:36:27 and 2014-12-03 16:36:27
Calculating the user arrival curve between 2014-12-03 16:36:27 and 2014-12-31 16:36:27
Calculating the user arrival curve between 2014-12-31 16:36:27 and 2015-01-28 16:36:27
Calculating the user arrival curve between 2015-01-28 16:36:27 and 2015-02-25 16:36:27
Calculating the user arrival curve between 2015-02-25 16:36:27 and 2015-03-25 16:36:27
Calculating the user arrival curve between 2015-03-25 16:36:27 and 2015-04-22 16:36:27
Calculating the user arrival curve between 2015-04-22 16:36:27 and 2015-05-20 16:36:27
Calculating the user arrival curve between 2015-05-20 16:36:27 and 2015-06-17 16:36:27
Calculating the user arrival curve between 2015-06-17 16:36:27 and 2015-07-15 16:36:27
Calculating the user arrival curve between 2015-07-15 16:36:27 and 2015-08-12 16:36:27
Calculating the user arrival curve between 2015-08-12 16:36:27 and 2015-09-09 16:36:27
Calculating the user arrival curve between 2015-09-09 16:36:27 and 2015-10-07 16:36:27
Calculating the user arrival curve between 2015-10-07 16:36:27 and 2015-11-04 16:36:27
Calculating the user arrival curve between 2015-11-04 16:36:27 and 2015-12-02 16:36:27
Calculating the user arrival curve between 2015-12-02 16:36:27 and 2015-12-30 16:36:27
Calculating the user arrival curve between 2015-12-30 16:36:27 and 2016-01-27 16:36:27
Calculating the user arrival curve between 2016-01-27 16:36:27 and 2016-02-24 16:36:27
Calculating the user arrival curve between 2016-02-24 16:36:27 and 2016-03-23 16:36:27
Calculating the user arrival curve between 2016-03-23 16:36:27 and 2016-04-20 16:36:27
Calculating the user arrival curve between 2016-04-20 16:36:27 and 2016-05-18 16:36:27
Calculating the user arrival curve between 2016-05-18 16:36:27 and 2016-06-15 16:36:27
Calculating the user arrival curve between 2016-06-15 16:36:27 and 2016-07-13 16:36:27
Calculating the user arrival curve between 2016-07-13 16:36:27 and 2016-08-10 16:36:27
Calculating the user arrival curve between 2016-08-10 16:36:27 and 2016-09-07 16:36:27
Calculating the user arrival curve between 2016-09-07 16:36:27 and 2016-10-05 16:36:27

keys = d.keys()
keys.sort()
values = [d[k] for k in keys]
plt.plot(keys, values)
plt.xlabel('Weeks')
plt.ylabel('Percentage')

<matplotlib.text.Text at 0x7f93d8055290>

4. Creating the Subreddit Social Network¶

In our study, we constructed the subreddit social networks by creating links between users that replied to other users’ posts. In this section, we will present the code which was used to create the subreddits' underlying social networks. As an example, we will use the Datasets subreddit's social network.

def get_subreddit_vertices_timeseries(subreddit_sf):
    """
    Creates a vertices Timeseries object
    :return: TimeSeries with the join time of each user to each subreddit
     :rtype: gl.TimeSeries
    """
    sf =  subreddit_sf.groupby("author", {'mindate': agg.MIN("created_utc"),
                                         'maxdate': agg.MAX("created_utc")})
    sf['mindate'] = sf['mindate'].apply(lambda timestamp: datetime.fromtimestamp(timestamp))
    sf['maxdate'] = sf['maxdate'].apply(lambda timestamp: datetime.fromtimestamp(timestamp))
    sf.rename({"author": "v_id"})
    return gl.TimeSeries(sf, index='mindate')


def get_subreddit_interactions_timeseries(subreddit_sf):
    """
    Creates subreddits interactions TimeSeries. Interaction exists between two subreddit users if user A posted a comment
     and user B replied to A's comment
    :return: TimeSeries with the subreddit interactions
    :rtype: gl.TimeSeries
    """
    subreddit_sf['parent_name'] = subreddit_sf["parent_id"]
    subreddit_sf['parent_kind'] = subreddit_sf['parent_id'].apply(lambda i: i.split("_")[0] if "_" in i and i.startswith("t") else None)
    subreddit_sf['parent_id'] = subreddit_sf['parent_id'].apply(lambda i: i.split("_")[1] if "_" in i and i.startswith("t") else None)

    sf = subreddit_sf[subreddit_sf['parent_kind'] == 't1'] # only reply to comments counts
    sf = sf['author', "created_utc", "id", "parent_id", "link_id"]
    sf = sf.join(subreddit_sf, on={"parent_id": "id"})
    sf['datetime'] = sf['created_utc'].apply(lambda timestamp: datetime.fromtimestamp(timestamp))
    sf.rename({'author': 'src_id', 'author.1': 'dst_id'})
    sf = sf['src_id', 'dst_id', 'datetime']
    return gl.TimeSeries(sf, index='datetime')

def create_sgraph(v_ts, i_ts):
    """
    Creates an SGraph object from the vertices and interaction TimeSeries objects
    :param v_ts: vertices TimeSeries
    :param i_ts: interactions TimeSeries
    :return: SGraph with the input data
    """
    edges = i_ts.to_sframe().groupby(['src_id', "dst_id"], operations={'weight': agg.COUNT(),
                                                                     'mindate': agg.MIN('datetime'),
                                                                     'maxdate': agg.MAX('datetime')})

    g = gl.SGraph(vertices=v_ts.to_sframe(), edges=edges, vid_field="v_id", src_field="src_id", dst_field="dst_id")
    return g


datasets_sf = sf[sf['subreddit'] =='datasets']
datasets_sf.__materialize__()
datasets_sf["created_utc"] = datasets_sf["created_utc"].astype(int)
vertices = get_subreddit_vertices_timeseries(datasets_sf)
interactions = get_subreddit_interactions_timeseries(datasets_sf)
vertices.print_rows(10)
interactions.print_rows(10)
g = create_sgraph(vertices,interactions)

+---------------------+----------------+---------------------+
|       mindate       |      v_id      |       maxdate       |
+---------------------+----------------+---------------------+
| 2010-04-05 08:18:54 |    voltagex    | 2015-09-27 04:06:28 |
| 2010-04-05 08:50:31 |      qwak      | 2010-04-09 08:37:35 |
| 2010-04-05 14:13:55 |      brey      | 2010-04-05 14:13:55 |
| 2010-04-05 14:44:42 | actionscripted | 2013-12-01 06:50:26 |
| 2010-04-05 15:00:51 |     bcain      | 2012-11-14 17:57:45 |
| 2010-04-05 15:57:46 |    reconbot    | 2010-04-05 15:57:46 |
| 2010-04-05 19:26:25 |      jc4p      | 2010-04-06 01:05:28 |
| 2010-04-05 22:27:12 |    Orborde     | 2010-04-05 22:27:12 |
| 2010-04-05 23:57:52 |     dmwit      | 2010-04-05 23:57:52 |
| 2010-04-06 00:11:33 |   alphabeat    | 2011-09-15 02:22:17 |
+---------------------+----------------+---------------------+
[4047 rows x 3 columns]

+---------------------+----------+----------+
|       datetime      |  src_id  |  dst_id  |
+---------------------+----------+----------+
| 2010-04-05 15:00:51 |  bcain   | voltagex |
| 2010-04-05 23:57:52 |  dmwit   | Orborde  |
| 2010-04-06 00:28:06 | voltagex |   jc4p   |
| 2010-04-06 01:05:28 |   jc4p   | voltagex |
| 2010-04-06 02:23:23 | voltagex |   jc4p   |
| 2010-04-07 02:49:08 | voltagex | defrost  |
| 2010-04-07 03:44:49 | defrost  | voltagex |
| 2010-04-07 06:34:23 | voltagex | defrost  |
| 2010-04-08 21:59:55 | voltagex | Orborde  |
| 2010-04-09 00:40:58 | deserted |  tibbon  |
+---------------------+----------+----------+
[4811 rows x 3 columns]

We created an SGraph object from the SFrame. Let's visualize the constructed social network.

g.summary()

{'num_edges': 3935, 'num_vertices': 4047}

import os
import logging
import bz2
from datetime import datetime
import graphlab as gl
import graphlab.aggregate as agg
import fnmatch

#gl.canvas.set_target('ipynb')
gl.set_runtime_config('GRAPHLAB_CACHE_FILE_LOCATIONS', '/data/tmp')
gl.set_runtime_config('GRAPHLAB_DEFAULT_NUM_GRAPH_LAMBDA_WORKERS', 128)
gl.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 128)


basedir = "/data/reddit/raw" # Replace this with the directory which you downloaded the file into
sframes_dir = "/data/reddit/sframes/" # Replace this with the directory you want to save the SFrame to
tmp_dir =  "/data/tmp" # Replace this with the directory you want to save the SFrame to
join_sframe_path = sframes_dir + os.path.sep + "join_all.sframe" # Where to save the join large SFrame object
sf = gl.load_sframe("/data/reddit_data_no_txt_without_bots_and_deleted_authors.sframe")
subreddit_users = gl.load_sframe("/data/subreddits_users.sframe")

This non-commercial license of GraphLab Create for academic use is assigned.

[INFO] graphlab.cython.cy_server: GraphLab Create v1.10.1 started. Logging: /tmp/graphlab_server_1487650195.log

We can use GraphLab's analytics toolkit to calculate various topological properties, such as the degree distribution and the graph's number of triangles. We hope to elaborate on this in a future tutorial. For now, let's use Networkx package to draw this subreddit's social network.

import networkx as nx
def sgraph2nxgraph(sgraph):
    """ Converts a directed sgraph object into networkx object
    :param sgraph: GraphLab SGraph object
    :return: Networkx Directed Graph object
    """
    nx_g = nx.DiGraph()
    vertices = list(sgraph.get_vertices()['__id'])
    edges = [(e['__src_id'], e['__dst_id']) for e in sgraph.get_edges()]
    nx_g.add_nodes_from(vertices)
    nx_g.add_edges_from(edges)
    return nx_g
nx_g = sgraph2nxgraph(g)

Let's draw a subgraph of the subreddit

import networkx as nx
import random
def draw_graph(g,layout_func=nx.spring_layout):
    pos = layout_func(g)
    d = nx.degree(g)
    n_sizes = [v * 25 +5 for v in d.values()]
    nx.draw(g, nodelist=d.keys(), node_size=n_sizes, node_color='blue')

# Selecting only sample of vertices with degree greater than 0
d = nx.degree(nx_g)    
v_list = [v for v in d.keys() if d[v] >0]
h = nx_g.subgraph(v_list[:500])
draw_graph(h)

5. Where to Go from Here¶

The social network dataset created as a result of our study opens the door for new and exciting research opportunities. This dataset can help not only to better understand the social structure of the Reddit community in particular, but also to understand how social networks and online communities evolve over time. Moreover, this corpus can be used as a ground-truth dataset for many studies in the field of social networks. Some examples of what can be done with this corpus are:

Understanding topological factors that may help a post to go viral.
Helping to better understand diffusion models and also assisting in validating diffusion models using real-world data.
Using this dataset as an excellent ground-truth dataset for evaluating entity-matching algorithms, based on our observation that many redditors are members in several subreddits.
Helping understand the connection between content and social networks; for instance, this dataset can provide insight on what type of content makes users more likely to interact with each other.

We would love to hear other ideas on what possible work can be done with our provided datasets.

author	subreddit	start_date	end_date	posts_num	activity_time
d4mini0n	ainbow	1334038071	1427931191	9	93893120
polinreddit	news	1325830586	1325830586	1	0
goliveyourdreams	relationships	1337470570	1451506045	27	114035475
ssa111	spreading	1448388980	1451500948	2	3111968
SugarBeets	malelivingspace	1356540644	1474085882	20	117545238
Wejurt	canucks	1429683528	1430083478	8	399950
themonthofmae	relationships	1468165432	1468165480	2	48
Puckhead7	ColoradoAvalanche	1419580278	1476590616	43	57010338
throwitawaytrustme	relationships	1448225775	1448225775	1	0
spinstartshere	Lenovo	1455307194	1456503969	3	1196775

datetime	author	author_flair_css_class	author_flair_text	body
2006-10-18 16:36:27	conrad_hex	null	null	It sort of sounds like the caffeinated-baby ...
2006-10-18 16:56:54	the_seanald	null	null	The 10 year old inside me chuckled at "pinches off ...
2006-10-18 17:07:48	Mr_Smartypants	null	null	One would think a journalist (or even a ...
2006-10-18 17:10:15	tcervl	null	null	almost useful!\r\n\r\nFor me, in computer science, ...
2006-10-18 17:18:07	Lagged2Death	null	null	An entertaining rant. Definitely worth a ...
2006-10-18 18:37:48	anonymgrl	null	null	:)
2006-10-18 19:13:51	Saintstace	null	null	No, but you just gave me the idea to do so. ...
2006-10-18 19:30:02	_jjsonp	null	null	that is fascinating. i wonder how long before ...
2006-10-18 20:46:34	dsandler	null	null	Or, at least, bolder, which is a trait that ...
2006-10-18 21:09:39	NitsujTPU	compsci	PhD \| Computer Science	Mind Hacks is awesome. The book is totally w ...

controversiality	created_utc	distinguished	edited	id	link_id	parent_id	retrieved_on	score
0	1161189387	null	false	cmqc1	t3_mpuf	t3_mpuf	1473802574	2
0	1161190614	null	false	cmqf7	t3_mp5w	t3_mp5w	1473802577	7
0	1161191268	null	false	cmqha	t3_mpuf	t3_mpuf	1473802577	2
0	1161191415	null	false	cmqhk	t3_mppf	t3_mppf	1473802577	3
0	1161191887	null	false	cmqiv	t3_mpjm	t3_mpjm	1473802578	4
1	1161196668	null	false	cmqyw	t3_mp5w	t1_cmqf7	1473802587	2
0	1161198831	null	false	cmr6f	t3_mr3t	t3_mr3t	1473802591	3
0	1161199802	null	false	cmrbo	t3_mp5w	t3_mp5w	1473802593	4
0	1161204394	null	false	cmrsx	t3_mpuf	t1_cmqc1	1473802603	1
0	1161205779	null	false	cmrxx	t3_mqcp	t3_mqcp	1473802605	0

stickied	subreddit	subreddit_id	ups	month	year	banned_by	replies	removal_reason	likes	user_reports
false	science	t5_mouw	2	10	2006	None	None	None	None	None
false	science	t5_mouw	7	10	2006	None	None	None	None	None
false	science	t5_mouw	2	10	2006	None	None	None	None	None
false	science	t5_mouw	3	10	2006	None	None	None	None	None
false	science	t5_mouw	4	10	2006	None	None	None	None	None
false	science	t5_mouw	2	10	2006	None	None	None	None	None
false	science	t5_mouw	3	10	2006	None	None	None	None	None
false	science	t5_mouw	4	10	2006	None	None	None	None	None
false	science	t5_mouw	1	10	2006	None	None	None	None	None
false	science	t5_mouw	0	10	2006	None	None	None	None	None