cRedditscore package

Submodules

cRedditscore.cRedditscore module

A module of tools to train and package predictive models for the quality of comments on reddit.

class cRedditscore.cRedditscore.TermFreqModel(comments_df, low_thresh=0, high_thresh=15)[source]

Bases: object

A Naive Bayes model predicting the quality of comments on Reddit. The data is input as a pandas dataframe with columns

  • comment_id: a unique identifier for each comment
  • score (int): the score of the comment
  • content: the comment itself

For example, we make a small comments data set

>>> import pandas as pd
>>> test_df = pd.DataFrame([
...     [1, 1, "That's cool", 1],
...     [2, -3, 'boo you', 2],
...     [3, 4, 'I love you', 2],
...     [3, 16, 'I love you', 4],
...     ], columns=['comment_id', 'score', 'content', 'timestamp'])

and build a TermFreqModel object from it.

>>> tfm = TermFreqModel(comments_df = test_df)
Parameters:
  • comments_df (pandas.core.frame.DataFrame) – the dataframe containing the comment data
  • low_thresh (int) – the lower bound for the score of a neutral comment. Anything lower is considered a bad comment
  • high_thresh (int) – the upper bound for the score of a neutral comment. Anything higher is considered a good comment
add_qual_feature(df)[source]

Add the comment quality feature to the data as a new column named qual. This will be our outcome variable.

Parameters:df (pandas.core.frame.DataFrame) – the data frame to add the qual column to
dump_model(pickle_name='text_mnb_model')[source]

Dump the model object to file with pickle.

Parameters:pickle_name (string) – the name of the file to dump the object to
fit()[source]

Fit the model.

get_data()[source]

Get the full data set of the model.

Returns:the full data set underlying the model
Return type:pandas.core.frame.DataFrame
get_good_bad(df)[source]

Get the good and bad comments in a data set.

Parameters:df (pandas.core.frame.DataFrame) – the comments data set
Returns:the dataframes containing only the good and bad comments from df
Return type:pandas.core.frame.DataFrame, pandas.core.frame.DataFrame
make_model(test_size=0.2, ngram_range=(1, 4), max_features=1000)[source]

Make a new class attribute

  • model:

    the Naive Bayes model as an sklearn.pipeline.Pipeline object

For example, we make a small comments data set,

>>> import pandas as pd
>>> test_df = pd.DataFrame([
...     [1, 1, "That's cool", 1],
...     [2, -3, 'boo you', 2],
...     [3, 4, 'I love you', 2],
...     [3, 16, 'I love you', 4],
...     ], columns=['comment_id', 'score', 'content', 'timestamp'])

build a TermFreqModel object from it,

>>> tfm = TermFreqModel(test_df)

and train a model on it predictive of the quality of a comment.

>>> tfm.train_test(test_size=0.1)
>>> tfm.make_model(ngram_range=(1, 3), max_features=10)
>>> tfm.fit()
>>> prediction = tfm.model.predict(['Thanks for a great post!'])
>>> prediction in ['good', 'bad']
True
Parameters:
  • test_size (int) – the percentage of data points to hold out for testing
  • ngram_range (tuple) – the range of n for ngrams to include as features
  • max_features (int) – the maximum number of features to include
most_recent_obs(df)[source]

Select the most recent observation of each comment in a data set.

Parameters:df (pandas.core.frame.DataFrame) – The data set of comments
Returns:the data set containing only the most recent observation of each comment in df
Return type:pandas.core.frame.DataFrame
setup_data()[source]

Set up the data for model training. In detail:

  • Remove all but the most recent observation for each comment
  • Add the quality feature to the data, which will be our outcome variable
  • Separate out the good and bad comments. This will be the data we train the model on.

This function adds a new class attribute

  • good_bad_df:

    the dataframe containing only the most recent observations of the good and bad comments

train_test(test_size=0.2)[source]

Split the data into train and test sets.

This function adds new class attributes

  • X_train and X_test,

    The features of the training and test parts of the data set

  • y_train and y_test,

    The outcomes of the training and test parts of the data set

Parameters:test_size (int) – the percentage of data points to hold out for testing
cRedditscore.cRedditscore.get_quality(score, low_thresh=0, high_thresh=15)[source]

Get the quality (good, bad, neutral) of a score based on the score thresholds.

For example,

>>> get_quality(score=15)
'neutral'
>>> get_quality(score=2, low_thresh=5, high_thresh=20)
'bad'
>>> get_quality(score=2, low_thresh=-10, high_thresh=1)
'good'
Parameters:
  • score (int) – the score of the comment
  • low_thresh (int) – the low threshold of a neutral comment
  • high_thresh (int) – the high threshold of a neutral comment
Returns:

the quality of the comment (good, bad or neutral)

Return type:

string

cRedditscore.collection module

A module of tools to collect comment data from Reddit.

class cRedditscore.collection.Collect(user_agent=None, subreddits=['dataisbeautiful', 'programming', 'technology', 'python', 'cpp', 'funny', 'news', 'science'], subm_table=None, comm_table=None, conn=None, debug=False)[source]

Bases: object

A class to handle the collecting and storing of comment data from Reddit.

In practice, collection requires the name of a Reddit web app to connect to the Reddit api. By default, no web app is passed to the class, in which case the class is useless.

The database for storing comments has a table subm_table for storing submission data with columns

  • submission_id (varchar(45)),
  • subreddit (varchar(45)), and
  • timestamp (int(11))

and a table comm_table for storing comment data with columns

  • submission_id (varchar(45)),
  • subm_title (varchar(45)),
  • subm_content (blob),
  • subm_created (int(11)),
  • subm_created_local (int(11)),
  • subm_score (int(11)),
  • subm_author (varchar(45)),
  • subm_num_comments (int(11)),
  • comment_id (varchar(45)),
  • user_id (varchar(45)),
  • prev_comment_id (varchar(45)),
  • created (int(11)),
  • created_local (int(11)),
  • timestamp (int(11)),
  • content (blob),
  • subreddit (varchar(45)),
  • score (int(11)),
  • ups (int(11)),
  • downs (int(11)), and
  • controversiality (int(11)).
Parameters:
  • user_agent (string) – the name of the web app to connect to the api through
  • subreddits (list) – the list of subreddits to collect from
  • subm_table (sqlalchemy.sql.schema.Table) – the submissions sql table
  • comm_table (sqlalchemy.sql.schema.Table) – the comments sql table
  • conn (sqlalchemy.engine.base.Connection) – the connection to the sql database

For example,

>>> col = Collect()
>>> col.subreddits[0]
'dataisbeautiful'
comment_to_db(comment=None, subm_values=None)[source]

Add comment to the database.

fetch_subm_from_table(subm)[source]

Fetch the rows corresponding to subm in self.subm_table.

Returns:the rows in self.subm_table that come from subm
Return type:list
get_random_subm()[source]

Get a random submission from Reddit.

Returns:a random submission from a random subreddit in self.subreddits if self.red exists, else None
Return type:praw.objects.Submission
rand_subm_to_db()[source]

Add a random submission to the database.

subm_to_db(subm)[source]

Add subm to the database with all of its comments.

cRedditscore.evaluation module

A module of tools to evaluate predictive models.

exception cRedditscore.evaluation.EvalError(msg)[source]

Bases: exceptions.Exception

Errors in evaluating the model, for example a missing predict function.

class cRedditscore.evaluation.Evaluate(model=None, data_features=None, data_responses=None, pos_label=None)[source]

Bases: object

Evaluate a predictive binary classification model on a choice of metrics including accuracy, AUC, precision and recall. The model is assumed to have functions

  • fit (for cross validation): takes as input the training set features and responses and fits the model to the training set
  • predict (for accuracy, precision, recall and cross validation): takes as input a list of observations and outputs a list of predictions
  • predict_proba (for AUC and drawing the ROC curve): takes as input a list of observations and outputs a list of probabilities that the response belongs to the first class

For example, we make a dummy test set and model:

>>> model = GenModel(predict = lambda x : ['blue' for i in x])
>>> test_features = range(20)
>>> test_responses = np.array(
...     ['blue' if i%2==0 else 'green' for i in range(20)]
...     )

and make an Evaluate object to test its accuracy.

>>> eval = Evaluate(model)
>>> eval.accuracy(test_features=test_features,
...     test_responses=test_responses)
0.5
Parameters:
  • model – the model to evaluate, described above
  • data_features (array-like) – the features of the data to evaluate the model on, generally the training set features
  • data_responses (array-like) – the responses of the data to evaluate the model on
  • pos_label – the class to be considered positive for auc, precision and recall; if None, the first class is picked
accuracy(test_responses, test_features=None, predictions=None, train=None)[source]

Find the accuracy of the model on a given test set.

Parameters:
  • test_responses (array-like) – the responses of the test set
  • test_features (array-like) – the features of the test set to predict on. If None, use the pre-made predictions
  • predictions (array-like) – the predictions to test. If None, use test_features to predict on
  • train (array-like) – the data set to train the model on (optional)
build_scores_df()[source]

Build the scores dataframe.

compute_curves()[source]

Plot ROC and precision recall curves.

compute_scores(test_features=None, test_responses=None, metrics=['accuracy', 'auc', 'precision', 'recall', 'f1'])[source]

Compute the metric scores for the model on the cross-validation folds of the training data or on the test data, storing the results in a dataframe.

Add a new class attribute

  • scores,

    the results of the computations as a pandas.core.frame.DataFrame object

Parameters:
  • test_features (array-like) – The features of the test data. If none, evaluate the model on the cv folds.
  • test_responses (array-like) – The responses of the test data. If none, evaluate the model on the cv folds.
cv_split(k=10)[source]

Make k folds of the data using stratified cross validation.

Parameters:k (int) – The number of folds to divide the data into
class cRedditscore.evaluation.GenModel(fit=None, predict=None, predict_proba=None)[source]

Bases: object

A general (binary classification) model class, mostly for testing and explanatory purposes.

Parameters:
  • fit (function) – the fit function of the model
  • predict (function) – the prediction function of the model
  • predict_proba (function) – the probability prediction function of the model, computes the probability that an observation belongs to the first class

Module contents