How it works
The Web App
The web-app is a simple Pyramid web app using SQL Alchemy to interface to a PostgreSQL database.
No main logic (Machine learning, analysis) is done in the web app, it is simply reading rows from the database, and presenting them with some very minimal formatting.
The default from page matches '/new' on Reddit/r/DirtyPenPals because there is some machine-learning friendly data to be had.
Upcoming will be a view that reflects /top as well, but that is extra
To fill the database there are several Postgres-enabled micro-services, intercommunicating via the async NOTIFY chain in postgresql.
Because of this architecture, each running service needs at least two connections to the database. One for the poller (mainloop), and more for eventual workers.
Each listening service can be multi-threaded or event-based in a mainloop.
here are currently three archivers, one that polls /r/new every minute, grabs all posts, inserts them into the database, and sends a notification.
The other two are batch jobs to pre-fill the database. One will use the search interface and bisect the history of a subreddit to find posts. While this is functional, it fails to find **all posts, and the guess is that we only see about 20% of all historical posts.
The third batch job will iterate over the database and locate authors. It will then use the reddit search API to find more posts by these authors, and put them in. This is a slow process, and is entirely capped by the reddit API limitation, but it finds quite a lot more, missing posts.
There are currently two machine learning type activities. These are long-running processes that have some fairly significant internal state, using a few gigs of memory.
Duplications (Reposts and post Copies) are handled by the method of (LSH) MinHash each post is shingled and stripped from common stop-words, then a MinHash is calculated over this set of shingles, and queried compared to the LSH cache. This gives us an approximate set of matching documents.
For each of these matching documents, the Jaccard Similarity is approximated from the MinHashes, and stored as the similarity score.
The above process should probably be split into two, one that finds matches using LSH, and one that calculates the differences and scores them together.
The other learner process are Naive Bayes Classifiers for a few different set-ups.
The best functioning one will build a corpus of gender-tagged based on The DPP Verification system. ( This is the part that was very suited for machine learning. ) These scores are then written to a tally table, together with the confidence score.
The Topic-guesser is trained on a sample of M4F/F4M/F4F/F4A/M4A/M4M posts, up to 15000 posts each. This makes each instance behave non-deterministically, and I am reconsidering its existance. The topic-guesser is also too memory hungry to fit in my 2GiB server.
There are several backend tools to display scores and sentiments in real-time as they drop in via the async notifications.
There are also pre-build versions of many of the tools that will fill up/recalculate scores based on the current max state.
What is left
The Bayesian models need to be restarted regularly, because they do not re-learn when new, verified content drops in. This should be automated.
Currently the Title guesser bayesian model is disabled as it is using too much RAM for my server.
More bayesian models. There is a scoring model that will guess the end-score of a post, but this performs, really really badly, as we cannot read the proper scores from reddit, and we would need to pick a learning time-range to base our scoring on.
More frontend. Doh. Just look at it.
Score updater. Something that grabs current scores for posts and updates those fields, without losing too much data or getting rate-limited by the API.
/r/talesfromtechsupport and /r/nosleep. Those are what I started with, and where the score-guessing came from.
Meta-guesser. Using titles as a correction, make a guesser for how META a post is as it comes in.
Verifier. Something that can trawl other subreddits (/r/gonewild?) for posters that did not verify gender and see if we can get a better set of "facts" there.