How it works

The Web App

The web-app is a simple Pyramid web app using SQL Alchemy to interface to a PostgreSQL database.

No main logic (Machine learning, analysis) is done in the web app, it is simply reading rows from the database, and presenting them with some very minimal formatting.

The default from page matches '/new' on Reddit/r/DirtyPenPals because there is some machine-learning friendly data to be had.

Upcoming will be a view that reflects /top as well, but that is extra

The backend

To fill the database there are several Postgres-enabled micro-services, intercommunicating via the async NOTIFY chain in postgresql.

Because of this architecture, each running service needs at least two connections to the database. One for the poller (mainloop), and more for eventual workers.

Each listening service can be multi-threaded or event-based in a mainloop.

Archivers

here are currently three archivers, one that polls /r/new every minute, grabs all posts, inserts them into the database, and sends a notification.

The other two are batch jobs to pre-fill the database. One will use the search interface and bisect the history of a subreddit to find posts. While this is functional, it fails to find **all posts, and the guess is that we only see about 20% of all historical posts.

The third batch job will iterate over the database and locate authors. It will then use the reddit search API to find more posts by these authors, and put them in. This is a slow process, and is entirely capped by the reddit API limitation, but it finds quite a lot more, missing posts.

Learners

There are currently two machine learning type activities. These are long-running processes that have some fairly significant internal state, using a few gigs of memory.

Duplications (Reposts and post Copies) are handled by the method of (LSH) MinHash each post is shingled and stripped from common stop-words, then a MinHash is calculated over this set of shingles, and queried compared to the LSH cache. This gives us an approximate set of matching documents.

For each of these matching documents, the Jaccard Similarity is approximated from the MinHashes, and stored as the similarity score.

The above process should probably be split into two, one that finds matches using LSH, and one that calculates the differences and scores them together.

The other learner process are Naive Bayes Classifiers for a few different set-ups.

The best functioning one will build a corpus of gender-tagged based on The DPP Verification system. ( This is the part that was very suited for machine learning. ) These scores are then written to a tally table, together with the confidence score.

The Topic-guesser is trained on a sample of M4F/F4M/F4F/F4A/M4A/M4M posts, up to 15000 posts each. This makes each instance behave non-deterministically, and I am reconsidering its existance. The topic-guesser is also too memory hungry to fit in my 2GiB server.

Misc

There are several backend tools to display scores and sentiments in real-time as they drop in via the async notifications.

There are also pre-build versions of many of the tools that will fill up/recalculate scores based on the current max state.

What is left