StopStalk Submission Retrieval Logic

Behind the scenes of submission retrieval

The numbers presented here are as of 01 Jan, 2018

What used to happen before ?

Currently we have 5700+ users along with 680+ custom users. This means we need to retrieve submissions of 6380+ users from all the 6 profile sites (CodeChef, CodeForces, Spoj, HackerRank, HackerEarth, UVa).

Total requests every 24 hours = 6380 * 6 = 38280

Note: This number is actually larger than this as most of the websites have pagination and we need to make extra requests to get all the submissions. Along with these requests other processing power is required to parse the submission details from the page html received followed by entering these details into the StopStalk submission table. There are also a few optimizations to reduce the computation -

last_retrieved optimization

There is a great optimization here which involves storing a last_retrieved for every site profile of every user which indicates what was the last timestamp the submissions from that site were retrieved. This helps in the next day's retrieval and breaking out from the processing of submissions once we hit a timestamp lesser than the last_retrieved value.

Invalid handles optimization

We also store invalid handles entered by users and once found that the handle does not exist on the site we store it in invalid_handles table and skip the subsequent retrievals for these handles to avoid unnecessary 404s. These handles are checked regularly so as to consider the cases when the user actually registers on the site with this handle which didn't exist before.

Even with these optimizations the time taken to completely retrieve the submissions for all the users across all the profile sites is about 8 hours which is tremendously huge. This time is directly proportional to the number of users implying as the user base increases we need a better optimization to serve the purpose.

New Heuristic approach

Insights to the logs of every day submission retrieval reveiled an interesting fact - why should crawling happen for every site for every user. Putting some manual input to the system explains the fact that a user wouldn't submit on every site every day. Even when a user has registered across sites it is practically not possible that he/she will submit on all the websites every day.
How do we capture this information and build an algorithm around it to reduce the number of requests everyday ?
We store a delay corresponding to every user every profile site whose default value is 0. Now we will use this <site>_delay value along with the <site>_lr (site last retrieved) to come up with a delayed retrieval procedure.

Let's have a look at the pseudocode -

// user_record is the record object of a specific user
ALL_SITES = ["CodeChef", "CodeForces", "SPOJ", "HackerRank", "HackerEarth", "UVa"]
for site in ALL_SITES
    site_delay = user_record[site + "_delay"]
    site_lr = user_record[site + "_last_retrieved"]
    if site_delay / 5 + 1 + site_lr.date > today.date
       // Skip the retrieval today
       continue
    else
        // Get how many new submissions were retrieved
        submission_count = retrieve_submissions(user_object[site + "_handle"], site_lr)
        
    if submission_count == 0
        // No submissions made for site and hence increment delay
        site_delay += 1
    else
        // Reset the delay as user is active today
        site_delay = 0

    // Update site last retrieved, regardless of whether we actually retrieved something.
    site_lr.date = today.date

Let's see it in action:
Say, we fetched details of one user yesterday (date.today - site_lr.date = 1) and the delay is 0 implying: 0 / 5 + 1 + site_lr.date would be equal to today's date, Therefore we will not skip this user and proceed to retrieve his submissions. Now if the user didn't submit today, we increment the delay by 1 and update site_lr.date to today's date.
Dry running this further we understand that until the delay value is 5 we would be retrieving submissions every day (Because the difference between site_lr.date and date.today will always remain 1).
Once the value is 5, we get 1 + 1 + site_lr.date > today's date, therefore we will skip this user today (i.e. we won't retrieve submissions and won't update site_lr.date). Now, the next day we will try to retrieve the user's submissions. If he has some submissions, we will reset the site_delay (and update site_lr.date) and go back to our initial state. In case, the user still hasn't made a submission we increment site_delay to 6 (and update site_lr.date). You can observe that we start retrieving the submissions on alternate days as the delay value goes from 5 to 9.
As the value reaches 10, we start retrieving every third day - and this proceeds accordingly unless the user submits a problem on the site after which the cycle resets and the process is repeated for every site profile.
Code in action - Link

Are we penalizing rare submitting programmers ?

Certainly not. We understand that there might be cases when user submits to a site when there is an interesting contest on the site. The user's can click on the "Retrieve my submissions" on the profile page and we will retrieve the submissions within 5 minutes. Not only this, we will also reset all the site_delay values to 0 :)

Please feel free to write to us in case of any queries - Contact us page or Facebook Group

Behind the scenes of submission retrieval

The numbers presented here are as of 01 Jan, 2018

What used to happen before ?

last_retrieved optimization

Invalid handles optimization

New Heuristic approach

Are we penalizing rare submitting programmers ?

Download submission

All caught up!

View submission

Is your resume even shortlisted?