May 18, 2017 Using Python to Download HLTV Demos

Using Python to Download HLTV Demos

Introduction
Writing the Script
Code (tl;dr)

Introduction

Demo files are a requirement for analyzing players at the professional level of Counter-Strike: Global Offensive. Currently, using the HLTV Demos page for each event is a cumbersome and labor-intensive process. Leveraging Python, we can eliminate all manual interaction and streamline the process of acquiring these demo files.

Writing the Script

Python is the obvious choice for a short script that will run with no dependencies. Writing the script in pure Python means that it will also be cross-platform between MacOS, Linux, and Windows. The downside of this is that it will not easily be multithreaded, however with a fast internet connection that becomes less important.

Determining How HLTV Serves Demos

The HLTV events archive lists every event that HLTV has coverage for. Each event has a unique ID that is displayed in the URL: for example, IEM Sydney has a URL of http://www.hltv.org/?pageid=28&eventid=2713. There are two arguments in this URL: pageid= and eventid=. pageid specifies what type of page to display while eventide specifies which event to query the data for. This means the event ID is the number after &eventid=, in this case 2713.

The matches that populate the resulting list are URLs that follow a similar pattern. For instance, the FaZe vs. SK match has a URL of http://www.hltv.org/?pageid=28&&eventid=2713&demoid=28247. This has three arguments: the same pageid and eventid from before with an added demoid argument. This demoid is what serves the demo when you query the download: hltv.org/interfaces/download.php?demoid=.

Getting the Demo IDs

In summary, the program must build a URL to query, parse the result it to find the proper Demo IDs, and query the download URL.

Querying the URL

The page with the demo files is pageid=28 and that is constant across all events. Thus, we can query a list by simply telling a program to build a URL that starts with http://www.hltv.org/?pageid=357&eventid= and concatenates a given eventid to the end.

eventid = eventID
offset = 0
url = 'http://www.hltv.org/?pageid=28&&eventid=%s' % (eventid)

Saving the Result

To see what the server returns to that query, we build a URL Opener object and add in some dummy headers¹. Lucky, this is quite simple to do using urllib and urllib2:

opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open(url)
html = response.read()

This opens the URL we created and then saves the resulting HTML to a String variable called html.

Getting the Demo IDs

Regular expressions can be leveraged to determine which URLs hold the Demo IDs to download. Since the match URLs are all mostly the same, we can create an array called demoIDs like so:

demoIDs = re.findall('"(.*?eventid=%s&offset=%d&amp;demoid=.*?)"' % (eventid, offset), html)

This fills an array with all of the URLs that exist for our current event on the current page. To clean the array so that we only have the Demo IDs, simply remove the parts from each item in the array that are the same:

for i in range(0, len(demoIDs)):
    demoIDs[i] = demoIDs[i].replace('" href="?pageid=28&amp;&eventid=%s&offset=%s&amp;demoid=' % (eventid, offset), “”)

After that, the array demoIDs is now only filled with five-digit Demo IDs.

Pagination

HLTV throws a wrench in the simplicity of this, however. Demo pages are restricted to only showing twenty-five matches at a time, and many events have more than twenty-five matches. To handle this, HLTV adds another argument to the URL for paginated events: offset. This specifies a multiple of 25 to offset the list of matches by: for example, page one is offset=0 while page two is offset=25 and so forth.

 eventid = eventID
 offset = 0
 url = 'http://www.hltv.org/?pageid=28&&eventid=%s&offset=%s' % (eventid, offset)

To account for this, the script must test for when the list of demos is equal to twenty-five so it knows to check for a subsequent page. This can be accomplished by an if statement:

if len(demoIDs) == 25:
        morePages = True
        page = 1
        while morePages:
            offset += 25
            url = 'http://www.hltv.org/?pageid=28&&eventid=%s&offset=%s' % (eventid, offset)
            opener = urllib2.build_opener()
            opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
            response = opener.open(url)
            html = response.read()
            moreDemoIDs = re.findall('"(.*?eventid=%s&offset=%d&amp;demoid=.*?)"' % (eventid, offset), html)
            for i in range(0, len(moreDemoIDs)):
                moreDemoIDs[i] = moreDemoIDs[i].replace('" href="?pageid=28&amp;&eventid=%s&offset=%s&amp;demoid=' % (eventid, offset), "")
                demoIDs.append(moreDemoIDs[i])
            if len(moreDemoIDs) < 25:
                morePages = False
                print "Parsing final page. Found %s IDs" % (len(demoIDs))
            else:
                page += 1
                print "Parsing next page: %s. %s IDs so far." % (page, len(demoIDs))

This only activates if the page we parse has exactly twenty-five demos. If it does, it builds the URL in the same way as before and then loops through until there are no more pages.

elif len(demoIDs) < 25:
    print "Total demos: %s" % len(demoIDs)
elif len(demoIDs) > 25:
    print "HLTV altered demo page layout”

These statements catch our edge cases. The first ends the script when the first page contains all of the demos and the second activates if more than twenty-five demos are found, which should currently be impossible.

Making a Method

All of this can be stored in a defined method. For example, using def getIDs(eventID): we can call getIDs(2713) and return the resultant array of IDs to download.

Downloading the Files

As outlined above, to serve a specific demo, HLTV queries hltv.org/interfaces/download.php with the argument demoid. Thus, to make the script download the demo file, we must build that URL. Since we have an array of Demo IDs, this is simple to do with a for loop:

counter = 0
for i in range(0, len(demoIDs)):
    url = "http://www.hltv.org/interfaces/download.php?demoid=%s" % (demoIDs[I])`

From here we us the same method as before to build an opener and open the URL:

opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open(url)

However, HLTV again throws a wrench in the mix. The URL we query does not directly go to the file we need to download. Thus we must capture the real URL it redirects to by using geturl():

finalurl = response.geturl()
filename = finalurl.rsplit('/', 1)[-1]
urllib.urlretrieve(finalurl, directory+"/"+filename)
counter += 1
print "Downloaded %s demos" % (counter)

Once we get the real URL that leads to the compressed demo file², the program uses another regular expression to get the text after the last / in the URL. This is the filename, for example, something like ESLProLeague hellraisers vs penta bo3.rar.

The final URL is passed to the urllib.urlretrieve() method along with the filename.

Saving the Demos Files

Before the script can download the demos, it needs a place to put them. Before activating, the script uses os to creates a folder inside the directory of the script. It takes the user’s input and creates a folder of that name: for example, if the script is on the desktop and a user enters IEM Sydney 2017, the script will save all of the downloaded demos to the new IEM Sydney 2017 folder.

Code (tl;dr)

The code is available with a GNU-GPL license in this repository on my GitHub.

To run the code, download/unzip or clone the repository. From there, use your CLI of choice to cd to the directory and run python Downloader.py. The script will ask for an event ID, which is described above. It will then ask for the name of the event and then will download all of the relevant demo files.

Like most web servers, a lack of proper headers will lead to a 403 error on HLTV ↩︎
HLTV generally serves these as .rar or .zip files. ↩︎

Christopher Sardegna's Blog

Thoughts on technology, design, data analysis, and data visualization.

Using Python to Download HLTV Demos

Introduction

Writing the Script

Determining How HLTV Serves Demos

Getting the Demo IDs

Querying the URL

Saving the Result

Getting the Demo IDs

Making a Method

Downloading the Files

Saving the Demos Files

Code (tl;dr)