Using Python to Download HLTV Demos
Demo files are a requirement for analyzing players at the professional level of Counter-Strike: Global Offensive. Currently, using the HLTV Demos page for each event is a cumbersome and labor-intensive process. Leveraging Python, we can eliminate all manual interaction and streamline the process of acquiring these demo files.
Writing the Script
Python is the obvious choice for a short script that will run with no dependencies. Writing the script in pure Python means that it will also be cross-platform between MacOS, Linux, and Windows. The downside of this is that it will not easily be multithreaded, however with a fast internet connection that becomes less important.
Determining How HLTV Serves Demos
The HLTV events archive lists every event that HLTV has coverage for. Each event has a unique ID that is displayed in the URL: for example, IEM Sydney has a URL of
http://www.hltv.org/?pageid=28&eventid=2713. There are two arguments in this URL:
pageid specifies what type of page to display while
eventide specifies which event to query the data for. This means the event ID is the number after
&eventid=, in this case 2713.
The matches that populate the resulting list are URLs that follow a similar pattern. For instance, the FaZe vs. SK match has a URL of
http://www.hltv.org/?pageid=28&&eventid=2713&demoid=28247. This has three arguments: the same
eventid from before with an added
demoid argument. This
demoid is what serves the demo when you query the download:
Getting the Demo IDs
In summary, the program must build a URL to query, parse the result it to find the proper Demo IDs, and query the download URL.
Querying the URL
The page with the demo files is
pageid=28 and that is constant across all events. Thus, we can query a list by simply telling a program to build a URL that starts with
http://www.hltv.org/?pageid=357&eventid= and concatenates a given
eventid to the end.
eventid = eventID offset = 0 url = 'http://www.hltv.org/?pageid=28&&eventid=%s' % (eventid)
Saving the Result
To see what the server returns to that query, we build a URL Opener object and add in some dummy headers1. Lucky, this is quite simple to do using
opener = urllib2.build_opener() opener.addheaders = [('User-Agent', 'Mozilla/5.0')] response = opener.open(url) html = response.read()
This opens the URL we created and then saves the resulting HTML to a String variable called html.
Getting the Demo IDs
Regular expressions can be leveraged to determine which URLs hold the Demo IDs to download. Since the match URLs are all mostly the same, we can create an array called
demoIDs like so:
demoIDs = re.findall('"(.*?eventid=%s&offset=%d&demoid=.*?)"' % (eventid, offset), html)
This fills an array with all of the URLs that exist for our current event on the current page. To clean the array so that we only have the Demo IDs, simply remove the parts from each item in the array that are the same:
for i in range(0, len(demoIDs)): demoIDs[i] = demoIDs[i].replace('" href="?pageid=28&&eventid=%s&offset=%s&demoid=' % (eventid, offset), “”)
After that, the array
demoIDs is now only filled with five-digit Demo IDs.
HLTV throws a wrench in the simplicity of this, however. Demo pages are restricted to only showing twenty-five matches at a time, and many events have more than twenty-five matches. To handle this, HLTV adds another argument to the URL for paginated events:
offset. This specifies a multiple of 25 to offset the list of matches by: for example, page one is
offset=0 while page two is
offset=25 and so forth.
eventid = eventID offset = 0 url = 'http://www.hltv.org/?pageid=28&&eventid=%s&offset=%s' % (eventid, offset)
To account for this, the script must test for when the list of demos is equal to twenty-five so it knows to check for a subsequent page. This can be accomplished by an
if len(demoIDs) == 25: morePages = True page = 1 while morePages: offset += 25 url = 'http://www.hltv.org/?pageid=28&&eventid=%s&offset=%s' % (eventid, offset) opener = urllib2.build_opener() opener.addheaders = [('User-Agent', 'Mozilla/5.0')] response = opener.open(url) html = response.read() moreDemoIDs = re.findall('"(.*?eventid=%s&offset=%d&demoid=.*?)"' % (eventid, offset), html) for i in range(0, len(moreDemoIDs)): moreDemoIDs[i] = moreDemoIDs[i].replace('" href="?pageid=28&&eventid=%s&offset=%s&demoid=' % (eventid, offset), "") demoIDs.append(moreDemoIDs[i]) if len(moreDemoIDs) < 25: morePages = False print "Parsing final page. Found %s IDs" % (len(demoIDs)) else: page += 1 print "Parsing next page: %s. %s IDs so far." % (page, len(demoIDs))
This only activates if the page we parse has exactly twenty-five demos. If it does, it builds the URL in the same way as before and then loops through until there are no more pages.
elif len(demoIDs) < 25: print "Total demos: %s" % len(demoIDs) elif len(demoIDs) > 25: print "HLTV altered demo page layout”
These statements catch our edge cases. The first ends the script when the first page contains all of the demos and the second activates if more than twenty-five demos are found, which should currently be impossible.
Making a Method
All of this can be stored in a defined method. For example, using
def getIDs(eventID): we can call
getIDs(2713) and return the resultant array of IDs to download.
Downloading the Files
As outlined above, to serve a specific demo, HLTV queries
hltv.org/interfaces/download.php with the argument
demoid. Thus, to make the script download the demo file, we must build that URL. Since we have an array of Demo IDs, this is simple to do with a for loop:
counter = 0 for i in range(0, len(demoIDs)): url = "http://www.hltv.org/interfaces/download.php?demoid=%s" % (demoIDs[I])`
From here we us the same method as before to build an opener and open the URL:
opener = urllib2.build_opener() opener.addheaders = [('User-Agent', 'Mozilla/5.0')] response = opener.open(url)
However, HLTV again throws a wrench in the mix. The URL we query does not directly go to the file we need to download. Thus we must capture the real URL it redirects to by using
finalurl = response.geturl() filename = finalurl.rsplit('/', 1)[-1] urllib.urlretrieve(finalurl, directory+"/"+filename) counter += 1 print "Downloaded %s demos" % (counter)
Once we get the real URL that leads to the compressed demo file2, the program uses another regular expression to get the text after the last
/ in the URL. This is the filename, for example, something like
ESLProLeague hellraisers vs penta bo3.rar.
The final URL is passed to the
urllib.urlretrieve() method along with the filename.
Saving the Demos Files
Before the script can download the demos, it needs a place to put them. Before activating, the script uses
os to creates a folder inside the directory of the script. It takes the user’s input and creates a folder of that name: for example, if the script is on the desktop and a user enters
IEM Sydney 2017, the script will save all of the downloaded demos to the new
IEM Sydney 2017 folder.
The code is available with a GNU-GPL license in this repository on my GitHub.
To run the code, download/unzip or clone the repository. From there, use your CLI of choice to cd to the directory and run
python Downloader.py. The script will ask for an event ID, which is described above. It will then ask for the name of the event and then will download all of the relevant demo files.