I started redesigning my website a several weeks ago, my objective was to create a centralized hub for sharing written information, code, video, and photography. It was rather easy to solve most of those problems, and sharing my latest YouTube video was simple at first.
I had this niggling feeling that my new website was on the heavyweight side, after all it’s WordPress based and I had a few plugins. The annoying reCaptcha logo was popping up everywhere, even when it wasn’t used. After using the Coverage tab in Chrome and installing yet more WordPress plugins to trim the fat, I tried get it down to as small a footprint as I could. Then came the Google PageSpeed Insights. Sometimes we are blissfully unaware of our problems and go through life with blinders on, PageSpeed Insights simultaneously woke me and gave me yet another obsession to chase.
Down the YouTube v3 API rabbit hole
You would think that “show my latest video” is a problem that is already solved, that Youtube would have made this a high priority and easy to solve problem. Well, I can tell you that it’s not as easy as it looks and it is not “solved” in the easy sense. Youtube provided the embedded player, the embedded player doesn’t use the YouTube v3 API, it uses an older AJAX API that anyone can query. The old API returns JSON, just like the new API, but the old API doesn’t have a way to constrain the result set to the most interesting data, it literally fetches EVERYTHING about your channel, it returned over 100 videos from my channel, totaling over 79KB just for a single REST query. You are beginning to see why the embedded player is so heavy!
The modern way to query data about a Youtube account is to use the v3 API, this is a relatively lightweight API that can do the same thing as the AJAX API, but it has better support for querying specific data and limiting the number of results. I learned what API calls I needed to implement my video thumbnail viewer:
- /search endpoint for querying the latest video from my channel
- /channel endpoint for querying my thumbnail URL and channel name
To query the latest video from your channel, you have to send a query to the /search endpoint with the Channel ID, part (what data you want), maxResult (how many results you want), and order (sort by date order). This is all pretty straightforward and you can compile the query using the interactive viewer on the Google API developer site. What isn’t immediately apparent is that this relatively simple piece of information has a cost of 1oo API units at minimum. Each part you ask for incurs additional API unit costs. You may be thinking, “but the Youtube embedded player does the same query through the old API and it returns a whole bunch of data, yet there is no restrictions on that?!”, and you would be rational to think that way. Then you may think “well, 100 units isn’t much, if the embedded player returns a bunch of data and there are millions of page views a day for that player, it has to be less costly to use the more efficient API?”, and there is where logic and reason collide with reality: Youtube only gives you 10,000 API units per day to query the v3 API. When you consider sites like SocialBlade, they must use hundreds of millions of units per day!
10,000 units ought to be enough for everyone
As a new YouTube v3 API user, it seemed contrary to me that the quotas per day were 10,000 units, but the quota per user per 100 seconds was 300,000 units, and the quota per 100 seconds was 3,000,000 units?! Why are the user and 100 second quotas greater than an order of magnitude larger than the daily quota? Obviously you can’t exceed 300,000 units in 100 second per user, because you’d exhaust all of your units before the quota triggered! I think you can see the dilemma, if a simple query like “tell me the newest video id for this channel” uses at least 100 units, that limits you to just 100 queries per day. I am the single most active visitor to my site, doing site development and writing articles. I easily exhausted my API quotas 2 days in a row just from developing the embedded Youtube thumbnailer.
It is possible to ask for an increase in your API quota, but the process is onerous and not streamlined. It should be simple to automate quota increases, then reactively limit abusers. I put in a request to increase my quota by 10 fold, but I’m not holding my breath. When I’m faced by a limitation or challenge I look for ways to overcome or workaround that challenge on my own. This quota limitation was something I could workaround, but not overcome on my own. I broke down my needs and concluded that I don’t upload Youtube videos often and I didn’t need my website to have my latest video exactly as it became available. This relaxing of requirements leads to a few possible solutions:
- Create a page on my site with a static thumbnail and link, update this page when I upload a video
- Create a script that draws data from a file that I update periodically with a cron job
- Remove my video thumbnail altogether
- Create a database backed API cache that automatically queries the Youtube v3 API, processes JSON queries and stores the JSON results
I think you know where this is heading, I wouldn’t write an article about a plebian solution such as static HTML or a cron job — I wrote a PHP script that proxies the Youtube v3 API and caches the results in a MySQL database.
Overview of the API proxy cache
I wanted the API cache to be as simple as possible, this meant it had to be a single file service that ran on my site and used the same database instance as my WordPress site. I also wanted to make it clean enough to publish on my Github account (without accidentally divulging sensitive data). I took the approach that the cache wouldn’t actually use the API key, it would proxy the API key from the JS client and use a restricted API key that is protected by specifying the referrer which is registered in the Google Developer Console.
The API cache is 2 PHP files, one that stores protected credentials for accessing the database, and the other is the functional code. The main PHP script expects a user and database to be created, but will create the table schema if it doesn’t find the table. The caching is somewhat novel and tries to leverage the JSON datatype introduced in MySQL 5.7 along with proper indexing so queries are lower impact.
MySQL 5.7 introduced the JSON datatype, it looks like a text string from the outside, but the server can understand queries to data stored inside the JSON blob. You would be right in thinking that an opaque datatype will result in a table scan if you perform a JSON query, but that’s only true if you don’t have an index on the data. MySQL 5.7 allows you to create a VIRTUAL column on data in a JSON datatype, this allows you to break out key value pairs from the JSON data and create indexes on them. Indexes store static strings of text, so when you insert or modify a row of data, the VIRTUAL column values are extracted and stored in the index tree. This approach allows you to create indexes on JSON data, which is exactly what I did.
The Youtube API v3 only has a few keys that I can rely on for each query: part, API key, and API endpoint. I created the API endpoint as a native varchar column, but the part and API key are VIRTUAL columns created from the JSON query data. Then I created a compound index on (endpoint, part, apikey), giving me a tuple, while not unique, it allows me to distinguish between different endpoints and part queries. This creates a 2 stage query that results in the index providing a subset of data that must then be scanned to provide the final result. It’s not the most elegant solution, but better than no index at all.
As written the API cache will refresh any element every 3600 seconds, but adding per-query cache lifetime would be trivial and might be something I do in the future if I expand my use of the YouTube API v3.
You can find the code for my YouTube API v3 cache and my embedded youtube thumbnailer below.