What Do We Have to Work With?
Key | Type | Description |
---|---|---|
key_id | uint64 | A unique identifier across all drawings. |
word | string | Category the player was prompted to draw. |
recognized | boolean | Whether the word was recognized by the game. |
timestamp | datetime | When the drawing was created. |
countrycode | string | A two letter country code of where the player. |
drawing | string | A JSON array representing the vector drawing |
Digging Deeper
We identify that the most important features are no doubt the word, the drawing itself, and the countrycode. The word would naturally be highly correlated with the drawing; after all, the drawing is a literal manifestation of the word based on the players imagination and drawing skills. As such, it would be trivial to train some form of artificial intelligence to recognize the topic being drawn. That is a task already solved.
If we want to infer any kind of demographical infomation (country) based on the drawing alone, that would sound like an impossible task. However, should the category be given beforehand and the agent be provided many examples of this category to learn from, it is conceivable for this agent to manage to glean soem degree of information regarding the players demographics, however little.
Let us explore each feature in more depth to see what obstacles they may yet present.
Feature One: word
Our team tallies a total of 345 different categories for the doodles. Ranging anywhere from airplanes ✈️ to zebras 🦓.
So how are these categories distributed? Do we have the same number of drawings for each category? We will answer that by plotting the number of drawings per category with a bubble chart. The size of bubbles reflects the number of drawings in that category.
We can see from the bubble chart that snowman, banana, calendar, potato, marker, yoga are the more popular categories. We observe that there exists a small degree of variance among different categories. While this variance is not very large, if left unnoticed it may still present a potential hazard.
We believe that the reason this dataset is so balanced is due to the fact that the collection mechanism is controlled by the Google QuickDraw team. They can complete control over which categories they wish to collect.
Feature Two: drawing
Doodle of an airplane:
Doodle of a zebra:
Some doodles are quite articulate and we could easily tell what the author was drawing. On the other hand, some are... rather simplistic, and we would be hard-pressed to actually figure out what the topic was.
But can you really tell what country a player is from just by looking at their doodle? At this point we may be tempted to say no, and conclude that this exploration was fruitless after all. Is there really any hope left?
Before plotting these out, we might have thought that the the drawings were just equivalent to a simple 2d image, perhaps just an outline of a sketch. After plotting them out in a time-sequence however, we quickly gain the insight that there are yet more features to be uncovered. The number of strokes, the order of the strokes, the amount of time spent, the amount of detail, and the intricacies within each stroke may all be compelling evidence that contribute towards gleaning more information (perhaps cultural habits) from the player.
Feature Three: countrycode
Where are the players from anyway?
To show which country the players are from, we plot the country distribution as a sorted bar chart. You can also hover your mouse over each bar the see the full country name and number of drawings contributed by that country.
We observe a staggering number from the United States, trumping the rest of the countries by a huge margin. In fact, the United States alone contributed more than 40% of the entire dataset! Talk about contribution.
On one hand it's great news that we have such a large dataset to work with. But on the other hand it is terrible that our dataset is so skewed. This poses a serious challenge towards building an effective model: some groups will have more than enough data while some group will have extremely few in comparison. If left unchallenged, this single problem is enough to cause any form of machine learning to be strongly biased towards predicting towards the direction of the skew.
We must come up with some form of clustering to aggregate enough datapoints for the rest of the world. In the next plot, we group the circles by continent, with the size of the circle log-proportionate towards the number of submissions from that country. Hover your mouse over the bubbles to see details.
- North America:
- South America:
- Europe:
- Africa:
- Asia:
- Oceania:
We observe that through this method of clustering, if we were to bunch up the countries into their respective continents, we can realistically deal with the skew in the data!
This also means that instead of speculating the country of the player, we will be aiming for a lower-hanging fruit: predicting the continent that the player is associated with.