May 10, 2016 by
Alexi Hawk's Impossible Data Set
This was originally posted on blogger here.
As the author of the only unsolved puzzle in the DBIR Cover Challenge this year, I figured I should provide a bit of a write up. I'll apologize to all of the cover challenge participants as it's quite literally 10 lines of code to solve, only two of which are actually functional (vs loading packages and naming stuff).The Idea
First, where the puzzle came from. I wanted to have a data-y puzzle in the challenge, but I also wanted it to be challenging for data science-y people. To that end, I suggested, and the team approved, a puzzle based on a dataset, but with a twist. The solution would not be from analyzing the data statistically. Even then, our estimate going in was that it was the hardest puzzle of the bunch and likely wouldn't be solved.
The Setup
To create the puzzle, I used gimp to create a raster image with the key text. I then opened the image in python using the PIL package. It lets you parse through each of the individual pixels and determine its RGB. I took all the pixels with RGB less than 10 (i.e. black) and saved them as a csv of (x, y) coordinates.
From there I transferred it to R. Since each point is a pixel (i.e. closer than the size of a circle drawn at that location), I filtered down to 10% of the points. Now, the first thing a good data scientist does is looks at the data, so we can't have it be that obvious. Instead, I added a third column with random points in the range of the first two columns. Then I swapped the first and third column. If creating a scatter plot of the data would have been looking straight on, now doing so (on the first two columns) is like looking at the vertical location and a completely random horizontal location of the pixel.
As we discussed the puzzle, someone else had suggested doing something with polar coordinates. So I did just that. I converted the current cartesian coordinates into spherical form. (Hopefully all the hints about spheres and looking at the ranges now make sense as two columns, the angles in radians, range from about 0 to 1.6 and one, the vector length, ranges from 0 to about 500).
The Payout
So, the solution to the dataset (in R) is as follows:
# Read in the filealexi <- read.csv("http://cybercdc.global/static/alexi.csv")
# Convert each from spherical coordinates to cartesianlibrary(pracma)back <- apply(alexi, MARGIN=1, sph2cart)
(At this point if you look at the data you'll notice two int rows and one numeric. That wasn't intended and gives away the correct rows a bit.)# The output of above is 1 point per column. Change it to rows using the 't' (transpose) command.
back <- t(back)# Convert it back to a dataframe to make it easily plottable with ggplotback <- as.data.frame(back)
# Give it column names to make it easy to refer tonames(back) <- c("V1", "V2", "V3")# Scatter plot the correct two dimensions to view the datalibrary(ggplot2)ggplot(as.data.frame(back)) + aes(x=V3, y=V2) + geom_point()
You may have to squish the vertical dimension a bit to read the text, but you'll see it.