May 10, 2016 by

Alexi Hawk's Impossible Data Set

blog-feature-image

This was originally posted on blogger here.

As the author of the only unsolved puzzle in the DBIR Cover Challenge this year, I figured I should provide a bit of a write up.  I'll apologize to all of the cover challenge participants as it's quite literally 10 lines of code to solve,  only two of which are actually functional (vs loading packages and naming stuff).

The Idea

First, where the puzzle came from.  I wanted to have a data-y puzzle in the challenge, but I also wanted it to be challenging for data science-y people.  To that end, I suggested, and the team approved, a puzzle based on a dataset, but with a twist.  The solution would not be from analyzing the data statistically.  Even then, our estimate going in was that it was the hardest puzzle of the bunch and likely wouldn't be solved.

The Setup

To create the puzzle, I used gimp to create a raster image with the key text.  I then opened the image in python using the PIL package.  It lets you parse through each of the individual pixels and determine its RGB.   I took all the pixels with RGB less than 10 (i.e. black) and saved them as a csv of (x, y) coordinates.

From there I transferred it to R.  Since each point is a pixel (i.e. closer than the size of a circle drawn at that location), I filtered down to 10% of the points.  Now, the first thing a good data scientist does is looks at the data, so we can't have it be that obvious.  Instead, I added a third column with random points in the range of the first two columns.  Then I swapped the first and third column.  If creating a scatter plot of the data would have been looking straight on, now doing so (on the first two columns) is like looking at the vertical location and a completely random horizontal location of the pixel.  

As we discussed the puzzle, someone else had suggested doing something with polar coordinates.  So I did just that.  I converted the current cartesian coordinates into spherical form.  (Hopefully all the hints about spheres and looking at the ranges now make sense as two columns, the angles in radians, range from about 0 to 1.6 and one, the vector length, ranges from 0 to about 500).

The Payout

So, the solution to the dataset (in R) is as follows:
# Read in the file
alexi <- read.csv("http://cybercdc.global/static/alexi.csv")

# Convert each from spherical coordinates to cartesian
library(pracma)
back <- apply(alexi, MARGIN=1, sph2cart) 
(At this point if you look at the data you'll notice two int rows and one numeric.  That wasn't intended and gives away the correct rows a bit.)
# The output of above is 1 point per column.  Change it to rows using the 't' (transpose) command. 
back <- t(back)
# Convert it back to a dataframe to make it easily plottable with ggplot
back <- as.data.frame(back)

# Give it column names to make it easy to refer to
names(back) <- c("V1", "V2", "V3")
# Scatter plot the correct two dimensions to view the data
library(ggplot2)
ggplot(as.data.frame(back)) + aes(x=V3, y=V2) + geom_point()


You may have to squish the vertical dimension a bit to read the text, but you'll see it.

Prologue

Incidentally, I actually tested that the spherical points to make sure that they wouldn't reveal the clue when visualized directly.  The sampling had to be adjusted so that if you visualized columns V1 and V3, it didn't reveal the activation text.

LET’S WORK TOGETHER