# Sunday, March 18, 2007

Just over a week ago, I came across this posting on the 37Signals blog that discusses some of the resources they used to populate testing databases for their new product, Highrise. Given that this product is a contact manager, they wanted contact names with details... and lots of 'em. In the comments to that post, "Jes" mentioned yet another resource -- the "Fake Name Generator" web site. He mentioned that you get full contact details for a fake identity and that you could get up to 20,000 for free. Hmm.

This interested me because I always like getting hold of useful data to tinker with on side projects. One of my passions in development is for data visualization, or "infoporn," so the more data to look at, the better. I've downloaded data that includes the Netflix Prize data set, the Enron internal emails released by FERC, and geo-coded zipcode lists. You never know what might be useful, right?

But now you're thinking... "if those contacts are fake, then why would they be interesting?"

The reason is that the person/people behind the Fake Name Generator have gone out of their way to make it credible-looking fake data. For example,

  • The cities match the states.
  • The zip codes match the cities.
  • The area codes (mostly) match the zip codes (I found a Bakersfield area code with an LA zip code).
  • The names are more than just random letters and resemble names you'd find in any US-based list of contacts.

Having a set of data like this greatly improves the testing of code that works with contact details. Who among us developers hasn't created fake records for "Donald Duck", "John Smith", and "Joe Blow"?

My understanding is that the data is created from various legitimate sources, but the values across columns are randomized -- so that someone's real first name is used with someone else's last name, someone else's address, someone else's city, and so on. A few searches turn up other discussions of this data, including a set of contacts uploaded to Swivel.

The data is provided free for up to 20,000 fake identities, provided that you're willing to wait up to a week to download your data. If you need it sooner, you pay $10US to expedite the process.

A few other cool things about this service:

  • You can specify which columns you'd like in your data, including credit card numbers (fake - but numerically valid), SSN/National ID numbers (also fake - but numerically valid), and gender.
  • Email addresses use domains from various temporary email services (mailinator.com, mytrashmail.com, etc). Again, they validate but aren't useful as anything other than test data.
  • You can get the data in various formats, including HTML, Excel XLS, SQL script, or delimited text files.
  • You can specify the countries and name types for your data... so if you need some data that includes Swiss addresses and Hispanic name sets, you could request it.

I also found the data to be reasonably well distributed, at least in the US-centric set of data I received. For example, across 20,000 contacts, I found:

  • The bulk of addresses were in California, Texas, and New York. The fewest were in Wyoming, Delaware, and New Hampshire. I had one record whose state was 'NN' -- ??
  • Most surnames started with the letters M, S, and B. The letters with fewest surnames were X, Q, and U.
  • The zipcode with the largest set of contacts was 90017 (Los Angeles), but the Area Code with the most contacts was 703 (Virginia). As I dug in further, it seemed somewhat logical because the LA area has numerous area codes spread across it.
  • Social security numbers had starting numbers that were evenly distributed from 0 to 6 (2500-3500 each), with just 700 of them beginning with the number 7. There were none that started with the number 8 or 9. I learned on this CodeProject article that SSNs beginning with 9 are reserved for special government use (Witness Protection, I'm sure... hah!), but I'm not sure why there were none starting with an 8.

Anyway, I've been impressed. It's an interesting service and seems worth bookmarking/tagging the site for later... you never know when you'll need a bunch of bogus (but real looking!) data.

Note: I've got no affiliation with this site whatsoever, aside from requesting a set of 20K fake identities and getting an email with download details a week later.

Technorati tags: , , ,

posted on Sunday, March 18, 2007 7:22 PM Mountain Daylight Time  #    Comments [0]
# Tuesday, March 13, 2007

Jeff Atwood, of Coding Horror, is on a blogging tear lately... I don't know how he manages to knock out such frequent posts on such consistently interesting topics. Today, I read his post on building your own hardware (with an interesting intro on how Google's servers have always been custom machines).

I've modified my machines in the past, adding RAM or drives here and there, but I've never built a machine from the basic components. For the last several years, I've purchased Dell machines (often from their Outlet, with great results) and I've never had a problem with their quality (and have yet to need to customer support, knock wood). Prior to that, I'd purchased Toshibas, Microns, and beige-box generic machines from local vendors.

That said, I'm not opposed to building my own machine. I can certainly connect the parts and troubleshoot various issues. So why don't I? Because I'm scared. That's right... I'm afraid.

My primary concern with building a machine from scratch is all the fine print I see in hardware compatibility. Whenever I read detailed specs or reviews for hardware components, I get the impression that it's VERY easy to build a door-stop. Front-side bus speeds here, parity errors there, chipset compatibility back here, and so on. And tracking down those types of problems scares the bejeebus out of me. I know how to debug software. I can find and fix memory leaks. But random reboots or POST errors? Cripes, where do I begin?

Reading Jeff's post earlier today, it struck me that there ought to be a way for a guy like him, who really follows the hardware world and enjoys spec'ing out machines, to make a little cash at it. Not enough to retire to the beach and I doubt Michael Dell will lose any sleep -- but if it's easy to set up, doesn't require any support, and it's something you're already interested in... why not?

On a whim, I checked NewEgg to see if they have an affiliate program... and sure enough, they do. I think it'd be awesome for Jeff (or similar hardware guru) to spec out a few machines on his site.

My primary machines for the last couple of years have been laptops. Currently, my work machine and personal machine are both fully-loaded Dell Inspiron 9400s -- a back-breaking desktop replacement that's very fast. The two are only physically distinguishable by a little fish sticker I let my daughter put on my personal machine. And I'm generally happy with these machines...   but who has just one or two machines? Our house also has the aging "Wife Laptop" (due to be replaced with a Tablet, I think), an old file-server, and my music-and-video production machine (not to mention a couple of Linux boxes Tivos).

But... show me the list of parts to build a "Little Bang" machine, a Media Center PC, a Windows Home Server box, the end-result of the Hanselman Developer Machine, etc. I've purchased and assembled electronics kits in the past and it's great fun... mostly because the parts-list and compatibility issues are taken care of for me and I can focus on the actual building part.

You'd want to make it clear that you're NOT supporting these parts or the resulting machine. It's strictly "do it yourself" and "at your own risk"... but if it's a parts list from someone who actually groks this stuff, I'd be happy to add those parts to my NewEgg shopping cart using your affiliate links.

There's lots of precedent and you have a decent audience of gear-heads... so what do ya say, Jeff?

Please?  ;-)

Technorati tags: , , ,

posted on Tuesday, March 13, 2007 11:42 PM Mountain Daylight Time  #    Comments [2]