๐Ÿ‘คAnon84๐Ÿ•‘12y๐Ÿ”ผ53๐Ÿ—จ๏ธ17

(Replying to PARENT post)

Oh boy, this is going to end badly. Remember when AOL released a huge anonymized dataset of their searches? People were still identified because naturally people searched for their own names, the names of their friends and families, local businesses, personal websites, etc.

This dataset is even worse since it includes both referral and the destination.

Keep in mind websites often put the usernames within the URL

Eg: http://www.facebook.com/Your.Name

http://www.reddit.com/user/USERNAME/

http://slashdot.org/~USERNAME

http://news.ycombinator.com/user?id=USERNAME

So no matter how much you think you have it anonymized, a person's browsing history could reveal a lot more than you think.

๐Ÿ‘คnivla๐Ÿ•‘12y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I'm really surprised anyone would take the risk to release data like this, even with their security protocols in place. It just doesn't seem worth it:

- The potential upside is a few citations in research papers.

- The potential downside is a widescale invasion of privacy of IU students and staff, and a huge PR disaster.

๐Ÿ‘คIvyMike๐Ÿ•‘12y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

This shit is going to be available on TPB before I can even click 'add comment'.
๐Ÿ‘คweareconvo๐Ÿ•‘12y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Marc Smith at Microsoft Research had a Usenet dB for research porpoises created about 6 years ago or so, and provided it to any researchers who wanted it. Although I didn't care about Usenet for my stuff, it was a good and useful offering for various researchers, and I hope this newer dB also proves useful! Thanks to Indiana for going to the trouble.
๐Ÿ‘คtriplesec๐Ÿ•‘12y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

How did they collect this data without someone raising privacy flags? Releasing this data is almost certainly a bad idea, since it will likely reveal who the people are who made those requests. Anonymized data usually isn't.
๐Ÿ‘คafhof๐Ÿ•‘12y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

If you are interested in this kind of data, it's worth noting that there are some older, but, in a sense, more manageable datasets at the Internet Traffic Archive [1]---the data there can be downloaded and does not require being physically shipping through the post.

The largest dataset consists of 1.3 billion requests (for the 1998 World Cup website).

[1] http://ita.ee.lbl.gov/html/traces.html

๐Ÿ‘คkmregan๐Ÿ•‘12y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Can someone actually post the real data.
๐Ÿ‘คberlinbrown๐Ÿ•‘12y๐Ÿ”ผ0๐Ÿ—จ๏ธ0