Tag Archives: data science

Learning Lists

Nine free, brilliant resources to learn data mining

I’m a big fan of playing with data.

In my earlier corporate life, I often used Excel to look through thousands of lines of spreadsheet goodness. I assumed what I was doing was “big data”, and I prided myself on my association with a trendy buzzword.

I know better now. A lot better.

If you’ve ventured here, you’re probably looking into data science, the mysterious science that seems to verge on mysticism in the press. The virtues of data are constantly praised as innovative and disruptive. They seem like the domain of an exclusive few practitioners lifting numbers into actionable insight.

Harvard Business Review went as far as to saying that the data scientist was the sexiest job of the 21st century.

It seems that data scientists create many of the most exciting projects at the cutting-edge of technology. The people you may know on LinkedIn appear thanks to data mining. Amazon’s book recommendations rely on computers to mine your book preferences and select the one book that is most likely to appeal to you. Facebook finds what posts you like, and serves you more of the same. Google finds out who you are, and filters search results and ads for you.

If I like computers, the search term Python will return me the programming language. If I like snakes, it will return me a whole bunch of snakes.

This is all down to the magic of data mining. You’re here because you want to look behind the veil and learn how to do all this.

It’s hard, but not as hard as you think. Data science, at its’ core, is all about using computing power to parse through huge data sets.

Learn Data Mining with code(love)

Learn Data Mining with code(love)

Here are nine free, brilliant resources to do just that.

1- Coursera’s Specialization in Data Mining (level: beginner) 

https://www.coursera.org/specialization/datamining/20

Coursera brings the best from the University of Illinois at Urbana-Champaign, ranked in the top 5 for computer science schools in America. It’s a useful introduction to data mining–the application of data science and computing power to find patterns in large collections of data.

2- A UCLA professor’s overview of data mining (level: beginner)

http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm

This blogpost delves deep into the specifics of data mining. It provides an overview and a set of definitions that will help bring you up to scratch.

3-Introduction to R (level: beginner)

https://www.codeschool.com/courses/try-r

The coding language R is the workhorse of scientific data analysis and visualization. Codeschool offers an interactive and gamified approach to learn it, similar to Codecademy. Working with R will give you insight into how to move and dance with digital data, a skill that is the foundation of data science.

4- Kaggle’s Wiki on Python (level: beginner) 

https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience

Kaggle is a platform for crowdsourced data challenges. The website has a ton of resources on how to get started with data science. This particular link leads to their guide on Python, one of the most versatile programming languages for data analysis.

5- Data Science 101 (level: beginner)

http://101.datascience.community/

This blog knows how to describe itself: “Data Science 101 is about learning to become a data scientist.” Simple, clear and to the point.

6- W3’s Tutorial on SQL (level: beginner) 

http://www.w3schools.com/sql/

W3 hosts a bunch of interactive tutorials on the basics of programming. This set of tutorials goes through SQL, a language that allows you to access data from most web databases. The tutorials will give you a glimpse into how data is structured for many websites and they will give you enough knowledge so that you would know how to play with data.

7-Horton’s Hadoop Sandbox (level: intermediate)

http://hortonworks.com/products/hortonworks-sandbox/

Have you ever wanted to play with big data? Learn the basics here and experiment with them. Hadoop helps distribute data across multiple servers, helping to process large amounts of data as seemlessly as possible.

8- Machine Learning on Coursera with Andrew Ng (level: intermediate)

https://www.coursera.org/course/ml

Learn about data mining and the algorithms you can create to make your data analysis job so much easier from a master in the field: the founder of Coursera Andrew Ng, a Stanford professor who has recently become Baidu’s chief scientist.

9- A Programmer’s Guide to Data Mining (level: advanced) 

http://guidetodatamining.com/

If you can work with Python at a proficient level, this book will help you implement different algorithms that will sort, filter, and manipulate your data for you. A must-read for people looking into the practical applications of data mining.

I hoped that helped get you set on the path to data mining. What resources do you think I’m missing? Comment below. 🙂

Defining the Future

Defining Big Data in Less Than Three Minutes

I remember the first time I said the word “big data” with pride when describing my work. It, like every good buzzword, meant nothing to me, but conveyed a lot to my imagined prospective audience. It said something about my intelligence that I was working in “big data”, plying away at Excel sheets with way too many lines—a sure sign of a “big data” expert!

I know better now. After doing some research, I’m proud to say that I knew absolutely nothing about the topic at the time. In many ways, I still don’t—but I know enough to talk about the basics of “big data” and what it really represents, so you can explore with me.

The first step is to realize that big data represents data that is so large and complex that conventional data tools such as the table-based SQL cannot handle the load. Big data is not simply a big dataset that can be handled with Excel. Think of, for example, someone tracking every time someone commented on Ahnold’s accent on social media, their location, and other user attributes, in a mad quest to find who had the best “get to the choppa!” or “there is no bathroom!” quote variations: you’d quickly go mad trying to pass through every single one of those data points in a relational table or in an Excel file, even if you worked for a large Arnold-watching company, and had a set data process.

An easy rule of thumb to describe this is to say that big data refers to data sets that become difficult for an organization with a conventional data process to handle. This can be on several orders of magnitude. A smaller business may struggle with a lower threshold than a larger one. Nevertheless, it is the beginning of the struggle, and the search for alternatives to bread-and-butter SQL/Excel that is at the core of big data.

Traditional data tends to group data into tables, and operates with a smaller number of servers. Big data tends to ungroup data, and organize and analyze data through parallel processing across a larger number of servers.

When people in the field comment about the possibilities offered by big data, they are espousing the collection of unfathomable amounts of details we are now leaving on the web which was impossible five or ten years ago—because there were not so many details on the web, and there were no tools to collect them. Now with smartphones, sensors, and social media, data points are multiplying on an exponential level. Those who would take a dragnet over all of this data, pry them through tools not traditionally used in data collection that spread the volume and velocity of data over several servers instead of one or two, and then emerge with finely combed and actionable insights despite the overbearingly massive amount of data, are dealing with big data. This includes the NSA, but also data scientists who won the 2012 election, and health analysts working to ensure better care for all.

Please contribute to big data by commenting or forwarding me your terabytes of favorite Ahnold quotes.

It’s probably big data: new tools and terms

Hadoop

NoSQL

MapReduce

MongoDB

Look at me in very not-tabled Javascript Object Notation, a favorite of web-based Big Data databases:

JSON

JSON in relation to Big Data

 

It’s probably not big data

Your Excel spreadsheets of political enemies, no matter how many you have

Your Excel spreadsheets of dateable people, no matter how many you have

Your SQL tables of your favorite Arnold movies, and quotes contained within

Your handwritten list of things you would do for a Klondike bar

Look at me in traditional SQL table form:

SQL

SQL in relation to Big Data