Category Archives: Uncategorized

Introduction to RAPIDS and GPU Data Science: CUDF/Dask vs. Pandas

BY Roger Huang

RAPIDS is the new framework for distributed data science and machine learning provided by NVIDIA. You can use software optimized to do distributed work over GPU hardware rather than just standard CPU cores.

This provides a lot more computational speedup for machine learning training and tasks, with many people reporting speedups over large datasets and common machine learning tasks to the order of magnitude of 10x or 100x.

RAPIDS is actually a set of APIs in both Python and C++ to implement common machine learning tasks on GPUs instead of CPUs. It integrates with CUDA, which is a Nvidia framework for parallel computing.

The purpose of this article is to build something like the Pandas Cookbook together for RAPIDS. I want to make it easy and intuitive to go from the Pandas and CPU ecosystem to taking advantage of GPUs and the increased computational power they can deliver.

GPUs vs CPUs

GPUs are an odd product of the need for humans to game. Gaming is a computationally expensive activity. Tons of memory and processing needs to happen behind the scenes to simulate nearly real universes for gamers.

This typically means that GPUs work with multiple cores (sometimes hundreds) that can perform simultaneous and parallel processing while CPUs are focused on a few threads with sequential calculations. While each individual thread may be slower than CPU threads, taken together on many shallow calculations, GPUs can vastly outperform CPUs by working on them all at once. There is some overhead on this, but on sufficiently large datasets or data pipelines, the differences lead to large speedups.

Getting Access To GPUs or TPUs

Getting access to GPUs and TPUs can be quite difficult. TPUs are specific TensorFlow processing units. For GPUs, the options are to work with the cloud or to build your own GPU machine.

On the cloud, there are several services that offer free GPU cloud hours, most prominently Kaggle and Google Colaboratory.

Google Colab offers the ability to use GPUs for free. However, the GPUs are randomly allocated and it’s hard to get good ones. The free one also cuts off after 12 hours.

The pro version of Google Colab, at $9.99 a month, is available in the United States and offers premium availability for Nvidia GPUs. It also offers more uptime (up to 24 hours) and more lax restrictions when it comes to idle times.

You can also create your own deep learning hardware. This tutorial shows you how to do it in under $1000, though it’ll require some setup and some patience on your side — though at the end, you’ll have a machine that will save you some variable cost. In practice, you’ll only want to do this if you’re serious about machine learning use cases, and using as much of the fixed cost compute as you can.

Of course, if you use AWS, Microsoft Azure, or Google Cloud solutions, you can pay for GPU access on those platforms, though that may end up costing a lot.

For the purposes of this playbook, we’re going to start using Kaggle, which comes with access to both TPUs and GPUs as accelerators, though you’ll need to verify a phone number to get access to that. Once you do, you can set up GPUs then install RAPIDS through the handy dataset.

Then you can import datasets from Kaggle datasets and you’re off and running. However, Kaggle has a 41 hour weekly quota on GPU usage — which means that it’s ideal for short experiments and learning examples.

The Kaggle instance will pause every 40 minutes or so. The best practice would be to pause the instance and turn it off when you’re not using it.

The above is a screenshot of my GPU usage. The usage will reset every week. You also won’t get access to the latest NVIDIA architecture, but it will be free. There’s an easy-to-access dashboard on your GPU and CPU usage as well.

If you want to get started right away with powerful infrastructure, BlazingSQL offers a free hosted Jupyter environment with the latest version of the RAPIDS stack pre-installed and a bit of GPU memory to play with. They’re also offering beta access to clusters of cloud GPUs.

Rapids will be pre-installed, but it’ll be harder to get intuitive access to different datasets as you might on Kaggle — so we’ll stick with Kaggle for now for the importing data part. But BlazingSQL can be used in practice, especially since Kaggle’s data and compute limits are set more towards learning rather than production.

cuDF’s role in Rapids

cuDF is meant to be the data manipulation layer of Rapids, allowing for the rapid manipulation of dataframes over GPUs. In the documentation, it describes cuDF as being useful for loading, joining, aggregating and filtering data.

You can think of it as a Pandas equivalent within Rapids and in fact many of the functions from cuDF map pretty closely to their Pandas equivalents.

In practice, when you’re dealing with data or wrangling data, you’re likely going to have to deal with cuDF if you want to work on RAPIDS on a GPU.

Dask-cuDF vs. cuDF vs. Pandas

Dask is a parallel processing library that slices up Pandas dataframes on CPUs. It can be used with cuDF combining multiple GPUs and chunking. This documentation from the RAPIDS team summarizes the difference and goes into detailed documentation of the different functions possible- from your standard filtering and value_counts() to more complex groupbys and aggregations.

The syntax here is very similar to Pandas — in practice, you’ll see using cuDF and Dask-cuDF as very similar experiences to the Pandas API, just with slightly less function completeness.

It’s of course important to also note when it’s best to use each framework:

Dask-cuDF for when you have very large datasets that you need multiple GPUs to train on and you have more memory in your dataset than the GPU can handle
cuDF for when you have a large dataset that can be trained and wrangled on a single GPU and when you maybe don’t have access to multiple GPUs, such as our example on Kaggle, where you only have access to one GPU
Pandas for a small enough dataset that can fit and be trained on CPU only. In practice, for most standard setups, unless you have a particularly strong computer with a GPU installed, Pandas will be “good enough” for now, especially with smaller datasets.

Dask-cuDF, cuDF Import/Export Data With Pandas and CSV

It’s relatively simple to go from different datatypes into the three frameworks, and pass dataframes from framework to framework. Let’s discuss how to transfer between the different frameworks.

It’s quite simple to go from a Pandas dataframe to a cuDF dataframe: it’s a one-line command. In this case, we take our predefined dataframe (seattlelibrarydf) of the Seattle Library inventory and convert it into a cuDF dataframe with many of the same properties (seattlelibrarycudf).

This simple function helps turn a Pandas Dataframe into the cuDF equivalent.

Common Functions

It’s time to put cuDF to the test and actually get working on a large dataset. In the case of Kaggle, we’re going to work with the Seattle Public Library dataset, a large collection of CSV files that tabulate the inventory of the Seattle Public Library as well as a set of CSVs that describe the yearly checkout patterns. Specifically, we’re going to join together the yearly checkout data and then do analysis on the inventory.

Let’s now look at a common function, the ubiquitous value_counts in Pandas. This takes a column and returns an aggregated count of cell values. In this case, we’ll do it on the different collection codes we can later join onto descriptions of the categories.

Note here that the syntax for the function in both Pandas and cuDF is essentially the same — but by using cuDF on an average-power GPU and a slightly larger than 1 GB dataset, with more than 2.5 million records, we achieve about a 10x speedup even in pre-processing the data, from 679 milliseconds to 78.9 milliseconds.

Groupby/Aggregations

Now let’s get to the meat of the dataset and join together different datasets that form the yearly check-ins per each item. This is not a trivial exercise. By the end, we’ll have combined together a large dataset of multiple GB (about 7 gb, or slightly under 50% of the GPU memory allocation Kaggle gives us) with about 90 million rows.

The read_csv function of cudf will also have some issues, specifically with the datatypes you need to define and validate. It will sometimes take columns and mix up datatypes, meaning you have to set them manually with the dtype variable.

However, cuDF is decently finicky about how you do this. So far, I’ve found that ‘int64’, ‘timestamp’, and ‘str’ (and it’s important that they be passed as strings) works, unlike the numpy variants suggested in the basic documentation. You can track the progress on this open Github issue.

Let’s now do some data wrangling and joins. We want to see what ten items are most frequently checked out in the dataset. We can do this really quickly by slicing a value_counts() method call just like you might do in Pandas. We’ll do this on the BibNumber column which serves as a primary key that unites both checkout data and the underlying information about each inventory item.

We get a bunch of item numbers. But what are the actual items here? Who are the authors? Can we say anything about these items beyond their key numbers.

To find that insight, we’ll have to perform a join of both the inventory information and the aggregated checkout data — and we’ll have to clean up the dates and times represented in the last CheckoutDateTime column. This is something we may cover in another tutorial.

For now, hopefully, you’ve learned enough to get set up on RAPIDS and CUDF and why you might want to use it.

Uncategorized

What is Digital Literacy? A Comprehensive Guide

BY Roger Huang

What is digital literacy?

“Digital literacy is the ability to use information and communication technologies to find, evaluate, create, and communicate information, requiring both cognitive and technical skills.” is the textbook definition given by the American Library Association. At code(love), we think it has to go further.

Digital literacy involves a set of foundational skills that are required to navigate the 21st century. These new 21st century skills will allow anybody to navigate the emerging technologies of today. It will empower everybody to fully interface with the rich ecosystem of applications and digital services that are being developed.

Why does it matter?

With high job satisfaction for technical jobs such as data scientist, high compensation levels, the ability to create and interact with new digital technologies has never been more important.

Digital literacy skills are needed to thrive in a world where many of the world’s richest companies are software and hardware technology companies such as Facebook, Google, and Microsoft.

It also matters because of the flipside. 72% of Americans are scared of a future where they think robots and machines do most of the jobs accorded to humans. That’s almost twice as many as those excited about that possibility. The divide in politics doesn’t seem to between liberals and conservatives so much as people who embrace the future or people who are afraid of it.

Today’s students are going to be confronting a world that is very different than what their high schools and universities are preparing them for. Even these so-called digital natives will need to quickly up their information literacy skills for the 21st century.

The digital divide between those who are digitally literate and those who are not will soon extend to wealth and life outcomes across the board as the digital world takes over.

We have to dig deeper into the specific components that underlie digital literacy and these new literacy skills with how much it matters.

What are the specific components of digital literacy?

The ability to find relevant and reliable information
The ability to work with applications
The ability to build a relevant audience
The ability to build a website
The ability to make payments and hold balances securely
The ability to understand and control your own data
The ability to understand new technologies

Let’s go look in-depth into each item:

1- The ability to find relevant and reliable information

The ability to find relevant information is how search engine Google has built a multi-billion dollar business. In 2017, people were producing about 2.5 quintillion bytes of data a day, most of it unstructured and hard to query. The Internet isn’t just the world’s largest container of data: it is also its largest attempt at structuring and classifying that data.

In order to be digitally literate, you should navigate that large realm of data and be able to pick out pieces of data and navigate the web.

This is an increasingly relevant skill in a world where media sources are disputed and where more and more authentic replicas of human behavior are being created: take a look at this photorealistic video of President Obama whose words were completely faked using artificial intelligence. The ability to be able to tell what information is relevant, credible and substantive is critical for digital literacy.

Digital content can be filled with inaccuracies. Determining reliable sources is a critical digital skill to have. It’s a critical part of 21st-century skills to have this new form of media literacy and understand digital media to be able to get the best information possible.

A nation with many digital citizens should have ready internet access, a way to curate and access information, and a way to quickly get relevant data.

Sample Stat: Only 17% of people are illiterate now in 2018. This was a reversal from 1820 when only 12% of the world could read and write. Hopefully, digital literacy will follow the same trend and as 80% of people will be able to find relevant information on the Internet.

Skills Required:

Reading comprehension
Writing or voice-to-text capability
The ability to quickly navigate search engines and get the most relevant results
The ability to authenticate information via secondary sources
The ability to verify providers of information and data

In general, you should be able to write out or communicate your search intent in a way that helps frame the most helpful results, understand how search engines surface certain results and the algorithms they use to determine the best results, and you should be able to quickly evaluate new sources of data for authenticity and reliability.

Resources:

Global Search Engine Market Share In The Top 15 Countries By GDP

This Medium article uses StatCounter to suss out which search engines have the most penetration and market share per each market. Google tends to dominate in most countries with above 70% search engine market share — though Yandex leads in Russia, and Baidu leads in China, while Yahoo has a significant share as a search engine in Japan.

How to Search on Google: 31 Google Advanced Search Tips

This guide for search modifiers will help you tailor down your search patterns to exactly the sort of information you’re looking for on the world’s most popularly used search engine.

2- The ability to work with applications

The world is run with different digital applications. If you’re a salesperson or somebody who has to chase down a list of people as part of your work, you’ve probably used customer relationship management software to track down everybody .

Your day-to-day routine might involve looking through social media applications and all sorts of different work and productivity apps, from spreadsheet software to document processors. Understanding how to work with these tools is a critical part of digital literacy.

The ability to navigate online communities, social networks and more and leave your own digital footprints is a critical part of digital citizenship as well — without participating in the digital discourse and lending your voice to it, your perspective may get lost in a world that has shifted from analog to digital.

Sample Stat: There were 171.8 billion mobile app downloads worldwide in 2017.

Skills Required:

Reading comprehension
Writing or voice-to-text interface capability
The ability to quickly navigate application user interfaces
The ability to navigate accessibility issues
The ability to use shortcuts
The ability to recognize app interface cues

Resources:

Usability 101: Introduction to Usability

This handy guide dives into what makes a website easier to access and lays down a process for how to make apps more usable. It then runs over why usability itself is critical. These ten usability heuristics help dive into the rules behind making sites easy-to-access.

Accessibility for iPhone and iPad: the Ultimate Guide

This guide runs through how to interact with an iPhone or iPad, two of the most popular screen interfaces for browsing the web. Learn how to do everything from accessing voice commands to increase the legibility of text.

3- The ability to build your own website

From being an application user, the next important step for digital literacy is to be able to build your own online media. In order to be fully digitally literate, it’s important not just to be a consumer and user, but also a producer or curator.

Having the ability to build your own website brings a whole new world of potential. It is akin to the writing aspect of literacy. It means the difference between merely absorbing the Internet and browsing it to being able to broadcast one’s thoughts on it — taking full advantage of the two-way street the Internet was always meant to be.

You can build simple webpages that help you do everything from displaying your CV and portfolio to sharing your thoughts on different matters, without a line of code. You might build a virtual store to sell your wares. Or you might share your business. With some basic knowledge of code, you can build so much more.

Sample Stat: 1 billion websites were created in 2015. There are close to 2 billion in 2018. Out of those 2 billion, only about 200 million (or 10%) are active.

Skills Required:

Reading comprehension
Writing or voice-to-text interface capability
Ability to work with applications/landing page generators
Ability to interact with text editors
Ability to understand basic HTML/CSS and ideally some JavaScript

Resources:

HTML and CSS Basics

This interactive tutorial helps cover the steps and resources you’d need to understand HTML and CSS, the building blocks of the modern Internet. Once you understand HTML and CSS, you’ll understand how the skeletons of websites are built, and you’ll be able to analyze different webpages.

Website Builders

This review of different website builders gives you a handy way to build your own webpages even if you don’t know any code.

4- The ability to build a relevant audience

Reddit co-founder Aaron Swartz once said that “Everybody has the right to speak on the Internet, what matters is who is heard.”

The ability to create a website or application means very little if you don’t understand how to draw a relevant audience to it, and if you don’t understand how content is surfaced to users around the world.

Writing something, after all, isn’t the same sharing it with millions of people around the world. The ability to make an impact on the Internet means getting your content seen by a targeted audience at scale.

This means working with digital marketing techniques and understanding how to spread content with social media and a variety of digital tools. It means knowing how search engines rank content and then using that knowledge to help showcase your content to people around the Internet.

Sample Stat: Out of the Alexa Top 50 websites by visitor traffic, the top ten only has three countries represented: India, China, and the United States.

Skills Required:

Reading comprehension
Writing or voice-to-text interface capability
Ability to use analysis and statistics tools for web traffic such as Google Analytics
Understanding of social media platforms and how to use them to distribute content
Understanding of search engines and how to use them to distribute content
Understanding of how social communities evolve on the Internet, and how to post and distribute content within those communities (ex: Reddit).

Resources:

Digital Marketing Made Simple: A Step-By-Step Guide

Neil Patel has made his living building large audiences for his ventures. Here he walks through all of the different tactics and approaches you can use to build your own relevant audience on the web.

SEO Starter Guide

This guide by Google will help you understand what it takes to rank in their search engine index. While everybody can create content, it’s really content that holds staying power in search engine rankings that creates lasting impact. Getting ranked on Google and other search engines the right way and with the right relevant keywords will certainly help you drive relevant audiences.

5- The ability to be able to make payments and hold balances securely

As the Internet gradually moves to a place where payments become part of the infrastructure, to become digitally literate is to combine your financial ability with your technological capabilities.

A decade ago, only about 5% of all retail operations were conducted on the Internet in the United States: now in those same categories, about 13% of retail sales are conducted online. In 2017, online retail sales to American customers crossed the $450bn mark, with rapid year-on-year growth of 16% from 2016.

With a growing amount of payment processors vying to help you send money online from Apple Pay to China’s WePay, it’s clear that e-commerce, unlike the heady days of the early 2000s Internet bust, is here to stay.

This has only been accentuated with the rise of blockchain technologies and cryptocurrencies, new entirely virtual monetary technologies. It’s been accelerated with a drive to online banking. With virtual assets coming into play and more real-world assets being digitized, the critical skill of being able to understand how to securely maintain balances online and to deal with transactions online will grow ever more important.

Sample Stat: According to a survey of 2,000 Americans, only about 8% of Americans hold digital cryptocurrencies.

Skills Required:

Reading comprehension
Ability to write or give voice-to-text commands
Basic statistics knowledge
Understanding of safe practices around authentication and passwords
Understanding financial interfaces around value transfer
Basic knowledge on how to maintain privacy and security on the Internet

Resources:

The following list of payment solutions will get you introduced to the services that help you both receive and send payments online.

Free Introductory Course to Digital Currencies

This online video series will teach you about the foundations behind digital currencies and how they have evolved into the current stage of financial and technological innovation. It will run over the basics of the blockchain, Bitcoin, and cryptocurrencies.

6- The ability to understand and control your own data

We all generate data as we interact with the Internet. A critical part of understanding the Internet and how to use it safely and consensually is to understand what data is captured from us, and to navigate how and where we can consent to particular uses of our data. We can then navigate the trade-off between our attention and the data we generate for a company with the utility that the company provides us.

We can also make sure that our data is private and that we can deliberately choose who we share it with for whatever purpose we want and we can make conscious choices to avoid companies that violate our data principles. By browsing on the Web, we give away data about ourselves constantly. Having control over that data lets us keep our privacy and security while benefitting from applications.

Sample Stat: 93% of Americans believe it is important to be in control of who gets information about them.

Skills Required:

Reading comprehension
Ability to write or give voice-to-text commands
Understanding of how data is processed on the web and transmitted
Understanding of what data is used for
Basic knowledge on how to maintain privacy and security on the Internet

Resources:

How to Encrypt Your Entire Life in One Hour

This handy guide will walk you through how to leave as little of a digital profile as possible by using encrypted chat and by making sure that the data you share with the world is the sort of data that you want shared.

Europe’s New Privacy Law Will Change the Web

This article talks about the sweeping new changes new European privacy legislation will bring (GDPR) and serves as a case study of how legislation can affect collective and individual data rights.

7- The ability to understand new technologies

As new technologies evolve, the ability to master them serves as the ultimate foundation of digital literacy. In order to be fully digitally literate, you need to have the foundation to be able to anticipate new technological advances, and to be fully ready to be an early adopter or creator with new trends.

We live in an age where each year brings drastic innovation, from biotechnology advances that allow individuals the power of modifying genomes to artificial intelligence models that can help individuals do tasks that once would have taken thousands of humans to do. To be able to understand those advances and create with them will help take and extend your digital literacy to the point where it is flexible and malleable to new advances, just like a full grasp of literacy allows you to understand and take in new ideas.

Sample Stat: Americans are more afraid of robots than death.

Skills Required:

Reading comprehension
Ability to write or give voice-to-text commands
Basic statistics knowledge
Ability to work with applications/landing page generators
Ability to understand basic HTML/CSS and ideally some JavaScript
The ability to quickly navigate search engines and get the most relevant results
The ability to authenticate information via secondary sources
The ability to verify who is a provider of information and data

Resources:

Learning How to Learn

Drawing from her background learning engineering, Dr. Oakley introduces powerful mental frameworks and tools to quickly and efficiently work with new information and challenges. It’s a powerful primer on how to adapt to an ever-changing world where information is king.

The Gartner Hype Cycle

The Gartner Hype Cycle walks through the different stages of excitement a new technology brings, and how it can solidify to lasting change. You can use it as a framework to place new technologies into a certain mindset.

—

Digital literacy shouldn’t just be a rehash of literacy principles for the digital age and our new digital world. It should be a whole new set of metrics and capabilities that can be measured as an indicator of whether countries and nation-states and their citizens are ready for the 21st century. By evolving our understanding of what digital literacy means, we can more meaningfully prepare people for a future too many are currently afraid of.

Data Science/Artificial Intelligence, Learning Lists, Uncategorized

Learn Machine Learning With These Six Great Resources

BY Roger Huang

Learn Machine Learning

A friend of code(love), Matt Fogel is doing awesome things with machine learning at fuzzy.io. He’s shared this valuable list of resources to learn machine learning that he usually gives his friends who ask him for more information.

You’ll see his original post here: https://medium.com/@mattfogel/master-the-basics-of-machine-learning-with-these-6-resources-63fea5a21c1c#.ta2bhsq8y

Learn machine learning with code(love)

Great blog posts, podcasts and online courses to help you get started

It seems like machine learning and artificial intelligence are topics at the top of everyone’s mind in tech. Be it autonomous cars, robots, or machine intelligence in general, everyone’s talking about machines getting smarter and being able to do more.

Yet for many developers, machine learning and artificial intelligence are dense terms representing complex problems they just don’t have time to learn.

I’ve spoken with lots of developers and CTOs about Fuzzy.io and our mission to make it easy for developers to start bringing intelligent decision-making to their software without needing huge amounts of data or AI expertise. A lot of them were curious to learn more about the greater landscape of machine learning.

You can describe machine learning as using techniques to help computers learn new ways of uncovering insights from data. This deep dive into the topic will explore many elements outside of this short guide if you’re interested in learning more.

What you need to understand before you learn machine learning is that it’s not a magic buzzword that will help solve every problem with you. Machine learning is a practical way to get more data insights with less work. Nothing more, nothing less.

To quote a professor in the field, “Machine learning is not magic; it can’t get something from nothing. What it does is get more from less. Programming, like all engineering, is a lot of work: we have to build everything from scratch. Learning is more like farming, which lets nature do most of the work. Farmers combine seeds with nutrients to grow crops. Learners combine knowledge with data to grow programs.”

If that excites you, here are some of the links to articles, podcasts and courses about machine learning that I’ve shared with my friends who were eager to learn more. I hope you enjoy!

Learn machine learning with code(love)

1– A Gentle Guide to Machine Learning

This guide, written by the awesome Raul Garreta of MonkeyLearn, is perhaps one of the best I’ve read. In one easy-to-read article, he describes a number of applications of machine learning, the types of algorithms that exist, and how to choose which algorithm to use.

2– A Visual Introduction to Machine Learning

This piece by Stephanie Yee and Tony Chu of the R2D3 project gives a great visual overview of the creation of a machine learning model that determines whether an apartment is located in San Francisco or New York based on the traits they hold. It’s a great look into how machine learning models are created and how they work in practice.

Podcasts

3– Data Skeptic

A great starting point on some of the basics of data science and machine learning. Every other week, they release a 10–15 minute episode where the hosts (Kyle and Linhda Polich) give a short primer on topics like k-means clustering, natural language processing and decision tree learning. They often use analogies related to their pet parrot, Yoshi. This is the only place where you’ll learn about k-means clustering via placement of parrot droppings.

4– Linear Digressions

This weekly podcast, hosted by Katie Malone and Ben Jaffe, covers diverse topics in data science and machine learning. They teach specific advanced concepts like Hidden Markov Models and how they apply to real-world problems and datasets. They make complex topics extremely accessible, and teach you new words like clbuttic.

Online Courses

5– Intro to Artificial Intelligence

Plan for this online course to take several months, but you’d be hard-pressed to find better teachers than Peter Norvig and Sebastian Thrun. Norvig quite literally wrote the book on AI, having co-authored Artificial Intelligence: A Modern Approach, the most popular AI textbook in the world. Thrun’s no slouch either. He previously led the Google driverless car initiative.

6– Machine Learning

This 11-week long Stanford course is available online via Coursera. Its instructor is Andrew Ng, Chief Scientist at Chinese internet giant Baidu and one of the pioneers of online education.

—

This list is really only scratching some of the complex and multifaceted topic that is machine learning. If you have your own favorite resource, please suggest it in the comments and start a discussion around it!