Friday, November 8, 2013

IPython introduction

Why do we need IPython?

The project IPython aims at making python more interactive. The default interactive python shell you get when you type "python" is very limited with functionality. For example, we can not explore files and directories with "ls" command or you have module imported and  want to see what is it there is no easy option. Also error messages with exploring files are not very informative. 

How to install?
Make sure you have PIP already setup which will help to install other libraries easily. If you don't have it then refer this.
> pip install ipython 
Install few other useful packages
> pip install nose
> pip install pexpect 
> pip install pandoc
For IPython notebook
> pip install pyzmq
> pip install Tornado   # This project is for HTTP server.
> pip install Jinja     # This is templating tool to render HTML pages.
If you get error that some specific module is missing then install it using PIP, for example if you get following error: "ImportError: No module named jinja2" then use,
> pip install Jinja2 
Starting IPython

If you have installed all packages successfully then just type ipython on command prompt, which will open ipython interactive session. You should be able to see something like,
 

This is similar to interactive shell of python but rich in features, so lets try few basic python commands. Declare few variables, 
# Lets declare 2 integer variables and one string variable
> varA = 10
> varB = 20
> myName = 'Akshay'

As you can see the tab completion feature is also available in IPython. Just type "var" and press tab, it will give list of all variable staring from "var" (refer line 4 in screenshot above). 

If you want to get list of all variable you can use the "who" command. 
# To get list of all variables use,
> who
# Then to get variables with type integer use,
> who int
#  To get list of variables with name, type and value use,
> whos




Now lets explore on how to clear variables from session.  Use "reset" command which will clear all variables currently available in memory. 
# To check existing objects,
> who
# To reset the session and delete all variables,
> reset
# To crosscheck if all objects (variables) are deleted
> who


# We can enable logging by following command,
>  logstart
# To switch off logging,
> logoff
# To switch on logging.
> logon


Now there are many other useful commands similar to what we have seen above. These commands are called magic commands and get the list of command,
> lsmagic



IPython has predefined ‘magic functions’ which can be used as command line style syntax. There are two kinds of magics as you can see in screenshot above,  line-oriented magics and cell-oriented magics. 
- Line magics are prefixed with single % character and work like regular command-line calls: they get as an argument the rest of the line, where arguments are passed without parentheses or quotes.  
- Cell magics are prefixed with a double %%, and they are functions that get as an argument not only the rest of the line, but also the lines below it in a separate argument.

Monday, October 28, 2013

Hadoop Hands-On exercise with Hortonworks: Setting up Hadoop sandbox and loading the data

Requirements:
1. You have a software like VitualBox, VMware, Hyper -V installed which can deal with the virtual environment.
2. You have downloaded appropriate sandbox image from here.
3. You have minimum 4 GB RAM
4. Optional: To get familiar with few basic concepts referred in this post (like HDFS, node) you might like to read previous post here

Lets get started!

Load the downloaded file, which is "Hortonworks+Sandbox+1.3+VirtualBox+RC6.ova" in my case  and start the virtual machine.


After successful booting you should get something similar to following,

 
Now you can go to the browser (prefer Chrome, Safari or Internet Explorer ) and enter the url "http://127.0.0.1:8888/". You will get a registration form where you can enter details and click submit.

The next webpage is something super-cool! A web page containing many leading big data project in single web page.  It is called Hue, graphical interface for Hadoop and related projects. It should look something like following,


You can see the split display of the web page. On the left side we have collapsible pane, you can go ahead and hide it. So all we should be able to see now it right pane which is Hue.

Now you can download test data (11 mb) from here.

Once you have downloaded the file, go to file browser (5th icon from left).

You should be able to see a file browser like following. Go to upload > files and upload the file NYSE-2000-2001.tar.gz. Note that you don't need to unzip/untar the file.

Once you upload the data you should be able to see following, which is similar to "ls -l" option on linux. It will give list of all files in HDFS along with size, user, group, permission and modification timestamp

You can click on the file "NYSE-2000-2001.tar.gz" and see detailed data in tabular format like following. At the top you will be able to see the path where this file is stored. On the left hand few more operations like view as binary, download will be displayed to you.


Now we have loaded the data in to HDFS successfully. Next step is loading the data in to HCatalog so that it is accessible for querying by Hive or Pig. The HCatalog is something which will manage metadata like schema so that it can be shared with other programs. 

Lets have a possible analogy example to understand exactly what is going on here and why we are doing it. 
  • Say, you are trying to load dataset from your local machine on server (sandbox here) 
  • You will upload and store the data on server's file system (HDFS in our case) 
  • You will load the data in to database (here its somewhat similar to HCataog which maintains schema like information) 
  • You will use SQL to query the data (here we will use Pig, Hive).

Click on HCatalog.



Click on the "Create a new table from a file". You will get few input fields. Enter table name and choose the same file (NYSE-2000-2001.tar.gz). In my case the "choose a file" option was not visible on Firefox but it worked fine on Chrome.


You can change the data type from double to float when the file is loaded and then click "create table".

Tutorial Summary: 

At this point we have setup a self-contained virtual machine with single node. An easy-to-use graphical user interface (HUE) can be used as soon as virtual machine is up. Using file browser functionality we uploaded a data set into HDFS. To make data set accessible for querying by Hive or Pig we loaded data into HCatalog.  In next tutorial we will try to query the data.

Note: This post is my implementation of Hortonworks tutorial for learning purpose.

Sunday, October 27, 2013

What is Hadoop?

Need: As we are generating more and more data everyday we need tools to deal with this huge scale. The tools can be programming languages, software or infrastructure. Obviously the infrastructure is the foundation for everything else. 

Short history: Google, one of the internet giants faced large scale computation problem pretty early. So Google researchers like Sanjay Ghemawat and Jeff Dean did some work to solve this issue. They published two research papers on Google File System (2003) and MapReduce (2004). These papers were basically trying to solve large scale computations in distributed manner.  In 2005 two researchers Doug Cutting (from Yahoo) and Mike Cafarella build something based on GFS and MapReduce called Hadoop.



Hadoop basics: The Hadoop file system is called HDFS, Hadoop Distributed File System. It deals with group of machine called clusters which are part of distributed computing environment. Every single machine is referred as node. This HDFS has minimum amount of data chunk which should be stored called as block size, the default size is 64 mb. So if we have 640 mb of data then it will be divided over 10 nodes for storing as default block size is 64 mb. There are two types of nodes called name node and data node. Name node will decide details about splitting and storing the data. Name node also manages the metadata and tree hierarchy associated with it. Data node will actually have the data chunks physically stored on them. In a single cluster we will have only one name node and multiple data nodes. 

Pitfalls: Based on discussion above we can say the name node acts as authority or gateway for accessing the data stored on data nodes. Now if name node is down then we can not access the data associated with its data nodes. This is one of the reasons why Hadoop is called single point of failure. This can be avoided by having secondary name node, which acts as backup if primary name node goes down. Hadoop was also built for batch processing and not real-time processing. Though few versions offer the real-time processing capability its not available with Apache Hadoop as of now.

How it is used: Now that we have discussed high-level concepts of infrastructure lets see how it is used. Companies dealing with huge amount of data will configure the Hadoop clusters with distributed data. For example, the retail giant Walmart deals with 250 node Hadoop cluster. So far we are done with the data storing and distribution part. Next thing is to run jobs (programs) on top of it which will do the actual computation. MapReduce or similar techniques will be used for this.




Wednesday, October 23, 2013

Inroduction to Python and Data Science

I was exploring Linkedin profiles of well known data scientists like Jeff Hammerbacher,  Hilary Mason, DJ Patil, Gilad Lotan to get the idea about their technical skill-set. The first common thing I could find was Python. So I decided to explore about general projects and data mining/ machine learning libraries associated with Python.
General Python Projects:
Python: A general purpose high-level programming language. Python supports multiple programming paradigms, like object-oriented, imperative and functional programming or procedural styles. [1] Python implementation is under open source license that makes it freely usable and distributable, even for commercial use. [2]
Created by: Guido van Rossum
CPython: It is the default, most widely used implementation of Python.
Written in: C
Maintained by: Python core developers and the Python community, supported by the Python Software Foundation
Difference between Python and CPython: Python is programming language and CPython is default implementation of it. So when we generally refer python programming language we are talking about CPython. There are several other implementations as well like Jython, IronPython etc.
Jython: Implementation of Python in Java. It has several differences and incompatibilities with CPython.
Written in: Java and Python
Successor of : JPython
RPython and PyPy: RPython (restricted python), is restricted subset of python. PyPy is interpreter which is written in RPython.
Project goal: Speed, efficiency and compatibility of CPython interpreter.
IronPython: It is Python implementation targeted at .NET framework. 
Written in: C#
Created by: Jim Hugunin
Currently maintained by: Volunteers at Microsoft's CodePlex open-source repository
Cython: It enables to write Python code which can be called back and forth, from and to C or C++ code natively. It is nothing but C extension for python.
IPython: It is interactive python. Motivation is scientific imputing and exploratory analysis, where we can directly play with data/ files. Default interactive environment is having limited functionality issue which can be solved by IPython.
Created by: Fernando Perez and others.
Getting started: IPython: Python at your fingertips (talk at Pycon 2012 by IPython creators)
Specific project related to Data (Processing, Mining, Visualization and Machine Learning):
SciPy Ecosystem: It is computing environment and open source ecosystem of Python packages used by scientists, analysts and engineers for performing scientific and technical computing.
  • Pandas: Python library providing high-performance, easy-to-use data structures and data analysis tools. 
  • NumPy: NumPy is python library that supports large, multi-dimensional arrays and high-level mathematical functions to perform various operations on these arrays. [written in: C and Python]
  • SciPy: It also refers to a python package (library) of algorithms and mathematical functions which is a core element of the SciPy environment for technical computing.
  • matplotlib: A python library which is used for 2D plotting (used for creation of various types graphs and charts)
  • IPython: The IPython project mentioned above is also part of core SciPy stack.
  • scikit: It is another package of python for scientific computing. This is not a core part of SciPy but add-on package.

StatsModels: It is a python module that enables users to explore data, estimate statistical models and perform statistical tests.
scikit-learn: Open source machine learning library build on top of NumPy, SciPy and matplotlib. Note that it is different from Scikit.
PypeR: It enables us to use R (most preferred language of data scientists) in Python through PIPE.

NetworkX: Python package for the creation, manipulation and study of the structure and functions of complex networks.

Note: I wish to make this post comprehensive over the time, so please post comments if I am missing any great projects here.

Thursday, October 3, 2013

Deep Learning: What is that?

What is it?


"Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence." [1]

What does that mean?

"Teaching machines to think has been a dream/nightmare of scientists for a long time. Rather than teaching a machine explicitly, Deep Learning uses simpler, core ideas and then builds upon them — much as a baby learns sounds, then words, then sentences." [2]


So just another technology?

May be not. MIT technology review ranked it amongst top 10 breakthrough technologies of 2013 which is going to change things forever. [3]

Who else think it can be a game changer?

Facebook Launches Advanced AI Effort to Find Meaning in Your Posts using deep learning [4]
Google is building one of the most ambitious artificial-intelligence systems to date, the so-called Google Brain with deep learning techniques. [5]
Recent Advances in Deep Learning for Speech Research at Microsoft [6]   
Baidu Opens Lab in Silicon Valley Devoted to Research into ‘Deep Learning’ [7]


Where I can find more about it?

This is interview of Peter Norvig (director of research at Google) and Yann LeCun (professor of Computer Science, Neural Science at NYU) about deep learning

Friday, September 27, 2013

Mongodb: Loading and querying the data

This blog post explains how can we load data from json files and perform simple queries on it. If you wish to install and configure the mongodb, please refer previous post abut the same.


Building database

You can go to mongo shell by clicking on "mongo" or running "./mongo" in bin folder. You can use following simple command to get list of all databases,

show dbs
admin   0.03125GB
local   (empty)

Now you can create the database a database by following command,

use twitterdb 
switched to twitterdb


However if you run "show dbs" then it will not show the database. Mongodb will 
actually create database when you try to save something in it. 

You can create a collection in the database. Collection is logical equivalent of 
table in RDBMS. The hierarchy is Database > Collection > Document.

You can create and authenticate user with following commands in mongo shell.

db.addUser("username", "password"); 
> db.auth("username", "password");

Loading the data 

I had around 3000 json files containing twitter feed gathered from twitter API. First I tested for single file if its working by following command. [Note: you have to use this command from terminal not mongo shell]

> ./mongoimport --host localhost --db <testdb> --collection <collection-name> --username <user> --password <password> --type json --file <filename>.json

Now after loading data from one json file we can use simple shell script to load other files as following from terminal. [Note: This script is written for files named as "file1.json", "file2.json"..."file3000.json"]

> for x in {1..3000}; do
./mongoimport --host localhost --db <testdb> --collection <collection-name> --username <user> --password <password> --type json --file file$x.json
done;

If you are using windows you might like write similar batch script. Another simple way is we can store key-value information is object called data and execute following command from mongo shell.

> db.collection.save(data)

Querying the data

The following command is equivalent of "SELECT * FROM test". Here test is the collection name.

db.test.find({})

Now we can add parameter name and expected value is it,

> db.test.find( { id_str: '1836728' } ) // find documents where id_str is 1836728
> db.test.find( { city: 'Tucson' , username: 'Akshay' } ) // Find document with user name Akshay and city as Tucson
> db.test.find( { age: {$lt : 18} } ) # find records where age is less than 18

Now these queries will show all fields from documents which satisfy given conditions. We can limit fields to be displayed for matching documents like following, "

>db.test.find( { age: {$lt : 18} } , { username: 1} ) // Display all user names where age is less than 18