Friday, September 27, 2013

Mongodb: Loading and querying the data

This blog post explains how can we load data from json files and perform simple queries on it. If you wish to install and configure the mongodb, please refer previous post abut the same.


Building database

You can go to mongo shell by clicking on "mongo" or running "./mongo" in bin folder. You can use following simple command to get list of all databases,

show dbs
admin   0.03125GB
local   (empty)

Now you can create the database a database by following command,

use twitterdb 
switched to twitterdb


However if you run "show dbs" then it will not show the database. Mongodb will 
actually create database when you try to save something in it. 

You can create a collection in the database. Collection is logical equivalent of 
table in RDBMS. The hierarchy is Database > Collection > Document.

You can create and authenticate user with following commands in mongo shell.

db.addUser("username", "password"); 
> db.auth("username", "password");

Loading the data 

I had around 3000 json files containing twitter feed gathered from twitter API. First I tested for single file if its working by following command. [Note: you have to use this command from terminal not mongo shell]

> ./mongoimport --host localhost --db <testdb> --collection <collection-name> --username <user> --password <password> --type json --file <filename>.json

Now after loading data from one json file we can use simple shell script to load other files as following from terminal. [Note: This script is written for files named as "file1.json", "file2.json"..."file3000.json"]

> for x in {1..3000}; do
./mongoimport --host localhost --db <testdb> --collection <collection-name> --username <user> --password <password> --type json --file file$x.json
done;

If you are using windows you might like write similar batch script. Another simple way is we can store key-value information is object called data and execute following command from mongo shell.

> db.collection.save(data)

Querying the data

The following command is equivalent of "SELECT * FROM test". Here test is the collection name.

db.test.find({})

Now we can add parameter name and expected value is it,

> db.test.find( { id_str: '1836728' } ) // find documents where id_str is 1836728
> db.test.find( { city: 'Tucson' , username: 'Akshay' } ) // Find document with user name Akshay and city as Tucson
> db.test.find( { age: {$lt : 18} } ) # find records where age is less than 18

Now these queries will show all fields from documents which satisfy given conditions. We can limit fields to be displayed for matching documents like following, "

>db.test.find( { age: {$lt : 18} } , { username: 1} ) // Display all user names where age is less than 18

MongoDB: Installation and configuration

This post talks about installing and configuring MongoDB.

Installation:

Go to http://www.mongodb.org/downloads and refer the production release section. Go to corresponding operating system, in this case OS X and click on download. Prefer 64 bit version.[1]

Decompress the downloaded file. Rename the folder as mongodb. You will get a bin directory and few other files. In bin directory there are two files which we will use most frequently, called "mongo" and "mongod". The "mongo" executable will be used to start interactive mongo shell and "mongod" will be used to start mongodb server.

If you click on mongodb to start server you might get this error, "error :dbpath (/data/db/) does not exist". Lets see how we can fix this.

Configuration:

We get the error as it can not find folder /data/db as mentioned in dbpath. So we can either change the dbpath or create those folders by following commands.

sudo mkdir /data                     
sudo mkdir /data/db
sudo chmod 777 /data/db

We use sudo as we are trying to create directory at root level. By using chmod we are making sure that directory is available to read, write and execute.

Start the server

Now you can click on "mongod" or run "./mongod" from command line in bin folder to start mongodb server. The port 27017 will be used to start this service. You will get notification and lot of status messages in terminal when the server starts. You can leave this terminal window as it is and use another window for performing your terminal dependent tasks.

To cross check that you have started the server, enter this url in broswer "http://localhost:28017/". You should be able to see something similar to following.


Sunday, September 15, 2013

Sunday, September 8, 2013

Optimizing your Digital Footprints in the Age of Data

"The age of Men is over. The time of the Orc Data has come"

It wont be exaggeration at all to say this as human intuition is being replaced by (computer-aided) data driven decisions. In order to make more accurate decision we need more/detailed data. So all companies are collecting data from every possible interface. Microsoft said the reason for killing "Start" button in new Windows version is usage data gathered from system says its not being used. So Microsoft collected "our data to make "our" life easy, good thing right? Let's take another example. Microsoft sends every search term you type for local or network searches to that big Bing engine.[1] And who knows then it might go to NSA data warehouse. Even retail companies like Target knows your web browsing history.[2] 

Companies are working really hard to collect more and more data. Our every keystroke is being recorded by internet companies, browsers, operating systems. [3][4][5][6][7]. And on top of that US intelligence is spending more than $25B on data collection annually.[8]





 

Long story short, big data is not bad but as an individual we must have options to preserve our privacy! I would like to talk about few simple ways to do the same in this post.

1] Shift to Firefox. Its stable, fast and secure. All other major browsers Chrome, Safari and IE (if you use this) tracks your activities for commercial use. Additionally Firefox provides really good addons for security and privacy. If you want more secure service then you can choose John Donym, Tor browser bundle which offers anonymous browsing/IP anonymization. Similar services can be used on mobile operating systems as well naming Orbot (Android) and Onion browser (iOS)

2] Unless you are not getting satisfactory results you can use DuckDuckGo as search engine. Its more than enough for regular activities. After PRISM controversy DuckDuckGo is getting really high traffic. [9]   

3] If you do not wish to switch from Google as search engine, login in google search. Go to the "Accounts > Dashboard > Manage your web history" and make sure you turn the web history off. Otherwise it logs your every search, mapping with your email id. (Provided your are logged in with some Google service like gmail, youtube, blogger etc). You can clear and pause your search history in Youtube as well if you dont want Google to use/collect your data .

4] If you are comfortable with linux/unix operating systems, prefer it over Windows, Chrome OS and OS X. Few really good options are linux Mint, Ubuntu, Fedora etc. If your don't wish to install these systems, you can use live CDs as well.

5] You can use following browser plugins to protect your data,
  • AdBlock Edge: This blocks advertisements and trackers across the web. (Firefox)
  • Disconnect.me: Disconnect lets you visualize and block the invisible websites that tracks you. (Safari, Chrome, Opera)
  • Https-everywhere: Its extension that encrypts your communications with many major websites, making your data transfer more secure. (Firefox, Chrome)
  • DoNotTrackMe: Perhaps my favorite addon which blocks more than 600 tracking services which includes identity thieves, advertisers, social networks, and spammers from tracking you. Same service provider gives one more service MaskMe which masks your contact details when you enter it in the web form. Following is the screenshot of techcrunch.com where 14 analytical services are supposed to track your activities.
     

Note: I am big data fanboy and certainly not against it, but strongly believe we must have a choice to protect our privacy.