Menu bar

Sunday, August 31, 2014

Python Tutorials 2 - Control Flow

if Statements
if x < 0:
...     x = 0
...     print 'Negative changed to zero'
... elif x == 0:
...     print 'Zero'
... else:
...     print 'More'
for Statements
>>> # Measure some strings:
... words = ['cat', 'window', 'defenestrate']
>>> for w in words:
...     print w, len(w)
The range() Function 
If you do need to iterate over a sequence of numbers, the built-in function range() comes in handy.
>>> range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
It is possible to let the range start at another number, or to specify a different increment (even negative; sometimes this is called the ‘step’):
>>> range(5, 10)
[5, 6, 7, 8, 9]
>>> range(0, 10, 3)
[0, 3, 6, 9]
>>> range(-10, -100, -30)
[-10, -40, -70]
To iterate over the indices of a sequence, you can combine range() and len() as follows:
>>> a = ['Mary', 'had', 'a', 'little', 'lamb']
>>> for i in range(len(a)):
...     print i, a[i]
break and continue Statements, and else Clauses on Loops
Loop statements may have an else clause; it is executed when the loop terminates through exhaustion of the list (with for) or when the condition becomes false (with while), but not when the loop is terminated by a break statement
>>> for n in range(2, 10):
...     for x in range(2, n):
...         if n % x == 0:
...             print n, 'equals', x, '*', n/x
...             break
...     else:
...         # loop fell through without finding a factor
...         print n, 'is a prime number'
...
2 is a prime number
3 is a prime number 

Reference: https://docs.python.org/2/tutorial/

Saturday, August 30, 2014

Python Tutorials 1 - Number,String,List

Python is just the language for you.
Programs written in Python are typically much shorter than equivalent C, C++, or Java programs, for several reasons:
  • the high-level data types allow you to express complex operations in a single statement;
  • statement grouping is done by indentation instead of beginning and ending brackets;
  • no variable or argument declarations are necessary.The language is named after the BBC show “Monty Python’s Flying Circus” and has nothing to do with reptiles.
Numbers
>>> 5 ** 2  # 5 squared
25
>>> 2 ** 7  # 2 to the power of 7
128
Python also has built-in support for complex numbers, and uses the j or J suffix to indicate the imaginary part (e.g. 3+5j).

String
Strings can be concatenated (glued together) with the + operator, and repeated with *
>>> # 3 times 'un', followed by 'ium'
>>> 3 * 'un' + 'ium'
'unununium'
Two or more string literals (i.e. the ones enclosed between quotes) next to each other are automatically concatenated.
>>> 'Py' 'thon'
'Python'
Strings can be indexed (subscripted), with the first character having index 0. There is no separate character type; a character is simply a string of size one:
>>> word = 'Python'
>>> word[0]  # character in position 0
'P'

Indices may also be negative numbers, to start counting from the right:
>>> word[-1]  # last character
'n'
>>> word[-2]  # second-last character
'o'
slicing is also supported. While indexing is used to obtain individual characters, slicing allows you to obtain a substring:
>>> word[0:2]  # characters from position 0 (included) to 2 (excluded)
'Py'
an omitted first index defaults to zero, an omitted second index defaults to the size of the string being sliced.
>>> word[:2]  # character from the beginning to position 2 (excluded)
'Py'
>>> word[4:]  # characters from position 4 (included) to the end
'on'
>>> word[-2:] # characters from the second-last (included) to the end
'on'
 +---+---+---+---+---+---+
 | P | y | t | h | o | n |
 +---+---+---+---+---+---+
 0   1   2   3   4   5   6
-6  -5  -4  -3  -2  -1
Python strings cannot be changed — they are immutable. Therefore, assigning to an indexed position in the string results in an error:
>>> word[0] = 'J'
  ...
TypeError: 'str' object does not support item assignment\

Unicode Strings
Creating Unicode strings in Python is just as simple as creating normal strings:
>>> u'Hello World !'
u'Hello World !'
>>> u'Hello\u0020World !'
u'Hello World !'
The built-in function unicode() provides access to all registered Unicode codecs (COders and DECoders)
When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding.
>>> u"abc"
u'abc'
>>> str(u"abc")
'abc'
To convert a Unicode string into an 8-bit string using a specific encoding, Unicode objects provide an encode() method that takes one argument, the name of the encoding. Lowercase names for encodings are preferred.
>>> u"äöü".encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'

 Lists
 list, which can be written as a list of comma-separated values (items) between square brackets
>>> squares = [1, 4, 9, 16, 25]
>>> squares
[1, 4, 9, 16, 25]
Unlike strings, which are immutable, lists are a mutable type
>>> cubes = [1, 8, 27, 65, 125]
>>> cubes[3] = 64  
>>> cubes
[1, 8, 27, 64, 125
You can also add new items at the end of the list, by using the append() method
>>> cubes.append(216)  # add the cube of 6
It is possible to nest lists (create lists containing other lists), for example:
>>> a = ['a', 'b', 'c']
>>> n = [1, 2, 3]
>>> x = [a, n]
>>> x
[['a', 'b', 'c'], [1, 2, 3]]
>>> x[0]
['a', 'b', 'c']
>>> x[0][1]
'b'

Reference : https://docs.python.org/2/tutorial/


Thursday, August 21, 2014

Wednesday, August 20, 2014

Data Science: How do I become a data scientist?

The current buzzword of tech industry is "Data Science".Here is the best answer from Quora about "How do I become a data scientist?".

The best way to become a data scientist is to learn - and do - data science. There are a many excellent courses and tools available online that can help you get there.

Here is an incredible list of resources compiled by Jonathan Dinu, Co-founder of Zipfian Academy, which trains data scientists and data engineers in San Francisco via immersive programs, fellowships, and workshops.

EDIT: I've had several requests for a permalink to this answer. See here: A Practical Intro to Data Science from Zipfian Academy

EDIT2: See also: "How to Become a Data Scientist" on SlideShare:http://www.slideshare.net/ryanor...

Environment
Python is a great programming language of choice for aspiring data scientists due to its general purpose applicability, a gentle (or firm) learning curve, and — perhaps the most compelling reason — the rich ecosystem of resourcesand libraries actively used by the scientific community.

Development
When learning a new language in a new domain, it helps immensely to have an interactive environment to explore and to receive immediate feedback. IPython provides an interactive REPL which also allows you to integrate a wide variety of frameworks (including R) into your Python programs.

STATISTICS
Data scientists are better at software engineering than statisticians and better at statistics than any software engineer. As such, statistical inference underpins much of the theory behind data analysis and a solid foundation of statistical methods and probability serves as a stepping stone into the world of data science.

Courses
edX: Introduction to Statistics: Descriptive Statistics: A basic introductory statistics course.

Coursera Statistics, Making Sense of Data: A applied Statistics course that teaches the complete pipeline of statistical analysis

MIT: Statistical Thinking and Data Analysis: Introduction to probability, sampling, regression, common distributions, and inference.

While R is the de facto standard for performing statistical analysis, it has quite a high learning curve and there are other areas of data science for which it is not well suited. To avoid learning a new language for a specific problem domain, we recommend trying to perform the exercises of these courses with Python and its numerous statistical libraries. You will find that much of the functionality of R can be replicated with NumPy, @SciPy, @Matplotlib, and @Python Data Analysis Library

Books
Well-written books can be a great reference (and supplement) to these courses, and also provide a more independent learning experience. These may be useful if you already have some knowledge of the subject or just need to fill in some gaps in your understanding:

O'Reilly Think Stats: An Introduction to Probability and Statistics for Python programmers

Introduction to Probability: Textbook for Berkeley’s Stats 134 class, an introductory treatment of probability with complementary exercises.

Berkeley Lecture Notes, Introduction to Probability: Compiled lecture notes of above textbook, complete with exercises.

OpenIntro: Statistics: Introductory text book with supplementary exercises and labs in an online portal.

Think Bayes: An simple introduction to Bayesian Statistics with Python code examples.

MACHINE LEARNING/ALGORITHMS
A solid base of Computer Science and algorithms is essential for an aspiring data scientist. Luckily there are a wealth of great resources online, and machine learning is one of the more lucrative (and advanced) skills of a data scientist.

Courses
Coursera Machine Learning: Stanford’s famous machine learning course taught by Andrew Ng.

Coursera: Computational Methods for Data Analysis: Statistical methods and data analysis applied to physical, engineering, and biological sciences.

MIT Data Mining: An introduction to the techniques of data mining and how to apply ML algorithms to garner insights.

Edx: Introduction to Artificial Intelligence: Introduction to Artificial Intelligence: The first half of Berkeley’s popular AI course that teaches you to build autonomous agents to efficiently make decisions in stochastic and adversarial settings.

Introduction to Computer Science and Programming: MIT’s introductory course to the theory and application of Computer Science.

Books
UCI: A First Encounter with Machine Learning: An introduction to machine learning concepts focusing on the intuition and explanation behind why they work.

A Programmer's Guide to Data Mining: A web based book complete with code samples (in Python) and exercises.

Data Structures and Algorithms with Object-Oriented Design Patterns in Python: An introduction to computer science with code examples in Python — covers algorithm analysis, data structures, sorting algorithms, and object oriented design.

An Introduction to Data Mining: An interactive Decision Tree guide (with hyperlinked lectures) to learning data mining and ML.

Elements of Statistical Learning: One of the most comprehensive treatments of data mining and ML, often used as a university textbook.

Stanford: An Introduction to Information Retrieval: Textbook from a Stanford course on NLP and information retrieval with sections on text classification, clustering, indexing, and web crawling.

DATA INGESTION AND CLEANING
One of the most under-appreciated aspects of data science is the cleaning and munging of data that often represents the most significant time sink during analysis. While there is never a silver bullet for such a problem, knowing the right tools, techniques, and approaches can help minimize time spent wrangling data.

Courses
School of Data: A Gentle Introduction to Cleaning Data: A hands on approach to learning to clean data, with plenty of exercises and web resources.

Tutorials
Predictive Analytics: Data Preparation: An introduction to the concepts and techniques of sampling data, accounting for erroneous values, and manipulating the data to transform it into acceptable formats.

Tools
OpenRefine (formerly Google Refine): A powerful tool for working with messy data, cleaning, transforming, extending it with web services, and linking to databases. Think Excel on steroids.

Data Wrangler: Stanford research project that provides an interactive tool for data cleaning and transformation.

sed - an Introduction and Tutorial: “The ultimate stream editor,” used to process files with regular expressions often used for substitution.

awk - An Introduction and Tutorial: “Another cornerstone of UNIX shell programming” — used for processing rows and columns of information.

VISUALIZATION
The most insightful data analysis is useless unless you can effectively communicate your results. The art of visualization has a long history, and while being one of the most qualitative aspects of data science its methods and tools are well documented.

Courses
UC Berkeley Visualization: Graduate class on the techniques and algorithms for creating effective visualizations.

Rice University Data Visualization: A treatment of data visualization and how to meaningfully present information from the perspective of Statistics.

Harvard University Introduction to Computing, Modeling, and Visualization: Connects the concepts of computing with data to the process of interactively visualizing results.

Books
Tufte: The Visual Display of Quantitative Information: Not freely available, but perhaps the most influential text for the subject of data visualization. A classic that defined the field.

Tutorials
School of Data: From Data to Diagrams: A gentle introduction to plotting and charting data, with exercises.

Predictive Analytics: Overview and Data Visualization: An introduction to the process of predictive modeling, and a treatment of the visualization of its results.

Tools
D3.js: Data-Driven Documents — Declarative manipulation of DOM elements with data dependent functions (with Python port).

Vega: A visualization grammar built on top of D3 for declarative visualizations in JSON. Released by the dream team at Trifacta, it provides a higher level abstraction than D3 for creating “ or SVG based graphics.

Rickshaw: A charting library built on top of D3 with a focus on interactive time series graphs.

Modest Maps: A lightweight library with a simple interface for working with maps in the browser (with ports to multiple languages).

Chart.js: Very simple (only six charts) HTML5 “ based plotting library with beautiful styling and animation.

COMPUTING AT SCALE
When you start operating with data at the scale of the web (or greater), the fundamental approach and process of analysis must change. To combat the ever increasing amount of data, Google developed the MapReduceparadigm. This programming model has become the de facto standard for large scale batch processing since the release of Apache Hadoop in 2007, the open-source MapReduce framework.

Courses
UC Berkeley: Analyzing Big Data with Twitter: A course — taught in close collaboration with Twitter — that focuses on the tools and algorithms for data analysis as applied to Twitter microblog data (with project based curriculum).

Coursera: Web Intelligence and Big Data: An introduction to dealing with large quantities of data from the web; how the tools and techniques for acquiring, manipulating, querying, and analyzing data change at scale.

CMU: Machine Learning with Large Datasets: A course on scaling machine learning algorithms on Hadoop to handle massive datasets.

U of Chicago: Large Scale Learning: A treatment of handling large datasets through dimensionality reduction, classification, feature parametrization, and efficient data structures.

UC Berkeley: Scalable Machine Learning: A broad introduction to the systems, algorithms, models, and optimizations necessary at scale.

Books
Mining Massive Datasets: Stanford course resources on large scale machine learning and MapReduce with accompanying book.

Data-Intensive Text Processing with MapReduce: An introduction to algorithms for the indexing and processing of text that teaches you to “think in MapReduce.”

Hadoop: The Definitive Guide: The most thorough treatment of the Hadoop framework, a great tutorial and reference alike.

Programming Pig: An introduction to the Pig framework for programming data flows on Hadoop.

PUTTING IT ALL TOGETHER
Data Science is an inherently multidisciplinary field that requires a myriad of skills to be a proficient practitioner. The necessary curriculum has not fit into traditional course offerings, but as awareness of the need for individuals who have such abilities is growing, we are seeing universities and private companies creating custom classes.

Courses
UC Berkeley: Introduction to Data Science: A course taught by Jeff Hammerbacher and Mike Franklin that highlights each of the varied skills that a Data Scientist must be proficient with.

How to Process, Analyze, and Visualize Data: A lab oriented course that teaches you the entire pipeline of data science; from acquiring datasets and analyzing them at scale to effectively visualizing the results.

Coursera: Introduction to Data Science: A tour of the basic techniques for Data Science including SQL and NoSQL databases, MapReduce on Hadoop, ML algorithms, and data visualization.

Columbia: Introduction to Data Science: A very comprehensive course that covers all aspects of data science, with an humanistic treatment of the field.

Columbia: Applied Data Science (with book): Another Columbia course — teaches applied software development fundamentals using real data, targeted towards people with mathematical backgrounds.

Coursera: Data Analysis (with notes and lectures): An applied statistics course that covers algorithms and techniques for analyzing data and interpreting the results to communicate your findings.

Books
An Introduction to Data Science: The companion textbook to Syracuse University’s flagship course for their new Data Science program.

Tutorials
Kaggle: Getting Started With Python For Data Science: A guided tour of setting up a development environment, an introduction to making your first competition submission, and validating your results.

CONCLUSION
Data science is infinitely complex field and this is just the beginning.

If you want to get your hands dirty and gain experience working with these tools in a collaborative environment, check out our programs athttp://zipfianacademy.com.

There's also a great SlideShare summarizing these skills: How to Become a Data Scientist

You're also invited to connect with us on Twitter @zipfianacademy and let us know if you want to learn more about any of these topics.