Hustle Hub
Posts
Hustle Hub #12

Hustle Hub #12

🛖 Top 10 Python Libraries to Know to Become a Data Scientist

Hey friends,

I still remember the first question I asked myself when I wanted to go into data science:

What's the programming language that I want to learn?

Fast forward to today, I'm glad that I picked Python. It has been an indispensable tool throughout my data science career for data analysis, data visualisation, machine learning, and general automation tasks. The ease of use, versatility, wide integration with other software, and active support from the community are the main reasons why I'm still using Python.

Don't just take my word for it. If you look across the job descriptions of data science roles, most of the roles require knowledge of Python - if not all.

A study from 2018 with a 18,827 sample size

Besides, a study from 2018 also voted Python (83%) as the top programming language for data analysis and data science.

Throughout my data science career, there are some Python libraries that I keep using from time to time which, therefore, I believe are the Python libraries that one should know to become a data scientist.

Let's get started! 🚀

📈 Data Cleaning & Analysis

1. Pandas

Pandas is my favourite library when it comes to data cleaning because I can do all sorts of data manipulation, analysis and EDA using Pandas. If you've ever used Excel before, think of Pandas like Excel - but on steroids (seriously).

🤪 Fun Fact: Pandas is a high-level library built on top of NumPy, which makes Pandas easy to use, more flexible, and efficient.

Want to learn Pandas? Here's how you can get started.

2. NumPy

NumPy is the fundamental package for scientific computing in Python. In simple words, for anything that involves numbers, including calculation, analysis, aggregation, or computing in general, you can use NumPy.

It's extremely versatile and fast. However, most data scientists (at least from my experience) still prefer Pandas when it comes to dealing with tabular data.

NumPy is still useful when you want to clean and analyse data in a way that's not supported by Pandas, or you want to build a deep model using NumPy.

Want to learn NumPy? Here's how you can get started.

📊 Data Visualisation

3. Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualisations in Python.

As a data scientist, you often need to visualise data to identify interesting patterns and trends as well as to generate insights from the graphs. This is where Matplotlib comes in handy.

Python Data Visualisation with Matplotlib

Want to learn Matplotlib? Here's how you can get started.

4. Seaborn

Seaborn is another Python library for data visualisation.

🤪 Fun Fact: Seaborn is a high-level library built on top of Matplotlib, which makes Seaborn easy to use and you can generate aesthetically pleasing data visualisation from the library.

Data Visualisation using Seaborn

Personally, unless I want to generate graphs with good visuals for my EDA, I'd mostly use Seaborn for presentation purposes because graphs generated from Seaborn are more simple to understand and aesthetically pleasing.

However, if you want to have more flexibility to customise your graphs, you'd still need to use Matplotlib but with more steps.

Want to learn Seaborn? Here's how you can get started.

🧑🏻‍💻Statistics

5. SciPy

SciPy is a library for scientific computing in Python. You can think of SciPy as the extension of NumPy beyond normal numerical computing, including optimisation, integration, differential equations, and statistics.

Therefore, SciPy is one of the most popular Python libraries in academia because of its capabilities to run optimisation and many other complex scientific computing.

Want to learn SciPy? Here's how you can get started.

6. Statsmodels

Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and statistical data exploration.

In simple words, whenever you want to do statistical analysis (regression, R-squared, probability analysis etc.), you can use Statsmodels.

Of course, you can also use NumPy to do the same, but it involves manual calculation whereas Statsmodels can output everything you need in one line of code.

🤪 Fun Fact: Statsmodels was originally part of Scipy until 2009 when it was removed from SciPy, improved and released as a new library called Statsmodels.

Want to learn Statsmodels? Here's how you can get started .

🤖 Machine Learning

7. Scikit-learn

So you want to build a machine learning model? Use Scikit-learn.

Earlier I mentioned that you can also NumPy to build machine learning models, but that involves writing many lines of code to build a simple machine learning model. With Scikit-learn, you can do it within a few lines of code (seriously).

An example of building a Support Vector Regression model using Scikit-learn

And this is just the tip of the iceberg of what Scikit-learn can do. Besides building various machine learning models, Scikit-learn also allows you to do feature engineering, hyperparameter tuning, and many more.

Want to learn Scikit-learn? Here's how you can get started.

8. TensorFlow

TensorFlow is an open-source machine learning library developed by Google that provides an end-to-end platform for machine learning.

Building the Future of TensorFlow

It is the first machine learning library that I learned when I started learning how to build deep learning models. Needless to say, TensorFlow had everything I needed.

If you want to get started with building deep learning models without getting too technical, I'd recommend Keras - a high-level deep learning library built on top of TensorFlow.

Want to learn TensorFlow? Here's how you can get started.

🛠️ Web Scraping

9. Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. This library is commonly used when you scrape data from the web (web scraping).

Why is it important? Because most data in the world is unstructured instead of the nicely tabular data that you receive in CSV format. And this is especially true when you deal with web data where you get data from the Internet in HTML format.

With Beautiful Soup, you can easily extract formatted data from the unstructured data for your analysis within seconds, not hours.

Want to learn Beautiful Soup? Here's how you can get started.

10. Selenium

If you want to automate web browsers in Python, use Selenium. Say goodbye to opening web browsers manually, clicking those buttons etc.

Why is it important? Because you can use Selenium for web scraping, automated testing on your web applications, and other interesting use cases. The best part? Selenium can be used to automate various web browsers, including Chrome, Safari, Firefox etc. (you get it).

Want to learn Selenium? Here's how you can get started.

Conclusion

In summary, here are the top 10 Python libraries to become a data scientist:

Pandas
Numpy
Matplotlib
Seaborn
SciPy
Statsmodels
Scikit-learn
TensorFlow
Beautiful Soup
Selenium

Master the above libraries and you'll have the right tools in your tool belt to do data science with Python.

Hopefully, this sharing is helpful to you. The list here is by no means exhaustive, but I think it's good enough to get started.

By the way, what other Python libraries that you think I should include on this list? Reply to this email and let me know! FOMO is real 😂

🚀 My Journey from Physics into Data Science

Finally, my first YouTube video is out! 🥳

In this video, I shared how I transitioned from physics into data science, including the internships that I took, competitions that I joined, and resources that I learned from. Hopefully, this video will help you learn more about how to go into data science.

Enjoy!

PS: Since this is my first YouTube video, I'd appreciate it if you can share with me feedback (via email) on how I can improve in my next videos. Thanks in advance! 🫡🙏🏻

🚀 Whenever you’re ready, there are 4 ways I can help you:

1. Book a coaching call with me if you need help in the following:

• How To Get Into Data Science

• LinkedIn Growth, Content Strategy & Personal Branding

• 1:1 Mentorship & Career Guidance

• Resume Review

2. Promote your brand to ~1000 subscribers in the data/tech space by sponsoring this newsletter.

3. Watch my YouTube videos where I talk about data science tips, programming, and my tech life (P.S. Don’t forget to like and subscribe 💜).

4. Follow me on LinkedIn and Twitter for more data science career insights, my mistakes and lessons learned from building a startup.

That's all for today

Thanks for reading. I hope you enjoyed today's issue. More than that, I hope it has helped you in some ways and brought you some peace of mind.

You can always write to me by simply replying to this newsletter and we can chat.

See you again next week.

- Admond

Reply

or to participate.