Creating a Data Science Portfolio

Getting a job as a data scientist is becoming more and more difficult. After Data Science has been dubbed the “sexiest job of the 21st century” many jumped on the bandwagon and started looking for a job in the field.

Now, the data science job market is quite saturated for juniors fresh out of college. Vacancies attract hundreds of junior data scientists and sometimes even senior data scientists.

Contradictory requirements in data science job postings make life difficult for entry-level applicants

The question is, what can you do to separate yourself from the hundreds of applicants?

Although there are many ways to do so, like additional internships, courses, MOOCs, etc., the one thing that has helped me out tremendously is creating a portfolio.

Building a data science portfolio has two goals. First, it allows you to demonstrate your technical expertise to a hiring manager. This is especially helpful if you are new to the field. Second, actively building a portfolio is a great learning opportunity. You will need to spend time on building algorithms, deploy solutions, and communicating results in a meaningful manner.

This article will give you my view on the question:

“What is necessary to create a solid data science portfolio?”

A data science portfolio can contain many things but typically revolve around the things that you, the data science professional, have created. Examples are building an algorithm from scratch, researching new methods, doing meaningful analyses, etc.

A few examples of my projects. A full overview can be found on https://www.maartengrootendorst.com/projects/ or on https://github.com/MaartenGr/projects

It also allows you to demonstrate the expertise that you did not develop in your past work experience but did focus on in personal projects.

This also means that a portfolio is not a tool primarily used to make up for gaps in work experience. It can also help you get an edge in the more sought-out positions in the field.

When building your portfolio, it is my belief that a person benefits greatly from focusing on developing a T-shaped profile:

The T-shaped profile represents two things. First, the horizontal bar represents generalistic skills. A Data Scientist should have basic knowledge of statistics, programming, deployment, business, etc. Second, the vertical bar represents a specialized skill. This can be a focus on NLP, Computer Vision, Time Series, etc.

In other words, although a Data Scientist can focus on specializing in one specific field, the basics should not be ignored. Finding a balance between these two should help you both career-wise and in personal development.

1. Creating your Projects

The fundament of a portfolio consists of a number of projects. These projects can make or break your portfolio as they are the face of your work.

Thus, these projects should be selected carefully and need to communicate a number of things: quality of analyses, communication skills, code quality, business relevance, etc.

Having another Titanic analysis is not going to separate you from the pack.

Here, I will go through these aspects and demonstrate how I would approach them in my portfolio if I were to create a new one.

The idea

When creating a project I would advise not starting from a dataset but from an idea. Analyzing without a purpose is typically meaningless and should be prevented as much as possible.

Try to answer questions such as:

What problem do I want to solve?
Who would benefit from my analyses and/or product?
Is there a technique that I would be able to improve upon?

Even if you cannot think of something on your own, there are plenty of sources that can help you get started:

12 Data Science Project Ideas
Reproduce a popular paper in, for example, PyTorch that was previously only available in Tensorflow
Create a package that fills a certain gap, like AutoScraper or DrawData

But most importantly:

Choose projects that are interesting to you, otherwise they will become too much of a chore

Sometimes, we do not need data to create a data-related project. You can start building algorithms, reproducing papers, or create a package that fills a specific gap.

https://github.com/alirezamika/autoscraper is a great example of an elegant, yet simple package that cleverly fills a gap.

As an example, in the last few years, I have focused on developing algorithms like BERTopic, KeyBERT, c-TF-IDF, and VLAC. These helped me demonstrate that I could take analyses to the next level.

Data sources

Although you can find a lot of interesting datasets on Kaggle, I would advise you to be a bit more creative in the selection of datasets. For example, you can analyze data from your local government (Dutch, U.S., French, etc.), WHO data, or search for data in the large Mendely Database.

In other words, try to collect the data yourself. In many organizations, there is typically quite a bit of hoop to work through before you get a clean format. Showing that you can find and process data yourself shows tremendous skills.

Scraping your data using BeautifulSoup, Scrapy, AutoScraper, or any other method will definitely add points to creativity. It also shows you can work with real-world data since the resulting data is typically messy.

Repositories

Share your work in a public place like Github. It is a nice way of demonstrating skills like version control, OOP, model evaluation, documentation, etc.

These repositories should have sufficient documentation for someone not familiar with your project to understand how to use or interpret it. Try to communicate clearly and effectively what the project is about, what the results are, and especially the implications of your analyses.

Although it is unlikely that the hiring manager will run your code, it gives a nice overview of the quality of work that you typically deliver.

Here are some examples of good ReadMe pages:

Clumper, part of a video series on calmcode.io.
An amazing overview of quality ReadMe pages

NOTE: If you are interested in examples of quality documentation, I would highly advise looking through the projects of https://github.com/koaning. The skill of proper documentation is severely underestimated.

Deployment

One of the best ways to demonstrate that your skills can create an impact is by deploying your code. The demand for data scientists with software engineering skills seems to be increasing rapidly.

There are several ways to approach this:

First, you can create an application out of the analyses you have created. For example, if you were to create a keyword extraction tool then you could create a Streamlit app demonstrating the useability of your application.

Second, by making your python package available through Pypi or Anaconda. How great would it be if the only thing you would have to do in order to use your python package is pip install my_package? It shows that you understand the processes leading up to and including production.

Demonstrating software engineering skills is key to separate you from the pack

It is important to realize that there is much more to data science than just creating a model. How are users going to use it? How does it need to be deployed? What should it look like? Etc.

2. Tutorials & Blogs

Aside from projects, you can start adding tutorials and blog posts to your portfolio. These are essential in demonstrating that you can explain technical matters in a way that is understandable to those not familiar with the content.

Whenever you create an analysis, model, package, or anything of interest, you can wrap that up in a blog post to make your work more public.

You truly understand something if you are able to teach it to others

Similarly, writing tutorials is an amazing way to help you understand the material. Whenever I learn a new topic I make sure that I can explain it to others.

Honing your writing skills will help you become a better communicator. You will slowly start to gain an intuition of your stakeholders' needs and wishes which is an amazing skill to have.

Since you are most likely reading this on Medium, it is unsurprising that I would suggest starting out here. Medium, specifically the TowardsDataScience publication, is a great place for posting your technical tutorials and thoughts on the AI field.

My work on TowardsDataScience has not gone unnoticed and has helped tremendously in almost all interviews I have had in the last few years.

And it is not only me who seems to benefit from writing. Megan Dibble explained how writing on Medium got her a job in data analytics. Likewise, Dario Radečić wrote about the benefits of having a blog as a Data Scientist. As a final example, David Robinson explained the benefits of starting a blog.

There are a few more places where you could start writing:

Hackernoon
LinkedIn
Reddit — r/machinelearning or r/datascience
Personal Website — here’s mine as an example

As you might have already guessed by now, these days your social media presence is becoming more and more important.

Making your work public helps the community understand your skillset. For hiring managers, it is difficult to truly know what you are capable of unless they can see it for themselves.

Although there are quite a few ways of approaching this, I believe that choosing something you enjoy is the most effective. Having said that, below are some communities that are interesting to add to your portfolio.

Kaggle

Competitions are what made Kaggle great. They are an excellent way to practice your modeling skills. Winning competitions and showing that you can go in-depth with predictive modeling is a nice bonus to your resume.

Personally, I believe that the kernels and notebooks provided by users are what keeps Kaggle great. The opportunity to help others by providing invaluable EDA is a much-underestimated task.

Being a discussion expert on Kaggle is highly underestimated

There is quite a bit of value to be gained from writing and publishing kernels. Answering questions, starting discussions, and developing kernels are vital for training your communication capabilities.

A great example is Andrew Lukyanenko, who is a Notebooks and Discussion Grandmaster, which is the highest rank you can get on that platform. In an interview, he talks about the challenges and benefits of working on kernels and discussions.

Stackoverflow

Contributing to questions on Stackoverflow can be a great experience. Not only will you learn quite a bit when it comes down to communicating complicated subjects, but it also provides you with an opportunity to further develop your technical skills.

Jon Skeet, one of the most known users on Stackoverflow, explains how he became famous on the platform and why communication is so important when coding. It is yet again proof that communication skills should not be ignored.

The responses to this StackOverflow question indicate that a good reputation helps you “shine” in an interview. Hiring managers are unlikely to ask for your StackOverflow profile but having one might give you a competitive edge.

Anything that adds to the proof of your expertise is helpful.

It shows that you understand the material quite well especially if you have gained a reputation on the platform. Imagine you often answer questions to specific NLP-based questions on StackOverflow. If I were on the hiring committee I would definitely see that as a plus.

Twitter & LinkedIn

Whenever you create an open-source package or finish up an analysis, share it through Twitter or LinkedIn. By promoting your content on these platforms you create an interesting opportunity to collaborate and interact with people in your field.

Here, it is mostly about networking and building up your personal “brand”. To give you a few pointers, Admond Lee talks about why it is important to build your personal brand and how it can affect your career.

There is more to Data Science than machine learning… statistics, evaluation, theory, experimentation, model serving, deployment, etc.

Philip Vollet is a good example of someone who has built a huge follower-based on these platforms. With 80k+ followers on LinkedIn, he has created a name for himself as an open-source advocate. I would highly advise following him to learn more about interesting open-source projects.

Thus, use these platforms mostly for making your work public and to connect with others. It is unlikely that sharing your Twitter and LinkedIn accounts will land you your dream job.

Github

Github is an excellent place for demonstrating your skills, networking with peers, and learning from mentors. As with most things, if you have the time, I would highly advise either creating a few repositories of your own or contributing to open-source.

When you create your own repositories it forces you to think about the audience and users. Are they able to understand what I am trying to achieve? Can they easily work with the codebase? This means that your README should be a large focus when creating a repository.

With many articles stating how to build up your Github portfolio (here, here, and here) there should be enough sources to get started.

Spend significant time on your README, first impressions lasts!

Being actively involved in the open-source community is not only a great way to learn but truly shows that you can go in-depth with certain technical aspects of the job. It shows good coding practices, deep algorithmic knowledge, and the ability to work together.

To get started, search for a package that you often use, like scikit-learn, and search for the CONTRIBUTING.md file. Here, you will typically find extensive instructions on how to contribute to the package. Scikit-learn has an excellent tutorial on how to contribute to the package. Make sure to focus on a simple issue and build up from there.

This interview with Sebastian Raschka, the author of the mlxtend package and the popular “Python Machine Learning” book, gives a nice overview of the benefits of working on open-source projects for your career.

As with most things, whether employers look at your Github profile depends on how you have it positioned on your resume. If you put it on your resume, there is a good chance that someone will skim through your Github profile.

4. What does a Portfolio look like?

After having created a set of projects, packages, blog posts, and/or tutorials, it is time to put it all together. What does a fully-formed portfolio look like?

Fortunately, there is not a single answer to this! This lends you some creativity when developing one. Whether it will be through a personal website or Github is up to you.

Github

Github is an excellent platform for creating and sharing your portfolio. There are several ways to approach this:

First, by spending some time optimizing your profile's README page. When someone opens up your Github profile, the first thing they will see is the README page. It is an excellent way to tell something about yourself, the projects you have done, and your experience.

I would highly advise looking through this excellent overview of portfolio READMEs for inspiration.

Preview of my Github portfolio page: https://github.com/MaartenGr/projects

Second, you can create a specific page for your portfolio that links to your repositories. This allows you to create a single overview of all the projects you have done without the need to go through all irrelevant forks in your profile.

One example is that of Andrey Lukynenko whose Github portfolio is often seen as a clear overview of projects.

LinkedIn

Do not forget to put your projects and experience on LinkedIn. These days, it is the core of your professional network. Try to build up your network so that each time you share a finished project, it gets seen in the right places.

There are quite a few ways to improve your LinkedIn profile but they all boil down to the same thing, communication. Communicating your skills in such a way that it is clear for the reader without having to sift through pages of experience.

Be concise, to the point, and focus on the results.

You can also put your projects on your LinkedIn profile. Make sure that you focus on skills, metrics, but most importantly, on impact.

Everything you do with data science is meant to have some impact. Make sure that reads throughout everything you are working on.

Personal website

Creating a personal website can go a long way in developing your personal brand. You can easily manage the content that you want to publish and focus on the things that are important to you.

The great thing about a personal website is that you can more easily control the narrative of your life story. Focus on the content that you believe is interesting such as projects, posts, or even public talks.

A famous example is Variance Explained, the personal website of David Robinson. His open-sourced website is built on using Minimal Mistakes, a Jekyll theme for easily deploying a personal site.

If you are not familiar with Jekyll, I would advise starting out with a simple Github Page site. You only need a repository with a README to get started!

Resume

Out of all the project representations, the one that gets the most attention is your resume. It is the one place where everything comes together and is typically the main source to be judged from in interviews.

Thus, it is vital that your projects are well-represented in your resume and get the attention they deserve.

I highly recommend the following guides on developing your resume:

Data Science Resume Round-up series by Ken Jee
How to Build a Compelling Data Science Portfolio by William Chen (Kaggle)
A step-by-step breakdown of the resume Jonathan Javier had after his first job at Snap Inc. (Snapchat)
A video of Tina Huang describing how she got an entry data science job at a FAANG with this resume.

By focusing on technology, results, and impact you can communicate the importance of your work.

Focus on the impact of your work

Overleaf is a great source of resume templates. Not only are the resumes of great quality, but it is also a nice opportunity to learn LaTeX which is commonly used in academics.

There is no one format that works best as recruiters may differ in opinion on what should be mentioned where. What seems to be a common denominator though is that you would want to focus on the tools, technology, and results

Creating a Data Science Portfolio

1. Creating your Projects

The idea

Choose projects that are interesting to you, otherwise they will become too much of a chore

Data sources

Repositories

Deployment

2. Tutorials & Blogs

3. Social Media Presence

Kaggle

Stackoverflow

Twitter & LinkedIn

Github

4. What does a Portfolio look like?

Github

LinkedIn

Personal website

Resume

Focus on the impact of your work