What is your data science development environment

Data Science Toolkit

For the job as a data scientist, tools are part of everyday life. In this article I explain which tools are used for which activity.

You already read a lot about the skills a data scientist has to master for his job. However, it is less common to find the tools with which a data scientist works every day. The selection of available software has now become overwhelming. In the following I give a "small" overview which programs, in my experience, belong to the equipment of a data scientist. The list is not exhaustive and may change over time. If you still miss a supposedly relevant tool, please write to me.

content

My personal toolkit is identified by the following emoji: 🙌

Development environments (IDE's)

An Integrated Development Environment (IDE) is the area in which data scientists spend a large part of their working time. This is where data is analyzed and models are programmed.

  • PyCharm: A Python IDE from jetbrains with a free community version. Particularly suitable for professional coders who value good maintainability and high test integration.
  • Visual Studio Code: 🙌 A lightweight IDE from Microsoft that is independent of the programming language and is available completely free of charge. The integrated features of the basic version are limited to the essentials, but can be expanded with numerous add-ons.
  • Jupyter Notebooks: 🙌 A browser-based live document, which makes code directly executable. Text, math formulas and charts are all in one place, eliminating the need for documentation in a broader sense. Supports e.g. Python, R and C ++.
  • RStudio: 🙌 The most popular IDE among R lovers. In addition to the code, the data and objects stored in the working environment are always in view at the time of execution. The standard version is free.
  • Spyder: The Scientific Python Development Environment, by scientist for scientist. In my opinion the counterpart of RStudio for Python.
  • Notepad ++: A simple editor that runs on Windows. Not comparable to a real IDE. But very useful for encoding and simple text manipulation.
  • Atom: 🙌 Also an easy-to-use text editor, but platform-independent and therefore also suitable for Mac and Linux.

Analysis tools

Not everyone has to be a thoroughbred techie to do data science. Analyzes can often be implemented very well with existing analysis tools, especially for exploring and visualizing the data. There are tons of analysis tools out there. Here is a handful of selected solutions that you as a data scientist come across more often.

  • Excel: The classic MS Office product is certainly not one of the most popular tools among data scientists. Nevertheless, it very often serves a good purpose and is available by default in almost every company.
  • Google Sheets: 🙌 The Excel alternative for people without an Office license. Runs exclusively in the cloud and allows several people to work collaboratively at the same time. Sheets can be linked to Google Data Studio for more beautiful visualizations.
  • Tableau: One of the most popular BI tools on the market. In this way, data can be explored in interactive dashboards or via visual analyzes.
  • QlikView: QlikView is very similar to Tableau. It is also used to visualize data and analysis results.
  • PowerBI: The BI tool from Microsoft. Combines well with other Microsoft products.

Git

As soon as program code becomes larger and more complex, one should consider a good strategy for managing it. This applies to data science as well as software development. Git is a way of versioning code and smaller databases and storing them in a central location. This is particularly advisable for working in teams, because Git records every change and checks for compatibility before merging.

  • GitHub: 🙌 Is the most popular platform for Git-based code management. GitHub has been part of Microsoft since the end of 2018.
  • GitLab: The little brother and an alternative to GitHub. But I won't go into the subtle differences to GitHub here.
  • Bitbucket: Atlassian's Git management tool. Anyone who already uses other Atlassian services, such as Jira and Confluence, has an advantage through the simple integration with this tool.

Project organization

Just like in other projects, it is also important in data science to keep an overview of the individual tasks and the project status. Especially when you work as a team on a project, agreements and coordination are more important. Project management tools help here.

  • Jira: A very well informed project management tool from Atlassian. Well suited for larger teams and IT companies. Can be combined well with other Atlassian products.
  • Trello: 🙌 For organizing tasks in small and medium-sized teams. A lot lighter than Jira.
  • Asana: A good middle ground between Jira and Trello.

Wiki

Even if some don't like it that much, documenting is important, even in data science. In addition to good program documentation in the code, it also includes writing down important project information. With a wiki, valuable knowledge is bundled in one place. If you find the right tool for you, documenting is even fun.

  • Confluence: 🙌 An extensive wiki from Atlassian with a slim editor and many additional add-ons in the marketplace.
  • OneNote: Not a wiki, but at least a digital notebook. OneNote from Microsoft can be used by several people at the same time in Office 365. The notes are always synchronized.
  • Evernote: A digital notebook and a great alternative to Confluence or OneNote. This tool is widely used in schools.

communication

Communication is the be-all and end-all. It's the same in data science. As a data scientist, you have to coordinate with team members or clients on a daily basis. It is not always possible or useful to meet your colleagues and contacts in person. There are now great solutions that enable communication worldwide and in real time.

  • Slack: 🙌 Currently one of the most widely used communication tools in the corporate world. Includes group chat, direct chat, telephony and document sharing.
  • Microsoft Teams: The Slack alternative from Microsoft and included in every Office365 license.
  • Zoom: A pure video conference system with numerous functions for smaller and larger team meetings.
  • WebEx: Also a tool from Cisco specializing in video conferencing.
  • Google Meet: 🙌 Google's video conferencing system included with the G Suite license.

design

Data scientists are not designers. Nevertheless, from time to time there are tasks that are similar to those of digital designers or UX designers. Design tools are also very useful for data scientists for creating concepts for data flow, the visualization of database models or the sketching of dashboards.

  • Powerpoint: The good old Powerpoint from Microsoft. Loved by consultants, frowned upon by developers. However, it can be used to create diagrams and concepts fairly quickly and almost every company has it installed.
  • Keynote: The PowerPoint alternative for all Mac users. You can work with this just as well as with Powerpoint.
  • Draw.io: 🙌 For the quick and easy creation of diagrams. Easy to integrate with Confluence and Jira.
  • Balsamiq: UI design and wireframing with Balsamiq. First drafts of dashboards can be implemented quickly here.
  • figma: A pure online design tool specializing in collaborative work on drafts. Can be used well to create mock-ups for dashboards.
  • miro: An online whiteboard for collaborative brainstorming with team members. Every workshop can now also be done remotely.

Having tools is one thing, using them properly is another. Time and again I find that data scientists and analysts are not using their existing toolbox efficiently. Often the employees do not have enough time to properly deal with the products, which results in poor productivity. Companies often pay very high license fees and yet many employees do not know the most important steps, let alone how to efficiently link the tools with one another.

I've been working day in and day out with many of the above tools for a number of years. If you have training needs or would like advice on a suitable toolkit for your data science department, just contact me.