
DataTalks.Club
By DataTalks.Club

DataTalks.ClubJul 16, 2021

AI for Digital Health - Maria Bruckert
Links:
- Free ML Engineering course: http://mlzoomcamp.com
- Join DataTalks.Club: https://datatalks.club/slack.html
- Our events: https://datatalks.club/events.html

Cracking the Code: Machine Learning Made Understandable - Christoph Molnar
We talked about:
- Christoph’s background
- Kaggle and other competitions
- How Christoph became interested in interpretable machine learning
- Interpretability vs Accuracy
- Christoph’s current competition engagement
- How Christoph chooses topics for books
- Why Christoph started the writing journey with a book
- Self-publishing vs via a publisher
- Christoph’s other books
- What is conformal prediction?
- Christoph’s book on SHAP
- Explainable AI vs Interpretable AI
- Working alone vs with other people
- Christoph’s other engagements and how to stay hands-on
- Keeping a logbook
- Does one have to be an expert on the topic to write a book about it?
- Writing in the open and other feedback gathering methods
- Advice for those who want to be technical writers
- Self-publishing tools
- Finding Christoph online
Links:
- LinkedIn: https://www.linkedin.com/in/christoph-molnar/
- Website: https://christophmolnar.com/
Free ML Engineering course: http://mlzoomcamp.com Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

The Unwritten Rules for Success in Machine Learning - Jack Blandin
We talked about:
- Jack’s background
- Transitioning from IC to management
- Lesson not taught in traditional school
- The importance of people’s perception, trust, and respect
- How soft skills are relevant to machine learning
- How to put on a salesman hat in machine learning management
- The importance of visuals and building a POC as fast as possible
- 1st Rule of Machine Learning – don’t be afraid to start without machine learning
- The importance of understanding the reality that data represents
- The importance of putting yourself in the shoes of customers
- The importance of software engineering skills in machine learning
- Where to find Jack’s content
- Jack’s next venture
Links:
- Jack's LinkedIn profile: https://www.linkedin.com/in/jackblandin/
Free ML Engineering course: http://mlzoomcamp.com Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

From a Research Scientist at Amazon to a Machine learning/AI Consultant - Verena Webber
Links:
- Mini sound bath: https://www.youtube.com/watch?v=g-lDrcSqcrQ
Free ML Engineering course: http://mlzoomcamp.com Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

From Marketing to Product Owner in Search - Lera Kaimashnіkova
We talked about:
- Lera’s background
- Lera’s move from Ukraine to Germany
- The transition from Marketing to Product Ownership
- The importance of communication and one-on-ones
- The role of Product Owner
- Utilizing Scrum as a Product Owner
- Building teams and cross-functionality
- Lera’s experience learning about search
- The importance of having both technical knowledge and business context
- Open developer positions at AUTODOC
- What experience Lera came to AUTODOC with
- How marketing skills helped Lera in her current role
- Lera’s resource recommendations
- Everything is possible
Links:
- Post: https://www.linkedin.com/posts/leracaiman_elasticsearch-ecommerce-activity-7106615081588674560-5WQO
Free ML Engineering course: http://mlzoomcamp.com Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

Collaborative Data Science in Business - Ioannis Mesionis
Links:
- LinkedIn: https://www.linkedin.com/in/ioannis-mesionis/
- Github: https://github.com/ioannismesionis
- Website: https://ioannismesionis.github.io/
Free ML Engineering course: http://mlzoomcamp.com Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

Bridging Data Science and Healthcare - Eleni Stamatelou
Free ML Engineering course: http://mlzoomcamp.com Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

DataTalks.Club Anniversary Interview - Alexey Grigorev, Johanna Bayer
Free ML Engineering course: http://mlzoomcamp.com Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

Data Engineering for Fraud Prevention - Angela Ramirez
We talked about:
- Angela's background
- Angela's role at Sam's Club
- The usefulness of knowing ML as a data engineer
- Angela's career path
- Transitioning from data analyst to data engineer/system designer
- Best practices for system design and data engineering
- Working with document databases
- Working with network-based databases
- Detecting fraud with a network-based database
- Selecting the database type to work with
- Neo4j vs Postgres
- The importance of having software engineering knowledge in data engineering
- Data quality check tooling
- The greatest challenges in data engineering
- Debugging and finding the root cause of a failed job
- What kinds of tools Angela uses on a daily basis
- Working with external data sources
- Angela's resource recommendations
Links:
- LinkedIn: https://www.linkedin.com/in/aramirez1305/
- Twitter: https://twitter.com/angelamaria__r
- Github: https://github.com/aramir62
- Previous podcast talk: https://twitter.com/i/spaces/1OwGWwZAZDnGQ?s=20
Free ML Engineering course: http://mlzoomcamp.com
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

From Data Manager to Data Architect - Loïc Magnien
We talked about:
- Loïc's background
- Data management
- Loïc's transition to data engineer
- Challenges in the transition to data engineering
- What is a data architect?
- The output of a data architect's work
- Establishing metrics and dimensions
- The importance of communication
- Setting up best practices for the team
- Staying relevant and tech-watching
- Setting up specifications for a pipeline
- Be agile, create a POC, iterate ASAP, and build reusable templates
- Reaching out to Loïc for questions
Links:
- Loiic LinkedIn: https://www.linkedin.com/in/loicmagnien/
Free ML Engineering course: http://mlzoomcamp.com
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Pragmatic and Standardized MLOps - Maria Vechtomova
We talked about:
- Maria's background
- Marvelous MLOps
- Maria's definition of MLOps
- Alternate team setups without a central MLOps team
- Pragmatic vs non-pragmatic MLOps
- Must-have ML tools (categories)
- Maturity assessment
- What to start with in MLOps
- Standardized MLOps
- Convincing DevOps to implement
- Understanding what the tools are used for instead of knowing all the tools
- Maria's next project plans
- Is LLM Ops a thing?
- What Ahold Delhaize does
- Resource recommendations to learn more about MLOps
- The importance of data engineering knowledge for ML engineers
Links:
- LinkedIn: https://www.linkedin.com/company/marvelous-mlops/
- Website: https://marvelousmlops.substack.com/
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

Democratizing Causality - Aleksander Molak
We talked about:
- Aleksander's background
- Aleksander as a Causal Ambassador
- Using causality to make decisions
- Counterfactuals and and Judea Pearl
- Meta-learners vs classical ML models
- Average treatment effect
- Reducing causal bias, the super efficient estimator, and model uplifting
- Metrics for evaluating a causal model vs a traditional ML model
- Is the added complexity of a causal model worth implementing?
- Utilizing LLMs in causal models (text as outcome)
- Text as treatment and style extraction
- The viability of A/B tests in causal models
- Graphical structures and nonparametric identification
- Aleksander's resource recommendations
Links:
- The Book of Why: https://amzn.to/3OZpvBk
- Causal Inference and Discovery in Python: https://amzn.to/46Pperr
- Book's GitHub repo: https://github.com/PacktPublishing/Causal-Inference-and-Discovery-in-Python
- The Battle of Giants: Causality vs NLP (PyData Berlin 2023): https://www.youtube.com/watch?v=Bd1XtGZhnmw
- New Frontiers in Causal NLP (papers repo): https://bit.ly/3N0TFTL
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

Mastering Data Engineering as a Remote Worker - José María Sánchez Salas
We talked about:
- José's background
- How José relocated to Norway and his schedule
- Tech companies in Norway and José role
- Challenges of working as a remote data engineer
- José's newsletter on how to make use of data
- The process of making data useful
- Where José gets inspiration for his newsletter
- Dealing with burnout
- When in Norway, do as the Norwegians do
- The legalities of working remotely in Norway
- The benefits of working remotely
Links:
- LinkedIn: https://www.linkedin.com/in/jmssalas
- Github: https://github.com/jmssalas
- Website & Newsletter: https://jmssalas.com
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

The Good, the Bad and the Ugly of GPT - Sandra Kublik
We talked about:
- Sandra's background
- Making a YouTube channel to break into the LLM space
- The business cases for LLMs
- LLMs as amplifiers
- The befits of keeping a human in the loop when using LLMs (AI limitations)
- Using LLMs as assistants
- Building an app that uses an LLM
- Prompt whisperers and how to improve your prompts
- Sandra's 7-day LLM experiment
- Sandra's LLM content recommendations
- Finding Sandra online
Links:
- LinkedIn: https://www.linkedin.com/in/sandrakublik/
- Twitter: https://twitter.com/sandra_kublik
- Youtube: https://www.youtube.com/@sandra_kublik
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

LLMs for Everyone - Meryem Arik
We talked about:
- Meryam's background
- The constant evolution of startups
- How Meryam became interested in LLMs
- What is an LLM (generative vs non-generative models)?
- Why LLMs are important
- Open source models vs API models
- What TitanML does
- How fine-tuning a model helps in LLM use cases
- Fine-tuning generative models
- How generative models change the landscape of human work
- How to adjust models over time
- Vector databases and LLMs
- How to choose an open source LLM or an API
- Measuring input data quality
- Meryam's resource recommendations
Links:
- Website: https://www.titanml.co/
- Beta docs: https://titanml.gitbook.io/iris-documentation/overview/guide-to-titanml...
- Using llama2.0 in TitanML Blog: https://medium.com/@TitanML/the-easiest-way-to-fine-tune-and-inference-llama-2-0-8d8900a57d57
- Discord: https://discord.gg/83RmHTjZgf
- Meryem LinkedIn: https://www.linkedin.com/in/meryemarik/
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

Investing in Open-Source Data Tools - Bela Wiertz
We talked about:
- Bela's background
- Why startups even need investors
- Why open source is a viable go-to-market strategy
- Building a bottom-up community
- The investment thesis for the TKM Family Office and the blurriness of the funding round naming convention
- Angel investors vs VC Funds vs family offices
- Bela's investment criteria and GitHub stars as a metric
- Inbound sourcing, outbound sourcing, and investor networking
- Making a good impression on an investor
- Balancing open and closed source parts of a product
- The future of open source
- Recent successes of open source companies
- Bela's resource recommendations
Links:
- Understand who is engaging with your open source project article: https://www.crowd.dev/
- Top 6 Books on Developer Community Building: https://www.crowd.dev/post/top-6-books-on-developer-community-building
- Which open source software metrics matter: https://www.bvp.com/atlas/measuring-the-engagement-of-an-open-source-software-community#Which-open-source-software-metrics-matter
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Why Machine Learning Design is Broken - Valerii Babushkin
Links:
- Book: https://www.manning.com/books/machine-learning-system-design?utm_source=AGMLBookcamp&utm_medium=affiliate&utm_campaign=book_babushkin_machine_4_25_23&utm_content=twitter
- Discount: poddatatalks21 (35% off)
- Evidently: https://www.evidentlyai.com/
- Article: https://medium.com/people-ai-engineering/design-documents-for-ml-models-bbcd30402ff7
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Interpretable AI and ML - Polina Mosolova
We talked about:
- Polina's background
- How common it is for PhD students to build ML pipelines end-to-end
- Simultaneous PhD and industry experience
- Support from both the academic and industry sides
- How common the industrial PhD setup is and how to get into one
- Organizational trust theory
- How price relates to trust
- How trust relates to explainability
- The importance of actionability
- Explainability vs interpretability vs actionability
- Complex glass box models
- Does the explainability of a model follow explainability?
- What explainable AI bring to customers and end users
- Can all trust be turned into KPI?
Links:
- LinkedIn: https://www.linkedin.com/in/polina-mosolova/
- Neural Additive Models paper: https://proceedings.neurips.cc/paper/2021/file/251bd0442dfcc53b5a761e050f8022b8-Paper.pdf
- Neural Basis Model paper: https://arxiv.org/pdf/2205.14120.pdf
- Interpretable Feature Spaces paper: https://kdd.org/exploration_files/vol24issue1_1._Interpretable_Feature_Spaces_revised.pdf

From Scratch to Success: Building an MLOps Team and ML Platform - Simon Stiebellehner
We talked about:
- Simon's background
- What MLOps is and what it isn't
- Skills needed to build an ML platform that serves 100s of models
- Ranking the importance of skills
- The point where you should think about building an ML platform
- The importance of processes in ML platforms
- Weighing your options with SaaS platforms
- The exploratory setup, experiment tracking, and model registry
- What comes after deployment?
- Stitching tools together to create an ML platform
- Keeping data governance in mind when building a platform
- What comes first – the model or the platform?
- Do MLOps engineers need to have deep knowledge of how models work?
- Is API design important for MLOps?
- Simon's recommendations for furthering MLOps knowledge
Links:
- LinkedIn: https://www.linkedin.com/in/simonstiebellehner/
- Github: https://github.com/stiebels
- Medium: https://medium.com/@sistel
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

From MLOps to DataOps - Santona Tuli
We talked about:
- Santona's background
- Focusing on data workflows
- Upsolver vs DBT
- ML pipelines vs Data pipelines
- MLOps vs DataOps
- Tools used for data pipelines and ML pipelines
- The “modern data stack” and today's data ecosystem
- Staging the data and the concept of a “lakehouse”
- Transforming the data after staging
- What happens after the modeling phase
- Human-centric vs Machine-centric pipeline
- Applying skills learned in academia to ML engineering
- Crafting user personas based on real stories
- A framework of curiosity
- Santona's book and resource recommendations
Links:
- LinkedIn: https://www.linkedin.com/in/santona-tuli/
- Upsolver website: upsolver.com
- Why we built a SQL-based solution to unify batch and stream workflows: https://www.upsolver.com/blog/why-we-built-a-sql-based-solution-to-unify-batch-and-stream-workflows
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Data Developer Relations - Hugo Bowne-Anderson
We talked about:
- Hugo's background
- Why do tools and the companies that run them have wildly different names
- Hugo's other projects beside Metaflow
- Transitioning from educator to DevRel
- What is DevRel?
- DevRel vs Marketing
- How DevRel coordinates with developers
- How DevRel coordinates with marketers
- What skills a DevRel needs
- The challenges that come with being an educator
- Becoming a good writer: nature vs nurture
- Hugo's approach to writing and suggestions
- Establishing a goal for your content
- Choosing a form of media for your content
- Is DevRel intercompany or intracompany?
- The Vanishing Gradients podcast
- Finding Hugo online
Links:
- Hugo Browne's github: http://hugobowne.github.io/
- Vanishing Gradients: https://vanishinggradients.fireside.fm/
- MLOps and DevOps: Why Data Makes It Differenthttps://www.oreilly.com/radar/mlops-and-devops-why-data-makes-it-different/
- Evaluate Metaflow for free, right from your Browser: https://outerbounds.com/sandbox/
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Lessons Learned from Freelancing and Working in a Start-up - Antonis Stellas
We talked about;
- Antonis' background
- The pros and cons of working for a startup
- Useful skills for working at a startup and the Lean way to work
- How Antonis joined the DataTalks.Club community
- Suggestions for students joining the MLOps course
- Antonis contributing to Evidently AI
- How Antonis started freelancing
- Getting your first clients on Upwork
- Pricing your work as a freelancer
- The process after getting approved by a client
- Wearing many hats as a freelancer and while working at a startup
- Other suggestions for getting clients as a freelancer
- Antonis' thoughts on the Data Engineering course
- Antonis' resource recommendations
Links:
- Lean Startup by Eric Ries: https://theleanstartup.com/
- Lean Analytics: https://leananalyticsbook.com/
- Designing Machine Learning Systems by Chip Huyen: https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/
- Kafka Streaming with python by Khris Jenkins tutorial video: https://youtu.be/jItIQ-UvFI4
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

Data Access Management - Bart Vandekerckhove
We talked about:
- Bart's background
- What is data governance?
- Data dictionaries and data lineage
- Data access management
- How to learn about data governance
- What skills are needed to do data governance effectively
- When an organization needs to start thinking about data governance
- Good data access management processes
- Data masking and the importance of automating data access
- DPO and CISO roles
- How data access management works with a data mesh approach
- Avoiding the role explosion problem
- The importance of data governance integration in DataOps
- Terraform as a stepping stone to data governance
- How Raito can help an organization with data governance
- Open-source data governance tools
Links:
- LinkedIn: https://www.linkedin.com/in/bartvandekerckhove/
- Twitter: https://twitter.com/Bart_H_VDK
- Github: https://github.com/raito-io
- Website: https://www.raito.io/
- Data Mesh Learning Slack: https://data-mesh-learning.slack.com/join/shared_invite/zt-1qs976pm9-ci7lU8CTmc4QD5y4uKYtAA#/shared-invite/email
- DataQG Website: https://dataqg.com/
- DataQG Slack: https://dataqgcommunitygroup.slack.com/join/shared_invite/zt-12n0333gg-iTZAjbOBeUyAwWr8I~2qfg#/shared-invite/email
- DMBOK (Data Management Book of Knowledge): https://www.dama.org/cpages/body-of-knowledge
- DMBOK Wheel describing the data governance activities: https://www.dama.org/cpages/dmbok-2-wheel-images
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Data Strategy: Key Principles and Best Practices - Boyan Angelov
We talked about:
- Boyan's background
- What is data strategy?
- Due diligence and establishing a common goal
- Designing a data strategy
- Impact assessment, portfolio management, and DataOps
- Data products
- DataOps, Lean, and Agile
- Data Strategist vs Data Science Strategist
- The skills one needs to be a data strategist
- How does one become a data strategist?
- Data strategist as a translator
- Transitioning from a Data Strategist role to a CTO
- Using ChatGPT as a writing co-pilot
- Using ChatGPT as a starting point
- How ChatGPT can help in data strategy
- Pitching a data strategy to a stakeholder
- Setting baselines in a data strategy
- Boyan's book recommendations
Links:
- LinkedIn: https://www.linkedin.com/in/angelovboyan/
- Twitter: https://twitter.com/thinking_code
- Github: https://github.com/boyanangelov
- Website: https://boyanangelov.com/
Free MLOps course: https://github.com/DataTalksClub/mlops-zoomcamp Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

Practical Data Privacy - Katharine Jarmul
We talked about:
- Katharine's background
- Katharine's ML privacy startup
- GDPR, CCPA, and the “opt-in as the default” approach
- What is data privacy?
- Finding Katharine's book – Practical Data Privacy
- The various definitions of data privacy and “user profiles”
- Privacy engineering and privacy-enhancing technologies
- Why data privacy is important
- What is differential privacy?
- The importance of keeping privacy in mind when designing systems
- Data privacy on the example of ChatGPT
- Katharine's resource suggestions for learning about data privacy
Links:
- LinkedIn: https://www.linkedin.com/in/katharinejarmul/
- Twitter: https://twitter.com/kjam
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

Building Scalable and Reliable Machine Learning Systems - Arseny Kravchenko
We talked about:
- Arseny's background
- Working on machine learning in startups
- What is Machine Learning System Design?
- Constraints and requirements
- Known unknowns vs unknown unknowns (Design stage)
- Writing a design document
- Technical problems vs product-oriented problems
- The solution part of the Design Document
- What motivated Arseny to write a book on ML System Design
- Examples of a Design Document in the book
- The types of readers for ML System Design
- Working with the co-author
- Reacting to constraints and feedback when writing a book
- Arseny's favorite chapter of the book
- Other resources where you can learn about ML System Design
- Twitter Giveaway
Links:
- Book: https://www.manning.com/books/machine-learning-system-design?utm_source=AGMLBookcamp&utm_medium=affiliate&utm_campaign=book_babushkin_machine_4_25_23&utm_content=twitter
- Discount: poddatatalks21 (35% off)
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Building an Open-Source NLP Tool - Johannes Hötter
We talked about:
- Johannes’s background
- Johannes’s Open Source Spotlight demos – Refinery and Bricks
- The difficulties of working with natural language processing (NLP)
- Incorporating ChatGPT into a process as a heuristic
- What is Bricks?
- The process of starting a startup – Kern
- Making the decision to go with open source
- Pros and cons of launching as open source
- Kern’s business model
- Working with enterprises
- Johannes as a salesperson
- The team at Kern
- Johannes’s role at Kern
- How Johannes and Henrik separate responsibilities at Kern
- Working with very niche use cases
- The short story of how Kern got its funding
- Johannes’s resource recommendation
Links:
- Refinery's GitHub repo: https://github.com/code-kern-ai/refinery
- Bricks' Github repo: https://github.com/code-kern-ai/bricks
- Bricks Open Source Spotlight demo: https://www.youtube.com/watch?v=r3rXzoLQy2U
- Refinery Open Source Spotlight demo: https://www.youtube.com/watch?v=LlMhN2f7YDg
- Discord: https://discord.com/invite/qf4rGCEphW
- Ker's Website: https://www.kern.ai
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Navigating Industrial Data Challenges - Rosona Eldred
We talked about:
- Rosona’s background
- How mathematics knowledge helps in industry
- What is industrial data?
- Setting up an industrial process using blue paint
- Internet companies’ data vs industrial data
- Explaining industrial processes using packing peanuts
- Why productive industry needs data
- Measuring product qualities
- How data specialists use industrial data
- Defining and measuring sustainability
- Using data in reactionary measures to changing regulations
- Types of industrial data
- Solving problems and optimizing with industrial data
- Industrial solvers
- Tiny data vs Big data in productive industry
- The advantages of coming from academia into productive industry
- Materials and resources for industrial data
- Women in industry
- Why Rosona decided to shift to industrial data
Links:
- Kaggle dataset: https://www.kaggle.com/datasets/paresh2047/uci-semcom

Mastering Self-Learning in Machine Learning - Aaisha Muhammad
We talked about:
- Aaisha’s background
- How homeschooling affects self-study
- Deciding on what to learn about
- Establishing whether a resource is good
- How Aaisha focuses on learning
- Deciding on what kind of project to build
- Find research materials
- Aaisha’s experience with the Data Talks Club ML Zoomcamp
- ML Zoomcamp projects
- Aaisha’s interest in bioinformatics
- Keeping motivated with deadlines
- Notes and time-tracking tools
- Drawbacks to self-studying
- Aaisha’s interest in machine learning
- Aaisha’s least favorable part of ML Zoomcamp
- Helping people as a way to learn
- Using ChatGPT as a “study group”
- Is it possible to use self-studying to learn high-level topics
- Switching topics to avoid burnout
- Aaisha’s resource recommendations
Links:
- LinkedIn: https://www.linkedin.com/in/aaisha-muhammad/
- Twitter: https://twitter.com/ZealousMushroom
- Github: https://github.com/AaishaMuhammad
- Website: http://www.aaishamuhammad.co.za/
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

The Secret Sauce of Data Science Management - Shir Meir Lador
We talked about:
- Shir’s background
- Debrief culture
- The responsibilities of a group manager
- Defining the success of a DS manager
- The three pillars of data science management
- Managing up
- Managing down
- Managing across
- Managing data science teams vs business teams
- Scrum teams, brainstorming, and sprints
- The most important skills and strategies for DS and ML managers
- Making sure proof of concepts get into production
Links:
- The secret sauce of data science management: https://www.youtube.com/watch?v=tbBfVHIh-38
- Lessons learned leading AI teams: https://blogs.intuit.com/2020/06/23/lessons-learned-leading-ai-teams/
- How to avoid conflicts and delays in the AI development process (Part I): https://blogs.intuit.com/2020/12/08/how-to-avoid-conflicts-and-delays-in-the-ai-development-process-part-i/
- How to avoid conflicts and delays in the AI development process (Part II): https://blogs.intuit.com/2021/01/06/how-to-avoid-conflicts-and-delays-in-the-ai-development-process-part-ii/
- Leading AI teams deck: https://drive.google.com/drive/folders/1_CnqjugtsEbkIyOUKFHe48BeRttX0uJG
- Leading AI teams video: https://www.youtube.com/watch?app=desktop&v=tbBfVHIh-38
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

SE4ML - Software Engineering for Machine Learning - Nadia Nahar
We talked about:
- Nadia’s background
- Academic research in software engineering
- Design patterns
- Software engineering for ML systems
- Problems that people in industry have with software engineering and ML
- Communication issues and setting requirements
- Artifact research in open source products
- Product vs model
- Nadia’s open source product dataset
- Failure points in machine learning projects
- Finding solutions to issues using Nadia’s dataset and experience
- The problem of siloing data scientists and other structure issues
- The importance of documentation and checklists
- Responsible AI
- How data scientists and software engineers can work in an Agile way
Links:
- Model Card: https://arxiv.org/abs/1810.03993
- Datasheets: https://arxiv.org/abs/1803.09010
- Factsheets: https://arxiv.org/abs/1808.07261
- Research Paper: https://www.cs.cmu.edu/~ckaestne/pdf/icse22_seai.pdf
- Arxiv version: https://arxiv.org/pdf/2110.
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Starting a Consultancy in the Data Space - Aleksander Kruszelnicki
We talked about:
- Aleksander’s background
- The difficulty of selling data stack as a service
- How Aleksander got into consulting
- The Mom Test – extracting feedback from people
- User interviews
- Why Aleksander’s data stack as a service startup was not viable
- How Aleksander decided to switch to consulting
- Finding clients to consult
- Figuring out how to position your services
- Geographical limitations
- Figuring out your target audience
- The importance of networking and marketing
- Pricing your services
- The pitfalls of daily and hourly pricing and how to balance incentives
- Is Germany a good place to found a company?
- Aleksander’s book recommendations
Links:
- LinkedIn: https://www.linkedin.com/in/alkrusz/
- Twitter: https://twitter.com/alkrusz
- Website: www.leukos.io
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Biohacking for Data Scientists and ML Engineers - Ruslan Shchuchkin
We talked about:
- Ruslan’s background
- Fighting procrastination and perfectionism
- What is biohacking?
- The role of dopamine and other hormones in daily life
- How meditation can help
- The influence light has on our bodies
- Behavioral biohacking
- Daylight lamps and using light to wake up
- Sleep cycles
- How nutrition affects productivity
- Measuring productivity
- Examples of unsuccessful biohacking attempts
- Stoicism, voluntary discomfort, and self-challenges
- Biohacking risks and ways to prevent them
- Coffee and tea biohacking
- Using self-reflection and tracking to measure results
- Mindset shifting
- Stoicism book recommendation
- Work/life balance
- Ruslan’s biohacking resource recommendation
Links:
- LinkedIn: https://www.linkedin.com/in/ruslanshchuchkin/
ree data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Analytics for a Better World - Parvathy Krishnan
We talked about:
- Parvathy’s background
- Brainstorming sessions with nonprofits to establish data maturity
- Example of an Analytics for a Better World project
- The overall data maturity situation of nonprofits vs private sector
- Solving the skill gap
- Publicly available content
- The Analytics for a Better World Academy
- The Academy’s target audience
- How researchers can work with Analytics for a Better World
- Improving data maturity in nonprofit organizations
- People, processes, and technology
- Typical tools that Analytics for a Better World recommends to nonprofits
- Profiles in nonprofits
- Does Analytics for a Better World has a need for data engineers?
- The Analytics for a Better World team
- Factors that help organizations become more data-driven
- Parvathy’s resource recommendations
Links:
- LinkedIn: https://www.linkedin.com/in/parvathykrishnank/
- Twitter: https://twitter.com/ABWInstitute
- Github: https://github.com/Analytics-for-a-Better-World
- Website: https://analyticsbetterworld.org/
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Accelerating the Adoption of AI through Diversity - Dânia Meira
We talked about:
- Dania’s background
- Founding the AI Guild
- Datalift Summit
- Coming up with meetup topics
- Diversity in Berlin
- Other types of diversity besides gender
- The pitfalls of lacking diversity
- Creating an environment where people can safely share their experiences
- How the AI Guild helps organizations become more diverse
- How the AI guild finds women in the fields of AI and data science
- Advice for people in underrepresented groups
- Organizing a welcoming environment and creating a code of conduct
- AI Guild’s consulting work and community
- AI Guild team
- Dania’s resource recommendations
- Upcoming Datalift Summit
Links:
- Call for Speakers for the #datalift summit (Berlin, 14 to 16 June 2023): https://eu1.hubs.ly/H02RXvX0
- Coded Bias documentary on Netflix: https://www.netflix.com/de/title/81328723#:~:text=This%20documentary%20investigates%20the%20bias,flaws%20in%20facial%20recognition%20technology.
- Book Weapons of Math Destruction by Cathy O'Neil: https://en.wikipedia.org/wiki/Weapons_of_Math_Destruction
- Book Lean In by Sheryl Sandberg: https://en.wikipedia.org/wiki/Lean_In
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Staff AI Engineer - Tatiana Gabruseva
We talked about:
- Tatiana’s background
- Going from academia to healthcare to the tech industry
- What staff engineers do
- Transferring skills from academia to industry and learning new ones
- The importance of having mentors
- Skipping junior and mid-level straight into the staff role
- Convincing employers that you can take on a lead role
- Seeing failure as a learning opportunity
- Preparing for coding interviews
- Preparing for behavioral and system design interviews
- The importance of having a network and doing mock interviews
- How much do staff engineers work with building pipelines, data science, ETC, MPOps, etc.?
- Context switching
- Advice for those going from academia to industry
- The most exciting thing about working as an AI staff engineer
- Tatiana’s book recommendations
Links:
- LinkedIn: https://www.linkedin.com/in/tatigabru/
- Twitter: https://twitter.com/tatigabru
- Github: https://github.com/tatigabru
- Website: http://tatigabru.com/
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

The Journey of a Data Generalist: From Bioinformatics to Freelancing - Jekaterina Kokatjuhha
We talked about:
- Jekaterina’s background
- How Jekaterina started freelancing
- Jekaterina’s initial ways of getting freelancing clients
- How being a generalist helped Jekaterina’s career
- Connecting business and data
- How Jekaterina’s LinkedIn posts helped her get clients
- Jekaterina’s work in fundraising
- Cohorts and KPIs
- Improving communication between the data and business teams
- Motivating every link in the company’s chain
- The cons of freelancing
- Balancing projects and networking
- The importance of enjoying what you do
- Growing the client base
- In the office work vs working remotely
- Jekaterina’s advice who people who feel stuck
- Jekaterina’s resource recommendations
Links:
- Jekaterina's LinkedIn: https://www.linkedin.com/in/jekaterina-kokatjuhha/
Join DataTalks.Club: https://datatalks.club/slack.html

Navigating Career Changes in Machine Learning - Chris Szafranek
We talked about
- Chris’s background
- Switching careers multiple times
- Freedom at companies
- Chris’s role as an internal consultant
- Chris’s sabbatical
- ChatGPT
- How being a generalist helped Chris in his career
- The cons of being a generalist and the importance of T-shaped expertise
- The importance of learning things you’re interested in
- Tips to enjoy learning new things
- Recruiting generalists
- The job market for generalists vs for specialists
- Narrowing down your interests
- Chris’s book recommendations
Links:
- Lex Fridman: science, philosophy, media, AI (especially earlier episodes): https://www.youtube.com/lexfridman
- Andrej Karpathy, former Senior Director of AI at Tesla, who's now focused on teaching and sharing his knowledge: https://www.youtube.com/@AndrejKarpathy
- Beautifully done videos on engineering of things in the real world: https://www.youtube.com/@RealEngineering
- Chris' website: https://szafranek.net/
- Zalando Tech Radar: https://opensource.zalando.com/tech-radar/
- Modal Labs, new way of deploying code to the cloud, also useful for testing ML code on GPUs: https://modal.com
- Excellent Twitter account to follow to learn more about prompt engineering for ChatGPT: https://twitter.com/goodside
- Image prompts for Midjourney: https://twitter.com/GuyP
- Machine Learning Workflows in Production - Krzysztof Szafanek: https://www.youtube.com/watch?v=CO4Gqd95j6k
- From Data Science to DataOps: https://datatalks.club/podcast/s11e03-from-data-science-to-dataops.html
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Preparing for a Data Science Interview - Luke Whipps
We talked about:
- Luke’s background
- Luke’s podcast - AI Game Changers
- How Luke helps people get jobs
- What’s changed in the recruitment market over the last 6 months
- Getting ready for the interview process
- Stage “zero” – the filter between the candidate and the company
- Preparing for the introduction stage – research and communication
- Reviewing the fundamentals during preparation
- Preparing for the technical part of the interview
- Establishing the hiring company’s expectations
- Depth vs breadth
- Overly theoretical and mathematical questions in interviews
- Bombing (failing) in the middle of an interview
- Applying to different roles within the same company
- Luke’s resource recommendations
Links:
- Luke's LinkedIn: https://www.linkedin.com/in/lukewhipps/
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Indie Hacking - Pauline Clavelloux
We talked about:
- Pauline’s background
- Pauline’s work as a manager at IBM
- What is indie hacking?
- Pauline initial indie hacking projects
- Getting ready for launch
- Responsibilities and challenges in indie hacking
- Pauline’s latest indie hacking project
- Going live and marketing
- Challenges with Unreal Me
- Staying motivated with indie hacking projects
- Skills Pauline picked up while doing indie hacking projects
- Balancing a day job and indie hacking
- Micro SaaS and AboutStartup.io
- How Pauline comes up with ideas for projects
- Going from an idea on paper to building a project
- Pauline’s Twitter success
- Connecting with Pauline online
- Pauline’s indie hacking inspiration
- Pauline’s resource recommendation
Links:
- Website: https://wintopy.io/
- Pauline's Twitter: https://twitter.com/Pauline_Cx
- Pauline's LinkedIn: https://www.linkedin.com/in/paulineclavelloux/
- Blog about Indiehacking: https://aboutstartup.io
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Doing Software Engineering in Academia - Johanna Bayer
We talked about:
- Johanna’s background
- Open science course and reproducible papers
- Research software engineering
- Convincing a professor to work on software instead of papers
- The importance of reproducible analysis
- Why academia is behind on software engineering
- The problems with open science publishing in academia
- The importance of standard coding practices
- How Johanna got into research software engineering
- Effective ways of learning software engineering skills
- Providing data and analysis for your project
- Johanna’s initial experience with software engineering in a project
- Working with sensitive data and the nuances of publishing it
- How often Johanna does hackathons, open source, and freelancing
- Social media as a source of repos and Johanna’s favorite communities
- Contributing to Git repos
- Publishing in the open in academia vs industry
- Johanna’s book and resource recommendations
- Conclusion
Links:
- The Society of Research Software Engineering, plus regional chapters: https://society-rse.org/
- The RSE Association of Australia and New Zealand: https://rse-aunz.github.io/
- Research Software Engineers (RSEs) The people behind research software: https://de-rse.org/en/index.html
- The software sustainability institute: https://www.software.ac.uk/
- The Carpentries (beginner git and programming courses): https://carpentries.org/
- The Turing Way Book of Reproducible Research: https://the-turing-way.netlify.app/welcome
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Data-Centric AI - Marysia Winkels
We talked about:
- Marysia’s background
- What data-centric AI is
- Data-centric Kaggle competitions
- The mindset shift to data-centric AI
- Data-centric does not mean you should not iterate on models
- How to implement the data-centric approach
- Focusing on the data vs focusing on the model
- Resources to help implement the data-centric approach
- Data-centric AI vs standard data cleaning
- Making sure your data is representative
- Knowing when your data is good enough
- The importance of user feedback
- “Shadow Mode” deployment
- What to do if you have a lot of bad data or incomplete data
- Marysia’s role at PyData
- How Marysia joined PyData
- The difference between PyData and PyCon
- Finding Marysia online
Links:
- Embetter & Bulk Demo: https://www.youtube.com/watch?v=L---nvDw9KU
Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Business Skills for Data Professionals - Loris Marini
We talked about:
- Loris’ background
- Transitioning from physics to data
- Aligning people on concepts
- Lead indicators and stickiness
- Context, semantics, and meaning
- Communication and being memorable
- Making data digestible for business and building trust
- The importance of understanding the language of business
- Stakeholder mapping
- Attending business meetings as a data professional
- Organizing your stakeholder map
- Prioritizing
- How to support the business strategy
- Learning to speak online
- Resource recommendations from Loris
Links:
- Discovering Data Discord server: https://bit.ly/discovering-data-discord
- Loris' LinkedIn: https://www.linkedin.com/in/lorismarini/
- Loris' Twitter: https://twitter.com/LorisMarini

From Software Engineer to Data Science Manager - Sadat Anwar
We talked about:
- Sadat’s background
- Sadat’s backend engineering experience
- Sadat’s pivot point as a backend engineer
- Sadat’s exposure to ML and Data Science
- Sadat’s Act Before you Think approach (with safety nets)
- Sadat’s street cred and transition into management
- The hiring process as an internal candidate
- The importance of people management skills
- The Brag List
- The most difficult part of transitioning to management
- Focusing on projects and setting milestones
- Sadat’s transition from EM to data science management
- How much domain knowledge is needed for management?
- The main difference between engineering and management
- How being an EM helped Sadat transition no DS management
- 53:32 Transitioning to DS management from other roles
- How to feel accomplished as a manager
- Sadat’s book recommendations
- Sadat’s meetups
Links:
- Sadat's Meetup page: https://www.meetup.com/berlin-search-technology-meetup/
- Meetup event "Bias in AI: how to measure it and how to fix it event": https://www.meetup.com/data-driven-ai-berlin-meetup/events/289927565/
ML Zoomcamp: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Teaching and Mentoring in Data Analytics - Irina Brudaru
We talked about:
- Irina’s background
- Irina as a mentor
- Designing curriculum and program management at AI Guild
- Other things Irina taught at AI Guild
- Why Irina likes teaching
- Students’ reluctance to learn cloud
- Irina as a manager
- Cohort analysis in a nutshell
- How Irina started teaching formally
- Irina’s diversity project in the works
- How DataTalks.Club can attract more female students to the Zoomcamps
- How to get technical feedback at work
- Antipatterns and overrated/overhyped topics in data analytics
- Advice for young women who want to get into data science/engineering
- Finding Irina online
- Fundamentals for data analysts
- Suggestions for DataTalks.club collaborations
- Conclusions
Links:
- LinkedIn Account: https://www.linkedin.com/in/irinabrudaru/
ML Zoomcamp: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Technical Writing and Data Journalism - Angelica Lo Duca
We talked about:
- Angelica’s background
- Angelica’s books
- Data journalism
- How Angelica got into data journalism
- The field of digital humanities and Angelica’s data journalism course
- Technical articles vs data journalism articles
- Transforming reports into data storytelling
- Are reports to stakeholders considered technical writing?
- Data visualization in articles
- Article length
- The process of writing an article
- Finding writing topics
- How Angelica got into writing a book (communication with publishers)
- The process for writing a book
- Brainstorming
- Reviews and revisions
- Conclusion
Links:
- Data Journalism examples (FENCED OUT): https://www.washingtonpost.com/graphics/world/border-barriers/europe-refugee-crisis-border-control/??noredirect=on
- Data Journalism examples (La tierra esclava): https://latierraesclava.eldiario.es/
- Small medium publication aiming at being Stack Overflow of Medium: https://medium.com/syntaxerrorpub
- Example of a self-published book on Data Visualization: https://www.amazon.com/Introduction-Data-Visualization-Storytelling-Scientist-ebook/dp/B07VYCR3Z6/ref=sr_1_4?crid=4JRJ48O7K8TK&keywords=joses+berengueres&qid=1668270728&sprefix=joses+beremguere%2Caps%2C273&sr=8-4
- My novels (in Italian) La bambina e il Clown: https://www.amazon.it/Bambina-Clown-Angelica-Lo-Duca/dp/1500984515/ref=sr_1_9?__mk_it_IT=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=2KGK9GMN0FAHI&keywords=la+bambina+e+il+clown&qid=1668270769&sprefix=la+bambina+e+il+clown%2Caps%2C88&sr=8-9
- My novels (in Italian) Il Violinista: https://www.amazon.it/Violinista-1-Angelica-Lo-Duca/dp/1501009672/ref=sr_1_1?__mk_it_IT=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=12KTF9EF5UKIG&keywords=il+violinista+lo+duca&qid=1668270791&sprefix=il+violinista+lo+duca%2Caps%2C81&sr=8-1
- Course on Data Journalism: https://www.coursera.org/learn/visualization-for-data-journalism
ML Zoomcamp: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

From Digital Marketing to Analytics Engineering - Nikola Maksimovic
We talked about:
- Nikola’s background
- Making the first steps towards a transition to BI and Analytics Engineering
- Learning the skills necessary to transition to Analytics Engineering
- The in-between period – from Marketing to Analytics Engineering
- Nikola’s current responsibilities
- Understanding what a Data Model is
- Tools needed to work as an Analytics Engineer
- The Analytics Engineering role over time
- The importance of DBT for Analytics Engineers
- Where can one learn about data modeling theory?
- Going from Ancient Greek and Latin to understanding Data (Just-In-Time Learning)
- The importance of having domain knowledge to analytics engineering
- Suggestion for those wishing to transition into analytics engineering
- The importance of having a mentor when transitioning
- Finding a mentor
- Helpful newsletters and blogs
- Finding Nikola online
Links:
- Nikola's LinkedIn account: https://www.linkedin.com/in/nikola-maksimovic-40188183/
ML Zoomcamp: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Product Owners in Data Science - Anna Hannemann
We talked about:
- About Anna and METRO
- Anna’s background
- The importance of a technical background for data product owners
- What are product owners?
- Product owners vs product managers
- Anna’s work on recommender systems at METRO
- Expanding the data team
- Types of algorithms used for recommender systems
- What kind of knowledge and skills data product owners need to have
- Problems and ideas should come from the business
- How Anna handles all her responsibilities
- The process for starting work on new domains
- Product portfolio management
- ProductTank and Anna’s role in it
- Anna’s resource recommendations
Links:
- Data Science for Business Book: https://www.amazon.de/-/en/Foster-Provost/dp/1449361323/ref=sr_1_1?keywords=data+science+for+business&qid=1666404807&qu=eyJxc2MiOiIxLjg3IiwicXNhIjoiMS41MiIsInFzcCI6IjEuNDYifQ%3D%3D&sr=8-1
- Article on Data Science Products: https://www.linkedin.com/pulse/way-create-data-science-products-lessons-learnt-anna-hannemann-phd/
ML Zoomcamp: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Building Data Science Practice - Andrey Shtylenko
We talked about:
- Audience Poll
- Andrey’s background
- What data science practice is
- Best DS practice in a traditional company vs IT-centric companies
- Getting started with building data science practice (finding out who you report to)
- Who the initiative comes from
- Finding out what kind of problems you will be solving (Centralized approach)
- Moving to a semi-decentralized approach
- Resources to learn about data science practice
- Pivoting from the role of a software engineer to data scientist
- The most impactful realization from data science practice
- Advice for individual growth
- Finding Andrey online
Links:
- Data Teams book: https://www.amazon.com/Data-Teams-Management-Successful-Data-Focused/dp/1484262271/
ML Zoomcamp: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html

Large-Scale Entity Resolution - Sonal Goyal
We talked about:
- Sonal’s background
- How the idea for Zingg came about
- What Zingg is
- The difference between entity resolution and identity resolution
- How duplicate detection relates to entity resolution
- How Sonal decided to start working on Zingg
- How Zingg works
- What Zingg runs on
- Switching from consultancy to working on a new open source solution
- Why Zingg is open source
- Open source licensing
- Working on Zingg initially vs now
- Zingg’s current and future team
- Sonal’s biggest current challenge
- Avoiding problems with entity/identity resolution through database design
- Identity resolution vs basic joins, data fusions, and fuzzy joins
- Deterministic matching vs probabilistic machine learning
- Identity and entity resolution applications for fraud detection
- Graph algorithms vs classic ML in entity resolution
- Identity resolution success stories
- What Sonal would do differently given the chance to start over with Zingg
- Advice for those seeking to realize their own solution to a data problem
- Reading suggestion from Sonal
- Conclusion
Links:
- Open-Source Spotlight demo "Zingg":https://www.youtube.com/watch?v=zOabyZxN9b0
- Creative Selection: Inside Apple's Design Process During the Golden Age of Steve Jobs book: https://www.amazon.com/Creative-Selection-Inside-Apples-Process/dp/1250194466
ML Zoomcamp: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp
Join DataTalks.Club: https://datatalks.club/slack.html
Our events: https://datatalks.club/events.html