Resume

This is a summary of my resume. There is a more detailed version of my resume that you can download by clicking the PDF icon in the top right of the page.

Table of contents

Basics

Name Sam Voisin
Label Machine Learning Engineer and Software Developer
Email samvoisin@protonmail.com
Url https://www.samvoisin.com/
Summary Seasoned Machine Learning Engineer and Software Developer with a strong background in creating robust data pipelines, scalable architectures, and efficient algorithms. Proven expertise in translating advanced data science and machine learning techniques into production-ready software solutions.

Work

  • 2023.10 - Present
    Senior Data Scientist
    Tradewind Data Science, Chicago, IL
    Implement modern ML algorithms (Gaussian Process, XGBoost, Random Forest) and DNN architectures (RNNs, LSTMs, Transformers) for time series forecasting. Design ML system architectures based on contemporary design patterns. Construct ETL pipelines using Python, SQL, and PySpark on large-scale data clusters. Established company-wide DevOps program.
  • 2022.03 - 2023.10
    Data Scientist
    Infinia ML, Durham, NC
    Developed flexible document processing pipelines for information extraction and classification. Designed and implemented large language model (LLM) infrastructure.
  • 2020.06 - 2022.03
    Data Scientist
    Geometric Data Analytics, Inc, Durham, NC
    Research and development for clients including DARPA, NRL, and AFRL. Developed novel algorithms for oceanographic research, remote sensing, and pattern-of-life modeling. See publications.
  • 2019.04 - 2019.09
    Research Assistant
    Duke University, Durham, NC
    Crafted research goals, planned and executed experiments, designed data pipelines, and optimized MCMC samplers for Bayesian hierarchical regression models.
  • 2015.01 - 2018.06
    Analyst
    Ally Financial Services, Charlotte, NC
    Analyzed financial market data and business metrics to mitigate business risk. Automated data gathering and processing. Acted as program lead and mentor for department internship program.

Education

  • 2018.08 - 2020.05
    M.S.
    Duke University, Trinity College of Arts and Sciences
    Statistical Science
  • 2010.08 - 2014.12
    B.S.
    Clemson University, College of Business and Behavioral Science
    Financial Management

Awards

  • 2020.02.11
    Data Health Science Conference 2020 Case Study Competition
    UofSC Big Data Health Science Center
    Earned first place. The objective of the case study was to develop a platform to aid first responders in diagnosing chemical exposure. We developed an interpretable nearest-neighbors model to transparently diagnose exposure to a wide variety of chemical agents.

Publications

  • 2023.03.01
    Topological Simplification of Signals for Inference and Approximate Reconstruction
    2023 IEEE Aerospace Conference
    As Internet of Things (loT) devices become both cheaper and more powerful, researchers are increasingly finding solutions to their scientific curiosities both financially and computationally feasible. When operating with restricted power or communications budgets, however, devices can only send highly- compressed data. Such circumstances are common for devices placed away from electric grids that can only communicate via satellite, a situation particularly plausible for environmental sensor networks. These restrictions can be further complicated by potential variability in the communications budget, for ex-ample a solar-powered device needing to expend less energy when transmitting data on a cloudy day. We propose a novel, topology-based, lossy compression method well-equipped for these restrictive yet variable circumstances. This technique, Topological Signal Compression, allows sending compressed signals that utilize the entirety of a variable communications budget. To demonstrate our algorithm's capabilities, we per-form entropy calculations as well as a classification exercise on increasingly topologically simplified signals from the Free- Spoken Digit Dataset and explore the stability of the resulting performance against common baselines.
  • 2022.09.27
    Topological Feature Tracking for Submesoscale Eddies
    Geophysical Research Letters
    Abstract Current state-of-the art procedures for studying modeled submesoscale oceanographic features have made a strong assumption of independence between features identified at different times. Therefore, all submesoscale eddies identified in a time series were studied in aggregate. Statistics from these methods are illuminating but oversample identified features and cannot determine the lifetime evolution of the transient submesoscale processes. To this end, the authors apply the Topological Feature Tracking (TFT) algorithm to the problem of identifying and tracking submesoscale eddies over time. TFT identifies critical points on a set of time-ordered scalar fields and associates those points between consecutive timesteps. The procedure yields tracklets which represent spatio-temporal displacement of eddies. In this way we study the time-dependent behavior of submesoscale eddies, which are generated by a 1-km resolution submesoscale-permitting model. We summarize the submesoscale eddy data set produced by TFT, which yields unique, time-varying statistics.
  • 2021.12.01
    [Whitepaper] Automation is All You Need: Faster Earth Systems Models with AI/ML
    US Department of Energy
    Tropical cyclones can induce extreme water cycle events through dramatic precipitation and storm surge. More reliable models of intensity will translate into better prediction of the impact of extreme events in large scale Earth systems simulations. We demonstrate and describe AI/ML methodologies for rapid assimilation of new, in situ data products.

Skills

Machine Learning
Deep learning
Ensemble methods
Large Language Models (LLM)
Retrieval Augmented Generation (RAG)
Natural Language Processing
Computer Vision
Pytorch
Scikit-learn
Probability and Statistics
Bayesian inference
Statistical analysis
Predictive modeling
Time series analysis
Hypothesis testing
Nonparametric methods
Software Development
Object-oriented programming
Functional programming
RESTful APIs
Design patterns
Containerization
Agile methodology
Cloud Infrastructure
AWS
EC2
S3
DynamoDB
Lambda
Data Management
PostgreSQL
NoSQL
PySpark
Neo4j
ChromaDB
Programming
Python
SQL
R
DevOps/ML Ops
Continuous Integration
Continuous Deployment
MLflow
Docker
GitHub Actions
Gitlab CI/CD