Resume
This is a summary of my resume. There is a more detailed version of my resume that you can download by clicking the PDF icon in the top right of the page.
Table of contents
Basics
Name | Sam Voisin |
Label | Machine Learning Engineer and Software Developer |
samvoisin@protonmail.com | |
Url | https://www.samvoisin.com/ |
Summary | Seasoned Machine Learning Engineer and Software Developer with a strong background in creating robust data pipelines, scalable architectures, and efficient algorithms. Proven expertise in translating advanced data science and machine learning techniques into production-ready software solutions. |
Work
-
2023.10 - Present
Senior Data Scientist
Tradewind Data Science, Chicago, IL
Implement modern ML algorithms (Gaussian Process, XGBoost, Random Forest) and DNN architectures (RNNs, LSTMs, Transformers) for time series forecasting. Design ML system architectures based on contemporary design patterns. Construct ETL pipelines using Python, SQL, and PySpark on large-scale data clusters. Established company-wide DevOps program.
-
2022.03 - 2023.10
Data Scientist
Infinia ML, Durham, NC
Developed flexible document processing pipelines for information extraction and classification. Designed and implemented large language model (LLM) infrastructure.
-
2020.06 - 2022.03
Data Scientist
Geometric Data Analytics, Inc, Durham, NC
Research and development for clients including DARPA, NRL, and AFRL. Developed novel algorithms for oceanographic research, remote sensing, and pattern-of-life modeling. See publications.
-
2019.04 - 2019.09
Research Assistant
Duke University, Durham, NC
Crafted research goals, planned and executed experiments, designed data pipelines, and optimized MCMC samplers for Bayesian hierarchical regression models.
-
2015.01 - 2018.06
Analyst
Ally Financial Services, Charlotte, NC
Analyzed financial market data and business metrics to mitigate business risk. Automated data gathering and processing. Acted as program lead and mentor for department internship program.
Education
Awards
-
2020.02.11
Data Health Science Conference 2020 Case Study Competition
UofSC Big Data Health Science Center
Earned first place. The objective of the case study was to develop a platform to aid first responders in diagnosing chemical exposure. We developed an interpretable nearest-neighbors model to transparently diagnose exposure to a wide variety of chemical agents.
Publications
-
2023.03.01 Topological Simplification of Signals for Inference and Approximate Reconstruction
2023 IEEE Aerospace Conference
As Internet of Things (loT) devices become both cheaper and more powerful, researchers are increasingly finding solutions to their scientific curiosities both financially and computationally feasible. When operating with restricted power or communications budgets, however, devices can only send highly- compressed data. Such circumstances are common for devices placed away from electric grids that can only communicate via satellite, a situation particularly plausible for environmental sensor networks. These restrictions can be further complicated by potential variability in the communications budget, for ex-ample a solar-powered device needing to expend less energy when transmitting data on a cloudy day. We propose a novel, topology-based, lossy compression method well-equipped for these restrictive yet variable circumstances. This technique, Topological Signal Compression, allows sending compressed signals that utilize the entirety of a variable communications budget. To demonstrate our algorithm's capabilities, we per-form entropy calculations as well as a classification exercise on increasingly topologically simplified signals from the Free- Spoken Digit Dataset and explore the stability of the resulting performance against common baselines.
-
2022.09.27 Topological Feature Tracking for Submesoscale Eddies
Geophysical Research Letters
Abstract Current state-of-the art procedures for studying modeled submesoscale oceanographic features have made a strong assumption of independence between features identified at different times. Therefore, all submesoscale eddies identified in a time series were studied in aggregate. Statistics from these methods are illuminating but oversample identified features and cannot determine the lifetime evolution of the transient submesoscale processes. To this end, the authors apply the Topological Feature Tracking (TFT) algorithm to the problem of identifying and tracking submesoscale eddies over time. TFT identifies critical points on a set of time-ordered scalar fields and associates those points between consecutive timesteps. The procedure yields tracklets which represent spatio-temporal displacement of eddies. In this way we study the time-dependent behavior of submesoscale eddies, which are generated by a 1-km resolution submesoscale-permitting model. We summarize the submesoscale eddy data set produced by TFT, which yields unique, time-varying statistics.
-
2021.12.01 [Whitepaper] Automation is All You Need: Faster Earth Systems Models with AI/ML
US Department of Energy
Tropical cyclones can induce extreme water cycle events through dramatic precipitation and storm surge. More reliable models of intensity will translate into better prediction of the impact of extreme events in large scale Earth systems simulations. We demonstrate and describe AI/ML methodologies for rapid assimilation of new, in situ data products.
Skills
Machine Learning | |
Deep learning | |
Ensemble methods | |
Large Language Models (LLM) | |
Retrieval Augmented Generation (RAG) | |
Natural Language Processing | |
Computer Vision | |
Pytorch | |
Scikit-learn |
Probability and Statistics | |
Bayesian inference | |
Statistical analysis | |
Predictive modeling | |
Time series analysis | |
Hypothesis testing | |
Nonparametric methods |
Software Development | |
Object-oriented programming | |
Functional programming | |
RESTful APIs | |
Design patterns | |
Containerization | |
Agile methodology |
Cloud Infrastructure | |
AWS | |
EC2 | |
S3 | |
DynamoDB | |
Lambda |
Data Management | |
PostgreSQL | |
NoSQL | |
PySpark | |
Neo4j | |
ChromaDB |
Programming | |
Python | |
SQL | |
R |
DevOps/ML Ops | |
Continuous Integration | |
Continuous Deployment | |
MLflow | |
Docker | |
GitHub Actions | |
Gitlab CI/CD |