My name is Chih-Hsu Lin. You can call me Jack.
I am a senior data scientist at C3.ai (NYSE: AI) in the San Francisco Bay Area.
I love data and have won 🏆top 3%-6% in 3 Kaggle competitions (top 2.9% of active users).
I received a 🎓PhD degree with quantitative concentration.
During my 3-month internship in a top-tier biotech company, Illumina (with >7,000 employees), I not only reduced 94% time of manual curation by machine learning pipeline, I also led a 11-person team to win 🏆 the 1st place in business case competition.
-
Designing Interpretable Neural Networks by Prior Knowledge to Predict Cancer Drug Targets. 2021. <Code>
📖Published in Bioinformatics (ranking in math & computational biology: 🏆3/59, top 5.1%)
❓Problem (classification): How to predict personalized drug targets? How to design better neural network architecture?
🤔Why it's important: Better algorithms can accelerate therapeutic development and explains the predictions to earn people's trusts.
📝What I did: I invented and implemented a new and interpretable neural network algorithm that converges 35% faster, reduces 200 times of parameters, and marginally outperforms (AUROC>0.88) traditional neural network in PyTorch.
💡Findings: Leveraging high-quality prior knowledge can build efficient, robust and interpretable neural networks.
📂Data type: tabular data
🛠️Skills: Deep learning, PyTorch, statistical tests -
Analysis of 5,500 Data Science Jobs. 2020. <Blog>. 1,100+ views in a week.
📂Data type: tabular and text
❓Problem: What skills do data science jobs need?
🤔Why it's important: It's good to understand the trend for job seekers.
📝What I did: I extracted and cleaned 5,500 job descriptions from the internet. I summarized results and generated interactive plots to investigate the skills, location and the difference between data analysts, scientists, and engineers.
💡Findings:
🛠️Skills: Plotly, Seaborn, web scraping -
Kaggle Recursion Cellular Image Classification. 2019. 🏆 Top 3.0% (26/866) <Code & Solution>
❓Problem (multiclass classification): How to classify 1,108 treatments based on the images of 4 different cell types?
🤔Why it's important: Accurate and precise image classification can expedite the drug discovery process and improve the understanding of drug effects on cells.
📝What I did: I built a deep learning pipeline using state of the art convolutional neural networks to achieved accuracy of 0.97757.
💡Findings: Different cell types have pretty distinct images so the cell type-specific models are necessary.
📂Data type: image of 6 channels
🛠️Skills: PyTorch, data augmentation, image processing, convolutional neural networks -
Accelerating Variant Triaging by Machine Learning. 2019. internship.
❓Problem (classification): How to predict the clinically relevant variants to automate the triaging process?
🤔Why it's important: Successful predictions can reduce turn around time and provide timely information to facilitate clinical decisions.
📝What I did: I parsed json files and converted them to tabular data. I cleaned and merged internal data with external data. I developed a machine learning pipeline to reduce the manual time by 94%. I presented the results as a poster at an international annual conference (8,500 attendees, ~250 exhibiting companies).
💡Findings: communication is the key to customize the pipeline to colleagues' needs and existing frameworks.
📂Data type: json and tabular data + external data collection and cleaning
🛠️Skills: Scikit-learn -
Multimodal Network Diffusion Predicts Future Disease-Gene-Chemical Associations. 2018. <Code>
🎓 PhD thesis published in Bioinformatics (ranking in math & computational biology: 🏆3/59, top 5.1%)
❓Problem (edge prediction): How to predict future disease-gene-drug interactions based on existing network data?
🤔Why it's important: Predicting interactions between diseases, genes and drugs can accelerate the drug development process.
📝What I did: I merged and cleaned data from 3 databases and generated a network of 215,000+ drug-gene-disease associations. I implemented and validated graph-based kernel machine learning methods in Python to predict associations with >90% precision.
💡Findings: Adding more data would improve performance only if the method is good enough.
📂Data type: graph/network
🛠️Skills: graph kernel machine learning algorithms (self-implemented), graph building and analysis -
Kaggle Mercedes-Benz Greener Manufacturing. 2017. 🏆 Top 4.9% (188/3,835)
❓Problem (regression): How to predict the time for the car to pass the manufacturing test based on anonymized car features?
🤔Why it's important: successful predictions can lead to speedier testing, lower carbon dioxide emissions without reducing Daimler’s standards.
📝What I did: I applied dimensionality reduction methods to compressed 386 anonymized variables. I developed a machine learning pipeline using gradient boosting and ensemble methods to achieve 0.55227 R², which was only 0.00323 less than the first place.
💡Findings: Fitting the public leaderboard may lead to bad ranking in final leaderboard.
📂Data type: tabular data (anonymized features)
🛠️Skills: Scikit-learn, dimension reduction, stacking, gradient boosting, XGBoost -
Kaggle Sberbank Russian Housing Market. 2017. 🏆 Top 6.1% (201/3,274)
❓Problem (regression): How to predict the Russia house prices based on house features and location under the country’s volatile economy?
🤔Why it's important: Successful predictions can provide more certainty to the market in an uncertain economy.
📝What I did: I build a machine learning pipeline to predict house price using gradient boosting, artificial neural network models and ensemble methods.
💡Findings: Filtering outliers can improve predictions.
📂Data type: tabular data + external data collection and cleaning
🛠️Skills: Scikit-learn, gradient boosting, XGBoost, fully connected neural network, Keras, ensemble