Skip to content
View Jack-Lin-DS-AI's full-sized avatar

Organizations

@LichtargeLab

Block or report Jack-Lin-DS-AI

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Jack-Lin-DS-AI/README.md

header

Hello! Welcome to Jack's page

My name is Chih-Hsu Lin. You can call me Jack.
I am a senior data scientist at C3.ai (NYSE: AI) in the San Francisco Bay Area.
I love data and have won 🏆top 3%-6% in 3 Kaggle competitions (top 2.9% of active users).
I received a 🎓PhD degree with quantitative concentration.
During my 3-month internship in a top-tier biotech company, Illumina (with >7,000 employees), I not only reduced 94% time of manual curation by machine learning pipeline, I also led a 11-person team to win 🏆 the 1st place in business case competition.

Projects

  1. Designing Interpretable Neural Networks by Prior Knowledge to Predict Cancer Drug Targets. 2021. <Code>
    📖Published in Bioinformatics (ranking in math & computational biology: 🏆3/59, top 5.1%)
    ❓Problem (classification): How to predict personalized drug targets? How to design better neural network architecture?
    🤔Why it's important: Better algorithms can accelerate therapeutic development and explains the predictions to earn people's trusts.
    📝What I did: I invented and implemented a new and interpretable neural network algorithm that converges 35% faster, reduces 200 times of parameters, and marginally outperforms (AUROC>0.88) traditional neural network in PyTorch.
    💡Findings: Leveraging high-quality prior knowledge can build efficient, robust and interpretable neural networks.
    📂Data type: tabular data
    🛠️Skills: Deep learning, PyTorch, statistical tests

  2. Analysis of 5,500 Data Science Jobs. 2020. <Blog>. 1,100+ views in a week.
    ❓Problem: What skills do data science jobs need?
    🤔Why it's important: It's good to understand the trend for job seekers.
    📝What I did: I extracted and cleaned 5,500 job descriptions from the internet. I summarized results and generated interactive plots to investigate the skills, location and the difference between data analysts, scientists, and engineers.
    💡Findings:

    📂Data type: tabular and text
    🛠️Skills: Plotly, Seaborn, web scraping

  3. Kaggle Recursion Cellular Image Classification. 2019. 🏆 Top 3.0% (26/866) <Code & Solution>
    ❓Problem (multiclass classification): How to classify 1,108 treatments based on the images of 4 different cell types?
    🤔Why it's important: Accurate and precise image classification can expedite the drug discovery process and improve the understanding of drug effects on cells.
    📝What I did: I built a deep learning pipeline using state of the art convolutional neural networks to achieved accuracy of 0.97757.
    💡Findings: Different cell types have pretty distinct images so the cell type-specific models are necessary.
    📂Data type: image of 6 channels
    🛠️Skills: PyTorch, data augmentation, image processing, convolutional neural networks

  4. Accelerating Variant Triaging by Machine Learning. 2019. internship.
    ❓Problem (classification): How to predict the clinically relevant variants to automate the triaging process?
    🤔Why it's important: Successful predictions can reduce turn around time and provide timely information to facilitate clinical decisions.
    📝What I did: I parsed json files and converted them to tabular data. I cleaned and merged internal data with external data. I developed a machine learning pipeline to reduce the manual time by 94%. I presented the results as a poster at an international annual conference (8,500 attendees, ~250 exhibiting companies).
    💡Findings: communication is the key to customize the pipeline to colleagues' needs and existing frameworks.
    📂Data type: json and tabular data + external data collection and cleaning
    🛠️Skills: Scikit-learn

  5. Multimodal Network Diffusion Predicts Future Disease-Gene-Chemical Associations. 2018. <Code>
    🎓 PhD thesis published in Bioinformatics (ranking in math & computational biology: 🏆3/59, top 5.1%)
    ❓Problem (edge prediction): How to predict future disease-gene-drug interactions based on existing network data?
    🤔Why it's important: Predicting interactions between diseases, genes and drugs can accelerate the drug development process.
    📝What I did: I merged and cleaned data from 3 databases and generated a network of 215,000+ drug-gene-disease associations. I implemented and validated graph-based kernel machine learning methods in Python to predict associations with >90% precision.
    💡Findings: Adding more data would improve performance only if the method is good enough.
    📂Data type: graph/network
    🛠️Skills: graph kernel machine learning algorithms (self-implemented), graph building and analysis

  6. Kaggle Mercedes-Benz Greener Manufacturing. 2017. 🏆 Top 4.9% (188/3,835)
    ❓Problem (regression): How to predict the time for the car to pass the manufacturing test based on anonymized car features?
    🤔Why it's important: successful predictions can lead to speedier testing, lower carbon dioxide emissions without reducing Daimler’s standards.
    📝What I did: I applied dimensionality reduction methods to compressed 386 anonymized variables. I developed a machine learning pipeline using gradient boosting and ensemble methods to achieve 0.55227 R², which was only 0.00323 less than the first place.
    💡Findings: Fitting the public leaderboard may lead to bad ranking in final leaderboard.
    📂Data type: tabular data (anonymized features)
    🛠️Skills: Scikit-learn, dimension reduction, stacking, gradient boosting, XGBoost

  7. Kaggle Sberbank Russian Housing Market. 2017. 🏆 Top 6.1% (201/3,274)
    ❓Problem (regression): How to predict the Russia house prices based on house features and location under the country’s volatile economy?
    🤔Why it's important: Successful predictions can provide more certainty to the market in an uncertain economy.
    📝What I did: I build a machine learning pipeline to predict house price using gradient boosting, artificial neural network models and ensemble methods.
    💡Findings: Filtering outliers can improve predictions.
    📂Data type: tabular data + external data collection and cleaning
    🛠️Skills: Scikit-learn, gradient boosting, XGBoost, fully connected neural network, Keras, ensemble

📈 GitHub Stats

Jack's GitHub Stats

Pinned Loading

  1. LichtargeLab/multimodal-network-diffusion LichtargeLab/multimodal-network-diffusion Public

    Repository for *Multimodal Network Diffusion Predicts Future Disease-Gene-Chemical Associations*

    HTML 9 3

  2. cellular_image_classification cellular_image_classification Public

    Kaggle competition 2019: Recursion Cellular Image Classification. The solution of the team ranked as 26/866 (top 3.0%; silver medal).

    Python 1

  3. LichtargeLab/BioVNN LichtargeLab/BioVNN Public

    Repository for *Using Interpretable Deep Learning to Model Cancer Dependencies*

    Python 1