I am an AI&ML Engineer with a keen interest in Programming, Statistics, Machine Learning and Deep Learning. My academic background is a mix of Engineering and Data Science.
Recently, I have been working with LLMs @JDoodle and try to run them efficiently and get them generate code that passes HumanEval Questions :)
I enjoy competing in data science competitions. I found great benefits in those competitions for sharpening up Data Science & ML skills and also very helpful for staying up-to-date.
Flockserve - LLM Inference Endpoint
Purpose:Most of production LLM loads are carried using closed-sourced solutions of cloud providers such as Google's Vertex AI or Azure ML etc. Purpose was to develop an Open-sourced, cloud agnostic, cost-efficient and flexiable alternative to those services.
Challange:Handling dynamic request rates and high volumes of traffic
Key Strategies Applied:Asynchronous processing of the requests were key to process high volumes of requests. Developing a custom metric "Queue Length Running Mean" to base the up/down scaling decision worked effectively. Use of skypilot for node provisioning was very helpful for achieving a cloud-agnostic solution.
M5 - Walmart Sales Forecasting Challange
Purpose:Forecasting sales of 30000 items in 12 Walmart Stores for 28 days using the last 6 years’ sales data together with a calendar and product-related information.
Challange:Intermittent demand for products was the main challenge with this dataset. Also, single-product level sales were highly variable.
Key Strategies Applied:200 Features are generated mainly statistics on sales data and interactions between sales and calendar. Also applied clustering based on intermittent demand-related features to group the products and train in-group products together. Finally, 28*3 Gradient Boosting Machines are trained to forecast different horizons from 1 to 28.
Product Matching
Purpose:Using an e-commerce platform’s i.e Shopee — product listing images and textual descriptions written by the owner of the listing, identify the identical products listed by different vendors.
Data:35000 listing images and descriptions in English or Indonesian or both.
Strategy:Creating combined embedding space of image and text then quantifying the similarity of listings based on cosine distance.
Model Architecture:EfficientNet-b3 & BERT + FC + ArcFace
Key Properties:Unseen test data micro averaged F1-score of ~0.73
Scraping car listings and images
Purpose: Scraping, transforming and storing car images together with other relevant information.
Scope: 1.5 million images
Storage: Amazon Web Services (AWS) – S3
Key Properties: Scraped responsibly by obeying Robots.txt and with 1 API request per second rate.
Predicting m-RNA Folding Probabilities
Purpose: Given m-RNA molecule base pair sequences and properties of each base pair, predicting the folding probability of each base pair.
Data: Sequential data as the order of m-RNA molecule is critical to understand the behavior of the molecule. Therefore, transformers and recurrent neural networks are useful.
Model Architecture: Embedding + LSTM with 3 hidden layer + Linear output layer
Key Properties: GPU training, Data augmentation, Weighted training by measurement errors, use of experiment tracking tools.
Predicting the Critical Temperature of Superconductors
Purpose: Understanding the affecting factors and predicting the critical temperature of superconductors.
Data: 20 thousand rows and 81 columns of data representing the chemical properties of superconductors.
Model Development: Regression models are developed using stepwise feature selection, L1&L2 parameter shrinkages. Also, XGBoost hyper-parameter tuning with grid-search is performed and XGBoost model is trained and compared with regression models.
Key Achievements: GPU training, Data augmentation, Weighted training by measurement errors, use of experiment tracking tools.
Processing, Visualising and Storing real-time fire data
Purpose: Create 3 data streams of temperature data, process, join and pipeline it to feed dynamic visualisation showing recent highest temperature values and static visualisation showing fire locations on a map.
Data: Historic surface temperature data coming from different NASA satellites.
System Architecture: 3 Kafka event producers are created to simulate real-time data with variable broadcasting frequencies. This data is parallelly processed by Spark streaming application. Results are visualised and saved into MongoDB.
Comparison of Online Movie Platforms
Purpose: Create an interactive data visualization tool to compare for-profit (IMDb) and non-profit (TMDb) movie platfoms’ user ratings.
Data: 25 million rows of user ratings from both platforms. Also detailed information about each movie.
Key Achievements: Used a movie metadata API to render 100K movie posters instantly with user interaction.
Implementation of Machine Learning Algorithms From Scratch