Module 11 · Quantitative Methods

Introduction to Big Data Techniques

EN: Mostly conceptual — fintech taxonomy, ML categories, NLP, and the data analytics pipeline.
VN: Chủ yếu khái niệm — fintech, machine learning, NLP, quy trình phân tích dữ liệu lớn.

Note: sheet bạn ghi "Simple Linear Regression" — CFA Institute curriculum chính thức cho M11 là Big Data Techniques.

1. Fintech & Big Data Sources Concept

About: Big Data adds three sources beyond traditional market data: non-traditional (sensors, social media), alternative (web traffic, geolocation). 4 V's: Volume, Velocity, Variety, Veracity.Tóm tắt: Big Data có 3 nguồn ngoài dữ liệu thị trường truyền thống: phi truyền thống và alternative. 4 V's.

Three data sources / Ba nguồn dữ liệu

Traditional Financial markets, trade data, corporate filings.
Non-traditional Social media, sensors, satellite imagery, credit-card transactions.
Alternative Web traffic, app usage, geolocation data.

4 V's of Big Data: Volume, Velocity, Variety, Veracity.

2. Machine Learning Categories Concept

About: Three ML categories: Supervised (labeled data → regression/classification), Unsupervised (find structure → clustering, PCA), Deep learning (neural nets for image/speech/NLP). Watch for overfitting.Tóm tắt: 3 loại ML: có giám sát, không giám sát, deep learning. Cẩn thận overfitting.

Three ML categories / Ba loại ML

Supervised Labeled training data — regression, classification (e.g. credit-risk scoring).
Unsupervised Unlabeled — clustering (k-means), dimension reduction (PCA).
Deep learning Neural networks with many hidden layers — image, speech, NLP.

Overfitting: Model memorizes training data; performs poorly on new data. Mitigated by cross-validation and regularization.

3. Natural Language Processing (NLP) Concept

About: NLP turns text into numbers via cleaning, tokenization, stemming, vectorization (bag-of-words / TF-IDF). Unlocks earnings-call transcripts, news, social posts as inputs to ML models.Tóm tắt: NLP biến văn bản thành số qua làm sạch, tách token, stemming, vector hóa. Khai thác văn bản làm input cho ML.

Steps / Quy trình

1. Cleansing — remove HTML, punctuation, stopwords.
2. Tokenization — split text into words/tokens.
3. Stemming / lemmatization — reduce words to root form.
4. Bag of words / TF-IDF — vector representation.
5. Sentiment / topic modeling.

4. Data Analytics Pipeline Concept

About: Five-step pipeline: conceptualize → collect → prepare/wrangle → explore (visualize, feature engineer) → train/evaluate/deploy. Most time goes to data preparation, not modeling.Tóm tắt: Pipeline 5 bước: định nghĩa → thu thập → làm sạch → khám phá → huấn luyện/triển khai. Phần lớn thời gian dành cho preparation.

Five steps / Năm bước

1. Conceptualization — define problem and KPIs.
2. Data collection — pull from sources.
3. Data preparation & wrangling — clean, transform, normalize.
4. Data exploration — visualize, feature selection & engineering.
5. Model training & evaluation — fit, validate, deploy.

Practice problem Practice

Practice problem

An asset manager builds a credit-scoring model that uses 200 features. The model achieves 99% accuracy on training data but only 65% on out-of-sample data. Which problem is most likely?

Show solution

High training accuracy + low out-of-sample accuracy = classic overfitting.

Mitigation: cross-validation, regularization (LASSO/Ridge), reduce features, more data.

Overfitting — model memorized training data including noise.