Train-Test Split (Fixed) - Machine Learning Basics

Partition a dataset into a training set and a test set using a fixed pre-determined index order. The replay shows X_train and y_train growing one point at a time for the first six positions, then X_test and y_test receiving the final two.

By hand

With scikit-learn

train_test_split handles the shuffle and partition in one call. test_size=0.25 requests 25% of 8 points = 2 test items; random_state=42 makes the shuffle reproducible. The actual rows selected differ from the hand-coded fixed order above.

naive.py

X = [1, 2, 3, 4, 5, 6, 7, 8]
y = [10, 20, 30, 40, 50, 60, 70, 80]
order = [2, 5, 0, 7, 4, 1, 3, 6]
X_train = []
y_train = []
X_test = []
y_test = []
for i, idx in enumerate(order):
    if i < 6:
        X_train.append(X[idx])
        y_train.append(y[idx])
    else:
        X_test.append(X[idx])
        y_test.append(y[idx])
print('RESULT:', (len(X_train), len(X_test)))

library.py

from sklearn.model_selection import train_test_split
from dalib.display import set_display
set_display()

X = [1, 2, 3, 4, 5, 6, 7, 8]
y = [10, 20, 30, 40, 50, 60, 70, 80]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, shuffle=True
)
print('train X:', X_train)
print('test  X:', X_test)
print('RESULT:', (len(X_train), len(X_test)))

train X: [1, 8, 3, 5, 4, 7]
test  X: [2, 6]
RESULT: (6, 2)

Implementation notes

RESULT reports only (train_size, test_size). Both halves produce 6/2, but the specific rows in each split differ: the naive version uses a hand-written order while sklearn applies its own seeded permutation. Aligning the two selections exactly would require hard-coding sklearn's internal shuffle, which would be contrived.
random_state=42 makes the sklearn split deterministic across runs within a pinned environment (same sklearn and NumPy versions); omitting it produces a different split every run.
A 75/25 train/test ratio is a common starting point; real workflows often also carve out a validation set (train_test_split called twice) or use cross-validation instead of a single hold-out.
shuffle=True is the default but is stated explicitly here for clarity.