RL with verify reward
Note Take a small samples to trained on
Note Added more samples trainning
Note Fully dataset trainning