1. Should be possible, but it may be a difficult task. If they did it in the paper you provided, that should be a proof that is should be possible. Wide transistors should be used (big W/L ratio), so they would work a little bit over Vth, e.g., VGS = 0.5 V. That lead to Vov = 0.1 V (overvoltage) - your transistor needs only VDS = 0.1 V to work into saturation.
2. There should be opamps that are not folded cascoded and have both NMOS and PMOS input. However, I cannot point to such architectures as I do not have a time to search the web. Have you tried books?
3. Folded cascode is a folded cascode. The cascode cannot be removed. Look into books for opamps architectures, e.g. Baker and Razavi. For online, you may look at page 49 at the free preview here
https://payhip.com/b/5Srt , where opamps are categorized a little bit (maybe not all categories, but should be useful, especially for beginners but not only).
Hope it will help a little. In case of more questions, ask.