Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable column pruning for the MR execution engine #34

Open
jphalip opened this issue Dec 30, 2022 · 0 comments
Open

Enable column pruning for the MR execution engine #34

jphalip opened this issue Dec 30, 2022 · 0 comments

Comments

@jphalip
Copy link
Collaborator

jphalip commented Dec 30, 2022

The MR execution engine does not seem to provide a reliable value for the hive.io.file.readcolumn.names when multiple tables are read in the same query. So we can't properly support column pruning as we have to select all the columns (i.e. SELECT *).

This is unfortunately quite inefficient. Tez, however, does not have that issue.

See more info here: https://lists.apache.org/thread/g464zybq4g6c7p2h6nd9jmmznq472785

We need to investigate to see if we can come up with a workaround, or figure out how to get the subset of read columns from some property or variable.

Relevant part of the codebase here:

Set<String> selectedFields;
String engine = HiveConf.getVar(jobConf, HiveConf.ConfVars.HIVE_EXECUTION_ENGINE);
if (engine.equals("mr")) {
// Unfortunately the MR engine does not provide a reliable value for the
// "hive.io.file.readcolumn.names" when multiple tables are read in the
// same query. So we have to select all the columns (i.e. `SELECT *`).
// This is unfortunately quite inefficient. Tez, however, does not have that issue.
// See more info here: https://lists.apache.org/thread/g464zybq4g6c7p2h6nd9jmmznq472785
// TODO: Investigate to see if we can come up with a workaround. Maybe try
// using the new MapRed API (org.apache.hadoop.mapreduce) instead of the old
// one (org.apache.hadoop.mapred)?
selectedFields = new HashSet<>(columnNames);
} else {
selectedFields =
new HashSet<>(Arrays.asList(ColumnProjectionUtils.getReadColumnNames(jobConf)));
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant