public class Hfs extends Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector> implements FileType<org.apache.hadoop.mapred.JobConf>
HadoopFlowConnector
when creating Hadoop executable Flow
instances.
Paths typically should point to a directory, where in turn all the "part" files immediately in that directory will
be included. This is the practice Hadoop expects. Sub-directories are not included and typically result in a failure.
To include sub-directories, Hadoop supports "globing". Globing is a frustrating feature and is supported more
robustly by GlobHfs
and less so by Hfs.
Hfs will accept /*
(wildcard) paths, but not all convenience methods like
getSize(org.apache.hadoop.mapred.JobConf)
will behave properly or reliably. Nor can the Hfs instance
with a wildcard path be used as a sink to write data.
In those cases use GlobHfs since it is a sub-class of MultiSourceTap
.
Optionally use Dfs
or Lfs
for resources specific to Hadoop Distributed file system or
the Local file system, respectively. Using Hfs is the best practice when possible, Lfs and Dfs are conveniences.
Use the Hfs class if the 'kind' of resource is unknown at design time. To use, prefix a scheme to the 'stringPath'. Where
hdfs://...
will denote Dfs, and file://...
will denote Lfs.
Call setTemporaryDirectory(java.util.Map, String)
to use a different temporary file directory path
other than the current Hadoop default path.
By default Cascading on Hadoop will assume any source or sink Tap using the file://
URI scheme
intends to read files from the local client filesystem (for example when using the Lfs
Tap) where the Hadoop
job jar is started, Tap so will force any MapReduce jobs reading or writing to file://
resources to run in
Hadoop "standalone mode" so that the file can be read.
To change this behavior, HfsProps.setLocalModeScheme(java.util.Map, String)
to set a different scheme value,
or to "none" to disable entirely for the case the file to be read is available on every Hadoop processing node
in the exact same path.
Hfs can optionally combine multiple small files (or a series of small "blocks") into larger "splits". This reduces
the number of resulting map tasks created by Hadoop and can improve application performance.
This is enabled by calling HfsProps.setUseCombinedInput(boolean)
to true
. By default, merging
or combining splits into large ones is disabled.Modifier and Type | Field and Description |
---|---|
protected java.lang.String |
stringPath
Field stringPath
|
static java.lang.String |
TEMPORARY_DIRECTORY
Deprecated.
|
Modifier | Constructor and Description |
---|---|
protected |
Hfs() |
|
Hfs(Fields fields,
java.lang.String stringPath)
Deprecated.
|
|
Hfs(Fields fields,
java.lang.String stringPath,
boolean replace)
Deprecated.
|
|
Hfs(Fields fields,
java.lang.String stringPath,
SinkMode sinkMode)
Deprecated.
|
protected |
Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme) |
|
Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme,
java.lang.String stringPath)
Constructor Hfs creates a new Hfs instance.
|
|
Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme,
java.lang.String stringPath,
boolean replace)
Deprecated.
|
|
Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme,
java.lang.String stringPath,
SinkMode sinkMode)
Constructor Hfs creates a new Hfs instance.
|
Modifier and Type | Method and Description |
---|---|
protected void |
applySourceConfInitIdentifiers(FlowProcess<org.apache.hadoop.mapred.JobConf> process,
org.apache.hadoop.mapred.JobConf conf,
java.lang.String... fullIdentifiers) |
boolean |
createResource(org.apache.hadoop.mapred.JobConf conf)
Method createResource creates the underlying resource.
|
boolean |
deleteChildResource(org.apache.hadoop.mapred.JobConf conf,
java.lang.String childIdentifier) |
boolean |
deleteResource(org.apache.hadoop.mapred.JobConf conf)
Method deleteResource deletes the resource represented by this instance.
|
long |
getBlockSize(org.apache.hadoop.mapred.JobConf conf)
Method getBlockSize returns the
blocksize specified by the underlying file system for this resource. |
java.lang.String[] |
getChildIdentifiers(org.apache.hadoop.mapred.JobConf conf)
Method getChildIdentifiers returns an array of child identifiers if this resource is a directory.
|
java.lang.String[] |
getChildIdentifiers(org.apache.hadoop.mapred.JobConf conf,
int depth,
boolean fullyQualified) |
protected static boolean |
getCombinedInputSafeMode(org.apache.hadoop.mapred.JobConf conf) |
protected org.apache.hadoop.fs.FileSystem |
getDefaultFileSystem(org.apache.hadoop.mapred.JobConf jobConf) |
java.net.URI |
getDefaultFileSystemURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
Method getDefaultFileSystemURIScheme returns the URI scheme for the default Hadoop FileSystem.
|
protected org.apache.hadoop.fs.FileSystem |
getFileSystem(org.apache.hadoop.mapred.JobConf jobConf) |
java.lang.String |
getFullIdentifier(org.apache.hadoop.mapred.JobConf conf)
Method getFullIdentifier returns a fully qualified resource identifier.
|
java.lang.String |
getIdentifier()
Method getIdentifier returns a String representing the resource this Tap instance represents.
|
protected static java.lang.String |
getLocalModeScheme(org.apache.hadoop.mapred.JobConf conf,
java.lang.String defaultValue) |
long |
getModifiedTime(org.apache.hadoop.mapred.JobConf conf)
Method getModifiedTime returns the date this resource was last modified.
|
org.apache.hadoop.fs.Path |
getPath() |
int |
getReplication(org.apache.hadoop.mapred.JobConf conf)
Method getReplication returns the
replication specified by the underlying file system for
this resource. |
long |
getSize(org.apache.hadoop.mapred.JobConf conf)
Method getSize returns the size of the file referenced by this tap.
|
static java.lang.String |
getTemporaryDirectory(java.util.Map<java.lang.Object,java.lang.Object> properties)
Deprecated.
see
HfsProps |
static org.apache.hadoop.fs.Path |
getTempPath(org.apache.hadoop.mapred.JobConf conf) |
java.net.URI |
getURIScheme(org.apache.hadoop.mapred.JobConf jobConf) |
protected static boolean |
getUseCombinedInput(org.apache.hadoop.mapred.JobConf conf) |
boolean |
isDirectory(org.apache.hadoop.mapred.JobConf conf)
Method isDirectory returns true if the underlying resource represents a directory or folder instead
of an individual file.
|
protected java.lang.String |
makeTemporaryPathDirString(java.lang.String name) |
protected java.net.URI |
makeURIScheme(org.apache.hadoop.mapred.JobConf jobConf) |
TupleEntryIterator |
openForRead(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess,
org.apache.hadoop.mapred.RecordReader input)
Method openForRead opens the resource represented by this Tap instance for reading.
|
TupleEntryCollector |
openForWrite(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess,
org.apache.hadoop.mapred.OutputCollector output)
Method openForWrite opens the resource represented by this Tap instance for writing.
|
boolean |
resourceExists(org.apache.hadoop.mapred.JobConf conf)
Method resourceExists returns true if the path represented by this instance exists.
|
protected void |
setStringPath(java.lang.String stringPath) |
static void |
setTemporaryDirectory(java.util.Map<java.lang.Object,java.lang.Object> properties,
java.lang.String tempDir)
Deprecated.
see
HfsProps |
protected void |
setUriScheme(java.net.URI uriScheme) |
void |
sinkConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> process,
org.apache.hadoop.mapred.JobConf conf)
Method sinkConfInit initializes this instance as a sink.
|
void |
sourceConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> process,
org.apache.hadoop.mapred.JobConf conf)
Method sourceConfInit initializes this instance as a source.
|
protected void |
sourceConfInitAddInputPath(org.apache.hadoop.mapred.JobConf conf,
org.apache.hadoop.fs.Path qualifiedPath) |
protected void |
sourceConfInitComplete(FlowProcess<org.apache.hadoop.mapred.JobConf> process,
org.apache.hadoop.mapred.JobConf conf) |
protected static void |
verifyNoDuplicates(org.apache.hadoop.mapred.JobConf conf) |
commitResource, createResource, deleteResource, equals, flowConfInit, getConfigDef, getFullIdentifier, getModifiedTime, getScheme, getSinkFields, getSinkMode, getSourceFields, getStepConfigDef, getTrace, hasConfigDef, hashCode, hasStepConfigDef, id, isEquivalentTo, isKeep, isReplace, isSink, isSource, isTemporary, isUpdate, openForRead, openForWrite, outgoingScopeFor, presentSinkFields, presentSourceFields, resolveIncomingOperationArgumentFields, resolveIncomingOperationPassThroughFields, resourceExists, retrieveSinkFields, retrieveSourceFields, rollbackResource, setScheme, taps, toString
@Deprecated public static final java.lang.String TEMPORARY_DIRECTORY
HfsProps.TEMPORARY_DIRECTORY
protected java.lang.String stringPath
protected Hfs()
@ConstructorProperties(value="scheme") protected Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme)
@Deprecated @ConstructorProperties(value={"fields","stringPath"}) public Hfs(Fields fields, java.lang.String stringPath)
fields
- of type FieldsstringPath
- of type String@Deprecated @ConstructorProperties(value={"fields","stringPath","replace"}) public Hfs(Fields fields, java.lang.String stringPath, boolean replace)
fields
- of type FieldsstringPath
- of type Stringreplace
- of type boolean@Deprecated @ConstructorProperties(value={"fields","stringPath","sinkMode"}) public Hfs(Fields fields, java.lang.String stringPath, SinkMode sinkMode)
fields
- of type FieldsstringPath
- of type StringsinkMode
- of type SinkMode@ConstructorProperties(value={"scheme","stringPath"}) public Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme, java.lang.String stringPath)
scheme
- of type SchemestringPath
- of type String@Deprecated @ConstructorProperties(value={"scheme","stringPath","replace"}) public Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme, java.lang.String stringPath, boolean replace)
scheme
- of type SchemestringPath
- of type Stringreplace
- of type boolean@ConstructorProperties(value={"scheme","stringPath","sinkMode"}) public Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme, java.lang.String stringPath, SinkMode sinkMode)
scheme
- of type SchemestringPath
- of type StringsinkMode
- of type SinkMode@Deprecated public static void setTemporaryDirectory(java.util.Map<java.lang.Object,java.lang.Object> properties, java.lang.String tempDir)
HfsProps
properties
- of type MaptempDir
- of type String@Deprecated public static java.lang.String getTemporaryDirectory(java.util.Map<java.lang.Object,java.lang.Object> properties)
HfsProps
properties
- of type Mapprotected static java.lang.String getLocalModeScheme(org.apache.hadoop.mapred.JobConf conf, java.lang.String defaultValue)
protected static boolean getUseCombinedInput(org.apache.hadoop.mapred.JobConf conf)
protected static boolean getCombinedInputSafeMode(org.apache.hadoop.mapred.JobConf conf)
protected void setStringPath(java.lang.String stringPath)
protected void setUriScheme(java.net.URI uriScheme)
public java.net.URI getURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
protected java.net.URI makeURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
public java.net.URI getDefaultFileSystemURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
jobConf
- of type JobConfprotected org.apache.hadoop.fs.FileSystem getDefaultFileSystem(org.apache.hadoop.mapred.JobConf jobConf)
protected org.apache.hadoop.fs.FileSystem getFileSystem(org.apache.hadoop.mapred.JobConf jobConf)
public java.lang.String getIdentifier()
Tap
getIdentifier
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
public org.apache.hadoop.fs.Path getPath()
public java.lang.String getFullIdentifier(org.apache.hadoop.mapred.JobConf conf)
Tap
getFullIdentifier
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
conf
- of type Configpublic void sourceConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> process, org.apache.hadoop.mapred.JobConf conf)
Tap
Flow
instance or if it participates in multiple times in a given Flow or across different Flows in
a Cascade
.
In the context of a Flow, it will be called after
FlowListener.onStarting(cascading.flow.Flow)
Note that no resources or services should be modified by this method.sourceConfInit
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
process
- of type FlowProcessconf
- of type Configprotected static void verifyNoDuplicates(org.apache.hadoop.mapred.JobConf conf)
protected void applySourceConfInitIdentifiers(FlowProcess<org.apache.hadoop.mapred.JobConf> process, org.apache.hadoop.mapred.JobConf conf, java.lang.String... fullIdentifiers)
protected void sourceConfInitAddInputPath(org.apache.hadoop.mapred.JobConf conf, org.apache.hadoop.fs.Path qualifiedPath)
protected void sourceConfInitComplete(FlowProcess<org.apache.hadoop.mapred.JobConf> process, org.apache.hadoop.mapred.JobConf conf)
public void sinkConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> process, org.apache.hadoop.mapred.JobConf conf)
Tap
Flow
instance or if it participates in multiple times in a given Flow or across different Flows in
a Cascade
.
Note this method will be called in context of this Tap being used as a traditional 'sink' and as a 'trap'.
In the context of a Flow, it will be called after
FlowListener.onStarting(cascading.flow.Flow)
Note that no resources or services should be modified by this method. If this Tap instance returns true for
Tap.isReplace()
, then Tap.deleteResource(Object)
will be called by the parent Flow.sinkConfInit
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
process
- of type FlowProcessconf
- of type Configpublic TupleEntryIterator openForRead(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess, org.apache.hadoop.mapred.RecordReader input) throws java.io.IOException
Tap
input
value may be null, if so, sub-classes must inquire with the underlying Scheme
via Scheme.sourceConfInit(cascading.flow.FlowProcess, Tap, Object)
to get the proper
input type and instantiate it before calling super.openForRead()
.
Note the returned iterator will return the same instance of TupleEntry
on every call,
thus a copy must be made of either the TupleEntry or the underlying Tuple
instance if they are to be
stored in a Collection.openForRead
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
flowProcess
- of type FlowProcessinput
- of type Inputjava.io.IOException
- when the resource cannot be openedpublic TupleEntryCollector openForWrite(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess, org.apache.hadoop.mapred.OutputCollector output) throws java.io.IOException
Tap
SinkMode
setting. If SinkMode is
SinkMode.REPLACE
, this call may fail. See Tap.openForWrite(cascading.flow.FlowProcess)
.
output
value may be null, if so, sub-classes must inquire with the underlying Scheme
via Scheme.sinkConfInit(cascading.flow.FlowProcess, Tap, Object)
to get the proper
output type and instantiate it before calling super.openForWrite()
.openForWrite
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
flowProcess
- of type FlowProcessoutput
- of type Outputjava.io.IOException
- when the resource cannot be openedpublic boolean createResource(org.apache.hadoop.mapred.JobConf conf) throws java.io.IOException
Tap
createResource
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
conf
- of type Configjava.io.IOException
- when there is an error making directoriespublic boolean deleteResource(org.apache.hadoop.mapred.JobConf conf) throws java.io.IOException
Tap
deleteResource
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
conf
- of type Configjava.io.IOException
- when the resource cannot be deletedpublic boolean deleteChildResource(org.apache.hadoop.mapred.JobConf conf, java.lang.String childIdentifier) throws java.io.IOException
java.io.IOException
public boolean resourceExists(org.apache.hadoop.mapred.JobConf conf) throws java.io.IOException
Tap
resourceExists
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
conf
- of type Configjava.io.IOException
- when the status cannot be determinedpublic boolean isDirectory(org.apache.hadoop.mapred.JobConf conf) throws java.io.IOException
FileType
isDirectory
in interface FileType<org.apache.hadoop.mapred.JobConf>
conf
- of JobConfjava.io.IOException
public long getSize(org.apache.hadoop.mapred.JobConf conf) throws java.io.IOException
FileType
public long getBlockSize(org.apache.hadoop.mapred.JobConf conf) throws java.io.IOException
blocksize
specified by the underlying file system for this resource.conf
- of JobConfjava.io.IOException
- whenpublic int getReplication(org.apache.hadoop.mapred.JobConf conf) throws java.io.IOException
replication
specified by the underlying file system for
this resource.conf
- of JobConfjava.io.IOException
- whenpublic java.lang.String[] getChildIdentifiers(org.apache.hadoop.mapred.JobConf conf) throws java.io.IOException
FileType
_log
).getChildIdentifiers
in interface FileType<org.apache.hadoop.mapred.JobConf>
conf
- of JobConfjava.io.IOException
public java.lang.String[] getChildIdentifiers(org.apache.hadoop.mapred.JobConf conf, int depth, boolean fullyQualified) throws java.io.IOException
getChildIdentifiers
in interface FileType<org.apache.hadoop.mapred.JobConf>
java.io.IOException
public long getModifiedTime(org.apache.hadoop.mapred.JobConf conf) throws java.io.IOException
Tap
getModifiedTime
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
conf
- of type Configjava.io.IOException
public static org.apache.hadoop.fs.Path getTempPath(org.apache.hadoop.mapred.JobConf conf)
protected java.lang.String makeTemporaryPathDirString(java.lang.String name)