Skip to main content

Reading Text Information

A typical project begins by reading data from somewhere into a data frame. A typical information source might be a text file, just information technology is as well possible to import information from binary data files such as those used by Stata, SAS, and SPSS.

Text files come up in many forms. It is ever a good idea to look at any documentation you have commencement. Then it can be informative to look at the text file itself, preferably in a dedicated text editor (on SSCC computers, use Notepad++).

Text Information Concepts

Yous are looking for a few things when y'all examine the file.

  • Data, metadata, actress text

    The file includes data values. Does it also include variable names or other information that helps ascertain the data? Is at that place a header or a footer with explanatory text about the file contents?

  • Observation delimiter

    What separates one ascertainment from the next? Commonly, each ascertainment has a carve up line in the text file, but information technology is possible to accept multiple observations per line, or multiple lines per observation.

  • Data value delimiter

    Within an ascertainment, what separates one data value from the next? Very commonly the information value delimiter will be a space or a comma. Tabs used to be common, and are hard to distinguish visually from spaces.

    Specially in older data sets, it used to exist common for data values to appear in specified columns - eastward.g. state in columns 3-iv and county in columns 5-7 - with no graphic symbol delimiting information values.

  • Character value quote

    Where information value delimiters are used, how are the same characters included in character data values? For example, if the data values are separated past spaces, how do yous include a space within a information value? The typical answer is, character data values are enclosed in quotes, either double (") or single (') quotes.

  • Missing value cord

    How are missing values indicated? This might be by having 2 data value delimiters with no data value between them. Or at that place might exist a special cord that denotes missing information, such equally NA, -99, or BBBBBBB. There may exist more 1 missing value indicator as well, such as -98 and -99.

Reading Information Files

The examples below read six CSV and text files from our website. If you would rather download the files and read them from a local directory, you can download them all by clicking here.

CSV Examples

The Uncomplicated Case

Consider this file:

                        url_simple <- "https://sscc.wisc.edu/sscc/pubs/data/dwr-read/class_simple.csv"                      

The start few lines look like this:

                        Name,Sex,Age,Superlative,Weight Alfred,M,fourteen,69,112.5 Alice,F,xiii,56.5,84 Barbara,F,13,65.iii,98 Carol,F,xiv,62.viii,102.5 Henry,K,14,63.5,102.5                      

In this file,

  • The first line has variable names, and the rest is data.
  • There is one observation per line.
  • Data values are separated by commas.
  • There appear to be no character quotes.
  • At that place announced to exist no missing values.

Information like this is very easy to read into R with the read.csv() function:

                        class_simple <- read.csv(url_simple) head(class_simple)                      
                                                  Name Sexual practice Historic period Height Weight i  Alfred   M  14   69.0  112.five 2   Alice   F  thirteen   56.5   84.0 3 Barbara   F  13   65.3   98.0 4   Ballad   F  14   62.8  102.5 5   Henry   M  xiv   63.v  102.5 6   James   M  12   57.3   83.0                      
                        str(class_simple)                      
                        'data.frame':   xix obs. of  five variables:  $ Proper noun  : chr  "Alfred" "Alice" "Barbara" "Carol" ...  $ Sexual practice   : chr  "Thou" "F" "F" "F" ...  $ Historic period   : int  14 thirteen xiii 14 xiv 12 12 15 xiii 12 ...  $ Elevation: num  69 56.5 65.iii 62.8 63.5 57.3 59.viii 62.5 62.v 59 ...  $ Weight: num  112 84 98 102 102 ...                      

Characters to Factors

Prior to R-4.0, Name and Sex in the previous example would accept been turned into factors automatically. This is no longer the default, simply remains an option.

                        class_simple <- read.csv(url_simple, as.is = FALSE) # all character vars to factors str(class_simple)                      
                        'data.frame':   xix obs. of  5 variables:  $ Name  : Factor west/ 19 levels "Alfred","Alice",..: i 2 3 four 5 6 7 8 9 10 ...  $ Sex   : Factor westward/ ii levels "F","M": 2 1 1 one 2 2 1 one 2 ii ...  $ Age   : int  fourteen xiii 13 14 14 12 12 15 xiii 12 ...  $ Elevation: num  69 56.five 65.3 62.8 63.v 57.three 59.8 62.5 62.5 59 ...  $ Weight: num  112 84 98 102 102 ...                      

To catechumen specific columns to factors, do the conversion every bit a separate pace later on, or use a named vector of column classes in a colClasses argument. The vector values are classes to assign, the names are variable names:

                        cc <- c(Sexual activity = "factor") # a named vector of column classes class_simple2 <- read.csv(url_simple, colClasses = cc) str(class_simple2)                      
                        'data.frame':   19 obs. of  five variables:  $ Proper noun  : chr  "Alfred" "Alice" "Barbara" "Carol" ...  $ Sexual activity   : Gene due west/ 2 levels "F","M": 2 ane one one 2 two one 1 two 2 ...  $ Age   : int  14 13 13 14 14 12 12 15 13 12 ...  $ Height: num  69 56.5 65.3 62.8 63.5 57.three 59.eight 62.5 62.5 59 ...  $ Weight: num  112 84 98 102 102 ...                      

Quoted Graphic symbol Values

Quoting character values is seldom a problem … only sometimes it is. So consider this file:

                        url_quotes <- "https://sscc.wisc.edu/sscc/pubs/data/dwr-read/class_quotes.csv"                      

The first few lines look similar this:

                        "Name","Sex","Age","Top","Weight" "B, Alfred","M",14,69,112.5 "Y, Alice","F",xiii,56.v,84 "M, Barbara","F",13,65.3,98 "P, Carol","F",14,62.eight,102.5 "A, Henry","G",xiv,63.5,102.5                      

This is what we'd like to see. At that place are commas within the data values for Proper name, but these are all in quotes. The default use of read.csv() assumes that double quotes or single quotes circumscribe grapheme values. If another character is used, we have the quote argument we tin use. If nothing is used, we could exist in trouble! (We might need a new strategy.)

                        class_quotes <- read.csv(url_quotes) str(class_quotes)                      
                        'data.frame':   19 obs. of  v variables:  $ Name  : chr  "B, Alfred" "Y, Alice" "M, Barbara" "P, Carol" ...  $ Sex activity   : chr  "G" "F" "F" "F" ...  $ Age   : int  14 13 xiii 14 fourteen 12 12 15 thirteen 12 ...  $ Height: num  69 56.five 65.iii 62.viii 63.v 57.iii 59.viii 62.5 62.5 59 ...  $ Weight: num  112 84 98 102 102 ...                      

Missing Values

Adjacent, consider this file:

                        url_missing <- "https://sscc.wisc.edu/sscc/pubs/data/dwr-read/class_missing.csv"                      

The commencement few lines wait like this:

                        Name,Sex,Age,Height,Weight Alfred,Grand,14,,112.five Alice,F,13,56.five,84 Barbara,F,thirteen,65.3,98 Ballad,F,xiv,62.eight,102.v Henry,Grand,14,63.5,102.5                      

Here nosotros have a missing value for Top in the kickoff observation (and more than subsequently in the information prepare).

                        class_missing <- read.csv(url_missing) head(class_missing)                      
                                                  Proper name Sex Historic period Superlative Weight 1  Alfred   Yard  14     NA  112.5 2   Alice   F  13   56.v   84.0 3 Barbara   F  13   65.3   98.0 iv   Carol   F  xiv   62.viii  102.v v   Henry   K  14   63.5  102.5 6   James   Thou  12   57.3   83.0                      

Depending on the software used to produce the text file, a missing value might exist denoted by ii information delimiters with no text in between, as in this instance. In text files produced by R iteself, missing values are usually denoted by NA. So by default these are turned into missing values equally well. Other software might use another symbol (periods are common, and dashes sometimes are used), for which nosotros have the na.strings argument.

Space Delimited

For space delimited information, we use a related function, read.tabular array(). Hither a few of the assumptions (defaults) are unlike. Now spaces are causeless to delimit information values where before they were assumed to be role of information values, and vice versa for commas. Files are causeless to have no headers. The other major arguments work as before.

                      url_space <- "https://sscc.wisc.edu/sscc/pubs/data/dwr-read/class_space.txt"                    

The commencement few lines look like this:

                      Name Sexual activity Age Summit Weight Alfred M 14 69 112.5 Alice F 13 56.5 84 Barbara F thirteen 65.iii 98 Carol F 14 62.viii 102.5 Henry M 14 63.5 102.5                    

Hither we have a header with variable names, which nosotros demand to indicate.

                      class_space <- read.table(url_space, header=True) str(class_space)                    
                      'data.frame':   19 obs. of  5 variables:  $ Proper noun  : chr  "Alfred" "Alice" "Barbara" "Carol" ...  $ Sex   : chr  "M" "F" "F" "F" ...  $ Age   : int  xiv 13 xiii 14 fourteen 12 12 15 xiii 12 ...  $ Height: num  69 56.five 65.3 62.8 63.5 57.three 59.eight 62.5 62.five 59 ...  $ Weight: num  112 84 98 102 102 ...                    

Fixed-Width Text

Data in fixed columns is easy to recognize when the information values run together. Even if they do non, this can be a solution when spaces or commas are valid data values and at that place are no character value quotes. Missing values are typically spaces, besides.

                      url_fixed <- "https://sscc.wisc.edu/sscc/pubs/information/dwr-read/class_fixed.txt"                    

The first few lines look like this:

                      Alfred M1469  112.five Alice  F1356.584 BarbaraF1365.398 Carol  F1462.8102.5 Henry  M1463.5102.5 James  M1257.383                    

Here we need to know how many columns each variable occupies (including spaces). Data documentation is a huge help here.

Notice that the width includes the decimal character.

                      myWidths <- c(7,1,2,iv,5)   # how many columns broad each of the variables is  class_fixed <- read.fwf(url_fixed, myWidths) head(class_fixed)                    
                                              V1 V2 V3   V4    V5 1 Alfred   M fourteen 69.0 112.5 2 Alice    F xiii 56.5  84.0 3 Barbara  F thirteen 65.3  98.0 iv Carol    F 14 62.8 102.v five Henry    M 14 63.5 102.5 half dozen James    M 12 57.three  83.0                    

Writing Data Files

Subsequently you lot have read in and manipulated your data in R, you should relieve your data. As you might have guessed, the opposite of read.csv() is write.csv().

The write.csv() function has the default option of row.names = T, which volition create a column with our row names. Since we did not name the rows in classfw, they incorporate the default vector of 1:nrow(classfw) by default. If we do not want this, nosotros can specify row.names = F.

                    write.csv(class_fixed, "class_fixed.csv", row.names = F)                  

To customize the separators and file format, meet aid(write.table).

You may also choose to save your data in the RDS format. R is able to read this file blazon faster than CSVs, just the disadvantage is that you cannot easily preview the file in application such every bit Excel or a text editor. With smaller datasets, you may non notice a difference in loading time. Nonetheless, if you piece of work with larger datasets, you may be able to save a lot of time by using RDS files.

                    saveRDS(class_fixed, file = "class_fixed.rds")                  

To load the file back into R,

                    dat <- readRDS("class_fixed.rds")                  

Paths and Working Directories

Data files have a name and are located in a folder. (A folder is the same a directory. Yous will see both of these names in common utilize.) The folder containing the file may be nested inside another binder and that folder maybe in nevertheless another folder and then on. The specification of the list of folders to travel and the file name is called a path. A path that starts at the root folder of the computer is chosen an absolute path. A relative path starts at a given folder and provides the folders and file starting from that folder. Using relative paths volition make a number of things easier when writing programs and is considered a good programming practice.

A path is made up of folder names. If the path is to a file, then the path will ends with a file name. The folders and files of a path are separated by a directory separator (due east.chiliad., / or \). Different operating systems use different directory separators. In R, the role file.path() is used to make full in the directory separator. Information technology knows which separator to employ for the operating organization information technology is running on.

At that place are a few special directory names. A single period, ., indicates the current working directory. Two periods, .., indicates moving up a directory. The following paradigm shows how .. would be used to get a data file in the folder construction used in the project organization department.

Relative file paths.

When R starts a session, it has a location to look for other files. This path is called the electric current working directory, and this is oft shortened to the working directory. Relative paths in a program are specified equally starting at the electric current working directory. To print your current working directory, employ the getwd() function. To change your working directory, supply setwd() with a relative path in quotes (or an absolute path, only relative paths are preferred).

                    getwd()                  
                    [1] "U:/schoolhouse/my_class"                  
                    setwd("hw_1")                  

In the image above, if your working directory is the folder hw_1, you can reach the data1.csv file with the path "../information/data1.csv". This path could be given to read.csv() (or another information-reading office) to read in the data, or to write.csv() to write over the file.

You can besides have file.path() create file paths for you:

                    data_path <- file.path("..", "information", "data1.csv") data_path                  
                    [1] "../data/data1.csv"                  

And then employ the objects created by file.path() to make reading and writing files simpler:

                    read.csv(data_path)  # create a partial path so nosotros tin customize the file name data_folder <- file.path("..", "data")  # path is "../data/data1.csv" read.csv(file.path(data_folder, "data1.csv"))  # path is "../information/data2.csv" write.csv(dat, file.path(data_folder, "data2.csv"))                  

Exercises

  1. Gear up your working directory to your Desktop.

  2. With dir.create(), create a folder chosen "Project". Then, inside of "Project", create 3 folders: "Data_raw", "Data_formatted", and "Scripts".

  3. Download a dataset from the Census and drag-and-driblet it into the Data_raw binder.

  4. Prepare your working directory to the Scripts folder.

  5. Create a new R script and salvage information technology in Scripts.

  6. Without changing your working directory from Scripts, read in your downloaded file from Data_raw and and then salve it every bit a CSV file to Data_formatted. Save these commands in your R script.

This is the commencement of a well-organized projection!